Deriving Correlation Matrices for Missing Financial Time-Series Data

The problem of missing data is prevalent in financial time series, particularly data such as foreign exchange rates and interest rate indices. Reasons for missing data include the clo-sure of financial markets over weekends and holidays and that sometimes, index data do not change between consecutive dates, resulting in stale data (also considered as missing data). Most statistical software packages function best when applied to complete da-tasets. Listwise deletion – a commonly-used approach to deal with missing data, is straightforward to use and implement, but it can exclude large portions of the original dataset (Allison, 2002). Where data are randomly missing or if the deleted data are insignificant (measured by statistical power), listwise deletion may add value. Techniques to handle missing data were suggested and implemented. These techniques were assessed to ascertain which provided the most accurate reconstructed datasets compared with complete dataset.


Background
The problem of missing data is widespread and poses a problem in financial time series data such as foreign exchange (FX) and interest rates.Data may be absent from financial time series for various reasons: financial markets close on weekends and holidays (with the latter being different in different countries) and indices sometimes do not change value over certain periods (called "stale data") and considered as "missing data" for this paper.Missing data pose a serious problem for statistical analysis which requires complete datasets.For example, the Pearson correlation requires the same number of pairwise data for its computation: errors result if this constraint is not satisfied through missing data.
Several techniques exist to handle missing or stale data.Listwise deletion (Note 1) (Complete Case Analysis), a commonly-used method is easy to use and implement.Although the simplicity provides an advantage, some significant disadvantages exist such as the potential exclusion of a large proportion of the original data (Allison, 2002).It can be a valuable technique when data are missing randomly (Note 2&3), or if the cases deleted do not result is significant differences between calculated values.
The scope of this paper is to investigate, evaluate and test different techniques to deal with missing data and then apply these techniques to financial time series data to construct correlation matrices.The statistical software R Studio (Note 4) was used to implement, test, and evaluate the different techniques used to handle missing data.
The remainder of the paper is structured as follows.A review of the available literature is provided in Section 1, followed by a description of the data used and the methodology employed in Section 2. The results are documented in Section 3 and Section 4 concludes.
Figure 1.Missing data patterns.From left to right: univariate, unit non-response, monotone, general, planned missing data and latent variable patters (Enders, 2010) 1.3 Classification of Missing Data Rubin (1976) first introduced a classification system for missing data, by treating the missing data indicators as random variables and assigning a distribution to them.Little and Rubin (2002) continued the idea of mechanisms that lead to missing data.They reviewed the theory first introduced by Rubin (1976), using different notation and terminology compared to those of the original paper.Enders (2010) elaborated on the classifications systems and the fact that missing data mechanisms do not necessarily provide causal explanations for the missing data, but that they do represent the generic mathematical relationships between the data and missingness.Missing data mechanisms can be described as possible relationships between measured variables and the probability of missing data (Enders, 2010).
The definitions that follow are derived from Little and Rubin (2002) and Enders (2010).
The missing indicator is denoted as  and the missing data mechanisms describe the different relationships between  and the data.  are the observed parts of the data and   are the missing parts.The parameter (or set of parameters), , describes the relationship between  and the data.The missing data mechanisms are defined formally below.
Missing Completely at Random (MCAR) requires that missingness is completely unrelated to the data, sometimes referred to as haphazard missingness (Enders, 2010).Little and Rubin (2002) emphasise that this assumption does not mean that the pattern itself is random, but that the missingness does not depend on the data Missing at Random (MAR) is defined as the missingness that depends on   but not on   (Graham, 2012).Occasionally, MAR is defined as a missingness mechanism in which there is no relationship between the propensity for missing variables on  and the values of  are partialling (the process by which a single variable is assigned a fixed value to identify correlations between the other variables) out other variables (Enders, 2010).MAR does not mean that the data are missing randomly, but that a systematic relationship between one or more measured variables and the probability of missing data exist (Enders, 2010).In simple terms this means that the missingness was not caused by a completely random process (Graham, 2012).The probability distribution of MAR is: Missing Not at Random (MNAR) is also known as Not Missing at Random (NMAR); these terms can be used interchangeably (Little & Rubin, 2002;Enders, 2010).It is sometimes defined as the type of missingness in which the cause of missingness is correlated with  (Graham, 2012).The variable  can sometimes be missingness or contain missingness depending on the way it is measured.Enders (2010) describes MNAR as a missingness mechanism were the probability of missing data on a variable  is related to the values of  itself.
The probability distribution of MNAR, defined below, is useful because it contains all the information about the missingness: where  is a parameter that describes the relationship between  and  (  and   ).

Imputation of Missing Data Approaches
Little and Rubin (2002) provide information about statistical analysis from missing data experts.A comprehensive overview of data editing and imputation may be found in De Waal, Pannekoek, and Scholtus (2011).Although not necessarily an exploration of the niche financial time series field, it does provide a comprehensive overview of the topic.Enders' (2010) covers applied missing data analysis and translates state-of-the art technical missing data literature into an accessible, reliable reference.
The approach used in this paper to address the problem of missing data is "multiple imputation", a common approach employed in several fields.Statistical models are used to extract relevant information from observed data sets and to use this to impute several values for the missing values.Several "complete" data sets are thus produced with observed values the same in each, but the imputed values differ depending on the approach used.The appeal of the process is that, after imputation, statistical methods (which would have been employed had there been no missing values) can be applied to each of the completed data sets.Simple procedures can then be applied to combine results.
The most common statistical analysis methods require data sets with no gaps.Real, empirical data, however, are replete with gaps and scattered missingness throughout.A possible circumvention of the problem is to employ listwise deletion, a process by which any record/observation or case is excluded from an analysis if any single observation in the record is missing (Graham, 2012).Listwise deletion discards all the (often substantial) information which exists in the partially-observed observations and which encodes the variable's relationships.
Better solutions than listwise deletion involve plugging the gaps with statistical estimates, but since missing data cannot be replaced with the "true" values of the missing data (in which case there would be no missing data), missing data must be imputed.Imputed values, however, cause statistical analysis software to exaggerate confidence in the output (by biasing standard errors and confidence intervals), more so than is justified because there are fewer data than were empirically observed.
The missing data theory that preceded this section lays the foundation for understanding the key concepts necessary to understand the techniques to treat missing values.In this paper, daily returns will be used for statistical analysis.Closing prices were used for the analysis.According to Taylor (2008) daily returns are more convenient to analyse changes in prices with than direct statistical analysis of financial prices, because consecutive prices are highly correlated, and the variances of the prices increase with time.This increase in prices is due to non-stationarity and the daily returns makes the process stationary.
The techniques to handle missing values that are listed below have been identified using a pragmatic approach.Hastie et al. (2009)  Next the techniques that can be used to handle missing values will be described.Each technique will be discussed briefly in terms of the mechanics behind the technique, the advantages, disadvantages and implementability of each technique.
According to Honaker et al. (2011), these approaches can lead to serious biases and covariances.Ad hoc techniques to treat missing data, for example listwise deletion, tend to discard information about variables to make the estimation problem more tractable (Schafer, 1997).This could lead to a loss in statistical power and listwise deletion may be biased if the missingness mechanism is MAR and not MCAR.
The ad hoc method of mean imputation may preserve the observed sample mean, but it distorts the covariance structure, biasing estimated variances and covariances toward zero.Imputing variables from a regression model, inflates the correlations, biasing them away from zero.When the missingness mechanism is complex, the derivation of imputation scheme that preserves the important aspects of a joint distribution can prove to be very difficult (Schafer, 1997).
Ad hoc approaches are, therefore, not necessarily a good way of treating missing values as these approaches are outdated.
Listwise deletion means that the variables with missing data on any of the variables used for the statistical analysis, will be excluded from the analysis.The advantages of this technique include that it is easy to implement and it is the default setting of most of the statistical software programs.Thus, the implementation of these techniques is straightforward.The disadvantages are however that this technique can greatly reduce the sample size that is used for statistical analysis.There is also the risk of potential bias in the mean as well as the variance of the statistical analysis on the data.
Mean imputation a simple method which is easy to understand and implement in any statistical software program.This method replaces all the missing values with the mean of the observed values of each variable, thus the mean of the variables observed is not biased and remains unchanged with this method.The disadvantage of this method is that the variance of the data is reduced.
Multiple Imputation (MI) was designed by Rubin (1976) to be practical without neglecting statistical elements.MI reduces the bias and increase efficiency compared to ad hoc approaches (Honaker et al., 2012).MI is very flexible and can be used with any kind of model or data, but it is a difficult process to implement due to the extremely technical nature of the process and subsequent algorithms.Another drawback of MI is that a different result is obtained every time it is used.
Maximum Likelihood (ML) is preferred to Multiple Imputation (MI) because ML is consistent and asymptotically efficient under the MAR assumption (Allison, 2012).If the missingness mechanism can be describe by MAR then the standard errors are unbiased.The ML method can be used incorporating the Expectation Maximisation (EM) algorithm.Dempster et al. (1977) first introduced the EM algorithm.The EM Algorithm follows a two-step approach to get ML estimates from the missing data:  Expectation (E): Calculate the expected value of the log-likelihood for observed data, based on current parameter estimates.
 Maximisation (M): Maximise the expected likelihood to obtain new parameter estimates.
The above process is repeated until convergence is obtained.
Missing data patterns and missing data mechanisms should be considered when dealing with missing data.When using a technique to handle the missing data the missing data pattern must be checked to see which pattern the missingness belongs to.This can provide valuable insight into the data in cases when data is not missing at random or when a specific pattern is evident in the missing data.Missingness mechanisms can influence the techniques used to handle the missing data, thus the missingness mechanisms must be tested if it is not known.When known, the choice of techniques to handle missing data can be made with more certainty.Figure 2 indicates the extent of missing data with black vertical lines for two interest rates.The remainder of the data look similar, whether for interest or FX rates.Table 2 shows the extent of missing data as a percentage of total missing data.Note that interest rates have more missing data than FX rates.Continuously-compounded daily returns were used for the calculations of the correlation matrix.These returns were used as opposed to simple returns, because of the additive property of the continuously compounded daily returns.The model was constructed in R Studio.

Methodology
The principle aim of this paper is to investigate, evaluate and test different techniques to deal with missing data and then apply these techniques to financial time series data with the ultimate aim of constructing correlation matrices (which are used for risk management, asset allocation, portfolio performance and regulatory capital calculations).
The correlation technique used in this paper is the Pearson correlation and correlations (Note 5) are calculated between FX rates and interest rate indices.The Pearson correlation coefficient is defined in Rice (2006) as Where   is the ith observation of variable ,   is the ith observation of variable ,  is the sample mean of variable , and  is the sample mean of variable .
Pearson correlations are used to calculate market parameters used in Monte-Carlo (MC) simulations that simulate market scenarios.Market scenarios are used for calculating the potential future exposure (PFE) for counterparty risk and default risk.These risks can be used in fair value measurements as defined by IFRS (International Financial Reporting Standards)13 or regulatory capital as stipulated by the Basel Committee.To calculate the aforementioned, correlation matrices are needed.This paper focuses on the correlation matrices and the construction thereof with infrequently observable time series data, i.e. time series data with missing data.The scope does not include the calculation of market scenarios and subsequent steps to calculate the PFE for counterparty and default risk.

FX Rates
FX rates (Note 6) belong to the most efficient and liquid segments of financial markets (Tichy, 2006).Because of this property, FX rates do not usually have problems with stale data.Brownian motion is a continuous time version of a random walk with steps being normally distributed random variables (Rice, 2007).FX rates are assumed to follow a geometric Brownian motion (GBM).If   is an exchange rate between two currencies at time ,   behaves like a geometric Brownian motion, i.e. it follows a stochastic differential equation (SDE) of the form Where   is an FX rate, and   is a Wiener process.

Interest Rates
Interest rate indices (Note 7) are unsecured short-term borrowing rates between banks (Hull, 2012) is a deterministic function of time,  is the mean reversion speed,  is the volatility and   is a standard Brownian motion under the risk neutral measure.
This paper employs both FX and interest rate indices for testing different techniques to handle missing data.The aim is to test the techniques that will not change the underlying distribution of the data with regards to the correlation matrix of the returns.In other words, the technique used to handle missing data with the correlation matrix "nearest" to that correlation matrix calculated with complete data is considered the "best" technique.
The correlation matrices constructed from the techniques to handle missing data were compared with the matrix from the complete dataset, by taking the absolute difference between the correlation matrices constructed by using the techniques and the complete case matrix.To illustrate this, a simplified example is used.Assume a correlation matrix is calculated from a complete dataset.Manually-created missing data are created and added to the dataset (there are various ways of doing this: imputation techniques and listwise deletion).This correlation matrix is then reduced to an upper triangle, as shown in Table 3.
Table 3. Example correlation matrix from complete dataset (and upper triangle) After the data have been treated with a technique to handle the missing data, a second correlation matrix is calculated (Table 4).
Table 4. Example correlation matrix from reconstructed data (and upper triangle) The absolute difference between the two correlation matrices is then calculated, creating in an upper triangle difference matrix.This matrix is an indication of "how far" away the two correlation matrices are from each othershown in Table 5.
Table 5. Difference matrix between Tables 3 and Table 4 Assume another imputation technique was used to treat the manually-created missing data.After following the approach outlined above, the difference matrix as shown in Table 6 was produced.
Table 6.Difference matrix from another data-imputation process

C
The differences in the correlation matrices (in Tables 5 and 6) are now summed to obtain a single value.From Table 5, this value is 0.2 + 0.2 + 0.3 = 0.7.From Table 6 this value is 0.1 + 0.3 + 0.2 = 0.6.The difference matrix with the lowest value is then the "best" difference matrix.Thus, to compare different techniques to handle missing data the single values (summed differences) were calculated and the lowest values identified.The lowest values indicated which techniques are better than the other to handle the missing data.The aim of this paper is to ascertain which technique results in the best "difference matrix value" to handle missing data used for the calculation of correlation matrices.
These models are then compared with each other using evaluations to assess imputation accuracy.The root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) were used to compare the different techniques to each other.All the above-mentioned comparison criteria are popular in the literature and may be used to compare and evaluate different techniques.
The RMSE is Where  ̂ is the predicted value,   is the observed value, and  is the number of observations.
The MAE is The MAPE is From these additional information criteria, the best techniques overall were chosen.To remove any bias, the tests were run over 1 000 simulations where missing data were generated, and the techniques used to handle the missing data were applied.
A schematic of the process thus followed is shown in Figure 3.
Figure 3. Simplified approach of model development and validation process

Imputation Techniques
Different imputation techniques were used to impute missing values.All methods were implemented, tested and compared with each other as well as with listwise deletion to ascertain which method imputed the best missing values used to construct the correlation matrices.A short description of each imputation technique follows: Linear interpolationthe gap (missing data) between two points is replaced using a linear polynomial between the two known points.
Spline interpolationa piecewise cubic (or Hermite) polynomial is fitted between the two known points to replace the missing data in between (second derivatives = 0 at endpoints) (Wagon, 2010).
Stineman interpolationsolves the non-monotonic problem of linear and spline interpolation.Each data point

Calculate returns
Generate correlation matrix has a slope obtained by fitting a circle to that point and its neighbours and using the slope of the tangent with the circle.Given the two points with assigned slopes a smooth function is obtained that connects the points and respects the slope (Wagon, 2010).The Stineman method is not as smooth as the linear and spline method (Stineman, 1980).
Mode imputation -replaces the missing data with the mode of the observed data.
Random Imputationreplaces missing data with a randomly drawn value from the observed data.
Mean Imputation -(also arithmetic mean or unconditional mean imputation) replaces missing data with the mean of the observed data.Mean imputation creates a complete dataset, but severely distorts the resulting parameter estimates, even if the data are MCAR (Enders, 2010).It also reduces the variability of the dataset and therefore will also affect the standard deviation and variance.
Median imputationmissing data are replaced by the median of the observed data.
Last observation carried forward -missing values are replaced with the previous observed value.
Next observation carried backwardmissing values are replaced with the following observed value.

Moving average imputationthe moving average of the observed data replaces missing values.
Exponential Weighted Moving Averagemissing data are replaced with the exponential weighted moving average of the observed data with a specified weighting parameter.
Linear Weighted Moving Average Imputationmissing data are replaced with the linear weighted moving average of the observed data.
nearest neighbour imputationa non-parametric learning algorithm used by Google to autocomplete Google searches. nearest neighbour imputation does not provide an exact match, but is a scenario driven method in which different scenarios are compared with each other.The difference between scenarios is a certain uniqueness and these scenarios are the neighbours.New scenarios are compared with each scenario in the model and matched according to the closest-neighbour to the case (Waqas et al., 2016). in the name is the amount of neighbours each case comprises.
Random forest imputationuses random forests to impute the missing data by running forests for many iterations.
Kalman imputation -uses Kalman smoothing on structural time series models or state space representation of an ARIMA model.
Listwise deletiona process by which any record/observation or case is excluded from an analysis if any single observation in the record is missing (Graham, 2012).Listwise deletion discards all the (often substantial) information which exists in the partially-observed observations and which encodes the variable's relationships.
Correlation matrices arising from altered input data must be not only real symmetric, but also positive semidefinite.This is an absolute requirement: even if the new, estimated correlation matrices are believed to be econometrically reliable, they may not be mathematically feasible (Rebonato & Jäckel, 1999).Once missing data were inserted in the time series data by whichever technique, correlation matrices were calculated and tested for real-symmetry and positive semi-definiteness (non-negative eigenvalues) using the technique described in Rebonato and Jäckel (1999).All correlation matrices were found to satisfy the positive semi-definite requirement.

FX Rates
FX rates are highly liquid and therefore do not have many stale data, nor many missing data.
Correlation matrices were compared with each other according to the approach detailed in Section 3. The techniques were first tested using time series with 0.5% and 1% missing data.These were created manually: empirical data were used, and random data points removed until the requisite level of missingness was reached.
Differences between the difference matrices using the Quandl dataset for each approach are shown in Figure 4. To standardise the results, the best technique (for both 0.5% and 1.0% missing data, this was the Stine interpolation) was rebased to 100.This allows a percentage comparison to be made with results obtained from other methods.The techniques that performed the worst were the mode, random, mean, and median interpolation > 1 000% higher than the results obtained for the Stine interpolation.
Figure 4. Comparing techniques using correlation matrix differences (for 0.5% and 1.0% missing data) for Quandl FX rates Each technique was evaluated using other tests of imputation accuracy, namely the RMSE, the MAE and the MAPE when the quantity of missing data was 0.5% and 1.0% (still with the Quandl data set).The results, which are again presented such that the best performing technique is rebased to 100 (again, the Stine interpolation), appear in Figure 5.The moving average and random forest approaches fare the worst for all imputation methods at both levels of missing data.Using the Bloomberg dataset, differences between the difference matrices for each approach are shown in Figure 6.Results are again standardised with the best technique (for both 0.5% and 1.0% missing data, this was the Stine interpolation) rebased to 100.The techniques that performed the worst were again the mode, random, mean, and median interpolation which gave difference between difference matrices > 1 500% higher than the results obtained for the Stine interpolation.
Figure 6.Comparing techniques using correlation matrix differences (for 0.5% and 1.0% missing data) for Bloomberg FX rates Each technique was then evaluated using the imputation accuracy tests when the quantity of missing data was 0.5% and 1.0% (using the Bloomberg data set).The results appear in Figure 7, again with the best performing technique rebased to 100 (again, the Stine interpolation).The random forest approach fares the worst for all approaches at both levels of missing data.

Interest Rates
The Bloomberg dataset was used for testing the different techniques to handle missing data present in interest rates.Interest rates are generally not highly-liquid and are characterised by many stale and missing data (in the range 2% to 45%).Because of this, higher thresholds of missing (or stale) data were used, namely, 2%, 15%, 30% and 45% missing data.These levels correspond to observed missing data percentages.
The three most liquid interest rates (3m EURIBOR, 3m GBP LIBOR and 3m USD LIBOR) were used as the benchmark to test the imputation techniques: these also had the least missing data.
Differences between the difference matrices for each approach for interest rate data are shown in Figure 8.To standardise the results, the best technique (for all levels of missing data, this was the Stine interpolation) was rebased to 100.The techniques that performed the worst were (again) the mode, random, mean, and median interpolation > 1 000% higher than the results obtained for the Stine interpolation.
Figure 8. Comparing techniques using correlation matrix differences (for 2%, 15%, 30% and 45% missing data) for Bloomberg IR rates The better techniques were then selected and evaluated using other tests of imputation accuracy, namely the RMSE, the MAE and the MAPE when the quantity of missing data was 2%, 15%, 30% and 45%.Results, which are again presented such that the best performing technique is rebased to 100 (again, the Stine interpolation), appear in Figure 9.The random forest method again fares the worst for all imputation methods at all levels of missing data.

Conclusion
The results provide a clear indication of the best and worst methods for missing data imputation and may be used for further research on different techniques and methods.There are, however, novel issues which have been set aside for future research.
Techniques to handle missing data were suggested and implemented.These techniques were used to construct correlation matrices to see which techniques created the correlation matrix that were the most accurate, when compared to the correlation matrix of a complete dataset.For FX and interest rate data, the mean, median, mode, random and listwise deletion methods were the worst performing methods.
From the RMSE, MAE and the MAPE results the Stine and linear interpolation methods are the best techniques to deal with missing data in interest rate indices.Spline interpolation and the Exponentially Weighted Moving Average methods may also be used as both these give good results.The worst performing methods areacross all levels of missingnessthe random, mean, median and mode imputation approaches.These should be avoided.The flexibility of multiple imputation methods provides a considerable advantage over methods that are reliant on the underlying data distribution.Although these methods are difficult to implement, using outputs and comparing these to univariate imputation methods can be of substantial interest.
Imputation methods such as random forest and the -nearest neighbour use parameters inputted by the user.These parameters have not been tested for optimalitynor calibratedin this work.Future work could establish such calibration and optimality.The parameters obtained could also be calibrated to determine which techniques are the best for use with FX, interest rates, and other time series data.
The techniques that performed the best to handle missing data in FX rate data were the Stine interpolation, linear interpolation and the Kalman method using a structural model.The techniques that performed the best to handle missing data in interest rate index data were Stine interpolation, linear interpolation and, to a lesser extent, the Exponentially Weighted Moving Average and the spline interpolation imputation methods.

Figure 2 .
Figure 2. (a) UK 3m LIBOR and (b) 3m JIBAR interest rates, showing missing data as vertical lines

Figure 9 .
Figure 9. Comparing techniques (a) 2% (b) 15%, (c) 30% and (d) 45% missing data , that includes   and   .The probability distribution of MCAR is: lists three ways to handle missing variables, assuming they are MCAR: Jan 06 -Jun 17) for the Bloomberg dataset and daily data for six years (Jan 2011 -Dec 2016) for the Quandl datasetwere used for implementing and comparing the different techniques to handle missing data.These indices comprise 2 947 incomplete observations because of missing and stale data.The London Interbank Offered Rate (LIBOR) and Johannesburg Interbank Average Rate (JIBAR) are indexed rates that are used as benchmarks for determining interest rates.The data appear in Table1below.
Two datasets were used for the implementation of the different techniques to handle missing data, one from Quandl (only FX data) and one from Bloomberg (both FX and interest rate data).Each dataset differs in terms of the number of observations, the number of missing data and variables.The two datasetswhich comprised daily data for 10.5 years (

Table 2 .
Missing or stale data as a percentage of total sample data.The percentages for the same rates differ because of the different sample times