A New Method to Detect Outliers in High-frequency Time Series

The objective of this research is to develop a fast, simple method for detecting and replacing extreme spikes in highfrequency time series data. The method primarily consists of a nonparametric procedure that pursues a balance between fidelity to observed data and smoothness. Furthermore, through examination of the absolute difference between original and smoothed values, the technique is also able to detect and, where necessary, replace outliers with less extreme data. Unlike other filtering procedures found in the literature, our method does not require a model to be specified for the data. Additionally, the filter makes only a single pass through the time series. Experiments show that the new method can be validly used as a data preparation tool to ensure that time series modeling is supported by clean data, particularly in a complex context such as one with high-frequency data.


Introduction
An important topic in time series analysis is how to deal with data that consist of on-the-minute, hourly, daily or weekly observations.Among the reasons for this interest is that high-frequency time series inevitably show unexpected spikes (peaks and troughs) that appear to be grossly inconsistent with neighboring values.Since occasional large disturbances may have serious consequences for model identification and parameter estimation in time series, it is important to attenuate their adverse effects before the data are used.This paper presents a new filter intended to remove or reduce potentially troublesome behavior in a time series, even though, in the preliminary stage, we ignore the specific model that is eventually to be applied to the data.
Let p t ≥ 0 be the observed values at period t and n be the length of the time series p t , t = 1, 2, • • • , n.We assume that, for each point of time, p t is given by where a t is a random variable with zero mean and finite variance σ 2 a .The values p 1 , p 2 , • • • , p n lie on a function (the reference curve) that should be flexible enough to represent a wide range of curvatures, but that should also be as smooth as possible.In this article, we detect extreme spikes (or outliers) by examining the absolute difference between observed values and the corresponding point in the reference curve.Therefore, peaks or troughs that are too high are considered anomalous and become candidates for an appropriate statistical treatment.However, only one pass is made through the data because of the length of the time series.The reference curve is obtained by solving the following problem: given a real λ > 0 and a positive integer m, find the values of p = ( p 1 , • • • , p n ) that minimize the convex combination: with where ∇ denotes the difference ∇ p t = p t − p t−1 .The function (2) has two terms: goodness of fit and smoothness.F ( p ) measures fidelity to the data in terms of the squared deviations between smoothed and observed values.In particular, F m is the maximum of F( p), which occurs when all m-th differences are equal to zero.In this case, the reference curve is determined by fitting to p a polynomial of degree (m−1) by the least squares.For example, if m = 1 then F m = n • var(p), where var(p) is the variance of the observed values.The term S ( p ) expresses the smoothness as the sum of squares of m-th differences between smoothed values.The constant S m is the maximum of S ( p), which occurs when p t = p t , ∀t, implying that S m = ∑ n t=m+1 (∇ m p t ) 2 .The constants F m and S m re-scale Q( p, m, λ) to the [0, 1] interval, so that smoothness and goodness-of-fit are balanced consistently.In short, the normalized linear filter (2)-( 3) can be considered a variant of the Whittaker-Henderson graduation.See Knorr (1984).Central to (2) is the trade-off between F( p), which relates to the goodness-of-fit and S ( p), relating to the smoothness.The equilibrium between goodness of fit and smoothness is achieved by a reasoned choice of the smoothing constant λ.If λ → 0, then the dominant component will be the squared Euclidean norm ∥ p−p∥ 2 and p will increasingly resemble the original values, no matter how irregular p may be.As λ → 1, smoothed prices approach the polynomial p t = ∑ m−1 j=0 b j t j t = 1, 2, • • • , n regardless of the goodness-of-fit component.A very simple choice is λ = 0.5, which implies that fidelity and smoothness are equally balanced.Apart from these three cases, the solution of ( 2) is a serious concern because the degree to which we are justified in sacrificing fidelity in order to obtain smoothness varies greatly from one problem to another.See Whittaker (1923).The paper is organized as follows: in the next section, we present the normalized linear filter (NLF) together with computation of the thresholds beyond which outliers are detected.The effectiveness of the proposed method is assessed in section 3 by comparing the results of SARIMA models fitted to original time series with those of the same models fitted to filtered time series.The final section discusses our findings and points out some improvements for further applications.

Optimal Smoothing
The smoothness component of the NLF smoother can be rewritten as where D m is an m-th differencing matrix with (n − m) rows and n columns The matrix D m transforms a column vector into the column of m-th differences of the elements in thew vector.A typical row of D m contains n − (m + 1) zeros and the 1 × (m + 1) vector starting from column t and ending with column t + m + 1.The elements d m are the successive binomial coefficients of order m with alternating signs The nonzero elements of D m form a diagonal band from the upper left to the lower right.Moreover, thanks to D m , (2) can be presented as To minimize Q( p, m, λ), its derivatives with respect to the p have to be equated to zero which leads to [ ] where I n is the (n × n) identity matrix of order n.The second order condition for a minimum of ( 7) is It is easy to show that the matrix A = is a symmetrical and positive definite and therefore p = A −1 p.The computation of p is simple as long as the scheme described above is applied to short time series, but the solution appears problematic for long time series.However, there are fewer difficulties than at first appear.Indeed, several authors, have devised very efficient computing software by exploiting the characteristics of the matrices involved.See, for example, Garcia (2010) and Cornea-Madeira (2017).

Choice of the Smoothing Constant
There are various techniques for choosing the smoothing constant.For example, Brooks et al. (1988) applied the general-ized cross-validation (GCV) score suggested in Golub & Wahba (1979).Guerrero (2008) outlines a new formalization of the concept of trend to choose the smoothing constant.Frasso & Eilers (2015) proposes the use of two curves for finding the appropriate value of λ automatically.The problem that both methods have is that the optimal value leads to substantial under (over)-smoothing and the resulting long-term pattern tends to have too many (or too few) wiggles that often show up in the wrong places.Other techniques, such as Morozov's discrepancy principle (see, for example, O'Leary ( 2001)) often work well, but produce quite unexpected results when their outcome is inaccurate.These considerations lead us to a direct calculation of the smoothing constant.As in Knorr (1984), our point of departure is an appealing analogy between λ ∈ [0, 1] and the confidence level of a statistical test where 0.50 or 0.60 are of little interest, but levels such as 0.95 or 0.99 may be important in terms of identifying the appropriate result of a test.To achieve a reasonable degree of smoothness, the recommended constant is the weighted average: The rationale is that when a time series behaves like a polynomial of degree (m−1), then F m is near its maximum (and hence S m → 0), we need less smoothness; when the reference values are very similar to the observations (that is, F m ≈ 0), we need more smoothness.Therefore, strategy (10) can provide adequate smoothness without producing unnecessarily large deviations from the observed values.At first glance, the interval [0.95, 1] may appear to be narrow, but it is not, because of the extreme reactivity of the NLF toward λ.

Detection of Extreme Spikes
Let a t = p t − p t , t = 1, 2, • • • , n be the absolute difference between observed values and points on the reference curve.We look for deviations a t that fall outside the following range where μ is a measure of central tendency, σ is a measure of scale and K is a positive multiplier.Since high-frequency time series often contain atypical values, the two statistics need to be robust to the presence of outliers.In this regard, in our experiments, we considered two well-known robust statistics.The measure of location is chosen to be the Sen rank weighted mean (Sen (1964)) where a (i) is the i-order statistic with 0 < j < (ν−1)/2.Our choice is j = 2.As a robust statistic of scale we use the first quartile of the sorted pair-wise absolute differences: where q = ( n 2 ) /4. See Rousseeuw & Croux (1993).If a t surpasses the warning limits in (11), then the corresponding value is considered an extreme spike.This, however, does not imply that the spike should automatically be eliminated.The presence of sharp peaks and/or narrow valleys is a rule rather than an exception in high-frequency time series.If too many of them are deleted and/or imputed, by using an average of the remaining data for example, then prediction modes may be applied to an unrealistic time series.It would be better to down-weight dubious observations rather than reject them.We recommend replacing a suspect spike p t with a linear combination of observed and smoothed values In so doing, we preserve the peak or trough nature of the data point.In other words, we assume that the direction of the changes is compatible with the local behavior of the time series, but the magnitude is substantially larger than what is expected under standard conditions.The NLF has three parameters that need to be specified: K, m and γ.We note that the choice of m is not related to the degree of non-stationarity of a stochastic process or even to the representation of a trend by an algebraic curve.Rather, m is associated with the leveling necessary to correct the oscillations: the higher the degree of the polynomial in the smoothness component, the greater is the danger of getting misleading results when using smoothed values.The appropriate value of m, K and γ must be determined experimentally.

Segmentation
While NLF is a useful tool, it can be slow and inefficient if the time series is so long that the robust statistics of location and scale used in ( 12)-( 13) completely loose their representativeness with respect to the various "local behavior" of time data.A possible strategy that could be used is a breakdown of long time series into contiguous non-overlapping periods (or segments) and separate, indipendent application of the NLF filter to each segment.Let n b be the number of segments and let n c = ⌊n/n b ⌋ be the common length of the segments.The pairs of integers indicate the start and the end of each segment, respectively.Note that if n(u b n c ) < n, then n b is set equal to n. NLF can now be applied to each segment Clearly, the segments do not have common elements and together constitute the original time series.In many applications, the calculation of the number of segments and the determination of the boundaries is formulated as an optimization problem.There are various methods and systems that can be used to solve the problem of dividing a time series into periods of similar behavior.See, for example, Keogh et al. (2004).Given the variety of requirements that can be proposed in any segmentation procedure and the lack of specific esperiences we restrict our attention to a mere subdivision of the time series into a prefixed number of segments of equal size.The combination of optimal smoothing and extreme spikes detection described in this section has been implemented in an R script which is available from the authors upon request.

Monte Carlo Analysis
When a filter is applied to a time series, an obvious question arises: how effective is the filter?This section reports a simulation study on the performance of the NLF detection/correction method in a specific example of high-frequency time series (hourly electricity prices) analyzed in the framework of seasonal ARIMA models.Accuracy is assessed relative to the number and size of observations being classified, correctly or wrongly, as outliers.See Janczura et al. (2013).

SARIMA Models
We analyze hourly time series of electricity spot prices from 1am on Friday, 1 January 2016 to 12 pm on Sunday, 31 December 2017, one for each macro-region of the Italian electricity market GME, 2018.Every time series is n = 17544 hours long.The dominant seasonality is s = 24.For each time series we will identify and estimate a SARIMA(p, where a t , t = 1, 2, • • • , are mutually uncorrelated random variables with zero mean and finite variance σ 2 a .The symbol B denotes the backward shift operator and ϕ * (B) and θ * (B) are polynomials where p * = p + sP, q * = q + sQ are the orders of the AR and MA polynomials, respectively.If all the roots of ϕ * (B) are greater than one in absolute value, then the process is stationary.What is more, if all roots of θ * (B) are greater than one in modulus, with no single root common to both polynomials, the process is invertible.We do not expect to obtain accurate results in terms of fitting ability.One reason for this is that , although hourly electricity prices exhibit numerous seasonalities ranging from daily to weekly to monthly, our study only considered s = 24.Moreover, we have ignored many factors that could act as external regressors: holiday effects, temperatures, alternative energy sources and heteroskedasticity.We also ignored the interconnections used for managing possible congestion occurring in the electricity market.Nonetheless, we think that, for the purposes of the present work, it is sufficient just to achieve an acceptable fit to the observed values.We have studied six time series p t, j , t = 1, 2, • • • , n; N; j = 1, 2, • • • , 6, one for each zone of the Italian electricity market, by using the auto.arimafunction of the R package f orecast (Hyndman (2015)), which chooses whether to include autoregressive and moving average components (and how many terms to include for each one) by using the AICc criterion.In particular, we use 0 ≤ p, d, q, P, D, Q ≤ 2 which include 729 distinct processes to be explored for each time series.The search for the best model is carried out in a non-stepwise automatic mode with parameters that are constrained to be stationary.The models reported in Table 1 satisfy the suggested criterion.

Effects of Smoothing on Point Forecast Accuracy
In order to assess the effects of the normalized linear filter (NLF), we compare some accuracy measures before and after the filter usage.In this regard, the time series analyzed in Table 1 are considered to be the "training" period of our analysis.
To this, we add the days from Monday, 1 January 2018 to Friday, 5 January 2018 inclusive (120 hours), which acts as the "validation" period.
The point forecast p n,t, j at origin n and lead time t of the j-th time series is obtained by identifying and estimating a SARIMA process for the training period.The search for the best model is conducted as described in the preceding paragraph.Forecast errors are obtained from the difference between actual values in the validation period and the corresponding forecast produced using the values in the training period: e n,t, j = p n+t, j − p n,t, j , t = 1, 2, • • • , L where L = 120 is the forecast horizon To evaluate the specific impact of NLF, we use the coefficient proposed by Hyndman & Koehler (2006) As the numerator and denominator both involve values on the scale of the original data, the quantity g is independent of the scale of the data.
For ease of comparison, we also report the more common forecasting criteria: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE).See, for example, Samir & White ( 2017) Table 2 shows the results for a filter with the following parameter' settings : m = 2, γ = 0.25, K = 5.25, n b = 4.It can immediately be noticed that the smoothing brings about an improvement in forecasting results in all of the six zones.This is confirmed if one looks at the values of all four accuracy measures before and after the use of the NLF method.
The coefficients after smoothing are always lower than the same coefficients computed before smoothing.Although the findings in Table 2 do not appear striking, the fact that the progress, although small, is obtained at a small computational cost should not be ignored.

Simulation Design
The models in Table 1 are used to generate N = 250 time series In this regard, we used simulate.Arima of the R package f orecast (Hyndman, 2015).Successively, the time series are intentionally corrupted with gamma distributed noise, that is, the original data points are replaced with simulated outliers.
In view of the poor fitting, simulated time series may be affected by two types of systematic error.First, some of the p t,i, j may be negative, so contradicting the standard assumption in models of electricity prices.Second, although generated by a stationary Gaussian process, unwanted spikes are always possible in a very long time series.A necessary consequence is the appearance of "endogenous" spikes, which usually give rise to false positive errors.The simulation design must cancel or attenuate both forms of bias.Our strategy is to do as follows: • 0. Simulate a time series p i, j from one of the models in Table 1.Set t = 1 and ν a = 0, where ν a is the number of artificial outliers.• 1. Remove zero values.Let n z be the number of simulated values less than or equal to zero and let P z be the corresponding time points.Set p t∈P z ,i, j = p t∈P z ,i, j .Hence, negative values are replaced with the corresponding observed values.If, however, the negative values are more than 5 % of the total length, then reject the time series and return to point 0. • 2. Modify "endogenous" outliers.Let Q θ 1 < Q θ 2 be the quantiles of p i, j defining the thresholds outside which simulated values may be confused with "exogenous" outliers, but will not be taken into account for accuracy.Let P L and P U be the observations in p i, j less than Q θ 1 and greater than Q θ 2 , respectively.Change p t∈P L ,i, j = (1 − u 1 ) µ i, j and p t∈P U ,i, j = (1 + u 2 ) µ i, j where u 1 and u 2 are random numbers in the [0, 0.25] interval and µ i, j = E( p i, j ).Values which are potentially too low (too high) are replaced with random values near to, but lower than (but greater than) µ i, j .• 3. Insert an outlier.Set t 1 = 1 + (t − 1), t 2 = r + (t − 1).Compute the mean µ i 1 :i 2 and the standard deviation The positive constant η, in practice, governs the range of values entitled to become outliers.• 4. If L t < p t,i, j < U t then p t,i, j it is not a good candidate because the risk of generating a non-detectable outlier is too high.Increase t by one and repeat step 3, provided that t ≤ (n − r + 1).Otherwise stop.• 5. Generate a random number u in the [0, 1] interval.If u ≥ τ then increase t by one and go to step 3.The constant τ controls the rarity of outliers.• 6. Generate g t from a gamma(a, β) distribution probability, where a = α µ i, j , E(g t ) = (α/β) µ i, j , var(g t ) = (α/β 2 ) µ i, j .
If p t,i, j < µ i, j , then set p * t,i, j = p t,i, j −g t , otherwise set p * t,i, j = p t,i, j +g t .Increase t and ν by one and return to step 3. Figure 1 shows an example simulated from the first model in Table 1.One aim of the simulation design is to keep error rates as low as possible, while remaining realistic.Despite the efforts made, not all the simulated spikes can be considered "extreme" in a meaningful sense and, thus, are likely to induce false negative errors.Additionally, false-positive errors remain a possibility when adhering strictly to the simulated time series.The occurrence of both false positive and false negative errors leads to incorrect decisions that could represent a limitation to the present work.

Performance of the Normalized Linear Filter (NLF)
The ability of NLF to identify outliers can be assessed by comparing the number of artificial outliers ν a inserted into the time series with the number of outliers identified by the method ν d .More particularly, we refer to the 2×2 table of the decisions taken A valid anomaly detection technique maximizes decisions of type A while, at the same time, keeping decisions of the types B and C at the lowest levels possible.Obvious measures of performance are The frequency A of values correctly considered as extreme spikes is central for both the coefficients.The frequency D of non-outliers, not detected as outliers is not included because outliers, by nature, are rare and, consequently, D is much larger than A, BorC and its involvement would give a distorted picture of the degree of success.Coefficients ( 21) are plausible indices of performance, but have an evident drawback: they are not symmetrical with respect to B and C. It is therefore reasonable to choose some kind of mean of C 1 and C 2 .We apply the harmonic mean of C 1 and C 2 , known as the "coincidence index" (Dice, 1945).
We have 0 ≤ C 3 ≤ 1 where 0 implies that no outliers are detected and 1 indicates that all, and only, the outliers are detected.The larger the coefficient becomes, the greater the effectiveness of the detection method is.We have evaluated ( 22) for all the simulation runs (250 time series of 17, 544 time points for each zone).The parameter' setting of the NLF method is the same as in paragraph 3.2.In Table 3 we report mean and standard deviations of ν s , ν d , C 1 , C 2 and C 3 (averaged over the repetitions).The column headed ∆% refers to the relative variation ∆% = (ν a − ν d )100/ν a of the true/detected outliers.With an initial general examination, we note the consistent behavior of the mean value of the coincidence index C 3 .As expected, C 3 (as well as C 1 and C 2 ) increases as τ, the frequency of the simulated outliers, increases.Practically the same conclusions apply to the standard deviations of all of the coefficients.Regarding the number of simulated and detected extremes, we observe that the latter is slightly, but systematically higher than the former.This is presumably due to the parameter setting, which is the same throughout the SARIMA models whereas a more differentiated approach seems warranted.Coefficient C 2 is known as sensitivity, that is, the probability the proposed method will discover a real outlier if there is one.In all the repetitions, the average value of C 2 is around 99% (with a negligible standard deviation), so indicating that there are few false negatives.Coefficient C 1 is known as the positive predictive value, that is, the probability that a detected outlier is indeed an extreme spike.Table 3 shows that the average values of C 1 are relatively low (but remain at acceptable levels) only for τ = 0.10, that is, in the case of time series that are contaminated by sporadic outliers.The findings in Table 3 show that the NLF method is capable of great accuracy in detecting extreme spikes in high-frequency time series.The values reached by the Dice coefficient C 3 are particularly remarkable and, hence, we can state that when NLF classifies an observed value as an outlier, it does so with a high degree of reliability.

Conclusions and Future Research
The filtering of high-frequency time series removes noise introduced by anomalous conditions, which are mostly selfexplanatory and might not contribute significantly to modeling and forecasting.The NLF method described in this paper is a fast, easily applicable and versatile pre-processing treatment of sequences affected by a spike phenomenon.Experi- mental findings show that the proposed methods can efficiently clean long time series and improve their quality.There are currently several ways in which a time series can be smoothed and filtered.One example is the Savitzky-Golay method, which is based on local least-squares polynomial approximation.See Barak (1995) and Shafer (2011) for more details.Another example is the moving weighted average discussed by Borgan (1979) in which each observation consists of a value determined by a local polynomial plus a random error term that satisfies the usual constant variance and uncorrelated assumptions of linear statistical models.A further possibility is the de-noising technique based on wavelets transform (Weron, 2006 [section 2.4.8]).Weron & Zator (2015) also show the validity of a smoother based on the Hodrick-Prescott filter.Finally, there is the tsoutliers function(of the R package forecast) for the automatic detection and replacement of outliers.We have not compared the aforementioned techniques with the NLS method.This is because of both the lack of software tools to assist application of these methods and the strict dependence of their algorithm on certain parameter settings, which have to be supplied by the user according to the desired strategy and which cannot be generalized beyond their own area of specialization.We plan in the future to compare our results with those obtained through the other techniques looking at a careful design of the experiment, which precludes one method from being preferred merely because data used for the comparison are more in accordance with the theory on which the method is based.

Table 1 .
Best SARIMA models for electricity zonal prices in Italy.

Table 2 .
Accuracy of point forecasts before and after smoothing.

Table 3 .
Comparison of average performance measures over 250 runs.