An Investigation towards the Suitability of Vector Autoregressive Approach on Modeling Meteorological Data

In most meteorological problems, two or more variables evolve over time. These variables not only have relationships with each other, but also depend on each other. Although in many situations the interest was on modelling single variable as a vector time series without considering the impact other variables have on it. The vector autoregression (VAR) approach to multiple time series analysis are potentially useful in many types of situations which involve the building of models for discrete multivariate time series. This approach has 4 important stages of the process that are data pre-processing, model identification, parameter estimation, and model adequacy checking. In this research, VAR modeling strategy was applied in modeling three variables of meteorological variables, which include temperature, wind speed and rainfall data. All data are monthly data, taken from the Kuala Krai station from January 1985 to December 2009. Two models were suggested by information criterion procedures, however VAR (3) model is the most suitable model for the data sets based on the model adequacy checking and accuracy testing.


Introduction
The climate change has been a global issue and always one of the most imperative topics in water resources.Weather parameters such as Precipitation, Temperature, Wind speed and Relative Humidity modelling and forecasting could be practically useful in risk management, water resource management and making decisions on climate change.These variables have undeniable effects on the hydrological cycle, agriculture and the environments.Modelling these physical processes deterministically may become a very challenging task due to the complexity of natural systems.As an alternative, stochastic models are used.The application of stochastic models to these climatological variables has been undertaken for long along the globe.However, most of the literatures dealt with single variable; for example; studies on precipitation can be seen in (Gil-Alana, 2012;Valdez-Cepeda et al., 2012;Ibrahim & Fadhilah, 2013;etc.).Studies on temperature can be seen in (Smith, 1993;Fraedrich & Blender, 2003;etc.).Studies on relative humidity can be seen in (Shiri et al., 2011;Jäntschi, 2011;Jamiyansharav, 2011;etc.).
In various time series problems, two or more random variables evolve over time.These variables not only have relationships with each other, but also are dependent.Generally, if variables are empirically dependent, then multivariate models should be considered.Although in many situations we are only interested in modelling and predicting only one variable, however, there is need to also consider all of these variables as a vector time series (Li and Genton, 2009).For example, the fluctuating nature of precipitation as a result of anthropogenic climate change has been the potential subject of significant in recent modelling framework (e.g.Wong et al. 2009; Wan Zawiah, 2012 and many others), but, many uncertainties remain, it is generally agreed that as temperatures increase, the intensity of heavy precipitation events also will increase (Meehl et al., 2007).
Vector autoregression (VAR) model, introduced by Sims (1980) is a technique that could be used to capture the linear interdependencies among multiple time series as well as to characterize the joint dynamic behavior of a collection of variables without requiring strong restrictions of the kind needed to identify underlying structural parameters.It has become a prevalent method of time-series modeling.VAR models generalize the univariate autoregression (AR) models by allowing for more than one evolving variable.All variables in a VAR are treated symmetrically in a structural sense (although the estimated quantitative response coefficients will not in general be the same); each variable has an equation explaining its evolution based on its own lags and the lags of the other model variables.VAR modeling does not require as much knowledge about the forces influencing a variable, as do structural models with simultaneous equations: The only prior knowledge required is a list of variables which can be hypothesized to affect each other intertemporally.
The main advantage of the VAR is that there is no need to specify which variables are the endogenous variables and which are the explanatory variables because in the VAR, all selected variables are treated as endogenous variables.That is, each variable depends on the lagged values of all selected variables and helps in capturing the complex dynamic properties of the data (Brooks, 2002).
However, Engle and Granger (1987) suggest that if a time-series system under study includes integrated variables of order 1 and satisfied the conditions of cointegration relations, then this system will be more appropriately specified as a vector error-correction model (VECM) rather than a VAR.These types of modeling strategy receives less application to meteorological data sets more specifically in the Malaysian meteorological phenomena.Therefore, the aim of the present work is to investigate the suitability of using the vector autoregressive method in modeling meteorological data sets of Kuala Krai in the northeast of Malaysia.

Methodology
A description of the data to be used and a brief overview of the methodology to be implemented in this research work are presented.The theoretical model, which serves as the basic framework of our analysis, is the Vector Autoregressive model of order p (VAR (p)).The compact mode of our methodology is designed in the framework below.

Study Area and Data Collection
The data used were taken from the Kuala Krai station in the center of the State of Kelantan at northeast of Peninsular Malaysia.The Kuala Krai station is located at latitude 5° 32' N and longitude 102° 12' E. The land is hilly and it is once an area of tropical rain forest.Kuala Krai is an area in which the meeting point of two main rivers and formed Sungai Kelantan flowing to the estuary in South China Sea near the State capital of Kota Bharu (Ababa, 2012).Kuala Krai is influenced by an extreme monsoon whereby the average annual temperature is 26.8 C ° while the annual rainfall is at the average of 2713mm.The daily meteorological data were collected from Jabatan Meteorologi Malaysia, which contained 24 hours mean temperature (°C), maximum wind speed (m/s) and rainfall (mm) data from year 1985 to year 2009.

Lag Selection
The information criteria of Akaike, Schwartz and Hannan-Quinn models would determine the lag length for VAR order, p (Misztal, 2010).The following criteria are; i.
Akaike's Information Criteria, ( ) ˆp t u the estimated residuals of the AR(p) process, while m is the number of estimated parameters.

Unit Root
In this study, we focused on three unit root test If p -value > significance level 0.05, then do not reject the null hypothesis.
If p -value < significance level 0.05, then reject the null hypothesis.

Model Estimation
A VAR specification was used to model each variable as a function of all the lagged endogenous variables in the system.Johansen (1988) considered that the process t y is defined by an unrestricted VAR system of order (p):

Data Analysis
The data of this paper consist of the monthly observations of meteorological variables in northern Malaysia for the period of 1985-2009, which include mean temperature ( ) C °, wind speed (m/s), and rainfall (mm).The summary statistics were presented and the time series plot and correlogram were plotted, followed by a unit root test, parameter estimation, model checking, causality test and lastly impulse response function.Table 3.1 presents the descriptive statistics of the meteorological variables where all variables display a positive value of the mean.The standard deviation of temperature and wind speed were smaller as compared to rainfall, indicates that the variation of the data set were not far away from its mean, but the standard deviation for rainfall was a bit large, indicates the possibility of some outliers occurred in the rainfall data set.Wind speed and rainfall distribution showed a positively skewed while temperature exhibit a negative skewed, but it was still approaching zero.In terms of kurtosis, both temperature and wind speed values were approaching 3, indicates that they had a normal shape of the distribution, while rainfall data set was having a leptokurtic distribution.we can see that the data follow seasonality pattern and all of the graphs fluctuated around its mean.Although the data seemed to be stationary, we might have to do seasonal differencing to remove the seasonal pattern.Figure 3.1 (b) shows the graph of all variables after seasonal differenced.

Summary Statistics
In order to confirm that the data of all three variables are having a seasonal pattern, the autocorrelation function was conducted.Figure 3.2 displayed the existence of seasonal pattern from the autocorrelation function test which repeated periodically every 12 months.3.2 presents the p-value for all tests that have been tested.At level, the p-value for ADF and PP test was 0.01, showing that the data for all three variables has no unit root.In Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests, the p-value for temperature and wind speed shows the existence of unit root while rainfall data is stationary, since the p-value was more than 0.05.
After seasonal difference, all p-values for ADF and PP test were less than significant level and the p-value for the KPSS test were more than the significance level, which indicate that the series were already stationary.

Lag Length Selection
Order of lag length is one of the most important aspects that should be included in VAR modeling because if we had chosen a different order of lag length, we would encounter with different result that could lead to misleading interpretation.After removing the seasonal pattern, the correlograms of the seasonally adjusted meteorology series (Figure 3.3) suggested that the first six orders of temperature variable, for instance might be auto-correlated.Four criterion procedure to measure the relative quality of a statistical model were estimated in order to identify the correct number of lag order, p. Akaike's Information Criterion (AIC), Hannan-Quinn information criterion (HQC), Schwarz Criterion (SC), and Final Prediction Error (FPE) were used to identify the number of lag order that would be used in the VAR modeling.According to the analysis below in Table 3.3, AIC and FPE suggested that an optimal lag length, p= 3 is appropriate for the time series data while HQC and SC suggested that lag length, p=1 should be used for modeling the time series.In this case, we decided to model the VAR process using both lag order, p=1 and p=3 and yet to identify which model would give the best performance by comparing the mean square error for both models.After identifying the lag order for VAR model, the estimation process of VAR modeling, including the constant and trend was performed.
The parameter estimation of VAR (1) and VAR (3) are summarized in equation 3.1 and 3.2.If we compare the coefficient and the p-value for both equations, there was not much difference between them.However, the p-value for the equation of all variables; temperature, wind speed and rainfall in both tables were significant in this study, where the null hypothesis is rejected, which indicate that the series of data not be suitable for VAR modeling.Nevertheless, the model checking that is going to be done in the next process will determine further the suitability of VAR modeling.

Johansen Cointegration Rank Test
The Johansen cointegration rank test was applied to check whether or not cointegration exist among the variables and yet to determine the cointegration rank of the variables.VECM is applied in the model instead of VAR when there exists a cointegration relationship between the variables.The results for the test is presented in Table 3.6.Rejecting the null hypothesis when the test statistics are less than critical values and it indicates that there exists a cointegration relationship between the variables.From Table 3.6, all the test statistics are more than the critical values which indicate that from rank 0 to 2, there is no cointegration relationships exist.Model checking was needed after fitting the model.Table 3.7 shows the result for VAR(1) and VAR(3); normality test (Jarque-Bera test), autocorrelation test (Breush-Godfrey LM test) and ARCH test (ARCH LM test).For VAR(1), the p-value for all tests were less than 0.05 indicate that this model was not normally distributed, correlated between variables and had heteroscedasticity effect.However, in VAR(3) model, the results shows that the model was not normally distributed, but the model had no auto-correlation between variables and it had no heteroscedasticity effect the model.When the series of data is auto-correlated and had heteroscedasticity effect, then the estimated variances of the estimated coefficients will be biased and inconsistent, and therefore hypothesis testing is no longer valid.

Causality Test
Causality test is one of the advantages when dealing with multivariate time series model, or in other word, we could not perform causality test when analyzing the univariate time series model.The result for the Granger causality test are shown in Table 3.8.At 95% confidence level, temperature and wind speed Granger-cause on the other variables for VAR(1) but for VAR(3), temperature and wind speed did not Granger-cause the other variables.However, for rainfall variability, it showed that rainfall was Granger cause temperature and wind speed for both VAR (1) and VAR (3).Rainfall was said to Granger-cause temperature and wind speed, meaning that temperature and wind speed could be better predicted using all the three variables; temperature, wind speed and rainfall than it could by using only temperature and wind speed alone.
3.1.9Impulse Response Impulse response function (IRF) is a process to investigate the impacts of each variable within the system.Through dynamic structure, IRF demonstrate the effect of one standard deviation shocks in the error terms from a variable to another endogenous variables (Gujarati, 2004).From this study, one might be interested to understand how sudden and unexpected change of a variable would affect another variable over a period of time.
Figure 3.4.Impulse response function Figure 3.4 shows the impact of the endogenous variables to one standard deviation shock of temperature, wind speed and rainfall.All of the endogenous variables showed positive impacts on temperature and wind speed at the initial time periods, but after that they became insignificant and slowly became zero.Temperature showed a negative impact on the rainfall in the first few period of time and slowly became insignificant and zero.Wind speed and rainfall demonstrated that initially their impact towards rainfall offer positive and significant.
Thereafter the effect remained zero (Pervez Zamurrad Janjuasamad et al., 2010).Orthogonal Impulse Response from Rainfall result, we clearly see that VAR (3) was the best model based on the smallest values obtained from any one of the four listed below.

Discussion and Conclusion
In this study, we presented VAR modeling based on meteorological data that consisted of 3 variables, namely temperature, wind speed and rainfall amount which were collected from the Kuala Krai station from the year 1985 till the year 2009.Data plotting and unit root testing showed that the data were stationary with a seasonal pattern.Seasonal differencing was required in order to remove the seasonal pattern.The lag length was chosen by information criteria to determine the order of VAR.For this set of data, VAR(1) and VAR (3) were chosen to be estimated.
The estimation of VAR (1) and VAR (3) showed that most of the parameters were significant, since they had smaller p-values where the smaller p-value indicate that the model is significant in rejecting the null hypothesis.
The correlation of residuals between variables showed that the temperature had negative correlation between wind speed and rainfall.However, wind speed was positively correlated with rainfall.
Model checking was required to check whether the estimated model was an adequate model.VAR (3) model indicated that there were no autocorrelation and no heteroscedasticty effects between variables.However, VAR (1) showed that the variables were auto-correlated between them and there was a heteroscedasticity effect where the presence of heteroscedasticity can invalidate statistical tests of significance.Accuracy testing such as ME, RMSE, MAE and MASE were performed to check the smallest error given by the model.VAR (3) gave the smallest error from all of the tests.
As a conclusion, VAR modeling might not be a suitable model when dealing with these data sets since the p-value for all equations are significant in rejecting the null hypothesis.However, from the model checking we are able to verify that VAR is suitable to be applied in these data sets.VAR (3) is more suitable as compared to VAR (1) in terms of their model checking and accuracy testing.Furthermore, Johansen cointegration rank tests showed that there is no cointegration relationship among the variables that lead to conclude that VECM model is not suitable for the data set.Future work will focus on other multivariate time series methods to find a more suitable model for these data sets. ( Figure 3.1.(a) time series plot for original data set (b) time series plot after seasonal differencing

Figure
Figure 3.2.Autocorrelation function of variables

Table 3 .
1. Descriptive statistics of the variables 3.2.Autocorrelation function of variables3.1.3Testingfor Unit RootThree tests were used to test whether the time series data were in a stationary state, namely Augmented Dickey Fuller (ADF), Phillip Perron (PP) and Kwiatkowski-Phillips-Schmidt-Shin (KPSS) tests.For ADF and PP test, the null hypothesis is rejected (i.e., the data does not have unit roots) when the p-value is less than the significance level, 0.05.Meanwhile, for KPSS test, rejecting the null hypothesis, when p-value less than significance level 0.05, indicate that the data has unit root.

Table 3
When comparing two models, we looked at the smallest values given by any one of those listed in Table3.9,either mean error (ME), root mean square error (RMSE), mean absolute error (MAE) or mean absolute square error (MASE) to identify which model should be chosen that would give a better predicted model.From the