Forecasting a Mix of Temporal and Non-Temporal Economic Variables with a Mixture-of-Experts Neural Network

This study investigates a versatile forecasting technique using an integrated system of Artificial Neural Networks (ANN) and Genetic Algorithms (GA) in a mixture-of-experts architecture to solve a general economic forecasting problem involving a mix of temporal and non-temporal variables. Using Klein Model I as a context and previous estimations from traditional methods as benchmarks, the study provides evidence on the effectiveness and efficiency of this integrated system. ANN helps overcome the imposition of assumptions on the behaviors of related variables, the specification of exact relationships, and the difficulty in nonlinear estimations of the economic model. GA helps overcome the sub-optimality of the tedious trial-and-error process in network building. The flexibility of the mixture-of experts network architecture offers many alternative configurations to capture the peculiarities of variables in context before aggregating intermediate estimations into the final result. The integrated system has shown its ability in processing effectively the mixture of economic variables, and producing efficient estimations and forecasts.


Introduction
General business forecasting problems, particularly those dealing with socio-economic variables, usually involve many temporal and non-temporal interactions.Very often, the value of an economic variable is not only related to its predecessors in time but also to the current and past values of other variables.Hence macroeconomic models have to incorporate various interrelated variables in the economy.
Most of econometric models have difficulty in providing accurate estimates/ forecasts due to the complexity of the economic system, the impossibility of validation with controlled experiments on the economy, and the existence of non-quantifiable factors in economic activities (Moody, 1995).Also, many assumptions have been imposed on the behaviors of related variables in the modeling process (Cromwell et al., 1994).In addition, one may encounter the complexity of estimation when dealing with nonlinear models (Mills, 1991).
This study focuses on the overcoming of these constraints in traditional modeling and forecasting methods with the implementation of an integrated system of Artificial Neural Network -ANN (Rosenblatt, 1959) and Genetic Algorithm -GA (Holland, 1975) in a modular network architecture (Jacobs et al., 1991a(Jacobs et al., , 1991b)).Such a mixture network system would be able to handle effectively a general family of business forecasting problems, i.e., forecasting with a mixture of temporal and non-temporal variables, in which an econometric model should be a useful context.An ANN has been proved to be a universal function approximator (Funahashi, 1989;Cybenko, 1989;Hornik et al., 1989) in learning nonlinear relationships inherent in the data without a priori functional form and imposed assumptions on the behavior of data.With a mixture of network architecture, one can also partition the problem space into domains and assign them to modular ANNs to learn the related peculiar patterns.GA can explore a large number of alternatives in the problem spacespecifically, all possible ANN architectures -in order to avoid sub-optimality (Goldberg, 1989).
The paper is organized as follows.Section 2 reviews issues in forecasting of a mixture of temporal and non-temporal variables presenting in an economic system, in which Klein Model I serves as an example.Section 3 discusses the use of ANN in forecasting.Section 4 proposes a mixture of ANNs for effective forecasting.Section 5 presents findings of this study and discussions.Finally, Section 6 concludes the paper with some remarks for future applications and practices.

Forecasting an Economic System
An economic system involving a mixture of temporal and non-temporal variables, could be a representative of a general family of business forecasting problems (Tong, 2011).An econometric model is a set of simultaneous equations to describe the working of an economy as a whole or one of its sectors (Ruud, 2000).

Structural Equations of an Economic System
Equations of an econometric model usually contain information on the following variables (Judge et al., 1985): -Endogenous or jointly determined variables have outcome values determined through the joint interaction with other variables within the system.
-Exogenous variables affect the outcome of the endogenous variables, but whose values are determined outside the system.Exogenous variables are assumed to condition the outcome values of the endogenous variables but are not reciprocally affected because no feedback relation is assumed.
-Lagged endogenous variables can be placed in the same category as the exogenous variables since the observed values are predetermined for the current period.The exogenous variables and lagged endogenous variables that may involve any length of lag are called predetermined variables.
-Non-observable random errors, also called random shocks or disturbances.

Nonlinearity and Dynamics of Economic Variables
Economic variables change over time so that the linearity of an economic model is a strong assumption.Therefore, major concerns in forecasting is in how to capture the nonlinearity and the dynamics of economic events in economic modelling.
Nonlinearity in economic models could exist in the variables and/or in the parameters.In such cases, a traditional method is to find a method, such as Box-Cox transformation, to convert the model into a linear specification.But there are intrinsically nonlinear models, which cannot be linearly transformed.The estimation of these models is based on minimizing or maximizing an objective function such as the sum of squared errors or the likelihood function (Judge et al., 1985).However, with the current optimization methods, one may still encounter estimation complexity when dealing with nonlinear optimization problems (Mills, 1991).
Dynamics in forecasting is usually captured with dynamic regression models (Pankratz, 1991), in which an output is linearly related to current and past values of one or more inputs.An alternative approach is simultaneous equation modeling taking into account the relationship among a set of macroeconomic time series (Sims, 1980).In this multivariate perspective, a given time series may be influenced not only by certain exogenous events occurring at a particular point in time but also by contemporaneous, lagged, and leading values of a second variable or many other variables (Cromwell et al., 1994).Judge et al. (1985) note that it is not unusual that parameters entering in a regression model simply reflect one's uncertainty on which model would adequately represent the relationship among the variables.

Klein Model I Revisited
Econometric models reported in literature range from including less than ten endogenous variables to more than one hundred endogenous variables (Bodkin et al., 1991).The classic model Klein Model I of US interwar economy from 1921 to 1941 (Klein, 1950), has been used as an example of modeling and estimation in econometrics.This model has three behavioral equations and three identities.All endogenous variables are in 1934 dollars, and all relationships are strictly linear.The model specification and variable descriptions are summarized in the following.For simplicity, time subscripts are omitted unless they indicate the lagged effects.
-Consumption Equation: where C is consumption, Wp is private wage bill, Wg is government wage bill, and P is non-wage income (profits).
-Investment Equation: where I is net investment, P is profits, and K t -1 is stock of capital at the beginning of the year.
-Private Wages: (Demand of Labor) where Y is output, T is taxes, and t is time trend (year minus 1931).
-Equilibrium Demand: -Income: -Capital Stock: Therefore, the system has six endogenous variables C, I, Wp, P, K, Y, and four exogenous variables T, Wg, G and t.
This system could be represented in "reduced form" with respect to the endogenous variables (Theil & Boot, 1962).Given the assumption on linearity of the system, the reduced form can be specified as follows where In the reduced form, each endogenous variable in year t is described linearly in terms of the same variable lagged one year (Ay t-1 ), the exogenous variables in the same year (Bx t ), the exogenous variables lagged one year (Cx t-1 ), and the reduced-form disturbances u.Since C, Wp and I do not occur in lagged form, the corresponding columns in the coefficient matrix consist of zeros.
Klein Model I has been estimated by the various traditional methods.These methods address either single equations or the whole system of equations (Klein, 1950;Theil, 1971;Greene, 2011).These traditional estimations serve as benchmarks for comparison with those from proposed techniques in this study.

Single-Equation Method of Least Squares
This method treats each equation independently of all others in the system.Klein noted that one had to make arbitrary choice of dependent variables for each of the three equations (Klein, 1950) Consider a system of simultaneous equations, the nonzero terms in the jth equation are In the first stage, Ordinary Least Squares prediction Y j * is obtained from a regression of Y j on X.Then the 2SLS estimator is obtained by Ordinary Least Squares regression of y j on Y j * and Xj (Greene, 2011).

Limited Information Maximum Likelihood
In Limited Information Maximum Likelihood estimation, one takes into account the absence of certain variables from a particular system equation (Theil, 1971).Using the reduced form of the system, the joint density of endogenous variables is formulated and maximized subject to the constraints that relate the structure to the reduced form (Klein, 1950).

Three-Stage Least Squares (3SLS)
Three-Stage Least Squares (3SLS) method uses Generalized Least Squares estimation to the system estimation, each of which has first been estimated with 2SLS.
In the first stage, the reduced form of the system is estimated.Using Ordinary Least Squares method, this results in Y j * for each equation.Then the fitted values of the endogenous variables are used to get 2SLS estimations of all the equations in the system.Also residuals of each equation are used to estimate the cross-equation variances and covariances *.In the last stage, Generalized Least Squares parameters are obtained for the system (Greene, 2011).

Full Information Maximum Likelihood
This method assumes that each of the three variables C, Wp, and I is non-autocorrelated, and there is no correlation between the disturbances in any of the structural equations.The estimators treat all equations and all parameters jointly in formulating the likelihood function to be maximized subject to all restrictions imposed by the structure.Estimation with Full Information Maximum Likelihood was reported in Klein's monograph (Klein, 1950).
Estimated parameters for the three equations for C, Wp, and I obtained from different methods of limited-and full-information estimations are reported in Greene (2011).In this study, the comparison across methods is based on residuals of related estimations reported in Klein (1950), SAS/ETS (SAS, 1984).

Function Approximation with Artificial Neural Networks
An Artificial Neural Network (ANN) topology consists of nodes as autonomous processing units connected by directed arcs and arranged into layers.Every node, other than input node, computes its output S as a function of the weighted sum of inputs directed to it from other nodes, where f(.) is a transfer function, usually a nonlinear, bounded and piecewise differentiable function, such as the sigmoid function Such an ANN produces a response, which is the superposition of n sigmoid functions, where n is the number of hidden nodes, to map a complex function.As one adds more hidden layers, ANN will be able to map higher order functions (Haykin, 2009;Graupe, 2013;Schmidhuber, 2014).
The ability of ANN in function approximation is due to its capability of learning the underlying functional relationship from the data itself, therefore, minimizing the necessary a priori non-sample information.A multi-layer network can produce a mapping between inputs and outputs consistent with any underlying functional relationship regardless of its true functional form.It eliminates the need for unjustified a priori restrictions, such as the Gauss Markov assumption, frequently used to facilitate estimation in regression analysis.
In traditional statistics, the appropriateness of the Ordinary Least Squares method is an empirical question, therefore the test of assumptions is a routine part of any application.In contrast, whether these assumptions hold or not, the ANN still yields a similar solution since the image of any underlying mapping can always be projected into a perfectly flexible mapping.
It has been shown that standard multi-layer networks using arbitrary transfer functions can approximate any Borel measurable function to any desired degree of accuracy (Hassoun, 1995;Steeb, 2005).The similarity between ANN techniques and traditional methods in statistics and econometrics has been investigated in the literature (Cheng & Titterington, 1994;Ripley, 1994;Hwang et al., 1994;Kuan & White, 1994;Dreyfus, 2005).

First Attempt of Forecasting Klein Model I with ANN
The first attempt to capture nonlinear relationships among economic variables of a structural system with ANN was undertaken by Caporaletti et al. (1994) with an in-sample estimation of Klein's Model I. Three ANNs are constructed and trained, each of which is used to forecast one of three endogenous variables of the model, i.e., consumption, investment, and private wage bills.Each ANN has thirteen input nodes corresponding to seven predetermined variables plus six exogenous variables of the model.The hidden layer contains eight nodes.The output layer has a single node corresponding to a particular endogenous variable.The authors conduct ex-post forecasts and find that results are significantly better than those from traditional estimation methods.
This attempt has some shortcomings.First, with a single output node the network does not account for the contemporaneous and simultaneous effects of endogenous variables.As such, it has a similar drawback of the traditional single equation estimation method.In a simultaneous equation system, the appropriate estimation should be based on a multivariate approach instead.
Then, current values of endogenous variables in this setting are considered as inputs of the network.In addition, there is no feedback to account for the dynamics of the system.As such, this network cannot estimate and forecast a particular endogenous variable without the need of predetermined, current as well as lagged, values of all other endogenous variables.
At last, this network architecture does not handle a mixture of non-temporal and temporal variables.As such, one cannot effectively account for the contemporaneous and lagged effects of related variables of an economic system.
Our study experiments a network architecture that has the ability to account for the simultaneous and contemporaneous effects of the variables in an economic model.Using recurrent network design, the proposed network also accounts for the dynamics of the system.As such, ANNs can effectively provide not only ex-post estimations but also ex-ante forecasts of an economic system as well.

Temporal Pattern Recognition with ANN
An ANN, if it is configured appropriately, does have the ability of recognizing and storing the temporal nature of patterns.This study experiments a combination of the static representation of temporal information and storing temporal patterns in a recurrent network.
In the static representation, a sequence of incoming temporal data is represented simultaneously in the network with an input node for the value of an economic variable corresponding to each time lag.For instance, if the variable X has three lags X t-1 , X t-2 , X t-3 , then three input nodes needed to capture these lagged values.
In dynamic forecast, the predicted values of economic variables of concern are used in next period forecasting.
Applying to ANN, one can store and generate temporal patterns via recurrent connection.In this configuration, the output just produced by the network is fed back to the input level to represent the state of the network at the preceding moment in time.Also, nodes can be created to keep some residue of the previous signals and allow slow decay of historical information.Jordan (1986) proposes an architecture in which the value of output layer is fed back to a context unit to create the memory traces.Both input units and context units activate the hidden units to produce the next network output.A context unit retains the past value of its input with an exponential decay.It can be considered as a lowpass filter to create an output that is a weighted average of some of its recent past inputs.
where 0   1 is a time constant to control the degree to which past values are factored in.The time constant could be set to 1  1/D, where D > 0 represents the memory depth, i.e., how long a given value fed to the context unit is remembered.
Literature has reported the performance of recurrent ANN versus Vector Autoregression (VAR) and asserted the comparable ability of ANN in multivariate time series forecasting (Nguyen & Kira, 1997;Moshiri & Cameron, 2000;Aydin & Cavdar, 2015).The study reported herein extends previous works with forecasting a mix of temporal as well non-temporal economic data of an economic model.

Mixture-of-Experts Architecture
In complex situations, one needs a system of networks in which many specialized networks are integrated or interacted which each other in logical or real parallelism.The mixture approach is to build complex models out of simple parts.
Function approximation with ANN is traditionally based on a superposition of simple basic functions such as logistic functions.Instead of using solely superposition, one can also use the principle of divide-and-conquer to split an input space into smaller regions, which can be fitted with simpler functions by a set of function approximators called expert networks (Jacobs et al., 1991,1 991b;Jordan & Jacobs, 1995a, 1995b).
The assumption is that data can be well described by a collection of functions, each of which is defined over a relatively local region of the input space (Jordan & Jacobs, 1995a, 1995b).The expert networks could be arranged in modular and/or hierarchical systems.They offer the ability of solving a complex problem by dividing it into a set of sub-problems, each of which may be simpler to solve than the original one.With the assumption that a particular type of network -an "expert" -is appropriate in a region of the input space, the network architecture requires a mechanism that identifies the experts or a mixture of experts that most likely produce the correct output from given associated inputs.This is accomplished with an auxiliary network, called a gating network, to provide the weight of contribution to various experts.
where O is the final output, S i is modular/intermediate estimation calculated from Equations 12-13.
Various network training algorithms have been proposed to take advantage of the modularity of mixture-of-experts systems (Masoudnia & Ebrahimpour, 2014).

Genetic Algorithms in ANN Optimization
Genetic Algorithms -GA (Holland, 1975;Goldberg, 1989) have been applied to the optimization of ANN.They are implemented to search for either a set of optimal network weights and/or an optimal network architecture.
GAs have been used to search for optimal interconnection weights in the weight space of a multiplayer, feedforward network without using any gradient information (Montana & Davis, 1989).Unlike the backpropagation using gradient method, a GA can avoid local minimum traps while performing a global search for best set of connection weights.Literature has reported on the superiority of a set of network weights selected by GA (Whitley et al., 1990;Sexton et al., 1998).
GAs have also been used to search in the space of all possible ANN architectures.Schaffer et al. (1990) propose the use of GA to evolve ANN architecture.Their method of representing a network architecture in a string allows for the possibility of including or excluding a hidden node/layer and changing network learning parameters during the evolutionary process.The method of optimizing network architecture with GA has been investigated by many other researchers (Davis, 1991).A neural genetic network behaves similar to nonlinear, nonparametric stepwise regression without any a priori assumption on the functional form of the relationship among data (Reeves & Rowe, 2002).

Mixture of Neural Networks in Estimation and Forecasting of Klein Model I
This study experiments the mixture-of-experts network architecture for estimation and forecast a mixture of temporal and non-temporal variables.The proposed network is able to account for not only the nonlinearity and dynamics but also the simultaneous and contemporaneous effects of the variables in an economic system.
For comparative purposes on estimation and forecasting of Klein Model I, relative performances of previous estimations versus one of the proposed system are evaluated on ex-post forecast for the period 1921-1941.Data for estimation are taken from Klein (1950).Then, the Klein's Model I framework is used to train and validate the ex-ante forecast ability of ANN on a moving window scheme from 1950 to 1994.Data are taken from National Income and Products Accounts of the United States 1929-1994 (U.S.Department of Commerce, 1998).Within this time horizon, a moving window frame is implemented.In each window, 20 annual periods are used for estimation, 5 subsequent periods for validation, and the next 5 periods for testing.
If one relaxes the linear restriction on the relationships among variables of the Klein Model I, then the reduced form of the Klein Model I (Theil & Boot, 1962;Intriligator, 1978) can be specified as: where Since the system has a group of temporal variables and a group of non-temporal variables, ANN needs two modules to learn the specific patterns of each type of variables.ANN also has a gating network to aggregate modular estimations into final results and to account for the simultaneous and contemporaneous effect of endogenous variables.The proposed mixture-of-experts network estimates Klein Model I with two-stage and modular architectures.Since the endogenous variables are contemporaneously related, it is not accurate to estimate them with a single equation approach.The relationship of endogenous variables and other variables of the system are estimated herein in the instrumental stage.Although these variables are estimated simultaneously, their contemporaneous effect has not been taken into account.Consequently, these instrumental estimations will be mapped to their actual values to account for this contemporaneous effect in the final stage.

Modular Estimation
In an economic system, some endogenous variables are affected by their lagged values.Also, the depth of lagged effects may vary across endogenous variables.In addition, some variables of the model may be affected by a certain exogenous variable a priori.Without modular estimation for each effect, it could be very difficult to approximate accurately the mixture of temporal and non-temporal variables.Consequently, the ANN should have different modules at the instrumental stage to capture these lagged effects or specified effects separately.Instrumental output results from modular estimations are aggregated at the final stage to account for the contemporaneous effect of all endogenous variables in the network outcome.Specifically, in the modular estimation of Klein Model I, the instrumental stage has two modules: a recurrent module to estimate P t , Y t , K t taken into account their lagged effect, i.e., P t -1 , Y t -1 , K From the initial structure of mixture-of-networks, GAs are used to select the optimal network topology at each stage and for each module in the ANN estimation.The fitness of each topology is evaluated on two criteria: the simplicity in terms of number of hidden layers and hidden nodes and relative performance in terms of its discrepancy between network outputs and desired targets.
In the following, the network configuration is represented as I-H1F-H2F-OF where I is the number of input nodes, H1 is the number of nodes in the first hidden layer, H2 is the number of hidden nodes in the second layer, O is the number of output nodes, and F is the transfer function choosing from a pool of logistic sigmoid functions (L), hyperbolic tangent functions (T), and linear functions (Lin).For instance, the notation of 9-7L-3T denotes a network configuration of 9 input nodes, one hidden layer with 7 nodes using sigmoid logistic transfer functions, and 3 output nodes using hyperbolic tangent transfer functions.

Two-Stage ANN Estimation of the Klein Model I
At each stage of estimations (i.e., instrumental stage and final stage), the selected network is trained in 30 runs; each run lasts 1,000 epochs with different initial random weights at each stage of estimation, minimum and maximum errors are recorded.This results in two streams of data representing instrumental estimations, one with maximum error and the other with minimum error.These streams of instrumental estimations are used in final estimation of system equations.The GA selects a network configuration of 9-7L-6T for the instrumental stage and 6-6L-3T for the final stage.Table 1 reports the performance of network with minimum/maximum error at the instrumental stage and at the final stage.The rationale of this recording is to evaluate the case where the network is trained in just one run at each stage, what would be the boundary of errors in the best and the worst estimations from 30 runs.
Results in Table 1 show that the performance of the ANN is superior to those of traditional methods (Klein, 1950;SAS, 1984) and Caporaletti et al. (1994).The comparison is based on Sum of Squared Errors (SSE) in the estimation of each exogenous variable as well as Total SSE in the estimation of the whole system at different training times.Caporaletti et al. (1994) use an ANN to estimate each endogenous variable of the system, one by one.This single equation estimation approach may not capture well the simultaneous and contemporaneous effects of other endogenous variables in the economic system as the system estimation approach used in this study.

Modular ANN Estimation of the Klein Model I
Using mixture-of-experts network architecture, the network configuration in this study has two modules: a recurrent module and a standard one.The recurrent module is refined to learn the lagged effect on related temporal variables.The standard module learns the inter-relationship of other variables in the system.Then instrumental estimations from these two modules are processed in the final stage with the mapping of instrumental estimations to desired targets of the system.This mapping accounts for the contemporaneous and simultaneous effects on the final estimation of the endogenous variables.In the instrumental stage, GA selects the configuration of 9-6L-3T for the recurrent module and 9-7L-3T for the standard module.Each network module is trained in 30 runs; each run lasts 1,000 epochs with different initial random weights.It results in two streams of data representing instrumental estimations to be used in the final estimation of system equations.In the final stage, GA selects the configuration of 6-6T-3T for the network.The final network is trained in 30 runs, each run lasts 1000 epochs with different initial random weights.In each module, minimum and maximum errors of estimation are recorded to define the boundary of errors.Results from modular ANN estimation are reported in Table 2.
In all cases, the results obtained from modular ANN estimations are superior to those of two-stage ANN and traditional methods reported in the previous section.The reason for this improvement is that the temporal effect of lagged endogenous variables on the system is taken into account explicitly in modular estimation.

Modular ANN Forecasting of the Klein Model I
This study uses the variables defined in the Klein Model I to forecast the related endogenous variables for the period from 1950 to 1994.As the US economy grows dramatically, the level of macroeconomic variables in this period increases accordingly.It would be difficult for a network to deal with variables whose values increase to an unbound limit and spacing with big gaps.As an alternative, this study considers a more compact space and focuses on the growth rate of related endogenous variables.Consequently, related data in the period are transformed into first differences of their natural logarithmic values to capture their growth rates.In the following, the growth rates of consumption, private wages, and net investment are indicated as DLC, DLWp, DLI, instrumental estimates as DLC*, DLWp*, DLI*, and final estimates as DLC**, DLWp**, DLI**, respectively.Training (1950-69) .000331.000832 Following Klein (1950) and Klein-Goldberger (1955) concentrated on the sign of the forecast residual, the current analysis focuses on the ability of the ANN to pick up the future direction of related variables.The experiments have been conducted with available data from US Bureau of Census from 1950 to 1994, being divided into 30-year moving time windows.For each window, 20 yearly periods are used for training, the next 5 for testing, and the subsequent 5 for validation or forecasting.GA is used to select the appropriate configuration for each module.The best network is used to make forecasts for the next 5 out-of-sample periods of the time frame..3.1 Period from 1950to 1979For this period, data from 1950-1969are used for training, 1970-1974for testing, and 1975-1978 for forecasting.GA selects a network configuration of 9-4L-3T for the recurrent module, 9-5L-3T for the standard module, and 6-4L-3T for the final stage.

5
For DLC, the network learns well as it captures correctly the changes in direction of this variable with a SSE of .000331.However, in the test/forecasting period, the network projects a slight fluctuation at a lower level when the related variable started fluctuating in an upward trend (Figure 3).The SSEs in test and forecast periods are .001794and .008602,respectively.The network does not experience these high growth levels of DLC in an upward trend.Consequently, it provides moderate forecasts.
For DLWp, the network learns well the data patterns and follows closely the changes in direction of the variable with a SSE of .000832.In the test and forecast periods, the network picks up the changes in direction with SSEs of .001821and .001209,respectively.One notes that as the network has learned the large fluctuation patterns in the training set, it is able to forecast a moderate level while following the future directions of the data (Figure 4).
For DLI, network forecasting picks up the changes in direction of the variable as it already learned the fluctuated patterns in the data.The SSEs for training and testing periods are .017397and .09676,respectively.However, when the variable fluctuates at a wide level (in 1974-75), the network has not experienced this new pattern to make a close prediction.Therefore, it produces forecasts at moderate levels.The SSE for this forecast period is .267968(Figure 5)., 1975-1979 for testing, and 1980-1984 for forecasting.GA selects a network configuration of 9-6T-3T for the recurrent module, 9-6T-3T for the standard module, and 6-7T-3T for the final stage.
For DLC, the network learns well the upward trend in the training set by following correctly the changes in direction of the variable with a SSE of .000682.It is able to pick up the patterns in the test period with a SSE of .000751.When the future data (1980-84) fluctuated in a new downward pattern, the network produced a dampening forecast at a moderate level (Figure 6).As the network has not experienced this pattern, it produces a SSE of .004883for the forecast period.
For DLWp, after learning well the upward trend in the training set with a SSE of .001172, the network forecasts a slight fluctuation at moderate level when the future data (1980-84) fluctuate in a new pattern (Figure 7).SSEs for the test and forecast periods are .002552and .010367,respectively.
For DLI, after learning well the fluctuation in the training set with a SSE of .008684,ANN forecasts follow the future data pattern.However, the network had not experienced the large changes in the levels of the extreme variation in the forecast period (e.g., the changes in 1982 to 1984).Consequently, when future data start fluctuating widely (1980-84), the network produced forecasts at moderate levels (Figure 8).SSEs for the test and forecast periods are .074401and .758536,respectively.The large errors in the forecast period are due to the large changes in levels of data that the network is unable to capture.GA selects a network configuration of 9-5T-3L for the recurrent module, 9-6T-3T for the standard module, and 9-3L-3T for the final stage.
For DLC, the network learns well the upward pattern of the training set with a SSE of .000752.When future data starts a downward trend (1985)(1986)(1987)(1988)(1989), the network has not experienced such a large change in levels to produce closer forecasts.As a result, network forecasts follows the future directions at higher levels (Figure 9).SSEs for the test and forecast periods are .006123and .012539,respectively.
For DLWp, the network learns well the upward pattern of the training set with a SSE of .001325.When future data starts a downward trend, ANN forecasts follow the trend but at higher levels (Figure 10).Similar to learning and forecasting DLC in this period, the network has not learned the large change in levels of future fluctuation in order to provide closer forecasts.SSEs for the test and forecast periods are .00773and .024928,respectively.
For DLI, the network learns well the fluctuation in the training set with a SSE of .019973.When future data start dropping (1985) and then fluctuating at a lower level (1985-89), the network forecasts follow the trend but at a higher level (Figure 11).SSEs for the test and forecast periods are .082321and .727144.The large error for the forecast period is due to the change in data patterns as they fluctuate at a moderate levels that the network is unable to follow closely.GA had selected a network configuration of 9-5L-3T for the recurrent module, 9-3L-3T for the standard module, and 9-5L-3T for the final stage.
For DLC, the network learns well the upward trend in the training set with a SSE of .001085.When future data start a long downward trend, the network has not learned these patterns in order to predict accurately the future level and, on some occasions, the changes of directions, e.g., in 1990 (Figure 12).SSEs for the test and forecast periods are .002213and .014209,respectively.
For DLWp, the network learns well the upward trend in the training set with a SSE of .001471.When future data has a downward trend, the network does not predict accurately the level and change of directions from the pattern that it has learned (Figure 13).SSEs for the test and forecast periods are .002426and .012759,respectively.
For DLI, the network learns well the fluctuation in the training set.Since the training set contains patterns of large fluctuations, the network forecast is able to follow the trend of future data, however in a large fluctuating pattern that it has learned (Figure 14).SSEs for the test and forecast periods are .136035and .164281,respectively.The large error from these periods is due to the sudden change in levels of fluctuation.The following are some observations on the ANN behavior in learning patterns of the training set and producing forecasts on unseen data.


The network does not learn and project the recent extreme trend.It tends to provide a moderate forecast in terms of directions and levels.


If the network has been trained with data having an upward trend and related variable to be predicted fluctuates in a downward trend, network forecasts will be dampened at a middle level.


If the network has not experienced drastic level changes in the training set, it produces a forecast following the trend but at a higher level for future downward change and lower level for future upward change.


If the network is trained with the fluctuated pattern, its forecasts follow the future trend but at a moderate change in level.The larger the variation in the training set, the closer the ANN will follow the patterns in the forecast period in terms of directions and levels.


The network cannot predict accurately a level outside the range of pattern it has learned from the training set.When it encounters such a case, it produces a forecast at an average level of the data in the training set.
Consequently, in order to improve its forecasting ability, a network should experience with the variation in trend (upward/downward, long/short fluctuations) and the possible highest and lowest levels of data patterns.
Experiments illustrate that the more variations exist in the training set, the closer ANN follows with future fluctuations in terms of directions and levels.In any case, the ANN forecasting tends to be conservative, not following immediately the drastic upward or downward trend.

Concluding Remarks
From extensive experiments with the Klein Model I, an integrated ANN and GA in mixture-of-experts network architecture has provided evidences of effective alternatives to traditional estimation / forecasting techniques to handle a mix of temporal and non-temporal variables.One can use hierarchical networks to conduct instrumental estimations.One can also partition the problem space into domains and assign them to modular ANN to learn the related patterns.
The versatile technique of integrated mixture-of-experts ANN overcome the imposed assumptions on the behaviors of related variables, the specification of exact relationship, and the difficulty in nonlinear estimation of the economic model.The GA helps overcome the sub-optimality of the tedious trial-and-error process in network building.The flexible network architecture offers many alternative network configurations to capture the peculiarities of variables in a problem space before aggregating intermediate estimations into final results.The integrated system processes effectively the mixture of variables, and produces efficient estimations and forecasts.
In a future work, seasonal patterns of temporal data would be explicitly recognized in an integrated ANN system.In that architecture, GA could be used to define a minimum number of input nodes (time lags) that is still capable to provide accurate forecasts.Such a design would alleviate the ANN technique from using a traditional statistical technique, such as the Box-Jenkins method (1976), to determine time lags, and consequently the memory, of a neural forecasting system.

Figure 1 .
Figure 1.Two-stage ANN Estimation of Klein Model 1

Figure 2 .
Figure 2. Modular ANN Estimation of Klein Model 1

Figure 3 .
Figure 3. Estimation and Forecasting of DLC

Figure 6 .
Figure 6.Estimation and Forecasting of DLC

Figure 9 .
Figure 9. Estimation and Forecasting of DLC

Figure 12 .
Figure 12.Estimation and Forecasting of DLC

Table 1 .
Two-Stage ANN Estimation of the Klein Model I

Table 2 .
Modular ANN Estimation of the Klein Model I

Table 3 .
Modular ANN Estimation and Forecasting of the Klein Model I