Multi Factor Stock Selection Model Based on LSTM

This paper takes CSI300 stock as the research object, and uses the LSTM model with memory characteristics and the traditional multi factor analysis to build an improved multi factor stock selection model. In back testing experiments, we use the trained LSTM model to forecast the stock returns and make a portfolio classification to construct the investment strategy. The result shows that the multi factor stock selection model based on LSTM has good profit forecasting ability and profitability.


Introduction
Quantitative investment has the advantages of objective rationality, accuracy, controllability, efficiency and sensitivity (Liang & Yongping, 2018), which have attracted the attention of the financial industry and academia.Among them, multi factor stock selection model (Malkiel & Fama, 1970;Asmess, 1997;Chen & Zhang, 1998;Mohanram, 2005) is widely used as a classic model of stock investment.In recent years, the rapid development of machine learning algorithms have provided new ideas for the research of quantitative investment, such as the use of support vector machine (Lifang et al., 2006;Yanfeng & Feng, 2006) and neural network algorithm (Hongxing & Zhaojun, 2002;Kun, Yong, & Wei, 2009;Bo, 2010;Wei, Weiqiang, & Bo, 2001) to predict stock prices.As an improved recurrent neural network (Xuejun & Win, 2016;Xiong, Nichols, & Shen, 2015;Ruiqi, 2015;Zhang, 2001), LSTM is more suitable for prediction of stock market time series than shallow neural network because it stores historical information through a cyclic feedback structure.
To sum up, this paper takes the CSI-300 stock as the research object, and uses the LSTM model with memory characteristics to combine the traditional multi factor analysis to build an improved multi factor stock selection model.In the test, the stock return rate of the trained LSTM model is predicted and classified, hoping to build a stock investment strategy with high yield and high accuracy, which will provide new ideas for the cross research in the field of neural network science and quantitative investment.

RNN Model
The biggest difference between RNN model and traditional neural networks is that it is tied to time.That is to say, it contains a network of cycles.The results of next time are affected not only by the input of the next time, but also by the output of the previous time, which means information has a lasting impact.
Figure 1.Recurrent neural network and its expansion form

LSTM Model
The LSTM is a well-designed RNN network, although both the LSTM and the original RNN contain three layers: the input layer, the hidden layer, and the output layer.However, the LSTM and the original RNN have a large difference in the design of the hidden layer, mainly because the LSTM has a special cell structure in the hidden layer.We can comparison the following two charts to better illustrate it.It can also be seen from Figure 3 that LSTM is to change a simple type of activation into several parts of the linear combination of storage cell to activate which means each time you can control the output information of the next step.For example, whether to include the previous information, how many problems are involved, and so on.
Each storage unit consists of three major components, input gates, output gates, and forget gates.Forget gate: to transmit more effectively, information should be filtered to decide which information can be forgotten.
Output gate: update the information in the new cell state.

Multi Factor Stock Selection Model Based on LSTM
The data structure of the multi-factor model processing is standard panel data, including three dimensions: stocks, time and factors; the corresponding strains are the returns of 1 T  period.
When applied to the LSTM network structure, there is some differences from the traditional multi factor model: the rate of return in the 1 T  period is still a training label, the factor corresponds to the feature of the sample and the stock corresponds to a sample, but the time dimension is a cyclic process in LSTM: the factor data in the past Tn  period should be included in the forecast of yield of 1 T  .

Parameter Setting
(1).Back testing time: From May 1, 2007 to April 30, 2016, the number of monthly data training samples under this time interval exceeds 18W (each stock represents one sample at the end of each month) (2).Strategy Time: May 1, 2016 -April 30, 2017 (3).LSTM time length (steps): 24 months, that is, each training sample contains factor data of the past 24 months and input them into neural network from the first month, and circulate the return value and the next month factor into the neural network simultaneously, and so on, until the forecast value of twenty-fourth months is obtained.
(4).Number of factors: Due to training in the neural network, we do not evaluate the validity of the factors at the beginning of the period, nor do we combine the factors and input all them into models.(Excluding some of the factors that are highly correlated and belong to the same category, this process can reduce the possibility of model training overfitting).The final 48 small factors are selected and belong to 10 common style factors.
(5).Number of classifications: In order to verify the accuracy of forecasting and exclude some of the noise in the sample, we classify the sample yield types into three categories: rising (monthly yield is greater than 3%) and falling (monthly yield is less than -3%), Neutral (monthly rate of return is between -3% and 3%) (6).Batch size: this parameter belongs to the system parameter of LSTM, which is the parameter used to calculate the gradient in the algorithm.That is to say, every training, the batch size samples in the total training sample are randomly selected as the training sample.
(7).Number of hidden neurons: This parameter also belongs to the system parameter of LSTM.It is the number of "nerves" that the input sample and hidden layer cells are connected to.It is limited by the performance of the computer and can only be set to three digits and 2 hidden layers.
(8).Learning rate: The LSTM system parameter is the speed at which the gradient falls during training.If it is too high, the gradient will easily disappear.If it is too low, the training will be too slow.
(9).Cross-checking ratio: To prevent overfitting of the model, 90% of the 18W samples were selected as the training set to train the model parameters, while the remaining 10% did not participate in the training and only tested as a test set.If the accuracy of the training set and the test set increase at the same time, the overfitting of the model may be too small.

Model Training
(1).Data preprocessing: according to the multi factor process, the cross section factor is treated with kicking off extreme value and standardization.At the same time, in order to eliminate the effect of the industry, the section single factor is used to return the industry matrix, and the residual is taken as the factor data for the final input.
(2) In-sample training: after 100 iterations, the result of training convergence has been observed.The final convergence level of the accuracy rate outside the sample is only higher than 50%, but it is necessary to distinguish the true prediction degree that this level can reflect.In order to intuitively test the stock selection effect outside the LSTM model sample, we choose the prediction result of each month that the model provide as the stock selection standard.Figure 8. A-share forecast portfolio net value It can be seen that in the most recent year, the model has a higher winning percentage for high and low returns, but it is less effective for forecasting the median neutrality.
Figure 9.The cumulative net value of all A-share long and short portfolios Over the past 12 months, the excess yield has been 75%.From the net cumulative value of short and long term, the excess earnings over the past 12 months were 4.5%.
In order to further verify the accuracy of the model for stock forecasting, we change the stock selection criterion from the model output to the activation value before the model final prediction.Because we classify the predicted target into three categories (high, medium and low), the neural network chooses the category with the largest activation value as the prediction category.Therefore, the activation value actually reflects the prediction probability of the model for the future stock returns.
Based on this, we reconstruct three types of stock portfolios.Each period selects 30% of the stock with the largest activation value as the corresponding combination.It can be seen that the prediction effect of the model on the neutral earnings still has not improved, but the forecast effect of long-short return is more accurate than that of the full A-share.The excess return of Over-the-counter was 9%, while the monthly winning percentage of these 12 months exceeded 90%.Through the backtesting of out-of-sample data, we found that through the LSTM learning, the prediction of stock returns is actually more accurate.At the same time, the prediction probability of the model for different types of returns can further reflect the probability of stocks' rising and falling.

Conclusion
The development of multi-factor models tends to be mature, and the alpha yield of factors has declined.If it is a core issue to maintain the benefits of multi-factor models in the quantitative field, we believe that the directions of expansion includes new factor mining, stock pool differentiation, and exploration of non-linear factor features.Machine learning is an effective solution to nonlinear problems.Specifically to the LSTM involved in this article, it is through the extension of the time dimension and the expansion of the space depth to spread the current factor space into the space of higher dimension, and find the effective path in it to realize the prediction of the factor model.
After strictly distinguishing the training set, the test set, and the data set, we can get the convergent result with higher accuracy through training, and get significant excess returns in the data back test.The accuracy of cross test is close to 90%, and the winning rate of short term out of the sample is more than 90% in the recent 12 months.The surprising point of these results is that by using the basic LSTM structure, such high accuracy and significant level can be got before the optimization of parameters, and further improvement and optimization of the model can be expected.At the same time, these results are within expectation which means their powerful data processing capabilities will be exposed in the field of investment when we no longer use machine learning and neural networks as complex "black boxes".

Figure 2 .
Figure 2. Design of the hidden layer of the original RNN

Figure 4 .
Figure 4.The unit structure of LSTM

Figure 5 .
Figure 5.The loss rate of LSTM

Figure 10 .
Figure 10.30% space combination net value

Figure 11 .
Figure 11.30% cumulative net value of long-short portfolio