Multi-factor Stock Selection Model Based on Kernel Support Vector Machine

In recent years, the combination of machine learning method and traditional financial investment field has become a hotspot in academic and industry. This paper takes CSI 300 and CSI 500 stocks as the research objects. First, this paper carries out kernel function test and parameter optimization for the kernel support vector machine system, and then predict and optimize the combination of market-neutral stock selection strategy and stock right strategy. The results of the experiment show that the multi-factor model based on SVM has a strong predictive power for the selection of stock, and it has a difference in the predictive power of different nuclear functions.


Introduction
Based on the computer technology, quantitative investment provides a meaningful direction for investment decision-making by using reasonable algorithms.And in combination with traditional investment concepts, it can obtain higher excess returns, for which it is gradually accepted by investors.In the field of stock investment, multi-factor stock selection model is widely used.At the earliest, the research of Fama et al (1970) proved that the stock price was jointly determined by multiple factors, and the single factor influence was insufficient to accurately describe the internal value of listed companies.Asness et al. (1997) further explored more representative factors in fundamental factors to construct stock selection portfolio and obtain excess returns.Recently, Wang (2016) used the scoring model based on equal weight rating to carry out factor selection, establish investment portfolio and conduct empirical analysis.Su (2018) further constructed the multi-factor stock selection model by removing redundant factors through fuzzy clustering algorithm.Vapnik (1995) proposed the support vector machine algorithm and later it became the most widely used machine learning model in the financial field.Sai (2013) used genetic algorithm and particle swarm optimization algorithm to optimize the kernel function of support vector machines, and then used the optimized SVM to predict the price of stock index futures, achieving good results.Han (2016) showed that the initial prediction results of time series models including GARCH or AR can improve the prediction accuracy of SVM.Huang (2017) combined the support vector machine with the traditional Fama-Fench three-factor model to construct A new stock selection strategy.After an empirical analysis of a-share, the new strategy model proved to be more profitable.
From what has been discussed above, the existing research mostly focused on the use of machine learning algorithm to optimize the traditional time series prediction model and optimizing the traditional factor to choose a single strategy, more empirical analysis of the strategy combination is less.Therefore, this paper will test various kernel functions of the kernel support vector machine system.Based on the test results, a portfolio model of stock selection strategy applied to CSI 300 and CSI 500 will be constructed.At the same time, through comparative analysis of two different portfolio strategies, market-neutral strategy and stock equity strategy, we further build a more profitable and more robust multi-factor stock selection model, providing new ideas for the application and development of machine learning methods in the financial investment field.

Classical Multifactor Model
The expression of classical multifactor model is: The multi-factor model is essentially a linear regression model between current factor exposure and future earnings.

Nonlinear Classification
The core idea of kernel support vector machines is to transform nonlinear classification into linear classification.Firstly, the original data is transformed into high-dimensional eigenspace by nonlinear mapping, and then the data in high-dimensional space are classified by linear support vector machine, so as to solve the nonlinear classification problem: 1 ( ) ( ( ),..., ( ),...) In practical application, the calculation amount of the objective function of the dual optimization problem is too large in the high dimensional space, so people use the technique of kernel function to avoid the explicit expression of the high dimensional feature, so as to avoid the problem of "dimension disaster" skillfully.

Kernel Function
After transforming original data x into high dimensional data () φ x by nonlinear mapping, the objective function of the dual problem of linear support vector machines is: For a new sample, we can calculate the discriminant function And then we can determine which class the sample belongs to base on the value of the discriminant function is greater than (or less than) zero.Note that the target function does not contain an explicit expression of the low-dimensional to high-dimensional mapping, and only relates to the selection of the kernel function, so this classifier is called the kernel support vector machine (Kernel SVM).Common kernel functions are shown in table 1.

Kernel function Expression
Linear kernel function . Classification renderings of support vector machines with different kernel functions The classification performance and classification boundary of different kernel functions are different.Figure 1 shows the classification of data using different kernel functions for the same set of data.).Missing value processing: After obtaining the new factor exposure sequence, the value that the factor exposure is missing is set as the average value of the same stock in the citic primary industry c).Neutral industry market value: The factor exposure degree after filling the missing value makes linear regression to the industry dummy variable and the market value after logarithm, and the residual value is taken as the new factor exposure degree.d).Standardization: The neutrogenized factor exposure sequence is subtracted from its current mean and divided by its standard deviation to obtain (0,1) N a new sequence that is approximately distributed.e).Principal component analysis: In order to avoid the collinearity between features, principal component analysis was conducted on the exposure degree of 70 standardized factors, and new features were obtained after the transformation of 70 dimensions.

Multi-Factor Stock Selection Model Based on Kernel Support Vector Machine
(4).Training set and cross validation set together: a).Classification problem: for the support vector machine model (hereinafter referred to as SVM ), in the cross-section period at the end of each month, the stocks that rank first and last 30% of next month's earnings are selected as positive example ( 1) y  and negative example ( 1) y  respectively.The 72 month samples were merged.90% samples were randomly selected as the training set, and the remaining 10% as the cross-validation set.b).Regression problems: for support vector regression model (hereinafter referred to as "SVM"), directly to the 72 months within the sample merged with sample data, also according to the proportion of 90% and 10% divided training set and the cross validation set. (

5).In-sample training:
The training set is trained with SVM.SVM selects five different types of:kernel functions: linear nucleus, three order polynomial kernel, 7 order polynomial kernel, nuclear and Gaussian kernel, use 12 months rolling back to the measurement of linear regression model as a unified control group.After determining the optimal parameters,the pre-processing characteristics of all samples (i.e., individual stocks) at the end of the month T were taken as the input of the model,and the predicted value of month 1 T  .() fx (discriminant function value, i.e. the distance from samples to classification hyperplanes) of each sample were obtained.Policy combinations can be constructed based on this predictive value (8).Model evaluation:

All models are shown in table 3
Evaluation indicators include two aspects.First, the accuracy of test set and AUC and other indicators to measure the performance of the model; Second, the performance of the strategic portfolio constructed in the previous step (including annualized excess returns, information ratios, etc.)

Empirical Analysis
This study constructs the stock selection strategy of CSI 300 and China Certificate 500 and carries out the test.The stock selection strategy can be divided into two categories.One is the industrial neutral strategythe industry configuration of the strategy combination is consistent with the benchmark (CSI 300, China Certificate 500).Select For the industry-neutral strategy of the stock selection of Shanghai and Shenzhen 300 constituent stocks, when the number of stocks in each industry is greater than or equal to 10, in addition to the 7-order polynomial core, The annual excess returns, information ratios and Calmar ratios of the remaining SVM models are higher than the linear regression models of the unified control group, the maximum return of excess returns is less than linear regression, in which the Gaussian core and the 3-order polynomial have the best nuclear performance.For the individual-share equal-weight strategy of Shanghai and Shenzhen 300 constituent stocks, when the total number of stock options is greater than or equal to 100, the annual excess yield and information ratio of Gauss kernel and 3-order polynomial kernel are higher than those of linear regression.For the industry-neutral strategy of the stock selection in the 500 component stocks, when the number of stock options in each industry is between 5 and 10, In addition to the 7-order polynomial kernel, the annual excess returns, information ratios and Calmar ratios of SVM models are significantly higher than those of the linear regression models, in which The Gaussian core, the 3-order polynomial nucleus and the Sigmoid nucleus are best performed.For the individual-share equal-weight strategy in the stock selection of the 500 constituent stocks, when the total number of stock options is greater than or equal to 100, in addition to the 7-order polynomial kernel, the annual excess returns, information ratios and Calmar ratios of SVM models are significantly higher than the linear regression models, the maximum return of excess returns is less than linear regression.

Conclusion
This study tested the support vector machine system of many kernel functions including linear kernel, polynomial kernel, Gaussian kernel and sigmoid kernel and used the support vector machine model to construct the stock selection strategy of Shanghai 300, China Certificate 500 to prove that the multi-factor model based on SVM has strong ability to predict stock-picking income through the back-test analysis, and analyze the differences of model prediction ability of different kernel functions, providing further theoretical basis for the application of machine learning in the field of quantitative investment.

Figure 2 .
Figure 2. Schematic diagram of multi-factor stock selection model based on kernel support vector machine Figure 3. Cross validation results AUC of Gaussian kernel model parameters -C and Y Figure 4. Comparison of the important indexes of the SVM model with different kernel functions (market neutral, stock selection in the Shanghai and Shenzhen 300 sub-unit)

Figure 6 .
Figure 6.Comparison of the important indexes of the SVM model with different kernel functions (market neutral, the stock selection of 500 sub-unit)

Table 3 .
Test model overview diagram