Introduction

Predicting the Stock return is a challenging endeavour, given the nonlinear nature of the stock market and the different approaches to predict the stock change. Though, advancements in artificial intelligence and other superior models have been used to increase forecasting accuracy, the prediction accuracy rate is still an unresolved issues1.

Enormous amount of attention in the empirical asset pricing literature has been directed to answer the questions of what drives the stock prices and what input features play major role in generating accurate results. In early years, Fama proposed in a weak-form market, people can make abnormal returns by mastering fundamental information, such as financial statements2. However, many scholars doubt the financial ratios do not consistently outperform the historical average benchmark forecast out of sample3. In addition other researchers started with the price trend itself, using technical indicators and found that technical indicators were efficient in predicting the market in the past3.

Thomas Fischer's 2018 study in the US market, utilizing the LSTM model, stands as one of the most referenced papers in cross-sectional stock selection. It unveiled a 51.4% accuracy rate between 1992 and 2015. However, the assessment of alpha between 2010 and 2015 showed a stagnant cumulative alpha of zero, highlighting limitations in this strategy during that period4. A subsequent parallel study by Ghosh et al. expanded input variables from one to three features but omitted classification accuracy. Notably, it demonstrated significant alpha improvements with a positive trajectory from 2010 to 2015, which unfortunately turned negative from 2016 to 20195.

The persistence of challenges in maintaining consistent alpha generation seems to stem from limitations in input variables. Hanauer et al. addressed this issue by integrating fundamental indicators into their machine learning (ML) models, resulting in an average risk-adjusted alpha of approximately 6% for European stocks6. In contrast, Liu et al. study focusing on China's stock market employed a Deep Neural Network (DNN) incorporating 36 price-related trend features and 5 fundamental factors. Despite initially achieving a validation accuracy of 55.46%, the subsequent inclusion of trend-related features led to a decrease in accuracy, reaching 49.79% and falling short of the established 50% benchmark7.

As observed in the previous paragraph, advancements in input sources have significantly bolstered the accuracy and alpha performance within DNN models. This progress is especially notable in their proficiency for pattern recognition and predicting price changes, resulting in substantial enhancements in alpha generation. However, despite these advancements, several drawbacks persist, primarily centred around challenges in data integration and feature engineering. Multiple data sources exist, including technical and fundamental indicators, yet a comprehensive framework for their cohesive integration remains absent.

From the array of fundamental and technical indicators discussed earlier, the initial selection of features often involves manual intervention. This process heavily leans on existing domain expertise to guide and determine which features are chosen for inclusion in the analysis8,9. The ascent of smart beta investing has significantly reshaped the financial domain. Over the last decade, the surge in smart beta funds has been remarkable. In the past, the market exhibited a higher prevalence of discernible anomalies or 'alpha features'. However, the adoption of smart beta strategies based on existing alpha formulas by more funds has led to a decline in alpha effects due to increased capital flow. Consequently, even seasoned experts in the field face mounting challenges in identifying distinctive features. The pursuit of formula-driven, linear, and easily explicable features—vital elements in expert-driven extraction—is becoming less effective. This has spurred the emergence of AI-based feature engineering methods.

This paper aims to elevate the accuracy of cross-sectional stock return prediction and augment the average risk-adjusted return ('alpha') within the DNN framework. It builds upon Thomas Fischer's LSTM model by integrating additional input sources and proposing a novel feature engineering method involving symbolic genetic programming (SGP). This approach aims to address feature engineering limitations, enriching both fundamental and technical features. Furthermore, tailored LSTM models are crafted to suit the distinctive attributes of the dataset. Consequently, significant enhancements in accuracy, precision, and recall rates are observed, surpassing the performance of both Thomas Fischer's and Ghosh's LSTM models. Additionally, our method notably amplifies the Rank Information Coefficient (IC) and the Information Ratio of IC (ICIR), resulting in a substantial improvement in alpha compared to the benchmarks set within the frameworks of Fischer and Ghosh.

Moreover, we aim to synthesize the findings of our study into a simple and rule-based strategy for a complete active index fund strategy for selecting winning and losing stocks, compared with the benchmark. Our hybrid model exhibits superior performance compared to the CSI 300 and CSI 500 indexes. Notably, our strategy consistently outperforms these indexes by an average of 31% and 24.48% per year, respectively. Additionally, it surpasses the average returns of the entire market by 17.38% annually. We also calculate the information ratio of the strategy, and it is found that it is 2.49, and this further highlighting its effectiveness.

The remaining sections of this paper are organized as follows: Section "Related works" will cover related works, including existing DNN models and their combinations with Genetic Algorithms. In Section "The proposed deep neural network", we provide an in-depth discussion of the methodology, including enhanced SGP for new features, the proposed architecture of the Symbolic Genetic Programming (SGP-DNN model), input data descriptions, forecasting horizon, segmentation predictions method and the trading strategy setting. Section "Experiments with five classical DNN frameworks for comparison" will focus on the experiments. Section "Conclusion" will conclude the paper.

Related works

The earliest study on applying machine learning in the stock domain can be traced back to 2006, where an accurate event weighting method and an automated event extraction system were presented10. However, there are several limitations to machine learning models. The challenges come from the employed dataset. Traditional machine learning models are best suited for small or medium-sized datasets and have limitations in processing high-dimensional datasets. They are prone to encountering the curse of dimensionality, especially for big or massive datasets, such as high-frequency or unstructured data11.

Comparing with machine learning algorithms, the Deep Neural Networks (DNNs) have significant advantages when it comes to handling large sets of time series data. LSTM is the most used model and advantageous over the conventional RNN due to the reason that it overcomes the problems of gradient vanishing or exploding. In 2015, Chen et al. built an LSTM-based model for the China stock market12. The most referenced paper for LSTM in the application in finance data was done by Thomas Fischer and Benedikt Kraus. They were the first to deploy the LSTM network on large-scale financial time series data and explained the source of the black box, which is high volatility, below-mean momentum, and extremal directional movement4. Following Fisher's work, four primary variants or supplementary approaches emerged as extensions to the single LSTM model: data decomposition, data dimension reduction, data augmentation techniques and Genetic Algorithm (GA) combination techniques.

Primarily, in the realm of data decompositions, traditional methods such as wavelet de-noising has been employed to stock index prediction since 201913,14. The utilization of state-of-the-art techniques like Empirical Mode Decomposition (EMD) and Complex Empirical Mode Decomposition (CEEMD) has been prominent. This trend has notably continued since 2020. EMD and CEEMD have been notably applied to indices like the SP500, Dow Jones, HSI, DAX, SSE, and Nikkei. These methods break down the data into 6 to 8 frequency components, which are subsequently fed either individually or collectively (alongside residuals) into different Convolutional Neural Networks (CNNs). The output from these CNNs is then directed to LSTM components, or in some cases, directly to individual LSTM components. This complex pipeline is designed for the purpose of forecasting index price changes15,16.

Secondly data dimension reduction techniques have also been used with LSTM, numerous scholars have integrated Principal Component Analysis (PCA) with DNN models to achieve dimension reduction. Yong’an Zhang introduced the CEEMD-PCA-LSTM model for time series prediction. Preceding the LSTM model, input sources undergo processing by a PCA model to condense dimensions, thereby extracting abstract and advanced features. This process not only enhances computational efficiency but also contributes to improved predictive capabilities17. By 2023, even transformer models with fused multi-source features have been proposed, leveraging the ITD (intrinsic time-scale decomposition) method to manage feature dimensions effectively18.

Third supplementary approach is data augmentation. Fisher’s attempt of LSTM is single LSTM module, and the attributes of overfitting was challenged by other scholars due to the limited availability of data points. Yujin presented a novel data augmentation approach to avoid the overfitting and propose ModAugNet Framework including two modules, one is overfitting prevention LSTM module, and another is prediction LSTM module. The number of data point has been increased by 252 times19. We could also find the Phase Space Reconstruction (PSR) method13 or feature expansion method20 for data augmentation.

Finally, the combination of Genetic Algorithm (GA) and Deep Neural Network (DNN) or other Machine Learning models has been utilized by many researchers to improve prediction accuracy. For the application of GA in conjunction with Deep Neural Networks (DNNs), two main applications can be observed: hyperparameter tuning and feature selection.

Hyperparameter tuning is a crucial aspect that needs to be addressed in the optimization process, including parameters setting such as the number of layers, nodes per layer, and number of time lags. GA is frequently employed to search for optimal hyperparameters for DNN. In 2018, Chung and Shin employed GA to identify the optimal number of time lags and LSTM units for hidden layers in stock prediction models21. In a similar study in 2019, Chung and Shin optimized the kernel size, kernel window, and pooling window size for CNN22. In addition, GA has been used to determine appropriate hyperparameters and input data sizes for Generative Adversarial Networks (GANs) in stock prediction by He and Kita in 202123. These studies demonstrate the effectiveness of GA in optimizing the hyperparameters of various deep learning models for stock prediction.

As for the feature selection, many researchers combine GA and other DNN model to reduce input variables and enhance calculation speed by selecting appropriate factors from a large pool of candidate variables. For instance, Chen and Zhou used GA to rank factor importance and select features for a Long Short-Term Memory (LSTM) model, while Milad employed GA as a heuristic approach for selecting relevant features for an Artificial Neural Network (ANN)24,25. Li utilized a multilayer GA to select features and reduce high dimensionality in a stock dividend dataset26. Recently, Yun revised GA-based selection methods to a two-stage process, using a wrapper method to select important features to avoid the curse of dimensionality, followed by the filter method to select more critical factors27.

The challenges associated with the aforementioned methods are distinct:

  1. 1

    Data decomposition methods are commonly utilized in stock index prediction rather than in the individual stock selection process. The unequal frequencies obtained from time series data pose a significant challenge, hindering the parallel aggregation of decomposed features for individual stocks.

  2. 2

    Principal Component Analysis (PCA) limitations align with theoretical expectations but often diverge from expected performance in empirical scenarios, demonstrating diminished effectiveness.

  3. 3

    Existing data augmentation methods are relatively simplistic and exhibit limited efficacy in improving accuracy or alpha effects. These methods mainly expand existing features without notably enhancing their value or informativeness.

  4. 4

    Tuning a Deep Neural Network (DNN) faces challenges from models with numerous parameters. Even with genetic algorithm integration, computational demands persist. While the genetic algorithm only reduces factors in feature selection.

An encouraging approach is to integrate Genetic Algorithms (GA) principles into the Data Augmentation method. This innovative strategy aims to leverage GA concepts to actively evolve factors and select features from this Genetic Evolved Method. This may lead to more effective feature sets. We'll delve into this proposed novel GA-based data augmentation method in methodology part.

The proposed deep neural network

In Artificial Intelligence (AI), Deep Neural Network (DNN) falls under the subset of Machine Learning and Neural Network28. DNN is based on the artificial neural network (ANN) which contained one or several layers between the input and output layers. In each layer it consists of the same components, and they are neurons, synapses, weights, biases, and functions29.

The proposed SGP-DNN framework comprises four primary phases: data pre-processing, data augmentation, filtered factor transformation, and feature extraction. As depicted in Fig. 1, during Phase 1, we acquire the dataset from the Alpha Factor Library by S&P Global Market Intelligence, which includes raw fundamental and technical indicators. Instead of conducting feature selection at this stage, our focus is on standard data pre-processing steps such as handling missing values, deleting outliers or noise, and performing feature normalization using the z-score method. Concurrently, we define heuristic formulas to aid the subsequent SGP programming phase.

Figure 1
figure 1

Flowchart of the proposed Deep Neural Network Framework.

In Phase 2, SGP operates similarly to a conventional Genetic Algorithm, involving steps like selection, crossover, and mutation. However, it notably differs by utilizing both heuristic generators and normalized features from phase 1 as inputs. A comprehensive explanation of SGP will be provided in Section "data augmentation: symbolic genetic programming". The outcomes of phase 2 consist of optimal features developed through the evolutionary process of SGP, tailored to its specific customized fitness function.

After Phase 2, a set of optimal features (augmented data) is obtained. Phase 3 involves defining rules for feature filtering, considering both their fitness for the model's function and their relevance to the prediction target. These filtered features then undergo transformation to ensure they are of a suitable size and sequence for subsequent use in the LSTM model. The output of Phase 3 comprises the filtered features arranged in an appropriate sequence, in this instance, with lags of 15 days.

The last phase (4) revolves around feature extraction using a DNN model. Specifically, we utilized the Long Short-Term Memory (LSTM) method to discern non-linear patterns, aiming to enhance the accuracy of stock return prediction. Selected features, arranged in sequence from Phase 3, are inputted into a two-layered LSTM model and utilize the final hidden layer output for the final prediction. However, in the feature extraction phase, we also experiment the filtered factors with Multi-Layer Perceptron (MLP). The objective is to observe whether LSTM or MLP could handle the raw data in the extracting pattern according to the nature of feature source (fundamental or technical indicators). A detailed explanation of the DNN model selection process will be presented in Section "Feature extraction method: LSTM vs MLP".

The effectiveness for classification of the prediction from phase I to phase IV is measured using Accuracy, Precision, Recall, Rank Information Coefficient (Rank IC) and Information Ratio of IC (ICIR) as performance (sensitivity) metrics for cross-sectional price change prediction, as demonstrated in Eqs. 1 to 5 In the next Section, we present the discussion on the dataset, software and hardware used in this study, as well as the elaboration on phase I, data augmentation and phase II, feature extraction.

Rank Information Coefficient (Rank IC) serves as a pivotal tool for appraising predictive model efficacy, particularly in portfolio formation during stock selection across a range. This metric evaluates the correlation in rankings between predicted scores of diverse securities and their realized returns, prioritizing relative rankings over precise predictions. It notably facilitates cross-sectional selections aimed at securing alpha or risk-adjusted returns for portfolios. Worth noting is that the sign of Rank IC holds less significance than its magnitude. A positive Rank IC suggests that higher stock values anticipate relatively larger returns, while a negative Rank IC signifies that lower stock values predict larger returns. Additionally, the Information Ratio of Rank IC (ICIR) parallels the Sharpe ratio for a portfolio, providing further insights into its performance.

$${\text{Accuracy}}=\frac{Number\, of\, correct\, predictions}{Total\, number\, of\, predictions}$$
(1)
$${\text{Precision}}=\frac{True\, Positives}{True\, Positives+False\, Positives}$$
(2)
$${\text{Recall}}\frac{True\, Positives}{True\, Positives+False\, Negatives}$$
(3)
$$\mathrm{Rank\, Information\, Coefficient}\left(\mathrm{Rank IC}\right)=\frac{\sum_{i=1}^{n}({Rx}_{i}-\overline{Rx})({Ry}_{i}-\overline{Ry})}{\sqrt{\sum_{i=1}^{n}{({Rx}_{i}-\overline{Rx})}^{2}}\sqrt{\sum_{i=1}^{n}{({Ry}_{i}-\overline{Ry})}^{2}}}$$
(4)

(R denoted Rank).

$$\mathrm{Information\, ratio\, of\, IC}({\text{ICIR}})=\frac{{\text{IC}}}{\mathrm{Standadized\, Deviation\, of\, IC}}$$
(5)

Dataset, software and hardware

In this study, two types of data were utilized during the experiments: fundamental indicators and technical indicators. Fundamental indicators comprise data derived from three types of financial statements, namely the balance sheet, profit and loss report, and cash flow report. On the other hand, technical indicators are based on price and volume, providing users with patterns of momentum and reversal. Prior to processing the data using the proposed method, an analysis based on Rank IC was conducted. Rank IC describes the correlation between predicted and actual stock returns, thereby indicating the degree of alignment between the analyst's fundamental and technical forecasts and the actual financial results. The Information Coefficient (Rank IC) is a numerical measure that ranges from 1.0 to − 1.0. A value of − 1 indicates a perfect negative relationship between the analyst's forecasts and the actual results, while a value of 1 indicates a perfect positive match between the forecasts and the actual results. This metric is highly important when making informed investment decisions, especially in the evaluation of cross-sectional stock returns forecasting. Typically, an information ratio of IC (ICIR) within the range of 0.40 to 0.60, and Rank IC values exceeding 5% in absolute terms, are considered highly favorable in this context.

The data used in this study is dataset of The Alpha Factor Library by S&P Global Market Intelligence30, which includes explainable factors for all A-listed stocks (around 4500 listed companies) in the Shanghai and Shenzhen Stock Exchange Market, including fundamental and technical indicators. The appendix contains a comprehensive description of both types of quantitative indicators (304) and their corresponding Rank IC values from 2015 to 2022. Table 1 presents the average Rank IC (Information Coefficient) of two specific type of quantitative indicators, while Fig. 2 illustrates the ICIR (Information Coefficient Information Ratio) of these indicators.

Table 1 Rank IC mean of the dataset.
Figure 2
figure 2

ICIR mean of the dataset.

For the data preparation and pre-processing, Python 3.8 was employed along with the numpy and pandas packages. The design of DNN models, including LSTM and MLP, was achieved using KERAS 2.4, a package based on Google TensorFlow 2.4. The Symbolic Genetic Programming (SGP) was implemented using the gplearn 0.0.2 package in Python. While the DNN network was trained on NVIDIA GPUs, the remaining models, such as SGP part, were trained on a CPU cluster. Detailed information regarding the software and hardware specifications utilized can be found in Table 2.

Table 2 Descriptions on the software and hardware.

The main aim of this study is to anticipate and forecast changes in cross-sectional stock prices. The target variable is categorized into two statuses: a value of 1 signifies a stock return higher than the medium of cross-sectional stock returns, while a value of 0 indicates a stock return lower than the medium of cross-sectional stock returns over short-term intervals.

The research investigates standard timeframes frequently used in stock predictions, spanning short-term intervals of 5 days (1 week), 10 days (2 weeks), and long-term intervals of 20 days (1 month). This study integrates short-term technical indicators and long-term fundamental indicators. Consequently, the 5-day forecasting period is chosen to assess prediction accuracy. By amalgamating these varied features, the study aims to provide accurate forecasts regarding stock price changes, specifically determining whether they will surpass or fall below the medium of cross-sectional stock prices within the defined 5-day window.

Data augmentation: symbolic genetic programming

The second step of the proposed DNN framework is to investigate the Genetic Algorithm (GA) in the data augmentation phase. Based on literature, Genetic Algorithms are a type of learning algorithm, that would result in a better neural network by crossing over the weights of two good neural networks. This algorithm could also generate and evaluates consecutive generations of humans in order to achieve optimization objectives. The algorithm creates mutation from the stock related indicators by randomly changing the chromosomes or genes of the individual parents. In this situation, GA can be complicated and costly when implemented on the stock related indicators which is nonlinear and having lots of noise or outliers. Therefore, to solve the problem of nonlinear type of data, the Symbolic Genetic Programming (SGP) is employed in this study. SGP has several advantages as it evolve by building blocks. In SGP, it employed the regression analysis which is more robust to search the space in finding the best model to fit the given stock return data. Different from GA, SGP find an intrinsic relationship between two or more variables which is hidden. Typically, there are two types of genes that contribute to the generations.

The first type in the study refers to the input features, while the second type represents the processing operators, encompassing mathematical functions like addition, subtraction, division, and multiplication. Predicting stock price data can be a daunting task, given its complex, dynamic, and non-linear nature. To tackle this challenge, mainstream hedge funds like World Quant, Cubist, and Menelia employ various heuristic operators such as correlation, covariance, and variance. These operators help them analyze and interpret the data, enabling them to make informed investment decisions31, as depicted in Table 3, to enhance the analysis and prediction of stock price data. In this study, an improved Symbolic Genetic Programming (SGP) is proposed, which utilizes symbolic tree expressions to handle and solve complex optimization problems, providing greater flexibility. The four-step approach outlined in Fig. 3 is applied to enhance the performance of the SGP.

Table 3 Heuristic operators.
Figure 3
figure 3

The structure of the proposed Symbolic Genetic Programming.

The first step in our proposed SGP, is to initiate the population of the genes. Here, we introduce the heuristic operators like the Table 3 shows in the reproduction of the genes. To guide the evolution of the genes, we set certain parameters. For instance, we established a probability of 40% for crossover, which involves exchanging genes between two individuals in the population. Additionally, we set a 40% probability for replacement, which involves copying an individual gene in the population. Finally, we assigned a very low probability for three types of mutation to prevent an excessive influx of new input features, which could lead to unpredictability. This helps maintain stability in the incorporation of new genetic material into the population.

Second, we designed and added rolling windows for all heuristic operators to the original SGP. To this end, we randomly generate rolling window seeds between 3 and 20 for rolling window to produce additional symbolic formulas. The third step is to design the fitness function. In this study, calculations are performed to determine the fitness target. In addition to using the original Pearson correlation (Rank IC) between the value of the symbolic formula and future price change as the fitness target, a combined formula will be used. This combined formula takes into consideration both the relatively high cumulated return of the bottom group among all cross-sectional stocks and the maintenance of monotonicity in the cumulated return of k groups based on the order of values in the symbolic formula. By incorporating these factors, the fitness target aims to optimize the performance of the symbolic formula in predicting stock price changes.

The formula is shown from Eqs. (6) to (8) below:

$$To{p}_{R}={\text{max}}(TopR-mean\left(totalR\right),FlopR-mean\left(totalR\right))$$
(6)
$${\text{Monotonicity}}={\text{max}}(\frac{1}{N}\sum_{k=1}^{N}{\text{max}}(0,{Sign(R}_{k}- {R}_{k+1})),\frac{1}{N}\sum_{k=1}^{N}{\text{max}}(0,{Sign(R}_{k+1}- {R}_{k})))$$
(5)
$$\mathrm{Rank\, Information\, Coefficient}=\frac{\sum_{i=1}^{n}({Rx}_{i}-\overline{Rx})({Ry}_{i}-\overline{Ry})}{\sqrt{\sum_{i=1}^{n}{({Rx}_{i}-\overline{Rx})}^{2}}\sqrt{\sum_{i=1}^{n}{({Ry}_{i}-\overline{Ry})}^{2}}}$$
(7)
$${\text{Fitness}}={{\text{Top}}}_{{\text{R}}} +{\uplambda }_{1}\times \mathrm{Monotonicity }+{\uplambda }_{2}\times \mathrm{ Information\, }{{\text{Coefficient}}}^{\mathrm{^{\prime}}}$$
(8)
$$({\mathrm{Default \lambda }}_{1}=0.4 {\mathrm{Default \lambda }}_{2}=2)$$

After obtaining many symbolic formulas based on the above algorithms, the final amendment for SGP is the filter system for the outcomes. The success ratio of Pearson correlation (Rank IC) and the profit and loss ratio (P&L ratio) of Pearson correlation from Eqs. (9) to (10) will be employed to select the final synthetic symbolic formulas generated by the SGP model. These above two ratios will also be used as metrics for the experiment part

$$\mathrm{Success\, Ratio\, of\, Rank\, IC }\left(\mathrm{IC\, success\, Ratio}\right)=\frac{Numbers\, of\, Correct\, Pearson\, IC}{Total\, num\, of\, Pearson\, IC}$$
(9)
$$\mathrm{Profit\, and\, Loss\, Ratio\, }\left(\mathrm{IC\, PNL}\right)=\frac{Mean(\left|Pearson\, IC\right|)}{Standard\, devation(Pearson\, IC)}$$
(10)

Feature extraction method: LSTM vs MLP

The final step in the proposed hybrid DNN framework involves extracting features from the augmented selected data obtained through the SGP process. Feature selection is carried out by creating a Hybrid DNN model that accommodates individual data sources based on their specific characteristics.

Since the development of DNN, the Multiple Layer Perceptron (MLP) was initially introduced as a basic supervised learning algorithm with multiple layers, each consisting of several neurons. However, MLPs have a significant drawback in their ability to handle sequence or time series data effectively. This limitation poses a crucial challenge in stock return forecasting, which heavily relies on the historical states of stocks, following a Markov Chain. To address this issue, a more suitable approach is to utilize the LSTM (Long Short-Term Memory) model, which falls under the category of Recurrent Neural Networks (RNN). LSTMs are specifically designed for sequence modelling tasks and overcome the limitations of MLP. Both LSTM and MLP models are chosen for comparisons, as shown in Fig. 4.

Figure 4
figure 4

Feature extract method: LSTM vs MLP.

In the first step, the original indicators are either inputted into the SGP model (as depicted in Fig. 4) to obtain selected features, which are then fed to the MLP or LSTM model. Alternatively, the original indicators can be directly fed into the MLP or LSTM model for comparison.

The performance of the four experiments is evaluated using metrics such as Rank IC and ICIR to determine the best model based on the dataset's unique characteristics. The optimization goal for all network settings is to minimize Mean Squared Error (MSE), while the performance quality is assessed using Rank IC and ICIR as metric indicators. Finally, the trained network is used to recognize feature patterns, and based on the Enhanced SGP-DNN Framework, simple trading rules suitable for the stock market are formulated. These rules are then ‘backtested’ in stock trading scenarios.

To ensure simulating the real stock investing and considering the ‘backtest’32, the forward rolling window and the segmentation prediction method were followed, the specific details are illustrated in Fig. 5. The whole sample period will be divided into three parts, in the training part, the dataset length is 1020 days which is used to update the model parameters. As for validation part, we use 160 days for tunning and the test part is 20 days and as a result the rolling window is also set as 20 days. The ratio of training set, validation set is taken as 8.5:1 and the real test days is 720 days from 2019-11-30 to 2022-12-31-resulting in a total of 36 non-overlapping trading periods.

Figure 5
figure 5

Train/validation/test set for rolling window.

Forecasting, ranking, and trading

The SGP-DNN model utilizes available information prior to time t to forecast the future price change of each stock. Its objective is for each stock to surpass the average price changes observed in the cross-sectional market during the subsequent period t + 1. To achieve this, the model ranks all cross-sectional stocks (4500 in total) in ascending order based on the predicted return by SGP-DNN for the next period. The highest-ranked stocks form the top group, and historically, we have divided the entire cross-sectional stocks into 10 groups, each containing 450 stocks. This ranking score serves as a basis for long only portfolio construction.

Long-Only Portfolio Strategy: The Long-Only Portfolio Strategy focuses on taking long positions in the top k stock portfolios, which are then held for a single period (t + 1). To gauge the effectiveness of this strategy, we will compare its performance against the CSI 300 and CSI 500 benchmarks (denoted as Excess R above 300 and Excess R above 500). These benchmarks represent broad-based indexes in the Chinese stock market. Moreover, we will also consider the average performance of an equal-weighted portfolio as a third performance benchmark (denoted as Excess R above average), the sharp ratio of Excess R above average (Sharp Ratio) will be also measured as the metrics in experiment part.

Experiments with five classical DNN frameworks for comparison

In this study, two types of raw data—fundamental and technical indicators—are utilized to evaluate the Neural Network's performance. The objective is to experiment with both datasets and identify potential discrepancies in outcomes.

Preceding this analysis, an extensive comparative test was conducted. The subsequent metrics will illustrate comparisons among several models: classical Thomas Fischer's LSTM model, Ghosh’s three features LSTM model, single LSTM model with raw features, PCA-LSTM model with raw features, and proposed SGP-LSTM model with raw features.

Figure 6 displays our comparison among these 5 models, utilizing a train/validation/test split without employing rolling window or segmentation prediction methods for direct comparisons. The entire sample period was divided into a 70% train set, a 20% validation set, and a 10% test set, enabling a thorough evaluation of the comparison metrics.

Figure 6
figure 6

Train/validation/test set for the whole sample period.

In Fig. 7, the study employs Thomas Fischer’s LSTM and Ghosh’s Three Factors LSTM models for cross-sectional stock selection using Chinese stock data. Additionally, a PCA-LSTM model was developed based on Zhang et al.'s methodology17. The figure also presents the Single-LSTM model and the proposed SGP-LSTM model, both utilizing raw features.

Figure 7
figure 7

The DNN frameworks comparisons for 5 models.

The examination of metrics across five models in Table 4 unveils intriguing discoveries. Thomas Fischer's LSTM model notably improves accuracy, achieving 53.8% in the Chinese stock market, surpassing its previously reported 51.4% accuracy in the US stock market (1992 to 2015). Unexpectedly, despite incorporating two additional features into Ghosh's model, there's no observable accuracy enhancement within the Chinese stock market (53.2% for Ghosh compared to Thomas' 53.8%).

Table 4 Metrics of Comparisons among different models.

In the related work section, I highlighted the persistent trend of negative accumulated alpha observed in both Thomas and Ghosh's LSTM models over the past decade. Table 5's data reveals a compelling pattern: the ratio between forecasted positive and negative outcomes exceeds 2, notably peaking at 3.75 in Ghosh's model. This underscores a consistent bias toward predicting positive outcomes. For instance, in Ghosh's model, out of over 4700 stocks, 3727 are forecasted as positive compared to 995 forecasted as negative, showcasing a recurring inclination to predict individual stock returns surpassing the median of all sectional stocks. This trend may significantly contribute to the negative accumulated alpha observed within the top group portfolio (trading rules for the 10 groups outlined in Section "Forecasting, ranking, and trading"), to be further explored in Section "Proposed models comparison with Fisher’s LSTM model, Ghosh’s three features LSTM model" through experimentation. A similar pattern is evident in Thomas's model. Conversely, the other three models exhibit a more balanced distribution between positive and negative forecasted outcomes.

Table 5 The metrics of fundamental indicators for DNN with MLP or LSTM.

Table 4 showcases the SGP-LSTM model achieving the highest accuracy rate of 53.70% in the test set. However, upon comparing accuracy and Rank IC, the Single-LSTM model (52.80%, 7.3%) appears less effective compared to the PCA-LSTM model (53.50%, 9.0%), suggesting an advantage of PCA-LSTM in predictive effectiveness over Single-LSTM. Notably, substituting PCA with SGP led to an improvement in accuracy and Rank IC from (53.5%, 9.0%) to (53.7%, 9.3%). These results signify that SGP-LSTM outperforms PCA-LSTM, validating the efficacy of data augmentation or decomposition methods.

Additionally, considering both the balanced distributions for positive and negative forecasted outcomes and the Rank-IC, the Proposed SGP-LSTM model demonstrates superiority over the other four models. Further exploration of this superiority will be detailed in Section "Proposed models comparison with Fisher’s LSTM model, Ghosh’s three features LSTM model" through experimentation.

After completing experiments on the five models using machine learning metrics, our next step involves a more in-depth exploration of the alpha effect employing the Proposed SGP-LSTM model. This exploration will utilize two distinct sets of raw features processed through rolling windows, following the trading rules outlined in Section "Forecasting, ranking, and trading".

In our stock trading experiment, we conducted 'backtests' spanning 720 days. The prediction horizon was fixed at 5 days, and we implemented a rolling cycle of 20 days. This setup allowed the hybrid model to optimize its parameters every 20 days, utilizing the 1180 data points mentioned earlier. These optimized parameters remained consistent for the subsequent 20-day period. Additionally, every 5 days, the model ranked its stock prediction values, selecting the top 10% (450 stocks) for portfolio construction. Equal weights were assigned for buying and holding, and stocks with limitations were excluded to address potential trading issues. This approach ensured systematic assessment while constructing the portfolio based on the model's predictions within this timeframe.

Following the procedure of Proposed SGP-DNN model in Section "The proposed deep neural network", we also nee to examine the efficacy of SGP when integrated with MLP and LSTM, thereby assessing the suitability of SGP in conjunction with both methods. The experiment will be segmented into two main sections: Section "Experiment with fundamental indicator" will elaborate on experiments using fundamental indicators, while Section "Experiment using technical indicator." will focus on experiments employing technical indicators. This structured approach aims to comprehensively explore the impact and potential of integrating SGP within different feature processing methods.

Experiment with fundamental indicator

We execute the experiment with the fundamental indicator. This experiment is to observe 8 metrics of the cross-sectional stock return prediction based on the fundamental indicator, whether the integration of SGP give improvement or vice versa. First, we experiment the fundamental indicator directly using the MLP method. Then followed by experimenting it using LSTM method. This experiment is without integration of SGP. To observe the capability of SGP, we executed an experiment based on the using the LSTM and MLP respectively with the integration of SGP method. Table 5, illustrates the results of the experiments conducted, where the LSTM or MLP is integrated with SGP, is labelled as SGP-MLP and SGP-LSTM respectively. While the results obtained without the integration of SGP is shown in the column labelled as MLP and LSTM respectively.

Table 5 shows the results executed from the experiment for data in the year of 2020 to 2022. Whereas Fig. 6 summarize the data from 2020 to 2022 based on its average mean. Based on the results shown in Fig. 8, the results indicate that when the raw fundamental indicators were used as input for LSTM or MLP models, the average IC values were − 1.85% and − 2.88%, respectively. The average value of IC in this situation is considered low whereby the ideal average value should be above 8%. While the average value for ICIR were − 1.55 and − 3.22, respectively. This value for cross-sectional stock price change prediction is considered average or acceptable. The ideal value for ICIR is above 3. However, after integrating the models with the SGP algorithm, the IC absolute values increased to 8.05% for MLP and 7.98% for LSTM which is considered as ideal outcome.

Figure 8
figure 8

The metrics from 2020 to 2022 for fundamental indicators.

The SGP-LSTM model attained the highest average value of − 5.46 for ICIR, surpassing the performance of other models. It exhibited superior results in terms of IC-success ratio and IC-PNL, with values of 78.91% and 2.13, respectively. Furthermore, both the SGP-LSTM and SGP-MLP models showcased notable advantages over the single DNN models by employing a straightforward rule-based strategy for a long-only approach. Specifically, the SGP-LSTM model demonstrated a Excess R exceeding the CSI 300 index by 22.35% and surpassing the CSI 500 index by 16.26%. Moreover, it achieved an Excess R above the average by 9.89% per year, positioning it among the top 10% of mutual fund managers in China.

Experiment using technical indicator.

In contrast, according to the findings presented in Table 6 and Fig. 9, SGP-MLP or SGP-LSTM does not demonstrate significant advantages over single DNN models when it comes to technical indicators. On average, the single LSTM model for technical indicators produced the best results in terms of normal metrics such as IC, ICIR, and IC-success, with percentages of -8.64%, -6.561, and 85.71% respectively (the original IC mean of technical indicators is 2.82% and ICIR mean is 0.23 from Table 1 and Fig. 2). When comparing the performance of the two single DNN models in relation to a simple rule-based strategy, the LSTM model outperformed the MLP model. This could be attributed to the fact that technical indicators represent sequential time series data, which is better suited for the LSTM model, as explained in the methodology section. Notably, when considering a long-only strategy, the LSTM model exhibited a significantly higher Excess R above average at 13.28%, compared to the MLP model's 3.94%.

Table 6 The metrics of technical indicators for DNN with MLP or LSTM.
Figure 9
figure 9

The metrics of technical indicators for DNN with MLP or LSTM.

Based on the experiments conducted earlier, we could summarize that the fundamental indicator will achieve the best result, when the indicators are fed into SGP algorithm, while the technical indicator will achieve the best result without integrating the SGP but directly through LSTM technique. Therefore, we design a new DNN framework that could work well with both fundamental and technical indicators. Figure 10 below illustrates the proposed DNN framework where both fundamental and technical indicators are fed as the raw data. The explanation on the experiment on this proposed framework will be discussed in the next section.

Figure 10
figure 10

The proposed SGP-DNN framework for raw indicators.

Experiment using fundamental and technical features.

Based on Fig. 10, the fundamental indicators are first fed as the raw data into the framework. As mentioned earlier, the results shows better when SGP is integrated with LSTM or MLP. Therefore, the fundamental indicators are processed based on SGP and the output is being an input for the Phase I, the augmentation phase. The output for the augmentation phase is combined with the technical indicators to be an input for the Phase II, the feature selection. Here, only LSTM is utilized as from the experiment executed earlier, LSTM outperformed the MLP in terms of a better results. Two layers of LSTM are performed with 100 nodes each, where the final feature selection is only 30 notes for the price changes prediction. In Section "Experiment using fundamental and technical features.", we present the results based on the experiments conducted using the new proposed framework as shown in Fig. 10.

According to Table 7 our hybrid model showcased a significant improvement of 1128% in information coefficient (IC) and an impressive surge of 5360% in IC information ratio (ICIR) when applied to fundamental indicators. For technical indicators, the hybrid model achieved a commendable 206% increase in IC and a remarkable surge of 2752% in ICIR. According to Table 8 and Fig. 11, the proposed SGP-LSTM model attained a rank IC value of 9.24% and an ICIR of 7.24 for a five-day prediction horizon.

Table 7 The metrics based on the proposed SGP-LSTM framework.
Table 8 The metrics based on the proposed SGP-DNN framework.
Figure 11
figure 11

The metrics based on the proposed SGP-LSTM framework for the average mean data.

Proposed models comparison with Fisher’s LSTM model, Ghosh’s three features LSTM model

This section focuses on comparing the alpha effect of the top group portfolio among the Proposed SGP-LSTM model, Thomas Fischer's, and Ghosh's models. The rolling windows test, executed with identical settings between Section "Feature extraction method: LSTM vs MLP" and Section "Forecasting, ranking, and trading", aimed to compare the performance of the proposed SGP-LSTM model against Thomas Fischer's and Ghosh's LSTM models. This comprehensive analysis involved evaluating metrics for both the DNN model and the portfolio's risk-adjusted accumulated return.

Table 9 presents the outcomes derived from the 2020 to 2022 test set, delineating the average metrics for the three models scrutinized in this study. As expounded in Section "Experiments with five classical DNN frameworks for comparison", limitations regarding the uneven distribution of forecasted positive and negative outcomes are evident for both the Thomas Fischer and Ghosh models. Specifically, as highlighted in Table 9, the precision metric exhibits poor performance for Thomas and Ghosh models, registering values of 52.1% and 51.5%, respectively, compared to 53.07% for the Proposed SGP-LSTM model. Despite their higher Rank IC values, as anticipated in Section "Experiments with five classical DNN frameworks for comparison", all models resulted in negative Excess Return (alpha). Ultimately, the Proposed SGP-LSTM model demonstrated a 16.38% Excess Return (alpha) for the test set, accompanied by an information ratio of around 2.66.

Table 9 The mean of metrics comparisons among three models.

Furthermore, Fig. 12 showcases the cumulative performance comparison of Excess Returns over the average return of all stocks within three long-only portfolios: Ghosh’s LSTM model, Thomas Fischer’s LSTM model, and our proposed enhanced SGP-LSTM model, which yields the most favorable outcomes. Over the three-year out-of-sample period, it achieves a relative annual return of 16.38% and accumulates a total return of 57.62%. In contrast, despite exhibiting relatively high accuracy rates, the accumulated excess returns of Thomas Fischer and Ghosh remain negative.

Figure 12
figure 12

Comparisons of excess R above average.

Figure 13 presents a comparison of the cumulative return curves for the proposed SGP-LSTM model portfolio and two broad-based indices, as well as Ghosh’s LSTM and Thomas Fischer’s LSTM portfolio, during the period of 2020–2022. The results clearly demonstrate that the proposed model outperformed the average portfolio, as well as the CSI 300 and CSI 500 indices. Notably, the SGD-LSTM hybrid model exhibited significant outperformance compared to the CSI 300 index, the CSI 500 index as well as two benchmark portfolios, as shown in Fig. 13. Over the span of three years, the proposed SGP-LSTM model accrued an accumulated R of about 67.75%. In contrast, the CSI 500 achieved 11.32% in accumulated R, whereas both CSI 300 and the two benchmark portfolios accumulated negative returns.

Figure 13
figure 13

Comparison of the cumulative return curves.

Conclusion

This paper introduced a methodology to enhance the cross-sectional stock return prediction by utilizing Symbolic Genetic Programming (SGP) for input generation and integrating it with Deep Neural Network (DNN) models. The study demonstrated significant improvements in prediction, outperforming popular market indexes. A hybrid model combining SGP with Long Short-Term Memory (LSTM) showcased superior performance, consistently surpassing market returns a simple rule-based strategy based on the proposed hybrid SGP-LSTM model outperforms major Chinese stock indexes, generating average annualized excess returns of 31.00%, 24.48%, and 16.38% compared to the CSI 300 index, CSI 500 index, and the average portfolio, respectively. The findings highlight the potential of the proposed approach in generating profitable investment strategies and provide insights into addressing challenges in data integration and feature engineering.

This study focused solely on financial time series data, which is known for its high autocorrelation. However, recent research has explored the incorporation of diverse data sources such as social media data, news, macroeconomic data, and high-frequency data. Moreover, the proposed hybrid SGP-DNN model could benefit from additional optimization targets, such as Excess Return of top groups or monotonicity of ten groups of target stocks, instead of solely relying on MSE as the optimization goal. Additionally, recent advancements in reinforcement learning or generative adversarial networks (GANs), such as ChartGPT application, have been suggested to be combined with hybrid DNN models. Therefore, it could be worthwhile to consider supplementing the suggested hybrid SGP-DNN model with GANs or reinforcement learning techniques to leverage multi-source information and improve prediction performance.