Abstract
Due to the rapid development of hardware devices, the analytical processing and algorithmic capabilities of computers are also being enhanced, which makes machine learning play an increasingly important role in the field of quantitative investment. For this reason, the possibility of replacing traditional human traders with automated investment algorithms that have been trained several times has become a hot topic in recent years. The majority of machine algorithms used in today’s stock trading market are supervised learning algorithms, which are still unable to objectively analyse the market and find the optimal solution for market trading on their own. To solve the two major challenges of environment awareness and automated decision-making, this study uses three core algorithms, PPO, A2C, and SAC, to build a set of ensemble automated trading strategies in a deep reinforcement learning-based framework. The ensemble trading strategy combines the advantages of each of the three algorithms to make the original reinforcement learning algorithm more adaptive, and to avoid consuming a large amount of memory when training the network, the study uses the PCA method to compress the dimension of the stock feature vector. We test our algorithm on 40 A-share stocks with sufficient liquidity and compare it with different trading strategies. The results show that the ensemble strategy proposed in this study outperforms three independent algorithms and two selected baselines, achieving an accumulated return of around 70%.
1. Introduction
Increasing computer processing power has led to the gradual digitization of financial market transactions, and it has become a reality to use computers to access large amounts of financial market data and complete high-frequency transactions within milliseconds [1]. The massive amount of data in the financial markets is often highly dimensional and noisy, and traditional econometrics cannot accurately quantify this multidimensional market data, so forecasting financial market data has always been a challenging problem in finance [2]. Most of the quantitative trading algorithms in the market today use supervised learning methods, which rely on manual design to select features and construct labels [3]. Quantitative trading systems under this approach require external managers with a certain level of financial knowledge to be able to change labels and parameters in a timely manner in response to market conditions, and therefore, the objectivity and independence of the algorithm cannot be guaranteed [4]. In contrast to supervised learning methods, reinforcement learning methods have the ability to learn control strategies from high-dimensional data [5]. Reinforcement learning methods do not require supervision by external managers and can continuously optimize paths to achieve the best cumulative returns through rewards and penalties during the interaction of market transactions [6].
Although reinforcement learning offers a new way of thinking about the analysis and prediction of financial data, there is still room for improvement. The first is that derivative algorithms on reinforcement learning often ignore the dominance of irrational investor sentiment in financial markets in the pursuit of decision objectivity, and researchers often combine sentiment mining with supervised algorithms, but few apply this to reinforcement learning methods [7]. This study addresses this issue by expressing market sentiment as a sentiment indicator and effectively improving the adaptability of algorithms to irrational markets. Secondly, it has been shown that reinforcement learning relies on a large amount of external environment as input data, but existing trading algorithms tend to select fewer technical indicators as the spatial state of the underlying investment to alleviate memory pressure [8], making the benefits of reinforcement learning methods less impressive than supervised learning methods. This study uses the PCA algorithm to compress the spatial vector dimension, incorporating more technical indicators to extend the spatial state of the underlying under the same memory pressure. In addition, the researcher found that different agents are suitable for different conditions of the stock market, which means that the model constructed by a single agent does not have the effectiveness and generalization ability in the face of different stock assets and different market environments [9]. So, this study constructs an ensemble strategy of three algorithms: PPO, A2C, and SAC. The ensemble strategy can choose the appropriate agent to maximize the accumulated return for different market environments and situations, so the ensemble strategy is more stable and reliable than a single strategy.
In the design of the ensemble model, we firstly constructed a deep reinforcement learning framework, setting up the environment, state space, and action space. Secondly, we designed a reward and punishment function for the agent to ensure that it can effectively optimize its own decisions. Thirdly, we connect the three different agents effectively through the Sharpe ratio. Finally, we demonstrate the effectiveness of the ensemble algorithm through ablation experiments and two baselines.
The study is structured as follows: Section 2 presents the relevant literature in the same research area, Section 3 describes the theoretical approach used in this study, Section 4 describes the detailed construction process of the ensemble model, Section 5 presents the data processing and empirical results, and finally we place the summary in Section 6.
2. Literature Review
Reinforcement learning has now evolved from its initial stand-alone application to a multiplicity of applications in combination with deep learning. It has been extensively investigated by many researchers in the field of finance due to its ability to effectively deal with continuous decision-making.
In terms of practical applications of reinforcement learning, Moody and Saffell proposed to construct portfolios and trade stocks with recurrent reinforcement learning and eventually proved that the returns of the reinforcement learning strategy were higher than those of the buy-and-hold strategy [10]. Tan et al. added an adaptive network fuzzy inference system as a supplement to the reinforcement learning framework to design a high-frequency trading strategy [11]. Sun and Bi selected convolutional neural networks and LSTM neural networks to build up and down classification models, respectively, based on which a high-frequency trading strategy was proposed and backtested with the main asphalt futures contract and demonstrated that the high-frequency trading strategy based on convolutional neural networks and long- and short-term memory neural networks had better profitability [12]. Dai and Zhang demonstrated that the reinforcement learning model outperformed the buy-and-hold strategy and MACD strategy in stock selection [13]. Lu and Salem applied the long- and short-term memory model (LSTM) to reinforcement learning and backtested it through forex trading, and the results demonstrated that the improved reinforcement learning model could effectively control the number of trades and maintain stable profitability [14]. Hu et al. constructed a cointegrated pair trading model based on the reinforcement learning SARSA algorithm and conducted simulation trading experiments on the Chinese bond market and demonstrated that the model outperformed the traditional model in all aspects and could significantly improve the profitability of the trading system [15].
In terms of algorithm design for reinforcement learning, Zhang and Wang and other scholars (2015) constructed a stock prediction model based on neural networks, which can solve the problem of high dimensionality of input data through genetic algorithms, and the results showed that genetic algorithms can improve model training efficiency [16]. Yung added news headlines for market opinion mining on the basis of stock time-series price data and in this way effectively improved the correct rate of model decisions [17]. Zhou et al. improved the traditional quantitative trading algorithm using the sentiment indicator ARBR, enabling the improved reinforcement learning algorithm to have richer returns in irrational markets [18]. Li et al. proposed a new trading model for deep reinforcement learning, which used two different reinforcement learning methods, stacked denoising self-coding (SDAEs) and long- and short-term memory (LSTM), and can effectively extract features from the raw data to build a robust trading agent, and experimental results show that the model achieves stable risk-adjusted returns in both the stock and future markets [19]. Gabrielsson and Johansson introduced seven new features based on Japanese candlesticks into the reinforcement learning input, and their HFT system outperformed the S&P 500 index and significantly outperformed the basic RRL algorithm in the test [20].
It is therefore easy to see that trading models built on reinforcement learning are well established and widely used in the financial field, achieving good returns in both the equity and foreign exchange markets. In terms of improvements to reinforcement learning algorithms, researchers have focused on improving different neural network structures and expanding the spatial state of the model.
3. Methodology
3.1. Reinforcement Learning Theory
Reinforcement learning is an important machine learning approach in current quantitative trading research [21]. Unlike common machine learning algorithms, the core of reinforcement learning is to allow an agent in an interactive environment to calculate the reward value for different actions using the current action state at the moment and to continuously optimize the agent’s internal policy along the direction of the best reward value until the best policy is found [22]. In summary, the reinforcement learning framework consists of the interactable state of the agent in the current environment, the different actions resulting from the decision, the policy for the decision made in the current state, the reward function used to calculate the reward value at the end of the action, the value function for the different actions and states, and the environment used to implement the agent interaction process.
The flow of the whole reinforcement learning algorithm is shown in Figure 1. At time , the agent obtains the current state of the environment and uses the policy function to process to output the action in the current state. After the action is completed, the action will be applied to the environment, causing the environment state to change from to , and then, the function uses the action transfer state to calculate the reward value for the action . The agent can use to continuously optimize future action strategies and ultimately maximize the cumulative reward value. The optimal strategy can be shown in (1), where is the chosen strategy, is the discount rate, is the total moment of interaction, and a is the state space.

3.2. Markov Decision Process
The Markov decision process is the classical algorithm for reinforcement learning modelling, and its main idea is to perform dynamic planning with finding the maximum cumulative payoff on the MDP [23]. If the current state, which contains all relevant historical information, can be used to determine the future cumulative payoff, then we can consider the state to have Markovianity, and this property can be described as follows:
We can describe the MDP using a set, as in (3), where is the state space, which stores all states in the environment, is the action space, which represents all actions that the agent can interact with, and is the transfer probability, which represents the probability that an action taken by the agent in a state will result in a state transfer, which we usually identify as needing to satisfy . is the reward generated by the action.
3.3. A2C
The actor-critic approach is one of the mainstream approaches in reinforcement learning, combining the advantages of both value-based and policy-based classical algorithms [24]. The core idea is to use the value of the state actions predicted by the critic model to optimize the decision-making behaviour of the actor model, and by alternating training, the generated actions can be made to better match the current environment and state. The structure of the model is shown in Figure 2.

Since the original AC model was slow to converge, the A2C algorithm was proposed to reduce the variance of the strategy gradient by adding a baseline to the strategy gradient while keeping the expectation of the strategy gradient random variable constant, allowing for faster convergence.
The gradient formula of the AC algorithm in its original state can be expressed as follows:
Since the baseline function is only related to the state and not to the action , the above equation can be derived as follows:
Since the baseline function is a function independent of the action , we can further optimize the formula using the state value function as the baseline and let the dominance function be . Eventually, we can obtain the optimized gradient as follows:
In this study, the MLP neural network is chosen to build the A2C algorithm. The algorithm is updated step by step during training, and its training process is seen in Table 1.
3.4. PPO
The principle of the PPO algorithm is to represent the policy parametrically as , using a parametrized linear function or neural network to represent the policy [25].
The PPO algorithm strategy gradient is implemented by calculating an estimator combined with a stochastic gradient ascent algorithm, and the updated formula can be seen as follows:where is the policy parameter before the update, is the updated policy parameter, is the learning rate, and is the importance weight. is the optimization objective, which is the expected value of the future reward in state .
3.5. SAC
SAC is a heteroskedastic AC algorithm developed for maximum entropy reinforcement learning [26]. Unlike other methodological theories, the SAC algorithm changes the goal of reinforcement learning by the introduction of the concept of entropy, which actually improves the exploratory and robustness of the algorithm [27].
The entropy [28–35] of the distribution x can be expressed as follows:
In order not to miss any valid actions and trajectories and to promote strategy randomization for greater robustness and exploration, the maximum entropy reinforcement learning algorithm requires the strategy function to output the action expressed as follows:
4. Ensemble Model Development
The ensemble model can be divided into two parts: the first part is to select a suitable algorithm from A2C, PPO, and SAC as agent, and the second part is to build the state space of the stock through stock price, technical indicators, and sentiment indicators to describe the stock trading market environment [36–39]. When these two parts are completed, the action space and reward function will link the agent with the environment so that the agent can make continuous decisions to maximize the cumulative return. The structure of the ensemble model is shown in Figure 3.

4.1. Select the Agent
In this study, we use a several months’ long window to train all three agents simultaneously, and every three months, we retrain our three agents. At the same time, we use the latter three months of the training window to validate the performance of the three different agents and select the agent with the highest Sharpe ratio as the agent of the ensemble strategy for the underlying investment. The following equation shows how the Sharpe ratio is calculated:
4.2. Setting Up the Environment
We incorporate the daily opening and closing prices of the stock: the technical indicators of the stock and the sentiment indicator ARBR, which describes market sentiment, as the spatial state of the stock.
It is worth mentioning that to incorporate more indicators as the spatial state of the stock while relieving the memory pressure on the algorithm, we used the PCA algorithm to compress the original 24-dimensional feature vector to 20 dimensions. Figure 4 shows the correlation hotspots of the technical indicators selected by the ensemble model. It is easy to see that there is a large positive correlation between vol10 and vol20 and a large negative correlation between the deviation rate BIAS and MA, so the PCA algorithm can be used to reduce the dimensionality.

4.3. Transaction Cost
Since every transaction in the stock market incurs transaction costs and the rules for transaction costs in the stock exchange vary from country to country, we have set a uniform transaction cost of 0.1% of the value of each transaction.
4.4. Reward Function
We define the reward value as the maximum profit that each group of stocks can take in a given period of time, expressed as follows:where is the value of the reward currently received, is the price of the stock at time t, and is the price of the stock at the previous time. The reward value is therefore the difference between the two momentary prices. When the current price is greater than the past price, a positive reward value is obtained; when the current price is lower than the past price, a negative reward value is obtained. The final cumulative return is as follows:
4.5. Action Space
To make it easier to calculate the profitability of the ensemble model in the stock market, we do not consider shorting trades in the stock and simply use buy, sell, hold, and wait and see as the action states of the stock. This can be expressed as follows:
5. Empirical Results
5.1. Data Preprocessing
We select the constituent stocks of the CSI 100 index as the pool of stocks to be traded in the pooled strategy. We use historical daily data from 1 January 2010 to 12 February 2021 for the evaluation of model returns. The stock data used in this study are downloaded via the wind terminal. As mentioned above, we split the historical stock data into two parts: one for the training of the three agents, PPO, A2C, and SAC, and the other for the validation of the three agents, including the adjustment of the learning rate and key parameters. After selecting a suitable agent by comparing the Sharpe ratio, we start the real trading test and compare the results. The three agent’s algorithms are MACD, and Min-Variance two baselines. A breakdown of the training data used is seen in Figure 5.

5.2. Test Result
The backtest results of the ensemble model and the comparison model are shown in Figure 6. It is easy to see that the ensemble model achieved a cumulative return of 71.92% and an annual return of 17.98%, which is higher than the remaining two individual agent models with the baseline in terms of return results. More detailed backtesting data are shown in Table 2. The ensemble model has the lowest annual volatility, which proves that the model is more stable and reliable than the other models, while the min-variance model has the highest annual volatility. In terms of the Sharpe ratio, the ensemble model achieved the highest Sharpe ratio, while it had the lowest maximum capital withdrawal rate. Overall, the A2C, PPO, and SAC models all achieved above-baseline returns, demonstrating that all three models have some portfolio management capability. In contrast, the ensemble model achieved a cumulative return of 70%, while its stability, Sharpe ratio, and maximum return were better than the other models, demonstrating the effectiveness of the model in the equity market.

6. Conclusion
In this study, we propose an ensemble trading strategy in a reinforcement learning framework, which selects the appropriate strategy as agent from three strategies, PPO, A2C, and SAC, through the Sharpe ratio, and incorporates more stock indicators and data as the state space of the stock using the PCA method. Through backtesting on the CSI 100, the results show that the proposed model outperforms the two agent models A2C and SAC in terms of return and outperforms all three independent agent models and the two baselines in terms of Sharpe ratio, annual volatility, and maximum retracement, so the ensemble model is innovative and superior and has research and application value.
Data Availability
The dataset can be accessed upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.