0% found this document useful (0 votes)
208 views11 pages

Automated Stock Trading with DRL

This document proposes a novel deep reinforcement learning approach for automated stock trading using cascaded LSTM networks. The approach uses LSTM to extract time-series features from stock data to represent the initial state for the reinforcement learning agent. The agent is then trained using proximal policy optimization to learn strategies that maximize expected returns. Experiments on US and Chinese stock market data show the proposed approach outperforms previous baseline models, particularly in the Chinese stock market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views11 pages

Automated Stock Trading with DRL

This document proposes a novel deep reinforcement learning approach for automated stock trading using cascaded LSTM networks. The approach uses LSTM to extract time-series features from stock data to represent the initial state for the reinforcement learning agent. The agent is then trained using proximal policy optimization to learn strategies that maximize expected returns. Experiments on US and Chinese stock market data show the proposed approach outperforms previous baseline models, particularly in the Chinese stock market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Novel Deep Reinforcement Learning Based Automated Stock Trading System

Using Cascaded LSTM Networks

Jie Zoua , Jiashu Loua , Baohua Wanga,∗, Sixue Liua


a College of Mathematics and Statistics, Shenzhen University, 518060, Guangdong, China

Abstract
arXiv:2212.02721v1 [[Link]] 6 Dec 2022

More and more stock trading strategies are constructed using deep reinforcement learning (DRL) algorithms, but DRL
methods originally widely used in the gaming community are not directly adaptable to financial data with low signal-
to-noise ratios and unevenness, and thus suffer from performance shortcomings. In this paper, to capture the hidden
information, we propose a DRL based stock trading system using cascaded LSTM, which first uses LSTM to extract
the time-series features from stock daily data, and then the features extracted are fed to the agent for training, while
the strategy functions in reinforcement learning also use another LSTM for training. Experiments in DJI in the US
market and SSE50 in the Chinese stock market show that our model outperforms previous baseline models in terms
of cumulative returns and Sharp ratio, and this advantage is more significant in the Chinese stock market, a merging
market. It indicates that our proposed method is a promising way to build a automated stock trading system.
Keywords: Deep Reinforcement Learning, Long Short-Term Memory, Automated stock trading, Proximal policy
optimization, Markov Decision Process

1. Introduction based on supervised learning and require training sets


that are labeled with the state of the market, but such
In recent years, more and more institutional and machine learning classifiers are prone to suffer from
individual investors are using machine learning and overfitting, which reduces the generalization ability of
deep learning methods for stock trading and asset man- the model[9].
agement, such as stock price prediction using Ran-
dom Forests, Long Short-Term Memory(LSTM) Neu- Fundamental data from financial statements and other
ral Networks or Support Vector Machines[1], which data from business news, etc. are combined with ma-
help traders to get optimal online strategies and ob- chine learning algorithms that can obtain investment
tain higher returns than strategies using only traditional signals or make predictions about the prospects of a
factors[2][3][4] . company[10][11][12][13] to screen for good investment
However, there are three main limitations of machine targets. Such an algorithm solves the problem of stock
learning methods for stock market prediction: (i) finan- screening, but it cannot solve how to allocate positions
cial market data are filled with noise and are unstable, among the investment targets. In other words, it is still
and also contain the interaction of many unmeasurable up to the trader to judge the timing of entry and exit.
factors. Therefore, it is very difficult to take into ac- To overcome the main limitations listed above, in this
count all relevant factors in complex and dynamic stock paper we use a deep reinforcement learning approach to
markets[5][6][7]. (ii) Stock prices can be influenced by construct low-frequency automated stock trading strate-
many other factors, such as political events, the behav- gies to maximize expected returns. We consider stock
ior of other stock markets or even the psychology of trading as a Markov decision process that will be repre-
investors[8]. (iii) Most of the methods are performed sented by states, actions, rewards, strategies, and values
in a reinforcement learning algorithm. Instead of rely-
∗ Corresponding
ing on labels (e.g., up and down in the market) to learn,
author
Email addresses: ianzou2000@[Link] (Jie Zou),
the reinforcement learning approach learns how to max-
loujiashu@[Link] (Jiashu Lou), bhwang@[Link] (Baohua imize the value function during the training phase. We
Wang), 18801117020@[Link] (Sixue Liu) mainly use the PPO algorithm to train the agent and
Preprint submitted to Expert Systems With Applications December 7, 2022
combine it with LSTM to extract time-series features for defining the necessary constraints in the environment
an initial state of a certain time window length, while the and the framework of the agent. Section 4 presents the
initial state is represented by the adjusted stock price, results and analysis of the experiments, covering the in-
available balance, shares of the underlying asset plus troduction of datasets, baseline models and evaluation
some technical indicators: MACD, RSI, CCI and ADX, metrics, process of finding the optimal parameters, and
illustrated in Figure 1. We take the ensemble strategy experiments in both Chinese and U.S. markets. In Con-
proposed by Yang, Liu[14] as the baseline and further clusion, we summarize the whole work and give direc-
expand on their work, so the training environment, state tions of improvement for the future.
space, behavior space and value function we use are
consistent with it.
2. Related Work

This section briefly summarizes the application of re-


inforcement learning and LSTM in quantitative trad-
ing, reviewing three learning methods in reinforcement
learning that are frequently applied to financial markets
and LSTM neural networks that are applied to predict
stock prices. These three learning methods are: critical-
only learning, actor-only learning and actor-critic learn-
ing.

2.1. Critic-only
The Critic-only approach, the most common of the
Figure 1: Agent-environment interaction in reinforcement learning three, uses only the action-value function to make de-
cisions with the aim of maximizing the expected re-
Experiments show that the automated stock trad- ward for each action choice given the current state. The
ing strategy based on LSTM outperforms the ensemble action-value function Q receives the current state and
strategy and the buy-and-hold strategy represented by the possible actions to be taken as input, then outputs
the Dow Jones Industrial (DJI) index in terms of cumu- an expected Q value as a reward. One of the most
lative return, and also has better performance in Chinese popular and successful approaches is Deep Q Network
stock market in terms of cumulative return and Sharpe (DQN)[15] and its extensions[16]. Chen[17], Dang[18]
ratio. and Jeong[19] used this method to train agents on a sin-
The main contributions of this paper are two-fold: gle stock or asset. Chen[17], Huang[20] used Deep Re-
(i) In the literature, security’s past price and related current Q Network (DRQN) trained agents to achieve
technical indicators are often used to represent state higher cumulative performance on quantitative trading
spaces. Instead of using the raw data, we use LSTM than baseline models and DQN to achieve higher cu-
to extract the time-series features from stock daily data mulative returns than baseline models. However, the
to represent state spaces, since the memory property main limitation of the method is that it performs bet-
of the LSTM can discover features of the stock mar- ter on discrete state spaces, however, the stock prices
ket that change over time and integrate hidden infor- are continuous. If a larger number of stocks or assets
mation in the time dimension, thus making it more are selected, the state space and action space will grow
likely that the POMDP is closer to the MDP. (ii) Dif- exponentially[21], which will weaken the performance
ferent from the previous DRL based methods which use of DQN.
multi-layer neural networks or convolutional networks
in agent training, we use LSTM as the training network 2.2. Actor-only
because it is a type of recurrent neural network capa- The actor-only approach is able to learn policies di-
ble of learning order dependence in sequence prediction rectly and the action space can be considered as contin-
problems. uous. Therefore, the advantage of this approach is that
The rest of this paper is organized as follows. Sec- it can directly learn the optimal mapping from a par-
tion 2 is a brief introduction to the work related to stock ticular state to an action, which can be either discrete
trading using reinforcement learning, categorized by the or continuous. Its disadvantages are that it requires a
training algorithm. Section 3 focuses on our algorithm, large amount of data for experiments and a long time
2
to obtain the optimal strategy[22]. Deng[23] used this the OpenAI gym[28][29][30], which is able to give the
method and applied Recurrent Deep Neural Network to agent various information for training, such as current
real-time financial trading for the first time. Wu[24] also stock prices, shareholdings and technical indicators. We
explored the actor-only method in quantitative trading, use Markov Decision Process (MDP) to model stock
where he compared deep neural networks (LSTM) with trading[36], so the information that should be included
fully connected networks in detail and discussed the im- in this multi-stock trading environment are: state, ac-
pact of some combinations of technical indicators on the tion, reward, policy and Q-value. Suppose we have a
daily data performance of the Chinese market, proving portfolio with 30 stocks.
that deep neural networks are superior. The results of
his experiments are mixed, and he shows that the pro- 3.1.1. State Space
posed approach can yield decent profits in some stocks, A 181-dimensional vector consisting of seven parts of
but performs mediocrely in others. information represents the state space of the multi-stock
trading environment: [bt , pt , ht , Mt , Rt , Ct , Xt ]. Each of
2.3. Actor-critic these components is defined as follows. In [14], this
The actor-critic approach aims to train two models si- 181-dimensional vector is fed as a state directly into
multaneously, with the actor learning how to make the the reinforcement learning algorithm for learning. How-
agent respond in a given state and the critic evaluat- ever, our approach is to give T (T is the time window of
ing the responses. Currently, this approach is consid- LSTM) 181-dimensional vectors to the LSTM for learn-
ered to be one of the most successful algorithms in RL, ing first, and the feature vector generated by the LSTM
while Proximal Policy Optimization (PPO) is the most are given to the agent for learning as our state.
advanced actor-critic approach available. It performs
better because it solves the well-known problems when 1. bt ∈ R+ : available balance at current time step t.
applying RL to complex environments, such as instabil- 2. pt ∈ R30
+ : adjusted close price of each stock at cur-
ity due to the distribution of observations and rewards rent time step t.
that constantly change as the agent learns[25]. In this 3. ht ∈ Z30+ : shares owned of each stock at current
paper, the baseline model[14] is constructed based on time step t.
the actor-critic approach, using a combination of three
4. Mt ∈ R30 : Moving Average Convergence Diver-
DRL algorithms: PPO, A2C, and DDPG. However, the
gence (MACD) is calculated using close price of
agent that learns only with PPO outperforms the resem-
each stock at current time step t. MACD is one
ble strategy in terms of cumulative return.
of the most commonly used momentum indicators
that identifies moving averages[32].
2.4. LSTM in stock system
5. Rt ∈ R30 + : Relative Strength Index (RSI) is cal-
Although Long Short-Term Memory Networks
culated using close price of each stock at current
(LSTM)[26] are traditionally used in natural language
time step t. RSI quantifies the extent of recent price
processing, many recent works have applied them in
changes[32].
financial markets to filter some noise in raw market
data[41][42][43][44]. Stock prices and some techni- 6. Ct ∈ R30 + : Commodity Channel Index (CCI) is
cal indicators generated from stock prices are intercon- calculated using high, low and close price. CCI
nected, so LSTM can be used as a feature extractor to compares current price to average price over a time
extract potentially profitable patterns in the time series window to indicate a buying or selling action[33].
of these indicators. Zhang[22], Wu[24] have tried to in- 7. Xt ∈ R30 : Average Directional Index (ADX)
tegrate LSTM for feature extraction while training agent is calculated using high, low and close price of
using DRL algorithm and the experiments have shown each stock at current time step t. ADX identifies
that it works better than baseline model. And the work trend strength by quantifying the amount of price
of Lim[45] shows that LSTM delivers superior perfor- movement[34].
mance on modelling daily financial data.
3.1.2. Action Space
3. Our method A set containing 2k + 1 elements represents the
action space of the multi-stock trading environment:
3.1. Stock Market Environment {−k, ..., −1, 0, 1, ..., k}, where k, −k represents the num-
The stock market environment used in this paper is ber of shares we can buy and sell at once. It satisfies
a simulation environment developed in [14] based on some conditions as follows.
3
1. hmax represents the maximum number of shares we 3.2. Stock Trading Agent
can be able to buy at a time.
3.2.1. Framework
2. The action space can be considered continuous
since the entire action space is of size (2k + 1)30 . We introduce LSTM as a feature extractor to improve
the model in [14], as shown in Figure 2.
3. The action space will next be normalized to [−1, 1].

3.1.3. Reward
We define the reward value of the multi-stock trading
environment as the change in portfolio value from state
s taking action a to the next state s0 (in this case two
days before and after), with the training objective of ob-
taining a trading strategy that maximizes the return:
   
Returnt (st , at , st+1 ) = bt+1 + pTt+1 ht+1 − bt + pTt ht −ct
(1)
where ct represents the transaction cost. We assume that
the per-transaction cost is 0.1% of each transaction, as
defined in [10]:
Figure 2: Overview of our model
ct = 0.1% × pT kt (2)
At the time step t, the environment automatically gen-
3.1.4. Turbulence Threshold erates the current state S t and passes it to the LSTM
We employ this financial index turbulencet [14] that network. It then remembers the state and uses its mem-
measures extreme asset price movements to avoid the ory to retrieve the past T stock market states to obtain
risk of sudden events that may cause stock market the state sequence Ft = [S t−T +1 , ..., S t ]. The LSTM
crash[35], such as March 2020 stock market caused by analyzes and extracts the hidden time-series features
COVID-19, wars and financial crisis: or potentially profitable patterns in Ft , and then out-
puts the encoded feature vector Ft0 and passes it to the
turbulencet = (yt − µ) Σ−1 (yt − µ)0 ∈ R (3) agent, which is guided by the policy function π(Ft0 )
to perform the optimal action at . The environment
Where yt ∈ R30 denotes the stock returns for current then returns the reward Rt , the next state S t+1 , and a
period t, µt ∈ R30 denotes the average of historical re- boolean dt to determine if the state is terminated accord-
turns, and Σ ∈ R30×30 denotes the covariance of his- ing to the agent’s behavior. Then, the obtained quintet
torical returns. Considering the historical volatility of (S t , at , Rt , S t+1 , dt ) is stored in the experience pool.
the stock market, we set the turbulence threshold to Actor computes At from the targetst computed by critic
90th percentile of all historical turbulence indexes. If using the advantage function. After a certain number of
turbulencet is greater than this threshold, it means that steps, actor back-propagates the error through the loss
extreme market conditions are occurring and the agent function J θi (θ) = Tt=1 ( ππθθ (a(att|s|stt)) Aπt θ − βKL(θ, θi )) and up-
P
will stop trading until the turbulence index falls below i
dates θ using the gradient descent method. And critic
this threshold. updates the parameters using the mean square error loss
function. And the environment will keep repeating the
3.1.5. Other Parameters process until the end of the training phase.
In addition to defining the state space, action space
and reward functions, some necessary constraints need
to be added to the multi-stock trading environment. 3.2.2. LSTM as Feature Extractor
Reinforcement Learning (RL) was initially applied to
1. Initial capital: $1 million. games, which have a limited state space, a limited action
2. Maximum number of shares in a single trade hmax : space, clear stopping conditions and a more stable envi-
100. ronment, so there is room to improve the use of RL for
3. Reward scaling factor: 1e-4, which means the re- stock trading. It is well known that financial markets are
ward returned by the environment will be only 1e-4 full of noise and uncertainty, and that the factors affect-
of the original one. ing stock prices are multiple and changing over time.
4
This makes the stock trading process more like a par- ratio between the new policy and the old one:
tially observable Markov decision process (POMDP),
since the states we use are not the real states in the πθ (at |st )
rt (θ) = (4)
stock trading environment. Therefore, we can use the πθold (at |st )
memory property of the LSTM to discover features of
So we can get rt (θold ) = 1 from it. The clipped surrogate
the stock market that change over time. LSTM can in-
objective function of PPO is:
tegrate information hidden in the time dimension, thus
making it more likely that the POMDP is closer to the HCLIP (θ) = Ê[min(rt (θ) Â (st , at ) ,
MDP[24][31].
In this paper, an LSTM-based feature extractor is de- clip (rt (θ) , rt (θold ) − , rt (θold ) +  ) Â (st , at ))]
(5)
veloped using the customized feature extraction inter-
That is,
face provided by stable-baselines3. The network struc-
ture of the LSTM feature extractor is shown in Figure HCLIP (θ) = Ê[min(rt (θ) Â (st , at ) ,
3. (6)
clip (rt (θ) , 1 − , 1 +  ) Â (st , at ))]

Where rt (θ) Â (st , at ) is the normal policy gradient ob-


jective in TRPO, and  (st , at ) is the estimated advan-
tage function. Term clip(rt (θ) , 1 − , 1 +  ) clips the
ratio rt (θ) to be within [1 − , 1 + ]. Finally, HCLIP (θ)
takes the minimum of the clipped and unclipped objec-
tive.
PPO contains some additional main optimizations ex-
cept the value function clipping[38]: (i) Reward scaling:
Instead of feeding rewards directly from the environ-
ment to the target, the PPO implementation performs
some sort of discount-based scaling scheme. (ii) Or-
thogonal initialization and layer scaling: Rather than us-
Figure 3: Overview of our model ing the default weight initialization scheme in the policy
and value network, an orthogonal initialization scheme
As shown in the figure 3, each input to the LSTM is is used, with the proportions varying between the lay-
a state list of length T arranged in time order. Starting ers. (iii) Adam learning rate annealing: PPO sometimes
from the farthest state, the hidden layer of the LSTM anneals the learning rate for optimization[40].
remembers the information of the state and passes it to The operation of LSTM in PPO is similar to the
the next point in time. We use the feature vector of the LSTM in above section, which is equivalent to mak-
most recent state, which is the current state, after one ing the agent recall the behavioral information of the
hidden layer of the LSTM with three linear layers. This previous moment while receiving new data, so that the
feature vector Ft0 will be used as input feature of goes decision made by the agent at this moment is based on
into the PPO for the agent to learn. the previous decision.
In the LSTM network, the initial number of features
is 181, the final number of output features is 128 and the
3.2.3. Proximal Policy Optimization (PPO) hidden size is 128. Linear layer 1 is (15 × 128, 128) and
PPO[37] is one of the most advanced of the cur- then passes the Tanh activation function. Linear layer2
rent policy-based approaches which use multiple epochs and 3 are the same two layers of (128, 128), and then
of stochastic gradient ascent to perform each policy pass Tanh.
update[39], and also performs best in stock trading
among the three DRL algorithms in [14], which is an 4. Performance Evaluations
important reason for our consideration of it. PPO is
very closely related to Trust Region Policy Optimization In this section, we first tuned some parameters in the
(TRPO): it can be seen as a refinement of TRPO[38]. In model, then used 30 Dow constituent stocks to evaluate
PPO, parameters of actor(agent) is θ. our model, and finally performed robustness tests on 30
First, a notation is used to represent the probability stocks in the SSE 50.
5
4.1. Description of Datasets 4.3. Baseline Methods
In this paper, 60 stocks are selected as the stock pool: Our model is compared with baseline models includ-
30 Dow constituent stocks, and 30 stocks randomly se- ing:
lected from the SSE50 in Chinese stock market. The • S&P 500: the typical Buy-And-Hold strategy.
stocks from Dow Jones are the same as the pool in [14]
to facilitate comparison with their ensemble strategy, • SSE50: another Buy-And-Hold strategy in Chi-
while the 30 stocks from SSE 50 are used to explore the nese market.
applicability of this paper’s model in the Chinese stock
market, a merging market. • Ensemble Strategy in [14]: they train agents for
The daily data for backtesting starts from 01/01/2009 three months simultaneously in the training phase
and ends on 05/08/2020, and the data set is divided into using A2C, DDPG and PPO algorithms and then
two parts: in-sample period and out-sample period. The select the agent with the highest Sharpe ratio as the
data in the in-sample period is used for training and val- trader for the next quarter. This process is repeated
idation, and the data in the out-sample period is used for until the end of the training.
trading. We only use PPO during the whole process.
4.4. Evaluation Measures
The entire dataset is split as shown in the Figure 4.
The training data is from 01/01/2009 to 09/30/2015, the • Comulative Return (CR): calculated by subtract-
validation data is from 10/01/2015 to 12/31/2015, and ing the portfolio’s final value from its initial value,
the trading data is from 01/01/2016 to 05/08/2020. In and then dividing by the initial value. It reflects
order to better exploit the data and allow the agent to the total return of a portfolio at the end of trading
better adapt to the dynamic changes of the stock market, stage.
Pend − P0
the agent can continue to be trained during the trading CR = (7)
phase. P0
• Max Earning Rate (MER): the maximum per-
centage profit during the trading period. It mea-
sures the robustness of a model and reflects the
trader’s ability to discover the potential maximum
profit margin.
Figure 4: Stock Data Splitting
max(A x − Ay )
MER = (8)
Ay
4.2. Training Parameters of PPO Where A x , Ay is the total asset of the strategy and
The agent is trained using only the actor-critic based x > y, Ay < A x .
PPO method, and the training parameters of PPO are set
• Maximum Pullback (MPB): the maximum per-
as shown in the table 1.
centage loss during the trading period. It measures
the robustness of a model.
Table 1: Training parameters of PPO
max(A x − Ay )
Parameter Value MPB = (9)
Ay
Reward Discount Factor 0.99
Update Frequency 128 Where A x , Ay is the total asset of the strategy and
Loss Function Weight of Critic 0.5 x > y, Ay > A x .
Loss Function Weight of Distribution Entropy 0.01
Clip Range 0.2 • Average Profitability Per Trade (APPT): refers
Maximum of Gradient Truncation 0.5 to the average amount that you can expect to win or
Optimizer Adam lose per trade. It measures the trading performance
β1 0.9 of the model.
β2 0.999 Pend − P0
 1e-8 APPT = (10)
NT
Learning Rate 3e-4
Where Pend − P0 means the returns at the end of
trading stage, and NT is the number of trading.
6
Figure 5: Trading results of different time windows in LSTM

• Sharpe Ratio (SR): calculated by subtracting the


Table 2: Comparison of different time windows in LSTM
annualized risk free rate from the annualized re- TW=5 TW=15 TW=30 TW=50
turn, and the dividing by the annualized volatility.
CR 32.69% 51.53% 90.81% 68.74%
It considers benefits and risks synthetically and re-
MER 68.27% 63.46% 113.50% 79.32%
flects the excess return over unit systematic risk.
MPB 58.93% 24.75% 46.51% 37.01%
E (RP ) − R f APPT 18.29 21.77 35.27 23.31
SR = (11)
σP SR 0.2219 0.7136 1.1540 0.9123
4.5. Exploration of Optimal Hyperparameters
We performed parameter tuning on two important layers) and then show the trading results of the model in
parts of the model: (i) the time window size of the Figure 6. (time window of LSTM is 30)
LSTM as a feature extractor. (ii) the hidden size of As can be seen from the figure 6, when hidden
LSTM in PPO training. size=512, the cumulative yield is significantly higher
than the other choices. It has a smaller drawback com-
4.5.1. Best Time Window of LSTM pared to hidden size=512*2 and was able to stop trading
For the time window of the LSTM, we tested the in the big drawback in March 2020, indicating that the
cases of TW=5,15,30,50, and then show the trading re- agent can be a smart trader under the right training con-
sults of the model in Figure 5. (hidden size of LSTM ditions of DRL.
in PPO is 512) From the figure, it can be seen that the
agent at TW=30 is able to achieve the highest cumula-
Table 3: Comparison of different hidden sizes of LSTM in PPO
tive return during the trading period, ahead of the agent HS=128 HS=256 HS=512 HS=1024 HS=512*2
at TW=50 by more than 20% and a figure that exceeds CR 69.94% 72.58% 90.81% 56.27% 89.13%
40% in agent without LSTM extraction. This verifies MER 84.29% 82.32% 113.50% 60.64% 92.34%
MPB 38.57% 29.92% 46.51% 30.39% 58.31%
the feasibility of our model: the stock price movements
APPT 28.07 30.79 35.27 23.96 33.26
in the stock market are correlated with their past trajec- SR 0.9255 1.0335 1.1540 0.8528 0.8447
tories, and the LSTM is able to extract the time-series
features of them.
The data in the table 3 shows the difference between
The data in the table 2 shows the difference between
these options in more detail: for HiddenSize=512, CR,
these options in more detail: for TimeWindow=30, CR,
MER, APPT and SR are much higher than the other op-
MER, APPT and SR are much higher than the other op-
tions, but MPB does not perform well. On balance, HS
tions, but MPB does not perform well. On balance, TW
= 512 is the optimal parameter.
= 30 is the optimal parameter.

4.5.2. Best Hidden Size of LSTM in PPO 4.6. Performance in U.S. Markets
For the hidden size of the LSTM in PPO, we tested The optimal parameters found (TW=30, HS=512)
the cases of HS=128, 256, 512, 1024, 512*2(two hidden were used as the parameters of the final model and the
7
Figure 6: Trading results of different hidden sizes of LSTM in PPO

Figure 7: Trading results by agents with LSTM in PPO plus LSTM feature extraction, LSTM in PPO, Ordinary PPO, Ensemble Strategy in [14]
and Buy-And-Hold strategy on DJI

results of this model were compared with the trading re- of risk tolerance and the ability to identify down markets
sults of the PPO model with LSTM in PPO and another and stop trading. In this respect, ensemble strategy does
with MlpPolicy and the Ensemble Strategy in [14], as a little better. But for a long time the agent of ensem-
shown in Figure 7. ble strategy has a negative attitude towards investing,
Table 4 shows the details of trading results in U.S. choosing not to trade whether the market is down or up.
Markets. This can lead to the loss of a large profit margin in the
long run. Overall, our model has the strongest profit-
taking ability, excels at finding profits within volatile
Table 4: Details of the trading results in U.S. Markets
PPO LSTM in PPO Our Model Ensemble DJI markets, and recovers quickly after pullbacks. In terms
CR 54.37% 49.77% 90.81% 70.40% 50.97% of the Sharpe ratio, our strategy is very close to the base-
MER 67.28% 63.45% 113.50% 65.32% 63.90%
MPB 28.30% 29.39% 46.51% 15.74% 72.32%
line model, but does not require the help of other algo-
APPT 20.02 22.84 35.27 28.54 N.A. rithms, which is much easier and faster.
SR 0.8081 0.6819 1.1540 1.3000 0.4149
Analyzing in more detail, all strategies can be divided
into two phases which are consistent with DJI index: (i)
Our model obtains the highest cumulative return of Accumulation phase: until 06/2017, our strategies were
90.81% and the maximum profitability of 113.5% on able to achieve stable growth with little difference in re-
30 Dow components, better than the ensemble strategy turns from the integrated strategies. However, after that,
in [14], the baseline model. However, in terms of max- our agent quickly captured profits and was able to grow
imum pullback, our model has an MPB=45.51% com- total returns rapidly. This phase lasted until 01/2018,
pared to DJI’s MPB=63.90%, indicating the possession when the cumulative return had reached a level where
8
Figure 8: Trading results by agents with LSTM in PPO plus LSTM feature extraction, Ensemble Strategy in [14] and Buy-And-Hold strategy on
SSE50

the difference with the final return was not significant. with our analysis of our model: the ability to capture re-
(ii) Volatility phase: Starting from 01/2018, our agent’s turns in volatility quickly and accurately, and the returns
trading style became very aggressive and courageous, obtained are positively correlated with positive volatil-
as reflected in the large fluctuations in returns. The re- ity.
turns were generally more stable during this phase and On further analysis, the cumulative return of our
were able to bounce back quickly within two months model increased rapidly from 01/2016 to 02/2018, fi-
after suffering a pullback on 01/2019. nally reaching almost three times that of the integrated
strategy. Subsequently, as the volatility of the SSE 50
4.7. Performance in Chinese Markets increased, the volatility of our model increased accord-
ingly. Within 02/2018 to 01/2019, the pullbacks of
The same model is used to trade the samples in the the two models were similar, but then our model cap-
Chinese stock market, as Figure 8 shown, and the en- tured nearly 80% of the return at 01/2019 within three
semble strategy is chosen as the most important base- months. Even though it suffered a decline in 01/2020
line model. We show the cumulative returns at the end due to the black swan event of Covid-19, it quickly
of the trade as follows. bounced back to its highest point six months later.
And table 5 shows the details of trading results in
Chinese Markets. In the Chinese market, the advan-
5. Conclusion
Table 5: Details of the trading results in Chinese Markets
Our Model Ensemble SSE50 In this paper, we propose a PPO model using cas-
CR 222.91% 120.87% 51.46% caded LSTM networks and compare the ensemble strat-
MER 222.91% 120.87% 51.46% egy in [14] as a baseline model in the U.S. market and
MPB 74.81% 39.95% 41.27% the Chinese market, respectively. The results show that
APPT 66.96 47.55 25.78 our model has a stronger profit-taking ability, and this
SR 2.3273 1.6938 0.4149 feature is more prominent in the Chinese market. How-
ever, according to the risk-return criterion, our model is
tages of our model are greater than the ensemble strat- exposed to higher pullback risk while obtaining high re-
egy. Our CR=222.91% is nearly twice that of ensemble turns. Finally, we believe that the strengths of our model
strategy (Ensemble CR=120.87%). Although our pull- can be more fully exploited in markets where the over-
backs are greater, the volatility is more reflected in the all trend is smoother and less volatile, as returns can be
rise in cumulative returns. Also, the Sharpe ratio tells more consistently accumulated in such an environment
us that our model (SR=2.3273) is better in the Chinese (such as the A-share market in China in recent years).
market when combining return and risk. This illustrates This suggests that there is indeed a potential return pat-
our model’s superior performance in emerging markets, tern in the stock market, and the LSTM as a time-series
which typically have greater volatility, and is consistent feature extractor plays an active role. Also, it shows that
9
the Chinese market is a suitable environment for devel- ing: An ensemble strategy.” In Proceedings of the First ACM
oping quantitative trading. International Conference on AI in Finance, pp. 1-8. 2020.
[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei
In the subsequent experiments, improvements can be A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Mar-
made in the following aspects: (i) The amount of train- tin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
ing data. The training of PPO requires a large amount Human-level control through deep reinforcement learning. na-
of historical data to achieve good learning results, so ture, 518(7540):529-533, 2015.
[16] Hado Van Hasselt, Arthur Guez, and David Silver. Deep rein-
expanding the amount of training data may help to im- forcement learning with double qlearning. In Proceedings of the
prove the results. (ii) Reward function. Some improved AAAI conference on artificial intelligence, volume 30, 2016.
reward functions for stock trading have emerged, which [17] Lin Chen and Qiang Gao, “Application of deep reinforcement
learning on automated stock trading,” in 2019 IEEE 10th Inter-
can enhance the stability of the algorithm.
national Conference on Software Engineering and Service Sci-
ence (ICSESS), 2019, pp. 29–33.
[18] Quang-Vinh Dang, “Reinforcement learning in stock trading,”
References in Advanced Computational Methods for Knowledge Engineer-
ing. ICCSAMA 2019. Advances in Intelligent Systems and
[1] Fang, Y.-C. et al. “Research on Quantitative Investment Strate- Computing, vol 1121. Springer, Cham, 01 2020.
gies Based on Deep Learning.” Algorithms 12 (2019): 35. [19] Gyeeun Jeong and Ha Kim, “Improving financial trading deci-
[2] Kuo, Ren Jie. “A Decision Support System for the Stock Mar- sions using deep q-learning: predicting the number of shares,
ket Through Integration of Fuzzy Neural Networks and Fuzzy action strategies, and transfer learning,” Expert Systems with
Delphi.” Appl. Artif. Intell. 12 (1998): 501-520. Applications, vol. 117, 09 2018.
[3] [Link] and M. Kozaki, “An intelligent forecasting system of [20] Chien Yi Huang. Financial trading as a game: A deep reinforce-
stock price using neural networks,” Proc. IJCNN, Baltimore, ment learning approach. arXiv:1807.02787, 2018.
Maryland, 652-657, 1992. [21] Zhuoran Xiong, Xiao-Yang Liu, Shan Zhong, Hongyang Yang,
[4] S . Cheng, A Neural Network Approach for Forecasting and and Anwar Walid. Practical deep reinforcement learning ap-
Analyzing the Price-Volume Relationship in the Taiwan Stock proach for stock trading. arXiv preprint arXiv:1811.07522,
2018.
Market, Master’s thesis, National Jow-Tung University, 1994.
[5] Stelios D. Bekiros, “Fuzzy adaptive decision-making for bound- [22] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep rein-
forcement learning for trading. The Journal of Financial Data
edly rational traders in speculative stock markets,” European
Journal of Operational Research, vol. 202, no. 1, pp. 285–293, Science, 2(2):25-40, 2020.
[23] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qiong-
2010.
[6] Yong Zhang and Xingyu Yang, “Online portfolio selection strat- hai Dai. Deep direct reinforcement learning for financial sig-
nal representation and trading. IEEE transactions on neural net-
egy based on combining experts’ advice,” Computational Eco-
nomics, vol. 50, 05 2016. works and learning systems, 28(3):653-664, 2016.
[24] WU Jia, WANG Chen, Lidong Xiong, and SUN Hongyong.
[7] Youngmin Kim, Wonbin Ahn, Kyong Joo Oh, and David Enke,
“An intelligent hybrid trading system for discovering trading Quantitative trading on stock market based on deep reinforce-
rules for the futures market using rough sets and genetic al- ment learning. In 2019 International Joint Conference on Neural
gorithms,” Applied Soft Computing, vol. 55, pp. 127–140, 02 Networks (IJCNN), pages 1-8. IEEE, 2019
2017. [25] Pricope, Tidor-Vlad. ”Deep reinforcement learning in quan-
[8] Zhang, Y., & Wu, L, Stock market prediction of s&p 500 via titative algorithmic trading: A review.” arXiv preprint
combination of improved bco approach and bp neural network. arXiv:2106.00123, 2021.
Expert Systems with Applications, 36, 8849–8854, 2009. [26] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term
[9] Carta, S.M., Ferreira, A., Podda, A.S., Recupero, D.R., & memory. Neural computation, 9(8):1735-1780, 1997.
Sanna, A. ,Multi-DQN: An ensemble of Deep Q-learning agents [27] A. Ilmanen, “Expected returns: An investor’s guide to harvest-
for stock market forecasting. Expert Syst. Appl., 164, 113820, ing market rewards,” 05 2012.
2021. [28] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas
[10] Hongyang Yang, Xiao-Yang Liu, and Qingwei Wu, “A practi- Schneider, John Schulman, Jie Tang, and Wojciech Zaremba,
cal machine learning approach for dynamic stock recommen- “Openai gym,” 2016.
dation,” in IEEE TrustCom/BiDataSE, 2018., 08 2018, pp. [29] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex
1693–1697. Nichol, Matthias Plappert, Alec Radford, John Schulman, Szy-
[11] Yunzhe Fang, Xiao-Yang Liu, and Hongyang Yang, “Practical mon Sidor, Yuhuai Wu, and Peter Zhokhov, “Openai baselines,”
machine learning approach to capture the scholar data driven [Link] 2017.
alpha in ai industry,” in 2019 IEEE International Conference on [30] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam
Big Data (Big Data) Special Session on Intelligent Data Mining, Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal,
12 2019, pp. 2230–2239. Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap-
[12] Wenbin Zhang and Steven Skiena, “Trading strategies to exploit pert, Alec Radford, John Schulman, Szymon Sidor, and
blog and news sentiment.,” in Fourth International AAAI Con- Yuhuai Wu, “Stable baselines,” [Link]
ference on Weblogs and Social Media, 2010, 01 2010. baselines, 2018.
[13] Qian Chen and Xiao-Yang Liu, “Quantifying esg alpha using [31] Lample, Guillaume, and Devendra Singh Chaplot. ”Playing FPS
scholar big data: An automated machine learning approach,” gameswith deep reinforcement learning.” Thirty-First AAAI
ACM International Conference on AI in Finance, ICAIF 2020, Conference on Artificial Intelligence. 2017.
2020. [32] Terence Chong, Wing-Kam Ng, and Venus Liew, “Revisiting
[14] Yang, Hongyang, Xiao-Yang Liu, Shan Zhong, and Anwar the performance of macd and rsi oscillators,” Journal of Risk
Walid. ”Deep reinforcement learning for automated stock trad- and Financial Management, vol. 7, pp. 1–12, 03 2014.

10
[33] Mansoor Maitah, Petr Prochazka, Michal Cermak, and Karel
Sredl, “Commodity channel index: evaluation of trading rule of
agricultural commodities,” International Journal of Economics
and Financial Issues, vol. 6, pp. 176–178, 03 2016.
[34] Ikhlaas Gurrib, “Performance of the average directional index
as a market timing tool for the most actively traded usd based
currency pairs,” Banks and Bank Systems, vol. 13, pp. 58–70,
08 2018.
[35] Mark Kritzman and Yuanzhen Li, “Skulls, financial turbulence,
and risk management,” Financial Analysts Journal, vol. 66, 10
2010.
[36] A. Ilmanen, “Expected returns: An investor’s guide to harvest-
ing market rewards,” 05 2012.
[37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,
and Oleg Klimov, “Proximal policy optimization algorithms,”
arXiv:1707.06347, 07 2017.
[38] Engstrom, Logan, Andrew Ilyas, Shibani Santurkar, Dimitris
Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry.
”Implementation matters in deep policy gradients: A case study
on PPO and TRPO.” arXiv preprint arXiv:2005.12729, 2020.
[39] Y. Gu, Y. Cheng, C. L. P. Chen, and X. Wang, “Proximal Pol-
icy Optimization With Policy Feedback”, IEEE Transactions on
Systems, Man, and Cybernetics: Systems, pp. 1–11, 2021.
[40] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. CoRR, abs/1412.6980, 2014.
[41] Wei Bao, Jun Yue, and Yulei Rao. A deep learning framework
for financial time series using stacked autoencoders and long-
short term memory. PloS one, 12(7):e0180944, 2017.
[42] Luca Di Persio and Oleksandr Honchar. Artificial neural net-
works architectures for stock price prediction: Comparisons and
applications. International journal of circuits, systems and signal
processing, 10:403–413, 2016.
[43] Thomas Fischer and Christopher Krauss. Deep learning
with long short-term memory networks for financial mar-
ket predictions. European Journal of Operational Research,
270(2):654–669, 2018.
[44] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho
Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Using
deep learning to detect price change indications in financial mar-
kets. In 2017 25th European Signal Processing Conference (EU-
SIPCO), pages 2511–2515. IEEE, 2017.
[45] Bryan Lim, Stefan Zohren, and Stephen Roberts. Enhancing
time-series momentum strategies using deep neural networks.
The Journal of Financial Data Science, 2019.

11

You might also like