Automated Stock Trading with DRL
Automated Stock Trading with DRL
Abstract
arXiv:2212.02721v1 [[Link]] 6 Dec 2022
More and more stock trading strategies are constructed using deep reinforcement learning (DRL) algorithms, but DRL
methods originally widely used in the gaming community are not directly adaptable to financial data with low signal-
to-noise ratios and unevenness, and thus suffer from performance shortcomings. In this paper, to capture the hidden
information, we propose a DRL based stock trading system using cascaded LSTM, which first uses LSTM to extract
the time-series features from stock daily data, and then the features extracted are fed to the agent for training, while
the strategy functions in reinforcement learning also use another LSTM for training. Experiments in DJI in the US
market and SSE50 in the Chinese stock market show that our model outperforms previous baseline models in terms
of cumulative returns and Sharp ratio, and this advantage is more significant in the Chinese stock market, a merging
market. It indicates that our proposed method is a promising way to build a automated stock trading system.
Keywords: Deep Reinforcement Learning, Long Short-Term Memory, Automated stock trading, Proximal policy
optimization, Markov Decision Process
2.1. Critic-only
The Critic-only approach, the most common of the
Figure 1: Agent-environment interaction in reinforcement learning three, uses only the action-value function to make de-
cisions with the aim of maximizing the expected re-
Experiments show that the automated stock trad- ward for each action choice given the current state. The
ing strategy based on LSTM outperforms the ensemble action-value function Q receives the current state and
strategy and the buy-and-hold strategy represented by the possible actions to be taken as input, then outputs
the Dow Jones Industrial (DJI) index in terms of cumu- an expected Q value as a reward. One of the most
lative return, and also has better performance in Chinese popular and successful approaches is Deep Q Network
stock market in terms of cumulative return and Sharpe (DQN)[15] and its extensions[16]. Chen[17], Dang[18]
ratio. and Jeong[19] used this method to train agents on a sin-
The main contributions of this paper are two-fold: gle stock or asset. Chen[17], Huang[20] used Deep Re-
(i) In the literature, security’s past price and related current Q Network (DRQN) trained agents to achieve
technical indicators are often used to represent state higher cumulative performance on quantitative trading
spaces. Instead of using the raw data, we use LSTM than baseline models and DQN to achieve higher cu-
to extract the time-series features from stock daily data mulative returns than baseline models. However, the
to represent state spaces, since the memory property main limitation of the method is that it performs bet-
of the LSTM can discover features of the stock mar- ter on discrete state spaces, however, the stock prices
ket that change over time and integrate hidden infor- are continuous. If a larger number of stocks or assets
mation in the time dimension, thus making it more are selected, the state space and action space will grow
likely that the POMDP is closer to the MDP. (ii) Dif- exponentially[21], which will weaken the performance
ferent from the previous DRL based methods which use of DQN.
multi-layer neural networks or convolutional networks
in agent training, we use LSTM as the training network 2.2. Actor-only
because it is a type of recurrent neural network capa- The actor-only approach is able to learn policies di-
ble of learning order dependence in sequence prediction rectly and the action space can be considered as contin-
problems. uous. Therefore, the advantage of this approach is that
The rest of this paper is organized as follows. Sec- it can directly learn the optimal mapping from a par-
tion 2 is a brief introduction to the work related to stock ticular state to an action, which can be either discrete
trading using reinforcement learning, categorized by the or continuous. Its disadvantages are that it requires a
training algorithm. Section 3 focuses on our algorithm, large amount of data for experiments and a long time
2
to obtain the optimal strategy[22]. Deng[23] used this the OpenAI gym[28][29][30], which is able to give the
method and applied Recurrent Deep Neural Network to agent various information for training, such as current
real-time financial trading for the first time. Wu[24] also stock prices, shareholdings and technical indicators. We
explored the actor-only method in quantitative trading, use Markov Decision Process (MDP) to model stock
where he compared deep neural networks (LSTM) with trading[36], so the information that should be included
fully connected networks in detail and discussed the im- in this multi-stock trading environment are: state, ac-
pact of some combinations of technical indicators on the tion, reward, policy and Q-value. Suppose we have a
daily data performance of the Chinese market, proving portfolio with 30 stocks.
that deep neural networks are superior. The results of
his experiments are mixed, and he shows that the pro- 3.1.1. State Space
posed approach can yield decent profits in some stocks, A 181-dimensional vector consisting of seven parts of
but performs mediocrely in others. information represents the state space of the multi-stock
trading environment: [bt , pt , ht , Mt , Rt , Ct , Xt ]. Each of
2.3. Actor-critic these components is defined as follows. In [14], this
The actor-critic approach aims to train two models si- 181-dimensional vector is fed as a state directly into
multaneously, with the actor learning how to make the the reinforcement learning algorithm for learning. How-
agent respond in a given state and the critic evaluat- ever, our approach is to give T (T is the time window of
ing the responses. Currently, this approach is consid- LSTM) 181-dimensional vectors to the LSTM for learn-
ered to be one of the most successful algorithms in RL, ing first, and the feature vector generated by the LSTM
while Proximal Policy Optimization (PPO) is the most are given to the agent for learning as our state.
advanced actor-critic approach available. It performs
better because it solves the well-known problems when 1. bt ∈ R+ : available balance at current time step t.
applying RL to complex environments, such as instabil- 2. pt ∈ R30
+ : adjusted close price of each stock at cur-
ity due to the distribution of observations and rewards rent time step t.
that constantly change as the agent learns[25]. In this 3. ht ∈ Z30+ : shares owned of each stock at current
paper, the baseline model[14] is constructed based on time step t.
the actor-critic approach, using a combination of three
4. Mt ∈ R30 : Moving Average Convergence Diver-
DRL algorithms: PPO, A2C, and DDPG. However, the
gence (MACD) is calculated using close price of
agent that learns only with PPO outperforms the resem-
each stock at current time step t. MACD is one
ble strategy in terms of cumulative return.
of the most commonly used momentum indicators
that identifies moving averages[32].
2.4. LSTM in stock system
5. Rt ∈ R30 + : Relative Strength Index (RSI) is cal-
Although Long Short-Term Memory Networks
culated using close price of each stock at current
(LSTM)[26] are traditionally used in natural language
time step t. RSI quantifies the extent of recent price
processing, many recent works have applied them in
changes[32].
financial markets to filter some noise in raw market
data[41][42][43][44]. Stock prices and some techni- 6. Ct ∈ R30 + : Commodity Channel Index (CCI) is
cal indicators generated from stock prices are intercon- calculated using high, low and close price. CCI
nected, so LSTM can be used as a feature extractor to compares current price to average price over a time
extract potentially profitable patterns in the time series window to indicate a buying or selling action[33].
of these indicators. Zhang[22], Wu[24] have tried to in- 7. Xt ∈ R30 : Average Directional Index (ADX)
tegrate LSTM for feature extraction while training agent is calculated using high, low and close price of
using DRL algorithm and the experiments have shown each stock at current time step t. ADX identifies
that it works better than baseline model. And the work trend strength by quantifying the amount of price
of Lim[45] shows that LSTM delivers superior perfor- movement[34].
mance on modelling daily financial data.
3.1.2. Action Space
3. Our method A set containing 2k + 1 elements represents the
action space of the multi-stock trading environment:
3.1. Stock Market Environment {−k, ..., −1, 0, 1, ..., k}, where k, −k represents the num-
The stock market environment used in this paper is ber of shares we can buy and sell at once. It satisfies
a simulation environment developed in [14] based on some conditions as follows.
3
1. hmax represents the maximum number of shares we 3.2. Stock Trading Agent
can be able to buy at a time.
3.2.1. Framework
2. The action space can be considered continuous
since the entire action space is of size (2k + 1)30 . We introduce LSTM as a feature extractor to improve
the model in [14], as shown in Figure 2.
3. The action space will next be normalized to [−1, 1].
3.1.3. Reward
We define the reward value of the multi-stock trading
environment as the change in portfolio value from state
s taking action a to the next state s0 (in this case two
days before and after), with the training objective of ob-
taining a trading strategy that maximizes the return:
Returnt (st , at , st+1 ) = bt+1 + pTt+1 ht+1 − bt + pTt ht −ct
(1)
where ct represents the transaction cost. We assume that
the per-transaction cost is 0.1% of each transaction, as
defined in [10]:
Figure 2: Overview of our model
ct = 0.1% × pT kt (2)
At the time step t, the environment automatically gen-
3.1.4. Turbulence Threshold erates the current state S t and passes it to the LSTM
We employ this financial index turbulencet [14] that network. It then remembers the state and uses its mem-
measures extreme asset price movements to avoid the ory to retrieve the past T stock market states to obtain
risk of sudden events that may cause stock market the state sequence Ft = [S t−T +1 , ..., S t ]. The LSTM
crash[35], such as March 2020 stock market caused by analyzes and extracts the hidden time-series features
COVID-19, wars and financial crisis: or potentially profitable patterns in Ft , and then out-
puts the encoded feature vector Ft0 and passes it to the
turbulencet = (yt − µ) Σ−1 (yt − µ)0 ∈ R (3) agent, which is guided by the policy function π(Ft0 )
to perform the optimal action at . The environment
Where yt ∈ R30 denotes the stock returns for current then returns the reward Rt , the next state S t+1 , and a
period t, µt ∈ R30 denotes the average of historical re- boolean dt to determine if the state is terminated accord-
turns, and Σ ∈ R30×30 denotes the covariance of his- ing to the agent’s behavior. Then, the obtained quintet
torical returns. Considering the historical volatility of (S t , at , Rt , S t+1 , dt ) is stored in the experience pool.
the stock market, we set the turbulence threshold to Actor computes At from the targetst computed by critic
90th percentile of all historical turbulence indexes. If using the advantage function. After a certain number of
turbulencet is greater than this threshold, it means that steps, actor back-propagates the error through the loss
extreme market conditions are occurring and the agent function J θi (θ) = Tt=1 ( ππθθ (a(att|s|stt)) Aπt θ − βKL(θ, θi )) and up-
P
will stop trading until the turbulence index falls below i
dates θ using the gradient descent method. And critic
this threshold. updates the parameters using the mean square error loss
function. And the environment will keep repeating the
3.1.5. Other Parameters process until the end of the training phase.
In addition to defining the state space, action space
and reward functions, some necessary constraints need
to be added to the multi-stock trading environment. 3.2.2. LSTM as Feature Extractor
Reinforcement Learning (RL) was initially applied to
1. Initial capital: $1 million. games, which have a limited state space, a limited action
2. Maximum number of shares in a single trade hmax : space, clear stopping conditions and a more stable envi-
100. ronment, so there is room to improve the use of RL for
3. Reward scaling factor: 1e-4, which means the re- stock trading. It is well known that financial markets are
ward returned by the environment will be only 1e-4 full of noise and uncertainty, and that the factors affect-
of the original one. ing stock prices are multiple and changing over time.
4
This makes the stock trading process more like a par- ratio between the new policy and the old one:
tially observable Markov decision process (POMDP),
since the states we use are not the real states in the πθ (at |st )
rt (θ) = (4)
stock trading environment. Therefore, we can use the πθold (at |st )
memory property of the LSTM to discover features of
So we can get rt (θold ) = 1 from it. The clipped surrogate
the stock market that change over time. LSTM can in-
objective function of PPO is:
tegrate information hidden in the time dimension, thus
making it more likely that the POMDP is closer to the HCLIP (θ) = Ê[min(rt (θ) Â (st , at ) ,
MDP[24][31].
In this paper, an LSTM-based feature extractor is de- clip (rt (θ) , rt (θold ) − , rt (θold ) + ) Â (st , at ))]
(5)
veloped using the customized feature extraction inter-
That is,
face provided by stable-baselines3. The network struc-
ture of the LSTM feature extractor is shown in Figure HCLIP (θ) = Ê[min(rt (θ) Â (st , at ) ,
3. (6)
clip (rt (θ) , 1 − , 1 + ) Â (st , at ))]
4.5.2. Best Hidden Size of LSTM in PPO 4.6. Performance in U.S. Markets
For the hidden size of the LSTM in PPO, we tested The optimal parameters found (TW=30, HS=512)
the cases of HS=128, 256, 512, 1024, 512*2(two hidden were used as the parameters of the final model and the
7
Figure 6: Trading results of different hidden sizes of LSTM in PPO
Figure 7: Trading results by agents with LSTM in PPO plus LSTM feature extraction, LSTM in PPO, Ordinary PPO, Ensemble Strategy in [14]
and Buy-And-Hold strategy on DJI
results of this model were compared with the trading re- of risk tolerance and the ability to identify down markets
sults of the PPO model with LSTM in PPO and another and stop trading. In this respect, ensemble strategy does
with MlpPolicy and the Ensemble Strategy in [14], as a little better. But for a long time the agent of ensem-
shown in Figure 7. ble strategy has a negative attitude towards investing,
Table 4 shows the details of trading results in U.S. choosing not to trade whether the market is down or up.
Markets. This can lead to the loss of a large profit margin in the
long run. Overall, our model has the strongest profit-
taking ability, excels at finding profits within volatile
Table 4: Details of the trading results in U.S. Markets
PPO LSTM in PPO Our Model Ensemble DJI markets, and recovers quickly after pullbacks. In terms
CR 54.37% 49.77% 90.81% 70.40% 50.97% of the Sharpe ratio, our strategy is very close to the base-
MER 67.28% 63.45% 113.50% 65.32% 63.90%
MPB 28.30% 29.39% 46.51% 15.74% 72.32%
line model, but does not require the help of other algo-
APPT 20.02 22.84 35.27 28.54 N.A. rithms, which is much easier and faster.
SR 0.8081 0.6819 1.1540 1.3000 0.4149
Analyzing in more detail, all strategies can be divided
into two phases which are consistent with DJI index: (i)
Our model obtains the highest cumulative return of Accumulation phase: until 06/2017, our strategies were
90.81% and the maximum profitability of 113.5% on able to achieve stable growth with little difference in re-
30 Dow components, better than the ensemble strategy turns from the integrated strategies. However, after that,
in [14], the baseline model. However, in terms of max- our agent quickly captured profits and was able to grow
imum pullback, our model has an MPB=45.51% com- total returns rapidly. This phase lasted until 01/2018,
pared to DJI’s MPB=63.90%, indicating the possession when the cumulative return had reached a level where
8
Figure 8: Trading results by agents with LSTM in PPO plus LSTM feature extraction, Ensemble Strategy in [14] and Buy-And-Hold strategy on
SSE50
the difference with the final return was not significant. with our analysis of our model: the ability to capture re-
(ii) Volatility phase: Starting from 01/2018, our agent’s turns in volatility quickly and accurately, and the returns
trading style became very aggressive and courageous, obtained are positively correlated with positive volatil-
as reflected in the large fluctuations in returns. The re- ity.
turns were generally more stable during this phase and On further analysis, the cumulative return of our
were able to bounce back quickly within two months model increased rapidly from 01/2016 to 02/2018, fi-
after suffering a pullback on 01/2019. nally reaching almost three times that of the integrated
strategy. Subsequently, as the volatility of the SSE 50
4.7. Performance in Chinese Markets increased, the volatility of our model increased accord-
ingly. Within 02/2018 to 01/2019, the pullbacks of
The same model is used to trade the samples in the the two models were similar, but then our model cap-
Chinese stock market, as Figure 8 shown, and the en- tured nearly 80% of the return at 01/2019 within three
semble strategy is chosen as the most important base- months. Even though it suffered a decline in 01/2020
line model. We show the cumulative returns at the end due to the black swan event of Covid-19, it quickly
of the trade as follows. bounced back to its highest point six months later.
And table 5 shows the details of trading results in
Chinese Markets. In the Chinese market, the advan-
5. Conclusion
Table 5: Details of the trading results in Chinese Markets
Our Model Ensemble SSE50 In this paper, we propose a PPO model using cas-
CR 222.91% 120.87% 51.46% caded LSTM networks and compare the ensemble strat-
MER 222.91% 120.87% 51.46% egy in [14] as a baseline model in the U.S. market and
MPB 74.81% 39.95% 41.27% the Chinese market, respectively. The results show that
APPT 66.96 47.55 25.78 our model has a stronger profit-taking ability, and this
SR 2.3273 1.6938 0.4149 feature is more prominent in the Chinese market. How-
ever, according to the risk-return criterion, our model is
tages of our model are greater than the ensemble strat- exposed to higher pullback risk while obtaining high re-
egy. Our CR=222.91% is nearly twice that of ensemble turns. Finally, we believe that the strengths of our model
strategy (Ensemble CR=120.87%). Although our pull- can be more fully exploited in markets where the over-
backs are greater, the volatility is more reflected in the all trend is smoother and less volatile, as returns can be
rise in cumulative returns. Also, the Sharpe ratio tells more consistently accumulated in such an environment
us that our model (SR=2.3273) is better in the Chinese (such as the A-share market in China in recent years).
market when combining return and risk. This illustrates This suggests that there is indeed a potential return pat-
our model’s superior performance in emerging markets, tern in the stock market, and the LSTM as a time-series
which typically have greater volatility, and is consistent feature extractor plays an active role. Also, it shows that
9
the Chinese market is a suitable environment for devel- ing: An ensemble strategy.” In Proceedings of the First ACM
oping quantitative trading. International Conference on AI in Finance, pp. 1-8. 2020.
[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei
In the subsequent experiments, improvements can be A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Mar-
made in the following aspects: (i) The amount of train- tin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
ing data. The training of PPO requires a large amount Human-level control through deep reinforcement learning. na-
of historical data to achieve good learning results, so ture, 518(7540):529-533, 2015.
[16] Hado Van Hasselt, Arthur Guez, and David Silver. Deep rein-
expanding the amount of training data may help to im- forcement learning with double qlearning. In Proceedings of the
prove the results. (ii) Reward function. Some improved AAAI conference on artificial intelligence, volume 30, 2016.
reward functions for stock trading have emerged, which [17] Lin Chen and Qiang Gao, “Application of deep reinforcement
learning on automated stock trading,” in 2019 IEEE 10th Inter-
can enhance the stability of the algorithm.
national Conference on Software Engineering and Service Sci-
ence (ICSESS), 2019, pp. 29–33.
[18] Quang-Vinh Dang, “Reinforcement learning in stock trading,”
References in Advanced Computational Methods for Knowledge Engineer-
ing. ICCSAMA 2019. Advances in Intelligent Systems and
[1] Fang, Y.-C. et al. “Research on Quantitative Investment Strate- Computing, vol 1121. Springer, Cham, 01 2020.
gies Based on Deep Learning.” Algorithms 12 (2019): 35. [19] Gyeeun Jeong and Ha Kim, “Improving financial trading deci-
[2] Kuo, Ren Jie. “A Decision Support System for the Stock Mar- sions using deep q-learning: predicting the number of shares,
ket Through Integration of Fuzzy Neural Networks and Fuzzy action strategies, and transfer learning,” Expert Systems with
Delphi.” Appl. Artif. Intell. 12 (1998): 501-520. Applications, vol. 117, 09 2018.
[3] [Link] and M. Kozaki, “An intelligent forecasting system of [20] Chien Yi Huang. Financial trading as a game: A deep reinforce-
stock price using neural networks,” Proc. IJCNN, Baltimore, ment learning approach. arXiv:1807.02787, 2018.
Maryland, 652-657, 1992. [21] Zhuoran Xiong, Xiao-Yang Liu, Shan Zhong, Hongyang Yang,
[4] S . Cheng, A Neural Network Approach for Forecasting and and Anwar Walid. Practical deep reinforcement learning ap-
Analyzing the Price-Volume Relationship in the Taiwan Stock proach for stock trading. arXiv preprint arXiv:1811.07522,
2018.
Market, Master’s thesis, National Jow-Tung University, 1994.
[5] Stelios D. Bekiros, “Fuzzy adaptive decision-making for bound- [22] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep rein-
forcement learning for trading. The Journal of Financial Data
edly rational traders in speculative stock markets,” European
Journal of Operational Research, vol. 202, no. 1, pp. 285–293, Science, 2(2):25-40, 2020.
[23] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qiong-
2010.
[6] Yong Zhang and Xingyu Yang, “Online portfolio selection strat- hai Dai. Deep direct reinforcement learning for financial sig-
nal representation and trading. IEEE transactions on neural net-
egy based on combining experts’ advice,” Computational Eco-
nomics, vol. 50, 05 2016. works and learning systems, 28(3):653-664, 2016.
[24] WU Jia, WANG Chen, Lidong Xiong, and SUN Hongyong.
[7] Youngmin Kim, Wonbin Ahn, Kyong Joo Oh, and David Enke,
“An intelligent hybrid trading system for discovering trading Quantitative trading on stock market based on deep reinforce-
rules for the futures market using rough sets and genetic al- ment learning. In 2019 International Joint Conference on Neural
gorithms,” Applied Soft Computing, vol. 55, pp. 127–140, 02 Networks (IJCNN), pages 1-8. IEEE, 2019
2017. [25] Pricope, Tidor-Vlad. ”Deep reinforcement learning in quan-
[8] Zhang, Y., & Wu, L, Stock market prediction of s&p 500 via titative algorithmic trading: A review.” arXiv preprint
combination of improved bco approach and bp neural network. arXiv:2106.00123, 2021.
Expert Systems with Applications, 36, 8849–8854, 2009. [26] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term
[9] Carta, S.M., Ferreira, A., Podda, A.S., Recupero, D.R., & memory. Neural computation, 9(8):1735-1780, 1997.
Sanna, A. ,Multi-DQN: An ensemble of Deep Q-learning agents [27] A. Ilmanen, “Expected returns: An investor’s guide to harvest-
for stock market forecasting. Expert Syst. Appl., 164, 113820, ing market rewards,” 05 2012.
2021. [28] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas
[10] Hongyang Yang, Xiao-Yang Liu, and Qingwei Wu, “A practi- Schneider, John Schulman, Jie Tang, and Wojciech Zaremba,
cal machine learning approach for dynamic stock recommen- “Openai gym,” 2016.
dation,” in IEEE TrustCom/BiDataSE, 2018., 08 2018, pp. [29] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex
1693–1697. Nichol, Matthias Plappert, Alec Radford, John Schulman, Szy-
[11] Yunzhe Fang, Xiao-Yang Liu, and Hongyang Yang, “Practical mon Sidor, Yuhuai Wu, and Peter Zhokhov, “Openai baselines,”
machine learning approach to capture the scholar data driven [Link] 2017.
alpha in ai industry,” in 2019 IEEE International Conference on [30] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam
Big Data (Big Data) Special Session on Intelligent Data Mining, Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal,
12 2019, pp. 2230–2239. Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap-
[12] Wenbin Zhang and Steven Skiena, “Trading strategies to exploit pert, Alec Radford, John Schulman, Szymon Sidor, and
blog and news sentiment.,” in Fourth International AAAI Con- Yuhuai Wu, “Stable baselines,” [Link]
ference on Weblogs and Social Media, 2010, 01 2010. baselines, 2018.
[13] Qian Chen and Xiao-Yang Liu, “Quantifying esg alpha using [31] Lample, Guillaume, and Devendra Singh Chaplot. ”Playing FPS
scholar big data: An automated machine learning approach,” gameswith deep reinforcement learning.” Thirty-First AAAI
ACM International Conference on AI in Finance, ICAIF 2020, Conference on Artificial Intelligence. 2017.
2020. [32] Terence Chong, Wing-Kam Ng, and Venus Liew, “Revisiting
[14] Yang, Hongyang, Xiao-Yang Liu, Shan Zhong, and Anwar the performance of macd and rsi oscillators,” Journal of Risk
Walid. ”Deep reinforcement learning for automated stock trad- and Financial Management, vol. 7, pp. 1–12, 03 2014.
10
[33] Mansoor Maitah, Petr Prochazka, Michal Cermak, and Karel
Sredl, “Commodity channel index: evaluation of trading rule of
agricultural commodities,” International Journal of Economics
and Financial Issues, vol. 6, pp. 176–178, 03 2016.
[34] Ikhlaas Gurrib, “Performance of the average directional index
as a market timing tool for the most actively traded usd based
currency pairs,” Banks and Bank Systems, vol. 13, pp. 58–70,
08 2018.
[35] Mark Kritzman and Yuanzhen Li, “Skulls, financial turbulence,
and risk management,” Financial Analysts Journal, vol. 66, 10
2010.
[36] A. Ilmanen, “Expected returns: An investor’s guide to harvest-
ing market rewards,” 05 2012.
[37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,
and Oleg Klimov, “Proximal policy optimization algorithms,”
arXiv:1707.06347, 07 2017.
[38] Engstrom, Logan, Andrew Ilyas, Shibani Santurkar, Dimitris
Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry.
”Implementation matters in deep policy gradients: A case study
on PPO and TRPO.” arXiv preprint arXiv:2005.12729, 2020.
[39] Y. Gu, Y. Cheng, C. L. P. Chen, and X. Wang, “Proximal Pol-
icy Optimization With Policy Feedback”, IEEE Transactions on
Systems, Man, and Cybernetics: Systems, pp. 1–11, 2021.
[40] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. CoRR, abs/1412.6980, 2014.
[41] Wei Bao, Jun Yue, and Yulei Rao. A deep learning framework
for financial time series using stacked autoencoders and long-
short term memory. PloS one, 12(7):e0180944, 2017.
[42] Luca Di Persio and Oleksandr Honchar. Artificial neural net-
works architectures for stock price prediction: Comparisons and
applications. International journal of circuits, systems and signal
processing, 10:403–413, 2016.
[43] Thomas Fischer and Christopher Krauss. Deep learning
with long short-term memory networks for financial mar-
ket predictions. European Journal of Operational Research,
270(2):654–669, 2018.
[44] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho
Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Using
deep learning to detect price change indications in financial mar-
kets. In 2017 25th European Signal Processing Conference (EU-
SIPCO), pages 2511–2515. IEEE, 2017.
[45] Bryan Lim, Stefan Zohren, and Stephen Roberts. Enhancing
time-series momentum strategies using deep neural networks.
The Journal of Financial Data Science, 2019.
11