Contention Window Optimization in IEEE 802.11ax Networks With Deep Reinforcement Learning
Contention Window Optimization in IEEE 802.11ax Networks With Deep Reinforcement Learning
Abstract—The proper setting of contention window (CW) val- control theory1 . A recent example of applying RL to wireless
ues has a significant impact on the efficiency of Wi-Fi networks. local area networks include a jamming countermeasure [5]
Unfortunately, the standard method used by 802.11 networks and an ML-enabling architecture [6]. RL performance can
is not scalable enough to maintain stable throughput for an
increasing number of stations, despite 802.11ax being designed be further improved by using deep artificial neural networks
arXiv:2003.01492v1 [cs.NI] 3 Mar 2020
to improve Wi-Fi performance in dense scenarios. To this end with their potential for interpolation and superior scalability. A
we propose a new method of CW control which leverages deep recent example of using deep RL (DRL) in wireless networks
reinforcement learning principles to learn the correct settings is an adaptable MAC protocol [7]. The authors of [8] also
under different network conditions. Our method supports two claim to use DRL in the area of CW optimization. However, a
trainable control algorithms, which, as we demonstrate through
simulations, offer efficiency close to optimal while keeping com- careful reading reveals that they use Q-learning (a typical RL
putational cost low. method) but without the neural network (deep) component.
Thus, we conclude that DRL has not yet been successfully
Index Terms—Contention window, IEEE 802.11, MAC proto-
cols, reinforcement learning, WLAN applied to study IEEE 802.11 CW optimization.
In this letter, we describe CCOD (Centralized Contention
window Optimization with DRL), our proposed method of
I. I NTRODUCTION applying DRL to the task of optimizing saturation throughput
of 802.11 networks by correctly predicting CW values. While
T HE latest IEEE 802.11 amendment (802.11ax) is sched-
uled for release in 2020, with the goal of increasing
Wi-Fi network efficiency. However, to ensure backward com-
CCOD is universally applicable to any 802.11 network, we
exhibit its operation under 802.11ax using two DRL methods:
patibility, one efficiency-related aspect remains unchanged in Deep Q-Network (DQN) [9] and Deep Deterministic Policy
802.11ax: the basic channel access method [1]. This method Gradient (DDPG) [10]. The former is considered a showcase
is an implementation of carrier-sense multiple access with DRL algorithm, while the latter is a more advanced method,
collision avoidance (CSMA/CA) wherein each station backs able to directly learn the optimal policy, which we expect will
off by waiting a certain number of time slots before accessing lead to increased network performance, especially in dense
the channel. This number is chosen at random from 0 to CW scenarios. Additionally, we demonstrate how we applied time
(the contention window). To reduce the probability of several series analysis to the recurrent neural networks of both DRL
stations selecting the same random number, CW is doubled methods. Finally, we provide the complete source code so that
after each collision. IEEE 802.11 defines static CW minimum the work can serve as a stepping stone for further development
and maximum values and this approach, while being robust to of DRL-based methods in 802.11 networks2 .
network changes and requiring few computations, can lead to
inefficient operation, especially in dense networks [2]. II. DRL BACKGROUND
CW optimization has a direct impact on network perfor- In general, RL is based on interactions, in which the agent
mance and has been frequently studied (e.g., using control and environment exchange information regarding the state
theory [3]). With the proliferation of network devices with of the environment, the action the agent can take, and the
high computational capabilities, CW optimization can now be reward given to the agent by the environment. Through a
analyzed using reinforcement learning (RL) [4]. RL is well- training process, the agent enhances its decision-making policy
suited to the problem of improving the performance of wireless until it learns the best possible decision in every state of the
networks because it deals with intelligent software agents environment that the agent can visit. In DRL, the agent’s policy
(network nodes) taking actions (e.g., optimizing parameters) is based on a deep neural network which requires training. We
in an environment (wireless radio) to maximize a reward consider two DRL methods differing on their action space:
(e.g., throughput) [4]. RL is an example of model-free policy discrete (DQN) and continuous (DDPG).
optimization, offering better generalization capabilities than DQN is based on Q-learning [4], which attempts to predict
conventional, model-based optimization approaches such as an expected reward for each action, making it an example of
W. Wydmaski and S. Szott are with AGH University, Krakow, Poland. This
a value-based method. DQN’s additional deep neural network
work was supported by the Polish Ministry of Science and Higher Education
1 This means that while RL algorithms try to directly learn an optimal policy
with the subvention funds of the Faculty of Computer Science, Electronics
and Telecommunications of AGH University. This research was supported in without learning the model of the environment, model-based approaches need
part by PLGrid Infrastructure. The authors wish to thank Jakub Mojsiejuk for to make assumptions about the model’s next state before choosing an action.
his remarks on an early draft of the paper. 2 https://github.com/wwydmanski/RLinWiFi
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
2
allows for more efficient extrapolation of rewards for yet CCOD operates in three phases. In the first, pre-learning
unseen states than in basic Q-learning. phase, the Wi-Fi network is controlled by legacy 802.11. This
Conversely, DDPG is an example of a policy-based method, serves as a warm-up for CCOD’s DRL algorithms. Afterwards,
because it tries to learn the optimal policy directly. Addition- in the learning phase, the agent undertakes decisions regarding
ally, it can produce unbounded continuous output meaning that the CW value following the TRAIN procedure of Algorithm
it can recognize that the action space is an ordered set (as in 1. The preprocessing in the algorithm consists of calculating
the case of CW optimization)3 . DDPG comprises two neural the mean and standard deviation of the history of recently
networks: an actor and a critic. The actor makes decisions observed collision probabilities H(pcol ) (of length h) using
based on the environment state, while the critic is a DQN-like a moving window of a fixed size and stride. This operation
neural network that tries to learn the expected reward for the changes the data’s shape from one- to two- dimensional (each
actor’s actions. step of the moving window yields two data points). This
collection can then be interpreted as a time series, which
III. A PPLYING DRL TO W I -F I means it can be analysed by a recurrent neural network. Their
design allows for a more in-depth understanding of both the
To apply DRL principles to Wi-Fi networks, we propose the
immediate and indirect relations between agent actions and
CCOD method, which comprises an agent, the environment
network congestion compared to a one-dimensional analysis
states, the available actions, and the received rewards. In
with a dense neural network.
summary, the CCOD agent is a module which observes the
state of the Wi-Fi network, selects appropriate CW values To enable exploration, each action is modified by a noise
(from the available actions) in order to maximize network factor, which decays over the course of the learning phase. For
performance (the reward). DQN, noise is the probability of overriding the agent’s action
The agent is located in the access point (AP), because with a random action. For DDPG, noise is sampled from a
the AP has a global view of the network, it can control its Gaussian distribution and added to the decision of the agent.
associated stations in a centralized manner (through beacon The final, operational phase starts after completing training,
frames), and it can handle the computational requirements which is determined by a user-set time limit. The agent is
of DRL. Furthermore, a CCOD AP can potentially exchange considered to be fully trained and will no longer receive any
information with other APs and become part of an SDN-based updates, so rewards are no longer needed. In this phase, CW
multi-agent Wi-Fi architecture [2]. is updated using the OPTIMIZE procedure of Algorithm 1.
We define the environment states as the current collision Once an agent is trained, it can be shared among APs.
probability pcol observed in the network calculated based on
the number of transmitted frames Nt x and correctly received Algorithm 1 CW optimisation using CCOD
frames Nr x : 1: procedure T RAIN(H(pcol ), load, a)
Nt x − Nr x
pcol = . (1) 2: . load - data sent since last interaction
Nt x 3: . a - previous action
The pcol measurements are done within predefined interaction 4: . s - state
periods and reflect the performance of the currently selected 5: s ← preprocess(H(pcol ))
CW value. In practice, pcol is not immediately available to the 6: r ← normalize(load)
agent, but since the AP takes part in all frame transmissions (as 7: agent.step(s, a, r) . Train the neural network
sender or recipient), the agent requires only obtaining Nt x from 8: a 0 ← agent.act(s) + noise
CW ← 2a +4
0
each station, which can be piggybacked onto data frames (Nr x 9:
is known at the AP based on the number of sent or received 10: return CW
acknowledgement frames). Note that this overhead is required 11: end procedure
only in the learning phase (described below).
The action of the agent is to configure the AP by setting 12: procedure O PTIMIZE(H(pcol )). Observed collision prob.
CW = 2a+4 − 1, where a ∈ [0, 6]. This range was chosen so 13: s ← preprocess(H(pcol ))
that CW fits into the original span of 802.11 values: from 15 14: a ← agent.act(s) . Pass through neural network
to 1023. We explore two algorithms with different outputs: 15: CW ← 2a+4
discrete (a ∈ N) for DQN and continuous (a ∈ R) for DDPG. 16: return CW
We use network throughput (the number of successfully 17: end procedure
delivered bits per second) as the reward in CCOD. This is
indicative of current network performance and can be observed
The application of DRL algorithms also requires configuring
at the AP. Since rewards in DRL should be a real number
certain key parameters. First, the performance of RL algo-
between 0 and 1, we normalize the throughput based on the
rithms depends on their reward discounts γ, which correspond
expected maximum throughput so that the rewards are cen-
to the importance of long term rewards over immediate ones.
tered around 0.5 (i.e., rewards above 0.5 indicate throughput
Second, the introduction of deep learning into RL algorithms
exceeding expectations).
creates an impediment in the form of many new hyperparam-
3 Discrete algorithms, like DQN, consider all possible actions as abstract eters so each neural network requires configuring a learning
alternatives. rate as an update coefficient. Third, since the learning is done
3
42 1000
CCOD w/ DQN
40 CCOD w/ DDPG
800
Mean CW
34
400
32
Standard 802.11
30 Look-up table 200
28 CCOD w/ DQN
CCOD w/ DDPG 0
26
0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of stations Round
(a) (b)
Fig. 1. Static scenario results: (a) network throughput, (b) mean CW values selected in each round (for 30 stations).
50 42 42
500 CW
CCOD w/ DQN 40 40
Network throughput [Mb/s]
CCOD w/ DDPG 40
30 36 Standard 802.11 36
300
34 CCOD w/ DQN 34
CW
20 CCOD w/ DDPG
200 32 32
Standard 802.11
10 30 30 Look-up table
100 28 CCOD w/ DQN
28 CCOD w/ DDPG
Number of stations 0
26 26
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50
Simulation time [s] Simulation time [s] Number of stations
instantaneous throughput (Fig. 2b). Standard 802.11 leads to simplifying assumptions. Also worth investigating are other
a decrease of up to 28% of the network throughput with the DRL algorithms as well as implementing a distributed version
increasing number of stations. CCOD is able to maintain the of CCOD.
efficiency on a similar level – the decrease of throughput
moving from 5 to 50 stations is only about 1% for both R EFERENCES
DDPG and DQN. Ultimately, the operation of both CCOD’s [1] B. Bellalta and K. Kosek-Szott, “AP-initiated multi-user transmissions
algorithms in the dynamic scenario lead to improved network in IEEE 802.11ax WLANs,” Ad Hoc Networks, vol. 85, pp. 145–159,
performance (Fig. 2c), both exceeding standard 802.11 and 2019.
[2] P. Gallo, K. Kosek-Szott, S. Szott, and I. Tinnirello, “CADWAN: A
matching the look-up table approach. Control Architecture for Dense WiFi Access Networks,” IEEE Commu-
nications Magazine, vol. 56, no. 1, pp. 194–201, 2018.
[3] P. Serrano et al., “Control theoretic optimization of 802.11 WLANs: Im-
VI. C ONCLUSIONS plementation and experimental evaluation,” Computer Networks, vol. 57,
We have presented CCOD – a method which leverages no. 1, pp. 258–272, 2013.
[4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
deep reinforcement learning principles to learn the correct wireless networking: A survey,” IEEE Communications Surveys &
CW settings for 802.11ax under varying network conditions Tutorials, vol. 21, no. 3, pp. 2224–2287, 2019.
using two trainable control algorithms: DQN and DDPG. Our [5] F. Yao and L. Jia, “A collaborative multi-agent reinforcement learning
anti-jamming algorithm in wireless networks,” IEEE Wireless Commu-
experiments have shown that DRL can be successfully applied nications Letters, vol. 8, no. 4, pp. 1024–1027, 2019.
to the problem of CW optimization: both algorithms offer [6] F. Wilhelmi, S. Barrachina-Muoz, B. Bellalta, C. Cano, A. Jonsson, and
efficiency close to optimal (with DDPG being only slightly V. Ram, “A Flexible Machine Learning-Aware Architecture for Future
WLANs,” arXiv preprint http://arxiv.org/abs/1910.03510, 2019.
better than DQN), keeping the computational cost low (around [7] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple
22 kflops, according to our estimations, excluding the one- access for heterogeneous wireless networks,” IEEE Journal on Selected
time training cost). As a result of the learning process, we Areas in Communications, vol. 37, no. 6, pp. 1277–1290, 2019.
[8] R. Ali et al., “Deep Reinforcement Learning Paradigm for Performance
have obtained a trained agent which can be directly installed Optimization of Channel ObservationBased MAC Protocols in Dense
in an 802.11ax AP. WLANs,” IEEE Access, vol. 7, pp. 3500–3511, 2019.
We conclude that the problem of CW optimization has [9] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, p. 529, 2015.
provided the opportunity to showcase the features of DRL. Fu- [10] D. Silver, “Deterministic policy gradient algorithms,” Proceedings of
ture studies should focus on analyzing more realistic network ICML’14, vol. 32, pp. I–387–I–395, 2014.
conditions, where we expect DRL to outperform any analytical [11] P. Gawłowicz and A. Zubow, “ns-3 meets OpenAI Gym: The Playground
for Machine Learning in Networking Research,” in ACM MSWiM, 2019.
model-based CW optimization methods which are based on