0% found this document useful (0 votes)
100 views

Contention Window Optimization in IEEE 802.11ax Networks With Deep Reinforcement Learning

Uploaded by

Serene Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

Contention Window Optimization in IEEE 802.11ax Networks With Deep Reinforcement Learning

Uploaded by

Serene Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1

Contention Window Optimization in IEEE 802.11ax


Networks with Deep Reinforcement Learning
Witold Wydmaski and Szymon Szott

Abstract—The proper setting of contention window (CW) val- control theory1 . A recent example of applying RL to wireless
ues has a significant impact on the efficiency of Wi-Fi networks. local area networks include a jamming countermeasure [5]
Unfortunately, the standard method used by 802.11 networks and an ML-enabling architecture [6]. RL performance can
is not scalable enough to maintain stable throughput for an
increasing number of stations, despite 802.11ax being designed be further improved by using deep artificial neural networks
arXiv:2003.01492v1 [cs.NI] 3 Mar 2020

to improve Wi-Fi performance in dense scenarios. To this end with their potential for interpolation and superior scalability. A
we propose a new method of CW control which leverages deep recent example of using deep RL (DRL) in wireless networks
reinforcement learning principles to learn the correct settings is an adaptable MAC protocol [7]. The authors of [8] also
under different network conditions. Our method supports two claim to use DRL in the area of CW optimization. However, a
trainable control algorithms, which, as we demonstrate through
simulations, offer efficiency close to optimal while keeping com- careful reading reveals that they use Q-learning (a typical RL
putational cost low. method) but without the neural network (deep) component.
Thus, we conclude that DRL has not yet been successfully
Index Terms—Contention window, IEEE 802.11, MAC proto-
cols, reinforcement learning, WLAN applied to study IEEE 802.11 CW optimization.
In this letter, we describe CCOD (Centralized Contention
window Optimization with DRL), our proposed method of
I. I NTRODUCTION applying DRL to the task of optimizing saturation throughput
of 802.11 networks by correctly predicting CW values. While
T HE latest IEEE 802.11 amendment (802.11ax) is sched-
uled for release in 2020, with the goal of increasing
Wi-Fi network efficiency. However, to ensure backward com-
CCOD is universally applicable to any 802.11 network, we
exhibit its operation under 802.11ax using two DRL methods:
patibility, one efficiency-related aspect remains unchanged in Deep Q-Network (DQN) [9] and Deep Deterministic Policy
802.11ax: the basic channel access method [1]. This method Gradient (DDPG) [10]. The former is considered a showcase
is an implementation of carrier-sense multiple access with DRL algorithm, while the latter is a more advanced method,
collision avoidance (CSMA/CA) wherein each station backs able to directly learn the optimal policy, which we expect will
off by waiting a certain number of time slots before accessing lead to increased network performance, especially in dense
the channel. This number is chosen at random from 0 to CW scenarios. Additionally, we demonstrate how we applied time
(the contention window). To reduce the probability of several series analysis to the recurrent neural networks of both DRL
stations selecting the same random number, CW is doubled methods. Finally, we provide the complete source code so that
after each collision. IEEE 802.11 defines static CW minimum the work can serve as a stepping stone for further development
and maximum values and this approach, while being robust to of DRL-based methods in 802.11 networks2 .
network changes and requiring few computations, can lead to
inefficient operation, especially in dense networks [2]. II. DRL BACKGROUND
CW optimization has a direct impact on network perfor- In general, RL is based on interactions, in which the agent
mance and has been frequently studied (e.g., using control and environment exchange information regarding the state
theory [3]). With the proliferation of network devices with of the environment, the action the agent can take, and the
high computational capabilities, CW optimization can now be reward given to the agent by the environment. Through a
analyzed using reinforcement learning (RL) [4]. RL is well- training process, the agent enhances its decision-making policy
suited to the problem of improving the performance of wireless until it learns the best possible decision in every state of the
networks because it deals with intelligent software agents environment that the agent can visit. In DRL, the agent’s policy
(network nodes) taking actions (e.g., optimizing parameters) is based on a deep neural network which requires training. We
in an environment (wireless radio) to maximize a reward consider two DRL methods differing on their action space:
(e.g., throughput) [4]. RL is an example of model-free policy discrete (DQN) and continuous (DDPG).
optimization, offering better generalization capabilities than DQN is based on Q-learning [4], which attempts to predict
conventional, model-based optimization approaches such as an expected reward for each action, making it an example of
W. Wydmaski and S. Szott are with AGH University, Krakow, Poland. This
a value-based method. DQN’s additional deep neural network
work was supported by the Polish Ministry of Science and Higher Education
1 This means that while RL algorithms try to directly learn an optimal policy
with the subvention funds of the Faculty of Computer Science, Electronics
and Telecommunications of AGH University. This research was supported in without learning the model of the environment, model-based approaches need
part by PLGrid Infrastructure. The authors wish to thank Jakub Mojsiejuk for to make assumptions about the model’s next state before choosing an action.
his remarks on an early draft of the paper. 2 https://github.com/wwydmanski/RLinWiFi

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
2

allows for more efficient extrapolation of rewards for yet CCOD operates in three phases. In the first, pre-learning
unseen states than in basic Q-learning. phase, the Wi-Fi network is controlled by legacy 802.11. This
Conversely, DDPG is an example of a policy-based method, serves as a warm-up for CCOD’s DRL algorithms. Afterwards,
because it tries to learn the optimal policy directly. Addition- in the learning phase, the agent undertakes decisions regarding
ally, it can produce unbounded continuous output meaning that the CW value following the TRAIN procedure of Algorithm
it can recognize that the action space is an ordered set (as in 1. The preprocessing in the algorithm consists of calculating
the case of CW optimization)3 . DDPG comprises two neural the mean and standard deviation of the history of recently
networks: an actor and a critic. The actor makes decisions observed collision probabilities H(pcol ) (of length h) using
based on the environment state, while the critic is a DQN-like a moving window of a fixed size and stride. This operation
neural network that tries to learn the expected reward for the changes the data’s shape from one- to two- dimensional (each
actor’s actions. step of the moving window yields two data points). This
collection can then be interpreted as a time series, which
III. A PPLYING DRL TO W I -F I means it can be analysed by a recurrent neural network. Their
design allows for a more in-depth understanding of both the
To apply DRL principles to Wi-Fi networks, we propose the
immediate and indirect relations between agent actions and
CCOD method, which comprises an agent, the environment
network congestion compared to a one-dimensional analysis
states, the available actions, and the received rewards. In
with a dense neural network.
summary, the CCOD agent is a module which observes the
state of the Wi-Fi network, selects appropriate CW values To enable exploration, each action is modified by a noise
(from the available actions) in order to maximize network factor, which decays over the course of the learning phase. For
performance (the reward). DQN, noise is the probability of overriding the agent’s action
The agent is located in the access point (AP), because with a random action. For DDPG, noise is sampled from a
the AP has a global view of the network, it can control its Gaussian distribution and added to the decision of the agent.
associated stations in a centralized manner (through beacon The final, operational phase starts after completing training,
frames), and it can handle the computational requirements which is determined by a user-set time limit. The agent is
of DRL. Furthermore, a CCOD AP can potentially exchange considered to be fully trained and will no longer receive any
information with other APs and become part of an SDN-based updates, so rewards are no longer needed. In this phase, CW
multi-agent Wi-Fi architecture [2]. is updated using the OPTIMIZE procedure of Algorithm 1.
We define the environment states as the current collision Once an agent is trained, it can be shared among APs.
probability pcol observed in the network calculated based on
the number of transmitted frames Nt x and correctly received Algorithm 1 CW optimisation using CCOD
frames Nr x : 1: procedure T RAIN(H(pcol ), load, a)
Nt x − Nr x
pcol = . (1) 2: . load - data sent since last interaction
Nt x 3: . a - previous action
The pcol measurements are done within predefined interaction 4: . s - state
periods and reflect the performance of the currently selected 5: s ← preprocess(H(pcol ))
CW value. In practice, pcol is not immediately available to the 6: r ← normalize(load)
agent, but since the AP takes part in all frame transmissions (as 7: agent.step(s, a, r) . Train the neural network
sender or recipient), the agent requires only obtaining Nt x from 8: a 0 ← agent.act(s) + noise
CW ← 2a +4
0
each station, which can be piggybacked onto data frames (Nr x 9:
is known at the AP based on the number of sent or received 10: return CW
acknowledgement frames). Note that this overhead is required 11: end procedure
only in the learning phase (described below).
The action of the agent is to configure the AP by setting 12: procedure O PTIMIZE(H(pcol )). Observed collision prob.
CW = 2a+4 − 1, where a ∈ [0, 6]. This range was chosen so 13: s ← preprocess(H(pcol ))
that CW fits into the original span of 802.11 values: from 15 14: a ← agent.act(s) . Pass through neural network
to 1023. We explore two algorithms with different outputs: 15: CW ← 2a+4
discrete (a ∈ N) for DQN and continuous (a ∈ R) for DDPG. 16: return CW
We use network throughput (the number of successfully 17: end procedure
delivered bits per second) as the reward in CCOD. This is
indicative of current network performance and can be observed
The application of DRL algorithms also requires configuring
at the AP. Since rewards in DRL should be a real number
certain key parameters. First, the performance of RL algo-
between 0 and 1, we normalize the throughput based on the
rithms depends on their reward discounts γ, which correspond
expected maximum throughput so that the rewards are cen-
to the importance of long term rewards over immediate ones.
tered around 0.5 (i.e., rewards above 0.5 indicate throughput
Second, the introduction of deep learning into RL algorithms
exceeding expectations).
creates an impediment in the form of many new hyperparam-
3 Discrete algorithms, like DQN, consider all possible actions as abstract eters so each neural network requires configuring a learning
alternatives. rate as an update coefficient. Third, since the learning is done
3

TABLE I of 60-second simulations (the first 14 rounds constituted


CCOD’ S DRL SETTINGS the learning phase, the last – the operational phase). Each
simulation consisted of 10 ms interaction periods, between
Paramter Value
which Algorithm 1 was run.
Interaction period 10 ms
History length h 300
DQN’s learning rate 4 × 10−4 V. R ESULTS
DDPG’s actor learning rate 4 × 10−4
DDPG’s critic learning rate 4 × 10−3 CCOD was evaluated in two different scenarios, for a static
Batch size 32 and dynamic number of stations, to assess various performance
Reward discount γ 0.7 aspects. We used two baselines for comparison: (a) the current
Replay buffer B size 18,000
operation of 802.11ax, denoted as standard 802.11, in which
CWmin = 24 − 1 and CWmax = 210 − 1, and (b) an idealized
case of a look-up table in which CWmin = CWmax = CW and
by mini-batch stochastic gradient descent, the correct choice CW ∈ {2x − 1|x ∈ [4, 10]}, where x depends on the number of
of batch size is also critical. Finally, both algorithms use a stations currently in the network. The look-up table (a mapping
replay buffer B, which records every interaction between the between the number of stations and CW) was prepared a priori
agent and the environment, and serves as a base for mini-batch by determining (with simulations) which CW values provide
sampling. best network performance (for multiples of five stations).

IV. S IMULATION MODEL


A. Static Scenario
We implemented CCOD in ns3-gym [11], which is a frame-
In the static scenario, there was a fixed number of stations
work for connecting ns-3 (a network simulator) with OpenAI
connected to the AP throughout the simulation. In theory, a
Gym (a tool for DRL analysis). The neural networks of DDPG
constant value of CW should be optimal in these conditions
and DQN were implemented in Pytorch and Tensorflow,
[2]. This scenario was designed to test whether CCOD’s
respectively.
algorithms are able to recognize this value and what is the
The ns-3 simulations used the following settings: error-free improvement over standard 802.11. For the look-up table
radio channels, IEEE 802.11ax at the PHY/MAC layers, the approach, the CW values remained static throughout the
highest modulation and coding scheme (1024-QAM with a experiment.
5/6 coding rate), single-user transmissions, a 20 MHz channel,
The results show that while 802.11 performance degenerates
frame aggregation disabled4 , and constant bit-rate UDP uplink
for larger networks, CCOD with both DDPG and DQN can
traffic to a single AP, with 1500 B packets and equal offered
optimize the CW value in static network conditions (Fig. 1a).
load calibrated to saturate the network. Also, we assumed
The improvement over standard 802.11 ranges from 1.5%
perfect and immediate transfer of state information to the
(for 5 stations) to 40% (for 50 stations). As anticipated,
agent (i.e., the current values of Nt x and Nr x are known at
CCOD’s operation reflects the performance of the look-up
the AP) as well as the immediate setting of CW at each
table approach.
station separately.5 The idealized simulation settings allow for
Fig. 1b presents the mean CW value selected by both
assessing the base performance of CCOD before moving to
CCOD’s algorithms in each round of simulating the static
more realistic topologies.
scenario for 30 stations. These results are from a single
The DRL algorithms were run with the parameters in experiment run and evidently 14 rounds of the learning phase
Table I, which were determined empirically through a lengthy are enough to converge to stable CW values.
simulation campaign to provide good performance for both
algorithms (their universality is left for further study). The
neural network architecture was the same for both algorithms: B. Dynamic scenario
one recurrent long short-term memory layer followed by two In the dynamic scenario, the number of transmitting stations
dense layers resulting in a [8-128-64] configuration. Using steadily increased from 5 to Ntot al stations, increasing the
a recurrent layer with a wide history window allowed the collision rate in the network. This scenario was designed to test
algorithms to take previous observations into account. The whether the algorithms are able to react to network changes.
preprocessing window was set to h2 with a stride of h4 , where For the look-up table approach, the CW values were updated
h is the history length. after every 5 stations joined the network.
Randomness was incorporated into both agent behavior and Fig. 2a shows how the number of stations increased in a sim-
network simulation. Each experiment was run for 15 rounds ulation run and how the CW values were updated accordingly.
CCOD, with both algorithms, decides on increasing the CW
4 Frame aggregation was disabled to speed up the experiments at the
value with the increasing number of stations. DQN strongly
cost of throughput. This does not qualitatively affect the network behavior
because if frame aggregation was enabled, the improvement would have been relies on oscillations between two (discrete) neighboring CW
proportional to the gain in throughput. values as a way of increasing throughput. DDPG’s continuous
5 In practice, relaxing the former assumption would require an overhead of
approach is able to follow the network behavior more closely,
around 100-200 B/s sent from the stations to the AP while relaxing the latter
assumption would require dissemination of CW values by the AP through and (in this run) settled on a lower final CW value. The change
periodic beacon frames. in CW in each simulation run is reflected in the change of
4

42 1000
CCOD w/ DQN
40 CCOD w/ DDPG
800

Network throughput [Mb/s]


38
36 600

Mean CW
34
400
32
Standard 802.11
30 Look-up table 200
28 CCOD w/ DQN
CCOD w/ DDPG 0
26
0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of stations Round

(a) (b)
Fig. 1. Static scenario results: (a) network throughput, (b) mean CW values selected in each round (for 30 stations).

50 42 42
500 CW
CCOD w/ DQN 40 40
Network throughput [Mb/s]
CCOD w/ DDPG 40

Network throughput [Mb/s]


400 38 38
Number of stations

30 36 Standard 802.11 36
300
34 CCOD w/ DQN 34
CW

20 CCOD w/ DDPG
200 32 32
Standard 802.11
10 30 30 Look-up table
100 28 CCOD w/ DQN
28 CCOD w/ DDPG
Number of stations 0
26 26
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50
Simulation time [s] Simulation time [s] Number of stations

(a) (b) (c)


Fig. 2. Dynamic scenario results. For a single round of CCOD’s operational phase and Nt ot al = 50: evolution of (a) number of stations and CW values
and (b) instantaneous network throughput. For varying Nt ot al : (c) network throughput.

instantaneous throughput (Fig. 2b). Standard 802.11 leads to simplifying assumptions. Also worth investigating are other
a decrease of up to 28% of the network throughput with the DRL algorithms as well as implementing a distributed version
increasing number of stations. CCOD is able to maintain the of CCOD.
efficiency on a similar level – the decrease of throughput
moving from 5 to 50 stations is only about 1% for both R EFERENCES
DDPG and DQN. Ultimately, the operation of both CCOD’s [1] B. Bellalta and K. Kosek-Szott, “AP-initiated multi-user transmissions
algorithms in the dynamic scenario lead to improved network in IEEE 802.11ax WLANs,” Ad Hoc Networks, vol. 85, pp. 145–159,
performance (Fig. 2c), both exceeding standard 802.11 and 2019.
[2] P. Gallo, K. Kosek-Szott, S. Szott, and I. Tinnirello, “CADWAN: A
matching the look-up table approach. Control Architecture for Dense WiFi Access Networks,” IEEE Commu-
nications Magazine, vol. 56, no. 1, pp. 194–201, 2018.
[3] P. Serrano et al., “Control theoretic optimization of 802.11 WLANs: Im-
VI. C ONCLUSIONS plementation and experimental evaluation,” Computer Networks, vol. 57,
We have presented CCOD – a method which leverages no. 1, pp. 258–272, 2013.
[4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
deep reinforcement learning principles to learn the correct wireless networking: A survey,” IEEE Communications Surveys &
CW settings for 802.11ax under varying network conditions Tutorials, vol. 21, no. 3, pp. 2224–2287, 2019.
using two trainable control algorithms: DQN and DDPG. Our [5] F. Yao and L. Jia, “A collaborative multi-agent reinforcement learning
anti-jamming algorithm in wireless networks,” IEEE Wireless Commu-
experiments have shown that DRL can be successfully applied nications Letters, vol. 8, no. 4, pp. 1024–1027, 2019.
to the problem of CW optimization: both algorithms offer [6] F. Wilhelmi, S. Barrachina-Muoz, B. Bellalta, C. Cano, A. Jonsson, and
efficiency close to optimal (with DDPG being only slightly V. Ram, “A Flexible Machine Learning-Aware Architecture for Future
WLANs,” arXiv preprint http://arxiv.org/abs/1910.03510, 2019.
better than DQN), keeping the computational cost low (around [7] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple
22 kflops, according to our estimations, excluding the one- access for heterogeneous wireless networks,” IEEE Journal on Selected
time training cost). As a result of the learning process, we Areas in Communications, vol. 37, no. 6, pp. 1277–1290, 2019.
[8] R. Ali et al., “Deep Reinforcement Learning Paradigm for Performance
have obtained a trained agent which can be directly installed Optimization of Channel ObservationBased MAC Protocols in Dense
in an 802.11ax AP. WLANs,” IEEE Access, vol. 7, pp. 3500–3511, 2019.
We conclude that the problem of CW optimization has [9] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, p. 529, 2015.
provided the opportunity to showcase the features of DRL. Fu- [10] D. Silver, “Deterministic policy gradient algorithms,” Proceedings of
ture studies should focus on analyzing more realistic network ICML’14, vol. 32, pp. I–387–I–395, 2014.
conditions, where we expect DRL to outperform any analytical [11] P. Gawłowicz and A. Zubow, “ns-3 meets OpenAI Gym: The Playground
for Machine Learning in Networking Research,” in ACM MSWiM, 2019.
model-based CW optimization methods which are based on

You might also like