0% found this document useful (0 votes)
21 views33 pages

Deep Reinforcement Learning For Multi-User

This paper presents a multi-agent deep reinforcement learning (DRL) framework for beamforming in multi-user massive MIMO systems under imperfect channel state information (CSIT) due to user mobility and feedback delays. It introduces three DRL-based schemes that effectively manage interference and optimize beamforming, demonstrating significant performance improvements over traditional methods. The proposed strategies achieve high information rates while maintaining robustness against channel aging and complexity in practical wireless networks.

Uploaded by

rezakohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views33 pages

Deep Reinforcement Learning For Multi-User

This paper presents a multi-agent deep reinforcement learning (DRL) framework for beamforming in multi-user massive MIMO systems under imperfect channel state information (CSIT) due to user mobility and feedback delays. It introduces three DRL-based schemes that effectively manage interference and optimize beamforming, demonstrating significant performance improvements over traditional methods. The proposed strategies achieve high information rates while maintaining robustness against channel aging and complexity in practical wireless networks.

Uploaded by

rezakohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

1

Deep Reinforcement Learning for Multi-user


Massive MIMO with Channel Aging
Zhenyuan Feng, Member, IEEE, Bruno Clerckx, Fellow, IEEE
arXiv:2302.06853v2 [eess.SP] 13 Jul 2023

Abstract

The design of beamforming for downlink multi-user massive multi-input multi-output (MIMO)
relies on accurate downlink channel state information (CSI) at the transmitter (CSIT). In fact, it is
difficult for the base station (BS) to obtain perfect CSIT due to user mobility, and latency/feedback
delay (between downlink data transmission and CSI acquisition). Hence, robust beamforming under
imperfect CSIT is needed. In this paper, considering multiple antennas at all nodes (base station and
user terminals), we develop a multi-agent deep reinforcement learning (DRL) framework for massive
MIMO under imperfect CSIT, where the transmit and receive beamforming are jointly designed to
maximize the average information rate of all users. Leveraging this DRL-based framework, interference
management is explored and three DRL-based schemes, namely the distributed-learning-distributed-
processing scheme, partial-distributed-learning-distributed-processing, and central-learning-distributed-
processing scheme, are proposed and analyzed. This paper 1) highlights the fact that the DRL-based
strategies outperform the random action-chosen strategy and the delay-sensitive strategy named as
sample-and-hold (SAH) approach, and achieved over 90% of the information rate of two selected
benchmarks with lower complexity: the zero-forcing channel-inversion (ZF-CI) with perfect CSIT and
the Greedy Beam Selection strategy, 2) demonstrates the inherent robustness of the proposed designs in
the presence of channel aging. 3) conducts detailed convergence and scalability analysis on the proposed
framework.

Index Terms

Deep learning, interference management, massive MIMO, reinforcement learning, wireless commu-
nication

Z. Feng is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K.
(e-mail: z.feng19@imperial.ac.uk).B. Clerckx is with the Department of Electrical and Electronic Engineering at Imperial College
London, London SW7 2AZ, UK and with Silicon Austria Labs (SAL), Graz A-8010, Austria (email: b.clerckx@imperial.ac.uk;
bruno.clerckx@silicon-austria.com)

July 20, 2023 DRAFT


2

I. I NTRODUCTION

Due to the increasing demand for data and connectivity in fifth-generation (5G) [1] and sixth-
generation (6G) [2], multi-antenna technologies have attracted great attention in academia and
industry. The research on multi-antenna techniques has promoted the development of multi-
input multi-output (MIMO) technology. MIMO nowadays plays an indispensable role in the
physical layer, media access control (MAC) layer, and network layer in wireless communications
and networking [3]. At the physical layer, multi-antenna beamforming strategies have attracted
great interest due to their ability to achieve considerable antenna gains, multiplexing gains, and
diversity gains [4], [5], and gradually evolved into a massive MIMO system, in which the number
of antennas at the BS reaches tens or even hundreds, attracting a larger number of users. To
enable a high throughput in the massive MIMO system, the base station (BS) relies on the huge
demand for global and instantaneous channel state information (CSI) based on efficient channel
estimation techniques [6], [7]. Nevertheless, the ground/air/space platforms such as high-speed
trains/unmanned aerial vehicles (UAV)/satellites have a common characteristic of 3D mobility
which leads to a stringent time constraint on CSI acquisition and even causes misalignment of
narrow beams. Therefore, in future communication systems, how to maintain good connectivity
and system capacity without perfect channel state information at the transmitter (CSIT) (so-called
imperfect CSIT) is regarded as an important problem that yearns for prompt solutions.
The imperfect CSIT is usually caused by the drastic change of the propagation environment
due to user mobility [8] and CSI feedback/acquisition delay between the base station (BS) and
users [9]. The CSI delay due to user mobility or feedback or acquisition delay is the time gap
between the time point when the downlink training happens and the BS starts downlink data
transmission with the estimated channel. Such delay can be in the level of milliseconds which
causes the estimated channels to be outdated when actually downlink transmission happens. This
delay becomes more catastrophic at high user mobility since rapid channel variation inevitably
causes performance degradation in massive MIMO systems [10]. This phenomenon, which is
known as channel aging, describes the mismatch between estimated and update-to-date channels.
That is to say, it represents the divergence arising between the channel estimation happening in
the BS/users and the actual channel through which the data transmission occurs.

A. Related Works

To address the issue above, many papers have studied massive MIMO systems with imperfect

July 20, 2023 DRAFT


3

CSIT [8]–[16] since CSIT is pivotal to the performance of systems that account for a great
number of antennas and users. In [8], the impact of channel aging due to mobility is partially
overcome through finite impulse response Wiener predictor without considering hardware phase
noise, which is further studied in [10]. To tackle the CSI feedback/acquisition delay, one strategy
is to use space-time interference alignment to optimize the degree of freedom (DoF) with delayed
CSIT [11], [12]. Another method investigates the channel prediction based on the channel
correlation [13] and past CSI [14]. In addition, to maintain the multi-user connectivity and
mitigate the degrading effect of user mobility, low complexity power allocation methods are
derived in [15] for Space Division Multiple Access (SDMA) which is outperformed by Rate-
Splitting Multiple Access (RSMA) in [16] in terms of ergodic sum-rate.
On the one hand, the channel prediction approaches in the papers cited above [8]–[14]
demonstrate good performance but experience extremely high complexity in channel prediction
algorithms due to the increasing dimension of the antenna arrays. On the other hand, power allo-
cation strategies in [15], [16] exhibit lower complexity but sacrifice performance for tractability.
To maintain a better balance of performance and complexity, an alternative strategy with lower
complexity and looser CSI requirement needs to be developed urgently.
Machine learning (ML) [17] has demonstrated great usefulness in wireless systems [18]–[38].
To cope with complex problems in a large-dimensional MIMO system, deep learning (DL)
has drawn research interest in not only beamforming design [20], [21] by feeding CSI to the
neural network but also channel prediction [22]–[26] by treating the time-varying channel as a
time series, thanks to the strong representation capability of the deep neural network (DNN).
Nevertheless, under stringent time constraints in mobility scenarios, the excellent generalization
performance of DNN can not be fully exploited due to an insufficient number of data samples.
In view of it, by elaborately treating the time-varying channel problem as a Markov decision
process (MDP), deep reinforcement learning (DRL) has been regarded as a useful technology
to design wireless communication systems by leveraging fast convergence of DL frameworks
as well as continuous improvement characteristic in reinforcement learning (RL) algorithms
[27]–[39]. Systematically, a comprehensive tutorial in [39] reveals the applications of DRL for
5G and beyond. DRL is used to solve the power allocation problem in time-varying channels
in [27] for single transmit antenna scenarios, and is further studied in [28] for multi-antenna
beamforming and in [35] for multi-user conditions. In [30]–[32], [37], DRL is utilized to
tackle the passive beamforming design problem in reconfigurable intelligent surfaces (RIS)-

July 20, 2023 DRAFT


4

aided communications and help reduce the computations compared to alternative frameworks.
In terms of active beamforming using DRL, several efforts have been made on designing low
complexity algorithms based on deep Q-network (DQN) [28], [34]–[36] and partially observed
MDP [38] frameworks.

B. Motivation and Specific Contributions of the Paper

Existing works [27]–[37] assume that perfect CSIT or instantaneous channel gain via receiver
feedback is known at the transmitter. Unfortunately, such an assumption is impractical in real-
world systems with CSI feedback/acquisition delay and user mobility [8], [9]. In addition,
beamforming is not limited to the transmitter and can also be used at the receiver to perform
better interference management. To our best knowledge, predicting the beamformers of both
transmitter and receiver with imperfect CSIT is never considered in DRL-based papers. Instead,
all the existing work focuses on high-level multi-cell single-user (SU) single-input-single-output
(SISO) [27], [29] (no transmit and receive beamforming) and multi-input-single-output (MISO)
(only transmit beamforming) [28], [31], [34], [38] scenarios without considering multiple receive
antenna cases, which motivates this work. In addition to DRL-based strategies, a traditional
procedure of beamforming is to use the frequency-division duplexing (FDD) pilot-based channel
estimation procedure and zero-forcing channel inversion (ZF-CI) scheme which is shown in Fig.
1 [4]. Compared with DRL-based approaches, a key disadvantage of this method is that the
system performance is heavily dependent on how fast the channel is changing as well as the
feedback delay.
Motivated by the above, we study the joint transmit precoder and receive combiner design in
massive MIMO downlink transmission with channel aging. The contributions of this paper are
summarized as follows.
• We construct an efficient multi-agent DRL-based framework for massive MIMO downlink
transmission1, in light of which three DRL-based algorithms were derived based on stream-
level, user-level, and system-level agent modeling. This is the first paper showing that the
DRL-based framework can be used to address very high-dimension optimization problems
and demonstrates 1) robustness on the degrading effect of channel aging, 2) stringent
interference management especially the inter-stream interference and multi-user interference.

1
The terminology massive MIMO in this paper implies multiple receive antennas

July 20, 2023 DRAFT


5

• To address the challenge of high-dimensional antenna beamforming problems, by utiliz-


ing the DRL-based framework, three DRL-based schemes, namely distributed-learning-
distributed-processing DRL-based scheme (DDRL), partial-distributed-learning-distributed-
processing DRL-based scheme (PDRL), and central-learning-distributed-processing DRL-
based scheme (CDRL), are proposed, analyzed, and evaluated. For DDRL, each stream is
modeled as an agent. All the agents save their experiences in a private experience pool for
later training. In contrast, in CDRL, the whole system is modeled as a central agent. What’s
more, to bridge DDRL and CDRL, we demonstrate another algorithm, i.e., PDRL which
offers a more flexible design by modeling each user as an agent to balance the performance
and complexity. Note that the DDRL and CDRL are different from those in [27], [28] since
we are tackling the problem with 1) receive beamforming with multiple receive antennas,
2) transmit beamforming under imperfect CSIT, 3) multiple streams for each user, and 4) a
large number of transmit antennas at BS compared with SISO in [27] and 4-antenna MISO
in [28], [35], respectively.
• Leveraging the DRL-based framework mentioned above, the precoders at BS and combiners
at users are jointly designed by gradually maximizing the average information rate through
the observed reward. In particular, the BS decides the transmit precoder and receive combiner
for each stream with imperfect CSIT and perfect CSIR. The merits of this design are
shown through extensive simulations by benchmarking our schemes against the conventional,
sample-and-hold (SAH) approach [26], zero-forcing channel-inversion (ZF-CI) strategy [4],
greedy beam selection and random action-chosen scheme.
• We demonstrate the advantages of DRL-based strategies over the benchmarks above. In
particular, the proposed algorithms show 1) fast convergence to efficient beamforming policy,
2) the robustness on tracing the channel dynamic against channel uncertainty due to channel
aging, and 3) lower complexity compared with traditional beamforming strategy. All of these
properties are essential in practical wireless networks.
• By numerical results, we show that our proposed DRL-based schemes outperform the SAH
approach and random action-chosen scheme. In particular, DDRL can achieve nearly 90%
of the performance of the state-of-the-art ZF-CI method with perfect CSI (ZF-CI PCSI) and
95% of the performance of the Greedy Beam Selection method but incurs more hardware
complexity and more uplink overhead in an FDD setup. By increasing the resolution of
the codebook and hyper-parameter tuning on the reward function, the performance can be

July 20, 2023 DRAFT


6

further improved.
Organizations: The whole Section II is devoted to the system model, channel model, and the
formulated sum-rate problem. In Section III, the basics of DRL are introduced, and three practical
multi-agent DRL-based approaches are proposed. The simulation results are demonstrated in
Section IV and this paper is concluded in Section V.
Notations: Boldface lower- and upper-case letters H, and h, denote vectors and matrices,
respectively. E{·} represents statistical expectation. (·)−1 , (·)T , (·)∗ , and (·)H indicate inversion,
transpose, conjugate, conjugate-transpose, respectively. R and I denote the real and imaginary
parts of a complex number, respectively. IM denotes an M × M identity matrix. 0 denotes an
all-zero matrix. ||a|| denotes the norm of a vector a. |a| denotes the norm of a variable a.

II. S YSTEM M ODEL

Consider the MIMO broadcast channel (BC) with one M-antenna BS and K N-antenna users
indexed by K = {1, . . . , K} [40] . The BS aims to deliver Ms streams in the time instant of
interest. For simplicity, A number of KNs streams are transmitted simultaneously from the M
antennas of the BS. Each group of Ns streams indexed by Ns ∈ N = {1, . . . , Ns } is targeted
at one of the K users. Note that we consider a setting where M ≥ KNs to ensure the spatial
multiplexing gain. The transmit power P is uniformly allocated to all KNs streams. We assume
that the BS and all users operate in the same time-frequency resource and are synchronized. The
transmitted signal, i.e., the precoded data vector, at time slot t can be written as
r K Ns
P XX
x(t) = pk,n (t)sk,n (t) (1)
KNs k=1 n=1
where sk,n , ∀k ∈ K, ∀n ∈ N , is the encoded message from message Wk,n with zero mean and
E(|sk,n |2 ) = 1, and precoder pk,n (t) ∈ CM ×1 is subject to kpk,n (t)k2 = 1. The received signal
at user k can be expressed as
q PNs
P
yk (t) = KN s
H k (t) n=1 pk,n (t)sk,n (t)
q
K PNs (2)
P
P
+ KN s
H k (t) j6 = k,j=1 i=1 pj,i (t)sj,i (t) + nk (t)

where the noise vector nk ∈ CN ×1 is assumed to follow a complex normal distribution, i.e.,
nk ∈ CN (0, σn2 IN ). At the user side, the combiner vector for each stream is denoted as wk,n (t) ∈
CN ×1 , kwk,n(t)k2 = 1, ∀k ∈ K, n ∈ N . Then, the achievable rate for user k and the average
user rate at time slot t can be written as
Ns PK
k=1 Rk (t)
X
Rk (t) = Gk,n (t), R̄(t) = (3)
n=1
K

July 20, 2023 DRAFT


7

, where Gk,n is the achievable rate of stream n for user k. To indicate the downlink information
rate in each stream, by adopting the Shannon capacity equation, Gk,n is given as

Gk,n (Wk (t), Pk (t))) = log(1 + γk,n (Wk (t), Pk (t))) (4)

where, for consistency with the notation in the following sections, γk,n(Wk (t), Pk (t)) denotes
the Signal-to-Interference-plus-Noise Ratio (SINR) of stream n for user k as
P H
KNs
|wk,n (t)Hk (t)pk,n (t)|2
γk,n (Wk (t), Pk (t)) = (5)
Ik,n (t) + Ic,k (t) + kwk,n (t)k2 σn2 (t)
where Wk (t) = [wk,1 (t), . . . , wk,Ns (t)] denotes the combining matrix and Pk (t) = [pk,1 (t), . . . , pk,Ns (t)]
denotes the precoding matrix. The inter-stream interference for stream n of user k and the multi-
user interference for stream n of user k are shown as
Ns
X P H
Ik,n (t) = |wk,n (t)Hk (t)pk,i (t)|2 (6)
i=1,i6=n
KN s

and
Ns
X X P H
Ic,k (t) = |wk,n (t)Hk (t)pj,i (t)|2 (7)
j∈K,j6=k i=1
KN s

, respectively.

A. Channel Model

We assume an extended Saleh-Valenzuela geometric model [41]. The channel between BS


and user k is modeled as a L-path channel as is shown below
L
r
ηk MN X H
Hk (t) = · αk,l (t) · uk,l (t)vk,l (t) (8)
L l=1

where ηk denotes the large-scale fading coefficient and complex gain αk,l (∀k ∈ K, ∀l ∈ {1, 2, . . . , L})
is assumed to remain the same at each time slot and varies between adjacent time slots according
to the first-order Gaussian-Markov process
p
αk,l (t) = ραk,l (t − 1) + 1 − ρ2 ek,l (t) (9)

where ek,l (t) ∽ CN (0, 1) and ρ is the time correlation coefficient obeying Jakes’ model [42].

ρ = Jo (2πfd ∆t cos θ) (10)

where fd and ∆t denote the Doppler frequency and the channel instantiation interval, respectively,
and J0 denotes the first kind 0th Bessel function. Since the users are assumed to move forward to

July 20, 2023 DRAFT


8

the BS or away, i.e., θ = 0 and maximum Doppler frequency fdmax is achieved which is written
as
ρ = Jo (2πfdmax ∆t ). (11)

In the typical case of a uniform linear array (ULA) where the antennas are deployed at both
ends of the transmission, the array steering vectors uk,l and vk,l corresponding to the angle of
arrival (AoA) φA,k,l and the angle of departure (AoD) φD,k,l in the azimuth are written as
1 d d
uk,l = √ [1, ej2π λ cos φA,k,l , . . . , ej2π λ (N −1) cos φA,k,l ]T (12)
N
and
1 d d
vk,l = √ [1, ej2π λ cos φD,k,l , . . . , ej2π λ (M −1) cos φD,k,l ]T (13)
M
, respectively, where λ is the wavelength of the signal and d denotes the inter-antenna space,
δA δA
which is usually set as d = λ/2, φA,k,l ∽ U(θA,k,l − 2
, θA,k,l + 2
) and φD,k,l ∽ U(θD,k,l −
δD δD
2
, θD,k,l + 2
) with {θA,k,l , θD,k,l } referring to the elevation angles and {δA , δD } denoting the
angular spread for arrival and departure, respectively [43].

B. Problem Formulation

As described above, the system performance heavily relies on precoding and combining vectors
design. However, there is an inevitable feedback delay between the time point when the user
estimates the channel and the BS starts transmitting data with the estimated channel fed back
by the users. As can be seen in Section II-A, such delay becomes quite problematic in high
mobility scenarios since the channel changes fast and the correlation coefficient ρ decreases
dramatically. Therefore, it is necessary to develop strategies that are robust to feedback delay
and user mobility, which, in this paper, is interpreted as maximizing the sum-rate of K users
based on the knowledge of past channels. The problem can be formulated as follows
K X
X Ns
max Gk,n (Wk (t), Pk (t)) (14a)
Wk (t), Pk (t) k=1 n=1

s.t. kpk,n (t)k2 = 1, ∀k, n, (14b)

kwk,n (t)k2 = 1, ∀k, n, (14c)

F (Hk (t′ )), ∀k until t′ = t − 1 are available, (14d)

where F (Hk (t′ )) is a function of Hk (t′ ) which is listed in Section III. Problem (14) aims at
optimizing the precoder and combiner to maximize the sum-rate for served users subject to

July 20, 2023 DRAFT


9

!"#$%(

!"#$%&'"'()*

!"#$%&$'()*+&$&$, ! -.+$$/%(012&3+2&"$
!"#$%&
!/%+;(!2

:*/9"7&$,(
!/2/*3&$+2&"$ 45%&$'(6//78+9'

-"38&$&$,(
!"#$%&$'(:&%"2 !/2/*3&$+2&"$

!"#$%&$'(!+2+()*+$13&11&"$
! </9/52&"$
!"#$%'

Fig. 1. The system model for FDD-based pilot process. The CSI feedback or acquisition delay ∆t is the time gap between
the time point when the channel is estimated and the BS starts downlink data transmission with the estimated channel.

0!"# 1P+2 3!! , "! / 0!"$ 1P+2 3!!"# , "!"# / 0!"% 1P+2 3!!"$ , "!"$ /

!! "! !!"# "!"# !!"$ "!"$ !!"% !


#! #!"# #!"$

"! $ %&'(%)& *+!! , "- ./ "!"$ $ %&'(%)& *+!!"$ , "- ./

"!"# $ %&'(%)& *+!!"# , "- ./

Fig. 2. Markov decision process of Q-learning.

constraints (14b)- (14d), which is a non-convex problem. To solve this problem, three efficient
DRL-based strategies are proposed in Section and III.

III. M ULTI - AGENT D EEP R EINFORCEMENT L EARNING FOR M ULTI - USER MIMO
D OWNLINK T RANSMISSION

To build up the foundation for the proposed DRL-based designs, an overview of DQN is
illustrated first, followed by the description of the state, action, reward function, and three multi-
agent DRL-based algorithms for the problem (14).

A. A Brief Overview of DQN

In reinforcement learning (RL), an agent learns the optimal action policy to maximize the
reward through trial-and-error interactions with the environment. RL is always formalized as an
approach for Makov Decision Process (MDP) problems, which consists of S, A, R, P, and

July 20, 2023 DRAFT


10

γ referring to a set of states, a set of actions, a reward function, a state transition function,
and the discount factor. To be specific, at time t, an agent in state st ∈ S takes an action
at ∈ A according to policy π(at |st ), obtains a reward rt = R(at , st ) and next state st+1 ∈ S
with probability P(st , at , st+1 ) in return for the action taken. Formally, each transition (so-called
experience of an agent in DQN) can be written as a tuple below

et = hst , at , rt , st+1 i. (15)

The optimal policy π ∗ (at |st ) is a mapping function between state and action to maximize the
future accumulate reward ∞
X
Rt = γ τ R(st+τ +1 , at+τ +1 ) (16)
τ =0

where discount factor γ ∈ [0, 1] balances the significance between immediate and future rewards.
The optimal policy can be achieved by using dynamic programming (DP) methods that require
detailed knowledge of the environment, i.e., P(st , at , st+1 ), which is unavailable due to the
variation of propagation channels.
To tackle this issue, as illustrated in Fig. 2, model-free Q-learning algorithms are demonstrated
to continuously improve the policy through interactions with the environment. To be specific,
the state-action value (called Q-value) is denoted as an expected reward of (s, a) by policy π

Qπ (st , at ) = Eπ (Rt |st = s, at = a) (17)

where the expectation is calculated over all the possible (s, a) pairs given by policy π, which
can be iteratively computed from the Bellman equation

Qπ (st , at ) = R(rt+1 |st = s, at = a) + γ


(18)
 
′ ′ ′
P
s′ ∈S P(st+1 = s , st = s, at = a) maxa′ ∈A Qπ (s , a )

where P(st+1 = s′ , st = s, at = a) denotes the transition probability from state s to s′ after


taking action a. The optimal policy returns the maximum expected cumulative reward at each
s, i.e., π ∗ = arg maxπ Qπ (s, a). Then the Q-value function can be represented as

Qπ∗ (st , at ) = rt+1 (st = s, at = a, π = π ∗ )


(19)
+γ s′ ∈S P(st+1 = s′ , st = s, at = a) maxa′ ∈A Qπ∗ (s′ , a′ ).
P

In classical Q-learning, a Q-value table q(s, a), named as Q-table, is constructed to represent the
Q-value function Qπ (s, a). This table consists of a discrete set of |S| × |A| which is randomly
initialized. The agent then takes actions according to an ǫ-greedy policy, receives reward r =

July 20, 2023 DRAFT


11

R(s, a) and transfers to the next state st+1 to complete the experience et . The Q-table is updated
as
q(st , at ) ←− (1 − α)q(st , at ) + α(rt+1 + γ max

q(st+1 , a′ )) (20)
a

where α ∈ [0, 1) is the learning rate. However, it is challenging to directly obtain the optimal
Qπ∗ (st , at ) due to the uncertain variation of the dynamic channel environment, i.e., an unlimited
number of states. To address the problems with such an enormous state space, deep Q-network
(DQN) is utilized here to approximate the Q-value function, which can be expressed as q(st , at , θ)
with θ denoting the weights of DQN. The optimal policy π ∗ can be represented by a group of
weights of the DQN. In addition, two techniques are exploited to strengthen the stability of DRL:
target network and experience replay. The target network q(st , at , θ̄) is another network that is
initialized with the same set of weights of trained DQN. The target DQN is used to generate the
target Q-value which is exploited to formulate the loss function of trained DQN. The weights of
target DQN are updated periodically for every fixed number of slots Ts by replicating the weights
of trained DQN to stabilize the training of trained DQN. The experience replay is intrinsically
a first-input-first-output (FIFO) queue that stores Em historical experiences in each training slot.
During training, Eb experiences are sampled from the experience pool O to train the trained
DQN to minimize the prediction error between the trained DQN and the target DQN. The loss
function is defined as
1 X
L(θ) = (r ′ − q(s, a; θ))2 (21)
2Eb
hs,a,r,s′ i∈O

where r ′ = r + γ maxa′ q(s′ , a′ ; θ̄), the weights of DQN θ is updated by adopting a proper
optimizer (e.g. RMSprop, Adam, and SGD). The specific gradient update is
h i

∇θ L(θ) = Es,a,r,s ∈O (r − q(s, a; θ)∇θ q(s, a; θ)
′ (22)

B. The Distributed-learning-distributed-processing DRL-based Algorithm

In this section, we cast the problem (14) as a sequential decision-making process and tailor
three multi-agent DRL algorithms to solve it. The DRL-based framework is elaborated first,
followed by the derived algorithms. To our best knowledge, this is the first paper tackling the
problem with 1) receive beamforming with multiple receive antennas, 2) transmit beamforming
under imperfect CSIT, 3) multiple streams for each user, and 4) multiple users in a single cell
compared with SISO in [27] and MISO in [28] with perfect CSIT, respectively. In addition,

July 20, 2023 DRAFT


12

!"#$%(

!"#$%&'"'()*
!"#$%&$'()*+&$&$, ! -.+$$/%(012&3+2&"$
!"#$%&

!/%+;(#2
9*/:&42&"$("$(! -+%45%+2/(62+2/(7$8"*3+2&"$
+$:("
>=%&$'(?//:@+4'

</4/=2&"$
!"#$%"&'$()*
!"#$%"
!"# $%"&'$
&'$()*
()*
+(,%-"-. !"#$%&$'(!+2+()*+$13&11&"$
!"#$%"&'$()*
/%01(##"-.
!
!"#$%'

Fig. 3. The downlink training and uplink feedback of proposed DRL framework. The detail structure of the distributed-learning-
distributed-processing framework is shown in Fig. 5.

!"# ! !$# !$% !$&

!
!"#$%&$'( -.%&$'( 40+56"*5&$,( !"#$%&$'(!+8+(
)*+&$&$, /0012+3' !03&7&"$ )*+$75&77&"$

Fig. 4. Timing of time slot t − 1.

the PDRL is also firstly demonstrated in this paper to bridge DDRL and CDRL to balance the
performance and complexity.
1) Downlink Training and Uplink Feedback: As is shown in Fig. 3 and Fig. 4, at time slot
t − 1, the BS sends downlink pilots to users, based on which the downlink channels are perfectly
estimated. User k can estimate the designed state information in Section III-B4 and feed it back
to the base station. With feedback from users, the BS can predict the indexes of precoders and
combiners for time slot t and start downlink data transmission.
2) The Proposed DRL-based Algorithm: To bring this insight to fruition, each stream is
modeled as an agent, totally KNs agents in our scheme. To be intuitive, we adopt a distributed-
learning-distributed-processing framework as shown in Fig. 5 and demonstrated in Algorithm 1.
At the initialization stage, all the KNs pairs of DQNs are established at the BS. For instance,
one pair of DQNs, namely trained DQN q(sk,n , ak,n ; θk,n ) and target DQN q(sk,n, ak,n ; θ̄k,n )
is possessed by agent (k, n). The input and output of trained DQN q(sk,n, ak,n ; θk,n ) are the
local state sk,n and action ak,n . In terms of the distributed learning procedure for agent (k, n),

July 20, 2023 DRAFT


13

,-.&"$&%/&(
.001 !$!"# # %!"#
$
"
!#"2&3()*+
!!!"# # "!"# # $!"# # %!"#
$
" ;"#$ < =1$# 213
!!
%
45>"#$ ?"#$ :
6 7% 8 9

?"#$ 5$36(9"#$
4&.1$/#3&(9
&#/6(@& ()**
** +,-./0)-

!"#$%"&'$()*+(,%-"-. 7.'#3& A*"#$ 6 7"#$ 8 9"#$ B


9"#$

!"#$%&'()*+ 1$# 213 45*"#$ 6 78 9"#$ :


!

!"#$%"&'$()*/%01(##"-.
! " #$%%&'

!!"# "!"#

Fig. 5. The framework of distributed-learning-distributed-processing scheme.

due to the feedback delay from users, only outdated CSI information is used to formulate the
observations sk,n at the beginning of each time slot. Then, the DRL agent adopts an ǫ-greedy to
balance exploitation and exploration by choosing actions, i.e, the precoder pk,n , and combiner
wk,n according to sk,n , in which the agent executes an action with probability ǫ randomly, or
executes the action ak,n = maxa q(sk,n, a; θk,n ) with probability 1 − ǫ. Regarding the distributed
learning process, the agent accumulates and stores the experience ek,n = hsk,n , ak,n , rk,n , s′k,ni
into experience pool and the historical experiences can be utilized to train the DQN with local
state-action pairs together with the corresponding reward. Each agent has a profound view of the
relationship between local state-action pairs and local long-term reward which, in return, leads
the whole system to a distributed-learning-distributed-processing manner.
3) Actions of the proposed multi-agent DRL approach for massive MIMO scenario: As
described in Section II, we aim to optimize the precoder pk,n and combiner wk,n , ∀k, n. Then,
the problem can be addressed by building two codebooks, i.e. St , Sr , which contain St and Sr
beamforming vectors. In the decision-making stage, each agent chooses one precoder from St
and one combiner from Sr . The action space can be represented as

A = {(ct , cr ), ct ∈ St , cr ∈ Sr } (23)

where ct and cr denote the codewords of two codebooks and the cardinal number of action
space A is St × Sr . The design of codebooks comes from [44] which is also applied in [28],

July 20, 2023 DRAFT


14

[34], [35], and introduced here as a quantization of beam directions. To specify each element,
we define matrix Ct ∈ CM ×St as
St
M mod(q+ ,St )
exp j 2π

T
⌊ St /T
2

Ct [p, q] = √ (24)
M
where T is the number of available phase values and Cr ∈ CN ×Sr can be obtained by substituting
the M and St with N and Sr accordingly. Each column of Ct and Cr corresponds to a specified
codeword and the whole matrix forms a beamsteering-based beamformer codebook.
4) States of the proposed DRL-based approach for massive MIMO scenarios: Under the
mobility scenario, the receiver feedback is delayed at time slot t, and the state of agent (k, n) is
constructed by the representative feature of observations from the last two successive time slots
t − 1 and t − 2 without observations from time slot t. That is to say, at the beginning of time
slot t − u, due to the delay of feedback, the BS is unable to instantaneously obtain the power
H
of the received signal, i.e., |wk,n (t)Hk (t)pk,n (t)|2 and |wk,n
H
(t)Hk (t − 1)pk,n (t)|2 . However, the
H
historical feedback, i.e., |wk,n (t − 1)Hk (t − 1)pk,n (t − 1)|2 and |wk,n
H
(t − 1)Hk (t − 2)pk,n (t − 1)|2
are usually available to the BS. Based on this assumption, the state sk,n (t) is designed as follows
• The ”desired” information of the agent (k, n) which consists of 5 parameters, i.e., the
H
channel gain |wk,n (t − 1)Hk (t − 1)pk,n (t − 1)|2 , the chosen index of precoder Uk,n (t − 1),
the chosen index of combiner Vk,n (t − 1), the achievable rate of stream n for user k, i.e.,
Gk,n (Pk (t − 1), Wk (t − 1)), and the interference-plus-noise Ik,n (t − 1) + Ic,k (t − 1) + σk2 .
• Interference information of the agent (k, n) which is represented by 8 parameters, i.e.,
{ N
P s H 2
PNs H
i=1,i6=n |wk,n (t − u)Hk (t − u)pk,i (t − u)| , i=1,i6=n |wk,n (t − 1 − u)Hk (t − u)pk,i (t −

1 − u)|2 , j6=k N
P P s H 2
P PNs H
i=1 |wk,n (t − u)Hk (t − u)pj,i (t − u)| , j6=k i=1 |wk,n (t − 1 − u)Hk (t −

u)pj,i (t−1 −u)|2|u ∈ {1, 2}}. It is worth noting here that in such a system, the interference
information plays a key role in the maximization of its own information rate (the rate of
stream n of user k), which, thus, should be included in state space.
• The information of agent (j, i), (j, i) 6= (k, n), ∀j, i consists of of 10(KN − 1) terms,
i.e.,{Uj,i (t−u), Vj,i(t−u), Gj,i (Mj (t−u), NPK |wj,i
H
(t−u)Hj (t−u)pj,i (t−u)|2 , NPK |wj,i
H
(t−
u)Hj (t−u)pk,n (t−u)|2 |u ∈ {1, 2}}. The information of other agents plays an irreplaceable
role for agent k to minimize the interference it causes to them, which, thus, should be
included in state space.
To sum up, the cardinal number of state space is 10KNs + 3. Note that the adopted design is not
guaranteed to be the optimal one but empirically achieves a good performance as demonstrated

July 20, 2023 DRAFT


15

with evaluation results in Section III. The output size of the DQN is S = St Sr which is equal
to the number of available actions.
5) The reward of the proposed DRL-based approach for the massive MIMO scenario: In
this massive MIMO scenario, if agent (k, n) only tries to maximize the achievable rate of the
stream (k, n) without taking the inter-stream and multi-user interference into consideration, a
large interference will be delivered to other agents. Therefore, our proposed reward function rk,n
consists of penalty coefficient λ and penalty term Pk,n(Wk (t), Pk (t)) to quantify the adverse
impact each agent causes to other agents. The penalty term Pk,n (Wk (t), Pk (t)) is given as

Pk,n (Wk (t), Pk (t))


 P

PK PNs H (t)H (t)p (t)|2
|wj,i j j,i
= j=1,j6=k i=1 log2 (1 + σ2 +Iˆ (t)+Iˆ (t) ) − Gj,i (Wk (t), Pk (t))
KNs
(25)
k,n1 c1 ,k
 P

PNs |wk,i (t)Hk (t)pk,i (t)|2
H
+ i=1,i6=n log2 (1 + σ2 +Iˆ (t)+Iˆ (t) ) − Gk,i (Wk (t), Pk (t))
KNs
k,n2 c2 ,k

where Iˆk,n1 (t), Iˆk,n2 (t), Iˆc1 ,k (t) and Iˆc2 ,k (t) are given by
Ns
P
Iˆk,n1 (t) =
X
H
|wj,i (t)Hj (t)pk,i (t)|2 , (26)
i=1,i6=l
KNs

Iˆc1 ,k (t) = q∈K,q6=k N P H


(t)Hj (t)pq,i (t)|2 P H
(t)Hj (t)pj,i (t)|2 ,
P P s
i=1 KNs
|wj,i − KNs
|wj,i (27)
Ns
P
Iˆk,n2 (t) =
X
H
|wk,i (t)Hk (t)pk,h (t)|2 (28)
h=1,h6=i,n
KN s

, and
Ns
P
Iˆc2 ,k (t) =
X X
H
|wk,i (t)Hk (t)pq,h (t)|2 (29)
q∈K,q6=k h=1
KN s

, respectively. Note that Pk,n (Wk (t), Pk (t)) is always a positive value due to the extraction
of the interference from a specified stream. Then, the achievable rate for stream (k, n), i.e.,
Gk,n (Wk (t), Pk (t)), is added into rk,n to highlight the contribution of agent (k, n) to the total
information rate. Hence, rk,n at time slot t is given as

rk,n (t) = Gk,n (Wk (t), Pk (t)) − λPk,n (Wk (t), Pk (t)) (30)

where penalty coefficient λ is used here as a weight parameter to manipulate the amount of
negative effect in the reward function. In regard to the reward function, the rationale behind
such a design is to maximize the achievable rate of improvement if the interference caused by
stream (k, n) is totally eliminated. This design not only maximizes the achievable rate of stream

July 20, 2023 DRAFT


16

(k, n), i.e., Gk,n but also minimizes the negative effect it causes to other streams, i.e., Pk,n .
Similar designs are comprehensively discussed in [27], [28] which also confirm that a well-
formulated reward function should act as a catalyst of the best decisions obtained by multiple
agents.

TABLE I
COMPARISON BETWEEN STRATEGIES

FDD & ZF-CI [45] TDD & ZF-CI [45] FDD & MA-DRL

M N + KN + 11(
M N channel coefficients,
KN CSI pilot symbols,
M N + KN and H
{|wk,n (t − 1)Hk (t − 1)pk,n (t − 1)|2 ,
CSI Overhead
(M N channel coefficients KN sounding pilot symbols Gk,n (Pk (t − 1), Wk (t − 1)),
Uplink
and KN CSI pilot symbols) Ik,n (t − 1) + Ic,k (t − 1) + σk2 ,
PNs H
i=1,i6=n |wk,n (t − v − u)Hk (t − u)pk,i (t − v − u
P PNs H
j6=k i=1 |wk,n (t − v − u)Hk (t − u)pj,i (t − v −

u ∈ {1, 2}, v ∈ {0, 1}}


M N + 2N
CSI Overhead
M N CSI pilot symbols 0 (M N channel coefficients and indexes
Downlink
of precoders and combiner pk,n and wk,n )
Computing Complexity
of Precoding and O((M N )3 ) O((M N )3 ) O((10KN + 3)L1 + L1 L2 + L2 S)
Combining Matrices

6) Discussion on the overhead and complexity of the proposed framework: As is shown in


Table I, if the base station has to tell the users what combiner to use, then it can consume
additional overhead on the downlink transmission. Fortunately, this overhead is negligible since
only the indexes of combiners are delivered to users. Note that the precoders are also sent to
terminals for the calculation of state information listed in the table. This reduces the computation
burden on the base station for processing this state information.
In terms of the computational complexity of precoders and combiners in the demonstrated
DRL-based approaches, the designed structure of target/trained DQNs includes four fully con-
nected layers. Specifically, the input layer consists of 10KNs + 3 neurons, followed by two
hidden layers with L1 and L2 neurons and a specified activation function. The fourth layer
serves as the output layer with S neurons. We employ two hidden layers in our design, as
a two-layer feedforward neural network is sufficient to approximate any nonlinear continuous

July 20, 2023 DRAFT


17

function based on the universal approximation theorem [46]. The computational complexity of
fully connected DNN can be written as O((10KNs + 3)L1 + L1 L2 + L2 S) for each agent. This
is much smaller than that of ZF-CI scheme due to the fact that ZF-CI involves matrix inversion
which limits the scalability to a large number of transmit and receive antennas.
Remark 1: Note that different from [23] where the mobility estimation and channel prediction
are needed, our work does not predict the channels sequentially. In this paper, we demonstrated
a low complexity and efficient DRL-based framework and as this is the first work proposing
DRL-based joint transmit and receiver beamforming for massive MIMO downlink transmission,
we would like to keep the benchmarks as clear and simple as possible such that researchers can
understand the fundamental benefits of the proposed strategies and carry on their studies in more
practical scenarios in the future. The comparison with mobility estimation and channel prediction
methods (such as VFK and MLP methods in [23]) could be addressed in future research, but
not the scope of this paper.

C. The Low-complexity Centralized-learning-distributed-processing DRL-based Algorithm

In this section, we demonstrate an extra algorithm for the problem (14) for three reasons.
First, a lower computation complexity is achieved in the centralized scheme by building and
training on an extra pair of DQNs instead of distributedly training with KNs agents. Second, a
lower storage space is required with only a central experience pool during the learning process.
Third, by saving and sampling the experiences from all distributed agents, the central agent can
learn the common features from the channels of all users and intelligently guide the decision-
making procedure of all distributed agents. Things need to be noted that CDRL is trained more
efficiently using parameter sharing, which is based on homogeneous agents. This allows the
policy to be trained with the experiences of all agents simultaneously. However, it still allows
different actions between agents due to the fact that each agent receives different observations.
This algorithm focuses on the decentralized parameter-sharing training scheme since we found
it to be scalable if we continue to increase the number of users and streams.
There are also some similarities between CDRL and DDRL. On the one hand, they have the
same state, action, and reward function without the necessity of designing new ones. On the
other hand, the executing phase is also performed by distributed agents.
The whole process is shown in Algorithm 2. At the initialization stage, only one pair of target
and trained DQNs is built for the central agent. For each distributed agent, one trained DQN is

July 20, 2023 DRAFT


18

Algorithm 1 DDRL Algorithm


1: Initialize: Establish a trained DQN and target DQN with random weights θk,n and θ̄k,n ,

respectively, ∀k ∈ {1, 2, . . . , K} , ∀n ∈ {1, 2, . . . , Ns }, update the weights of θ̄k,n with θk,n .


2: In the first Es time slots, agent (k, n) randomly selects an action from action space A, and
stores the corresponding experience hsk,n , ak,n , rk,n, s′k,n i in its pool, ∀k, n.
3: for each time slot t do
4: for each agent (k, n) do
5: Obtain state sk,n from the observation of agent (k, n).
6: Generate a random number ω.
7: If ω < ǫ then:
8: Randomly select an action in action space A.
9: Else
10: Choose the action ak,n according to the Q-function q(sk,n .a; θk,n ), ∀k, n
11: End if .
12: Agent (k, n) executes the ak,n , immediately receives the reward rk,n and steps into next
state s′k,n , ∀k, n.
13: Agent (k, n) puts experience hsk,n , ak,n , rk,n , s′k,ni into experience pool Ok,n , randomly
samples a minibatch with size Eb . Then, the weights of trained DQN θk,n are updated using
back propagation approach. The weights of target DQN θ̄k,n is updated every Ts steps.
14: end for
15: end for

established. In the first several time slots, each agent randomly selects an action and saves the
experiences into the central experience pool. When the episode begins, the central agent adopts
an ǫ-greedy strategy to balance exploitation and exploration so as to find the optimal policy.
After learning from the sample experiences, the central agent broadcasts the updated weights of
the central trained DQN to all other distributed agents for decision-making purposes.

D. Bridging the DDRL and CDRL: Partial-distributed-learning-distributed-processing Scheme

In contrast with DDRL and CDRL, the partial-distributed-learning-distributed-processing DRL-


based scheme (PDRL) offers a more flexible solution to the problem (14) by modeling each user
as an agent. In the extreme case of Ns = 1, K > 1, PDRL boils down to DDRL by simply

July 20, 2023 DRAFT


19

Algorithm 2 CDRL Algorithm


1: Initialize: Establish a central trained DQN and central target DQN with random weights θc

and θ̄c for the central agent, update the weights of θ̄c with θc . Establish a trained DQN with
random weight θk,n , ∀k ∈ {1, 2, . . . , K} , ∀n ∈ {1, 2, . . . , Ns } for each distributed agent.
2: In the first Es time slots, agent (k, n) randomly selects an action from action space A, and
stores the experience hsk,n, ak,n , rk,n , s′k,n i, ∀k, n in the experience pool of central agent Oc .
3: for each time slot t do
4: for each agent (k, n) do
5: Obtain state sk,n from the observation of agent (k, n).
6: Generate a random number ω.
7: If ω < ǫ then:
8: Randomly select an action in action space A.
9: Else
10: Choose the action ak,n according to the Q-function q(sk,n .a; θk,n ), ∀k, n
11: End if .
12: Agent (k, n) executes the ak,n , immediately receives the reward rk,n and steps into next
state s′k,n , ∀k, n.
13: Agent (k, n) puts experience hsk,n, ak,n , rk,n , s′k,n i into central experience pool Oc .
14: end for
15: Central agent randomly samples a minibatch with size Eb . Then, the weights of central
trained DQN θc are updated using the back propagation approach. The weights of target
DQN θ̄c is updated every Ts steps. Then, central agent broadcasts the weights θc to all the
distributed agents, i.e., θk,n = θc , ∀k, n.
16: end for

treating each stream as an agent. In the other extreme case of K = 1, Ns > 1, PDRL boils down
to CDRL by forcing one central agent to do the training work. Compared with CDRL, PDRL
demonstrates better performance-complexity balance by learning the representative features of
the propagation environment for a specified user which is demonstrated in Fig. 13. The whole
algorithm is illustrated in Algorithm 3.

July 20, 2023 DRAFT


20

IV. R ESULT E VALUATION

This section demonstrates the performance of our proposed multi-agent DRL-based algorithm
to maximize the average throughput of all the users. We first illustrate the simulation setup,
followed by the simulation results in different scenarios.

A. Simulation Setup

We consider a downlink transmission from one BS to multiple users. The BS serves K = 4


users in a single cell. The maximum transmit power P is fixed to 20 dBm and noise variance
σ 2 at users is fixed to -114 dBm. The BS is equipped with M = 32 transmit antennas and
the users are equipped with N = 4 receive antennas unless otherwise stated. Without loss of
generality, the uniform linear array (ULA) is equipped in both transmitter and receiver sides
with half-wavelength inter-antenna spacing. The large-scale channel fading is characterized by
the log-distance path-loss model expressed below
d
η = L(d0 ) + 10ω log10 . (31)
d0
where d = 10 m is the BS-user distance. According to Table III of [47], the value of L(d0 )
for d0 = 1 m is 68 dB and fading coefficient ω is 1.7. In terms of the shadowing model, the
log-normal shadowing standard deviation βk is set to 1.8 dB. The small-scale fading channel is
generated according to the channel model introduced in Section II. Regarding the parameters
of Jake’s model with user speed 3.55 km/h, the maximum Doppler frequency fdmax and channel
instantiation interval Ti are set as 800 Hz and 1 × 10−3 s, respectively [47]. The corresponding
correlation coefficient ρ is 0.6514≈ 0.65.
As is illustrated in Fig. 5, the whole framework can be divided into 2 phases, the learning
phase, and the processing phase. Before the learning phase, we randomly generate channels
obeying Jake’s model, randomly choose actions, observe the reward, and accumulate and store
the corresponding experiences into the experience pool with size 1000 for the first 200 time
slots, i.e., Em = 1000, Es = 200. In addition, the mini-batch size Eb is set as 32. Stepping into
the learning stage, for the DNN, the number of neurons in two hidden layers, i.e, L1 , L2 , are
both set as 256, followed by the ReLu activation function. The initial learning rate α(0) is 5e−3
and the decaying rate dc is 10−4 such that the learning rate continues to decay with the number
1
of time slots following α(t) = α(t − 1) ∗ 1+dc t
. In terms of optimization, the adaptive moment
estimation (Adam) is utilized to prevent the diminishing learning rate problem. To minimize

July 20, 2023 DRAFT


21

the prediction error between trained DQN and target DQN, the weights of trained DQN are
substituted into target DQN every 120 time slots, i.e., Ts = 120 with discount factor γ and
penalty coefficient λ set as 0.1 and 1, respectively. During the processing phase, for ǫ-greedy
strategy, we set the initial exploration coefficient ǫ as 0.7 which decays exponentially to 0.001.
Note that the adopted parameters are not guaranteed to be optimal ones, which experimentally
perform well in this setup. In the legend of simulation figures, DDRL and CDRL come from
Algorithm 1 and Algorithm 2, respectively. The value of each point is a moving average over
the previous 500 time slots unless otherwise stated.
To demonstrate the effectiveness of our DRL-based approaches, four benchmark schemes are
evaluated, which are as follows:
• ZF-CI PCSI: Each agent executes the action from the scheme in [40] with instantaneous
and perfect CSI, i.e., Hk (t), ∀k.
• SAH: This approach stores the most recent estimated channel, i.e., Hk (t − 1), ∀k and this
approach always sends the channel coefficients to the base station, which will be used for
calculating the precoders using ZF-CI. This strategy essentially ignores the non-negligible
delay between the channel estimation and the time point when the actual DL transmission
happens [26]. When ρ = 1, SAH is the same as ZF-CI PCSI. SAH only captures delay but
assumes perfect knowledge of CSI at t − 1.
• Random: Each agent randomly chooses actions. The performance serves as a lower bound
in the simulation.
• Greedy Beam Selection (GBS): Each agent exhaustively selects an action in a greedy
manner, the actions with the highest sum information rate are chosen as the solution for each
channel realization. The benchmark serves as the upper bound for DRL-based strategies.
Note that the size of the beam selection set increases exponentially with the size of the
codebooks ((NK)St Sr ). For instance, when K = 4, N = 1, St = 32, Sr = 4, the total number
of action combination is 432 which is quite large considering the hardware constraint. Thus,
we consider (8, 1) in this benchmark.

B. DDRL vs CDRL

Fig. 6 depicts the average achievable information rate versus the number of time slots with
different numbers of transmit and receive beamformer codebook size (St , Sr ). A first observation
is that the performance gaps between two DRL-based schemes and SAH are gradually increased

July 20, 2023 DRAFT


22

M = 32, N = 4, K = 4, N s = 1, different codebook sizes, (St, Sr) M = 32, N = 4, K = 4, N s = 1, different moving speed
7 7

Average achievable information rate (bps/Hz)

Average achievable information rate (bps/Hz)


6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
number of time slots 104 number of time slots 104

Fig. 6. Average information rate versus the number of Fig. 7. Average achievable information rate versus the
time slots with different codebook sizes (St , Sr ). number of time slots with different correlation coefficients.

M = 32, N = 4, Ns = 1, different nummber of users K


M = 32, N = 4, Ns = 1, K = 4, = 0.65, Users Rescheduling
7
8

Average achievable information rate (bps/Hz)


Average achievable information rate (bps/Hz)

7 6

6 5

5
4

4
3
3

2
2

1
1

0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1 2 3 4 5 6 7 8 9 10
number of time slots 105 number of time slots 104

Fig. 8. Average information rate versus number of time Fig. 9. Average information rate versus number of time
slots with users rescheduling at 5e4th, 1e5th and 15e4th slots with different number of users K.
time slot.

with the number of St and Sr and DDRL roughly observes a gain of 380% over SAH when
St = 32 and Sr = 4. The reason behind such a phenomenon is that, in DRL-based strategies,
better interference management can be achieved by higher resolution in the codebook which
significantly reduces the quantization error and effectively alleviates the interference from other
streams. A second observation is that DDRL can achieve nearly 90% of the system capacity
of the ZF-CI PCSI and 95% of the Beam Selection strategy, utilizing only a few pieces of
information in the designed features from itself and other users. An interpretation is that the
lack of instantaneous CSI and imperfect codebook design degrade the system performance and

July 20, 2023 DRAFT


23

this 10% and 5%gap cannot be fully eliminated A third observation is that DDRL (32, 8) only
demonstrates slightly better performance than DDRL (32, 4) and DDRL (16, 4) which shows
of robustness on codebook size. A fourth observation is that the CDRL always demonstrates
instability before convergence. An explanation is that the huge differences between the dynamic
environment of different users make it extremely difficult to find the commonality among
them. Then, each agent could be misled by the experiences of other agents, which, thus,
results in fluctuations before convergence and degradation in system performance. Conversely,
in DDRL, each agent selects a specific precoder and combiner for its intended stream which is
relevant to its propagation environment and is considerably different among streams. This local
adaptability greatly improves the performance of DDRL. After comprehensive considerations
among computation complexity, system performance, and convergence speed, (32, 4) is chosen
as a codebook baseline in the simulations of both DRL-based schemes. Compared with the
Greedy Beam Selection method, the DRL-based strategies reveal extremely lower complexity
on large action space but achieve roughly 95% of Greedy Beam Selection performance.II.
Hence, CDRL is not suitable for practical systems where a large codebook is not available
while DDRL demonstrates more robustness regarding codebook size but requires much more
amount of memory and computing resources for training. We implemented the demonstrated
algorithms with TensorFlow in a general computer, i.e., i7-8700 CPU, 3.20 GHz. The running
time for different algorithms is listed in Table. II.

TABLE II
RUNNING T IME F OR E ACH C HANNEL R EALIZARION

DDRL (32, 4) ZF PCSI SAH Beam Selection (8, 1) Random Selection


Time 0.2s 0.8s 0.8s 10s 0.1s

Fig. 7 exhibits the average achievable information rate versus the number of time slots with
different values of correlation coefficient ρ. The DDRL scheme with ρ = 0.65 and ρ = 0.1 can
exceed the benchmark SAH with approximately 380% and 500%, respectively. This result greatly
embodies the superiority of our DRL-based framework over the traditional massive MIMO
optimizing scheme in mobility scenarios since a 20% performance degradation is caused by
the fast-changing channels in SAH. In addition, it can be observed that DDRL with ρ = 0.1
demonstrates a slightly lower performance than ρ = 0.65 which is also shown in CDRL. An

July 20, 2023 DRAFT


24

explanation is that the DDRL scheme is not sensitive to the dynamic and fast-changing wireless
environment but CDRL needs more time steps to learn the representative features of the rapid-
changing environment in high mobility scenarios, which results in a lower convergence. Note that
the DRL-based methods have certain adaptability to environmental changes in user speed which
can be interpreted as robustness on max Doppler frequency. Even though the correlation between
adjacent channels is very small, the DRL-based frameworks still benefit from the exploration-
exploitation strategy. Similar results are also observed in [34].
In Fig. 8, we assume that the users’ rescheduling happens at the 50000th, 100000th, and
150000th-time slots. Instead of re-initializing the weights of all the DQNs in each agent, all
of them continue the training process based on the designed information from newly scheduled
users. First, a much higher start point and a comparable convergence time can be achieved in
DDRL without witnessing a great performance collapse compared with the ZF-CI PCSI scheme.
This can be interpreted by the fact that each agent tries to find the common features between
the first scheduled and reschedule users which naturally makes a better decision based on these
features and exhibits the ability for maintaining connectivity against user rescheduling in mobile
networks. Second, with more rescheduling happening, a higher information rate and a faster con-
vergence can be observed in CDRL. An interpretation is that the common features learned from
the previously scheduled users boost the training in rescheduling. After learning and ’storing’
more and more feature information into the weights of DQN, the central agent demonstrates
universality to the channel uncertainty of rescheduled users and the neural network weights
extracted from the previously trained DQN is a good candidate for the weight initialization
of the current trained DQN in both schemes. However, if rescheduling happened frequently,
the proposed DRL-based schemes can not converge to a set of good parameters if rescheduling
happens before 25000-time slots but a jump-start happens when we implement a group of trained
DQN to new users.
Fig. 9 investigates the average achievable information rate versus the number of time slots
with K = 4 and 6, respectively. As opposed to the decrease of average user rate, the total cell
throughput improves which suggests that both DRL-based approaches can benefit from multi-
user diversity provided by time-varying channels across different users. In addition, although
an 11.7% performance degradation can be observed in CDRL which is smaller than 19.8% in
DDRL, CDRL achieves a much smaller performance gain over SAH in comparison to DDRL
when K = 6. This suggests that DDRL is more robust in multi-user scenarios than CDRL.

July 20, 2023 DRAFT


25

N = 1. N s = 1, K = 1, different number of transmit antennas N = 4, N s = 2, K = 4, different nummber of transmit antennas M


20 8

Average achievable information rate (bps/Hz)

Average achievable information rate (bps/Hz)


18
7
16
6
14
5
12

10 4

8
3
6
2
4
1
2

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
number of time slots 104 number of time slots 104

Fig. 10. Average information rate versus the number of Fig. 11. Average information rate versus number of time
time slots with different number of transmit antennas M slots with different number of transmit antennas M when
when N = 1, Ns = 1, K = 1. N = 4, Ns = 2, K = 4.

To sum up in a nutshell, on the one hand, CDRL and DDRL are not equally suitable for
massive MIMO systems with channel aging, namely, CDRL is less computationally complex
but demonstrates instability and incurs performance loss. In contrast, DDRL offers a promising
gain over CDRL by favoring an adaptive decision-making process and facilitating cooperation
among all agents to mitigate interference but incurs a higher hardware complexity. Also, DDRL
is more robust on rescheduling than CDRL considering the convergence speed and stability. On
the other hand, both schemes demonstrate robustness on fading characteristics of the environment
and changes on interference conditions.

C. Multi-antenna and Multi-stream

Fig. 10 illustrates the DRL-based scheme with 6 different numbers of transmit antennas M
when N = 1, Ns = 1, K = 1. Without any penalty, i.e., inter-stream interference and multi-user
interference, a near-optimal result can be observed by leveraging the proposed state, action,
and reward design in an interference-free scenario with a stable increase of information rate,
which thereby validates the effectiveness of the codebook design in Section III. The convergence
time is proportional to M which limits its scalability. Intuitively, the reason for this effect on
the convergence time is that it takes a longer time to learn the representative feature of high-
dimension CSIT in sequential states. In contrast with Fig. 10, a serious multi-user and inter-
stream interference is managed in Fig. 11 when N = 4, Ns = 2, K = 4. It can be observed

July 20, 2023 DRAFT


26

M = 32, N = 4, K = 4, different nummber of streams Ns M = 32, N = 4, K = 4, N s = 2


7 6

Average achievable information rate (bps/Hz)

Average achievable information rate (bps/Hz)


6
5

5
4

4
3
3

2
2

1
1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
number of time slots 104 number of time slots 104

Fig. 12. Average information rate versus the number of Fig. 13. Average information rate versus number of time
time slots with different number of streams N . slots with different DRL-based schemes.

that the transmit diversity and array gain cannot be fully achieved in the proposed DRL-based
scheme if the rich interference is not properly suppressed due to the constraint of codebook
precision and CSIT imperfections. Hence, the drawback of using DRL-based methods is that
inter-stream interference can not be sufficiently alleviated if each agent fails to choose an action
that causes small interference to all other agents during exploration and exploitation.
Fig. 12 characterizes the average achievable information rate versus the number of time slots
with different numbers of streams for each user. First, DDRL has significantly higher performance
compared to the conventional SAH scheme in different numbers of streams. Second, a 15.7%
performance degradation can be observed from the DRL-based scheme between 1-stream and
2-stream scenarios which is smaller than that in SAH (around 28% between black and blue
dotted lines). This reveals the privilege of the DDRL in inter-stream interference management.
An overview of the average information rate versus number of time slots with three DRL-
based algorithms is demonstrated in Fig. 13. Compared with DDRL and PDRL, a performance
collapse is observed in CDRL due to the degrading effect of inter-stream interference. By flexibly
modeling each user as an agent, PDRL greatly mitigates the inter-stream interference by learning
the local observations from the target user’s propagation channel.

D. Reward, Penalty Analysis and Statistical Test

To reveal the significance of the neural network size (L1 , L2 ), the learning rate α and the
discount factor γ, Fig. 14 shows the sum reward versus the number of time slots with different

July 20, 2023 DRAFT


27

M = 32 N = 4, N s = 2, K = 4, different (L 1 , L 2 ), and M = 32, N = 4, Ns = 1, K = 4, = 0.65, different


20 7

Average achievable information rate (bps/Hz)


6
15
Sum reward of all agents
5
10

4
5
3

0
2

-5
1

-10 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
number of time slots 104 number of time slots 104

Fig. 14. Sum reward versus the number of time slots with Fig. 15. Average information rate versus the number of
different number of users K. time slots with different penalty λ.

(L1 , L2 ), α and γ. The first observation is that a faster convergence is observed with larger
α, this is intuitive since the gradient descent is sped up with a larger value of loss function.
The second observation is that, compared with (256, 256), a reward degradation appears with
(32, 32), which suggests that increasing the DNN size demonstrates a stronger representation
capability of input features and boosts the performance of the DRL-based scheme. Due to the
negligible performance improvement in (512, 512), (256,256) is chosen as a baseline to maintain
a balance between user connectivity and computational burden.
Fig. 15 offers an insight into the impact of different penalty values λ. This penalty term
intrinsically represents an adjustment of reward function for each agent. Different from [27],
[28], it is demonstrated in Fig. 15 that the system capacity is gradually increased with the
penalty value from 0.1 to 5. An interpretation is that each agent causes high interference to
other agents while still trying to maximize its information rate. Due to the uncertainty of the
dynamic environment, the lack of perfect CSI introduces unpredictable interference for all the
agents and an increase of penalty value can make a remedy for this by choosing an action
that minimizes the interference to other agents instead of maximizing the received power of
itself. This result also indicates that the decision-making process of all the agents is robust in
unexpected high-interference scenarios.
The cumulative distribution function (CDF) over different DRL-based methods and bench-
marks have been plotted in Fig. 16. It can be seen that the CDF curves confirm the discussion
of the superiority of DDRL over other schemes. Also, the performance of DDRL is significantly

July 20, 2023 DRAFT


28

M = 32, N = 4, K = 4, N s = 1
1 M = 32, N = 4, K = 4, N s = 1

0.9 9

Average achievable information rate (bps/Hz)


0.8
Empirical cumulative probability 8

0.7 7

0.6 6

0.5 5

0.4 4

0.3 3

0.2 2

0.1 1

0 0
0 1 2 3 4 5 6 7 8 9 10
Average achievable information rate (bps/Hz) DDRL CDRL ZF-CI PCSI Random SAH GBS

Fig. 16. Cumulative distribution function (CDF) of the Fig. 17. Boxplot of the average information rate over
average information rate over different DRL-based meth- different DRL-based methods and benchmarks. The bot-
ods and benchmarks. tom and top of each box are the 25th and 75th percentiles
of the information rate values, respectively. The whisker
length is set to infinity to ensure there are no outliers.

limited by the codebook resolution which is confirmed in Fig. 17 where we show the boxplot
of different methods.

V. C ONCLUSION AND F UTURE R ESEARCH

In this paper, we studied the beamforming optimization for massive MIMO downlink trans-
mission with channel aging. An optimization framework in light of DRL was studied and three
DRL-based algorithms were derived based on stream-level, user-level, and system-level agent
modeling. Specifically, the transmit precoder at BS and receive combiner at user terminals
were jointly optimized to maximize the average information rate. Furthermore, we analyzed
the performance loss of DRL-based approaches as compared to the ideal case with continuous
beamforming with different numbers of codebook sizes, users, antennas, streams, and user speeds.
Interestingly, it was shown that even using a very low-resolution codebook in DDRL is still able
to achieve 95% and 90% as in the case with GBS and ZF-CI, respectively. Simulation results
showed that significant robustness on user mobility can be achieved by using some received
power values of imperfect CSIT at the expense of more uplink overhead. Also, the convergence
speed and scalability of the proposed algorithms are discussed. The convergence speed is linearly
increased with the number of transmit antennas and performance degradation in the multiuser
case is non-negligible due to the severe co-channel interference. In addition, CDRL consumes

July 20, 2023 DRAFT


29

less computation complexity but demonstrates instability and incurs performance loss. In contrast,
DDRL offers a promising gain over CDRL by favoring an adaptive decision-making process and
facilitating cooperation among all agents to mitigate interference but incurs a higher hardware
complexity and non-stationarity. Finally, the reward, penalty analysis, and statistical test confirm
the fact that the performance of the proposed algorithms is greatly limited by the resolution of
codebooks. Several important issues that are not addressed in our paper yet, some of which are
listed as follows to motivate future research.
• Multi-cell: This paper considered single-cell multiuser conditions. However, when multi-
cell is considered. The transmit power of the BS needs to be optimized, due to which the
corresponding optimization problem is more challenging to solve, and thus is worthy of
further investigation.
• Extremely Large-scale MIMO: To overcome the capacity constraints of conventional MIMO,
extremely large-scale MIMO (XL-MIMO) are being proposed which can provide a much
stronger beamforming gain to compensate for the severe path loss. As such, it is worth
comparing the proposed massive MIMO with the XL-MIMO in future investigations.
• Beamforming Codebook Design: As is shown in this paper, the performance is greatly limited
by the resolution of the designed codebook. A better codebook enables the system to handle
larger and more complex channel conditions without compromising on performance.

A PPENDIX A
PDRL A LGORITHM

ACKNOWLEDGMENT

The author would like to thank Hongyu Li, Yumeng Zhang, and Dr. Onur Dizdar for stimu-
lating discussions.

R EFERENCES
[1] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong, and J. C. Zhang, “What will 5G be?” IEEE
Journal on selected areas in communications, vol. 32, no. 6, pp. 1065–1082, 2014.
[2] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “What should 6G be?” Nature Electronics, vol. 3, no. 1, pp. 20–29,
2020.
[3] Z. Xiao, Z. Han, A. Nallanathan, O. A. Dobre, B. Clerckx, J. Choi, C. He, and W. Tong, “Antenna array enabled
space/air/ground communications and networking for 6G,” arXiv preprint arXiv:2110.12610, 2021.
[4] V. Stankovic and M. Haardt, “Generalized design of multi-user MIMO precoding matrices,” IEEE Transactions on Wireless
Communications, vol. 7, no. 3, pp. 953–961, 2008.

July 20, 2023 DRAFT


30

Algorithm 3 PDRL Algorithm


1: Initialize: Establish K pairs of trained/target DQNs with random weights θk and θ̄k , ∀k ∈

{1, 2, 3, . . . , K} as user-specified agents, update the weights of θ̄k with random θk . Build
experience pool Ok , ∀k. Establish a trained DQN with weight θk,n , ∀k ∈ {1, 2, . . . , K} , ∀n ∈
{1, 2, . . . , N} for each distributed agent.
2: In the first Es time slots, agent (k, n) randomly selects an action from action space A, and
stores the experience hsk,n , ak,n , rk,n , s′k,ni, ∀k, n in the experience pool of corresponding
user-specified agent Ok .
3: for each time slot t do
4: for each agent (k, n) do
5: Obtain state sk,n from the observation of agent (k, n).
6: Generate a random number ω.
7: If ω < ǫ then:
8: Randomly select an action in action space A.
9: Else
10: Choose the action ak,n according to the Q-function q(sk,n .a; θk,n ), ∀k, n
11: End if .
12: Agent (k, n) executes the ak,n , immediately receives the reward rk,n and steps into next
state s′k,n , ∀k, n.
13: Agent (k, n) puts experience hsk,n, ak,n , rk,n , s′k,n i into central experience pool Ok .
14: end for
15: User-specified agent k randomly samples a minibatch with size Eb . Then, the weights of
its trained DQN θk are updated using back propagation approach. The weights of its target
DQN θ̄k is updated every Ts steps. Then, the user-specified agent broadcasts the weights θk
to the corresponding distributed agents, i.e., θk,n = θk , ∀n.
16: end for

[5] B. Clerckx and C. Oestges, MIMO wireless networks: channels, techniques and standards for multi-antenna, multi-user
and multi-cell systems. Academic Press, 2013.
[6] W. Ding, F. Yang, C. Pan, L. Dai, and J. Song, “Compressive sensing based channel estimation for OFDM systems under
long delay channels,” IEEE Transactions on Broadcasting, vol. 60, no. 2, pp. 313–321, 2014.
[7] J. Meng, W. Yin, Y. Li, N. T. Nguyen, and Z. Han, “Compressive sensing based high-resolution channel estimation for
OFDM system,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 1, pp. 15–25, 2011.

July 20, 2023 DRAFT


31

[8] K. T. Truong and R. W. Heath, “Effects of channel aging in massive MIMO systems,” Journal of Communications and
Networks, vol. 15, no. 4, pp. 338–351, 2013.
[9] T. Ramya and S. Bhashyam, “Using delayed feedback for antenna selection in MIMO systems,” IEEE Transactions on
Wireless Communications, vol. 8, no. 12, pp. 6059–6067, 2009.
[10] A. K. Papazafeiropoulos, “Impact of general channel aging conditions on the downlink performance of massive MIMO,”
IEEE Transactions on Vehicular Technology, vol. 66, no. 2, pp. 1428–1442, 2016.
[11] N. Lee and R. W. Heath, “Space-time interference alignment and degree-of-freedom regions for the MISO broadcast
channel with periodic CSI feedback,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 515–528, 2013.
[12] N. Lee, R. Tandon, and R. W. Heath, “Distributed space–time interference alignment with moderately delayed CSIT,”
IEEE Transactions on Wireless Communications, vol. 14, no. 2, pp. 1048–1059, 2014.
[13] H. Yin, H. Wang, Y. Liu, and D. Gesbert, “Addressing the curse of mobility in massive MIMO with Prony-based angular-
delay domain channel predictions,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 12, pp. 2903–2917,
2020.
[14] A. Papazafeiropoulos and T. Ratnarajah, “Linear precoding for downlink massive MIMO with delayed CSIT and channel
prediction,” in 2014 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2014, pp. 809–914.
[15] C. Kong, C. Zhong, A. K. Papazafeiropoulos, M. Matthaiou, and Z. Zhang, “Sum-rate and power scaling of massive MIMO
systems with channel aging,” IEEE transactions on communications, vol. 63, no. 12, pp. 4879–4893, 2015.
[16] O. Dizdar, Y. Mao, and B. Clerckx, “Rate-Splitting Multiple Access to Mitigate the Curse of Mobility in (Massive) MIMO
Networks,” IEEE Transactions on Communications, vol. 69, no. 10, pp. 6765–6780, 2021.
[17] Y. Anzai, Pattern recognition and machine learning. Elsevier, 2012.
[18] M. Nerini, V. Rizzello, M. Joham, W. Utschick, and B. Clerckx, “Machine Learning-Based CSI Feedback With Variable
Length in FDD Massive MIMO,” arXiv preprint arXiv:2204.04723, 2022.
[19] N. Ma, K. Xu, X. Xia, C. Wei, Q. Su, M. Shen, and W. Xie, “Reinforcement learning-based dynamic anti-jamming power
control in uav networks: An effective jamming signal strength based approach,” IEEE Communications Letters, vol. 26,
no. 10, pp. 2355–2359, 2022.
[20] J. Kim, H. Lee, and S.-H. Park, “Learning Robust Beamforming for MISO Downlink Systems,” IEEE Communications
Letters, vol. 25, no. 6, pp. 1916–1920, 2021.
[21] T. Lin and Y. Zhu, “Beamforming design for large-scale antenna arrays using deep learning,” IEEE Wireless Communica-
tions Letters, vol. 9, no. 1, pp. 103–107, 2019.
[22] J. Yuan, H. Q. Ngo, and M. Matthaiou, “Machine learning-based channel prediction in massive MIMO with channel aging,”
IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 2960–2973, 2020.
[23] H. Kim, S. Kim, H. Lee, C. Jang, Y. Choi, and J. Choi, “Massive MIMO channel prediction: Kalman filtering vs. machine
learning,” IEEE Transactions on Communications, vol. 69, no. 1, pp. 518–528, 2020.
[24] C. Wu, X. Yi, Y. Zhu, W. Wang, L. You, and X. Gao, “Channel prediction in high-mobility massive MIMO: From
spatio-temporal autoregression to deep learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp.
1915–1930, 2021.
[25] Z. Qin, H. Yin, Y. Cao, W. Li, and D. Gesbert, “A Partial Reciprocity-based Channel Prediction Framework for FDD
massive MIMO with High Mobility,” arXiv preprint arXiv:2202.05564, 2022.
[26] Y. Zhang, A. Alkhateeb, P. Madadi, J. Jeon, J. Cho, and C. Zhang, “Predicting Future CSI Feedback For Highly-Mobile
Massive MIMO Systems,” arXiv preprint arXiv:2202.02492, 2022.
[27] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,”
IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2239–2250, 2019.

July 20, 2023 DRAFT


32

[28] J. Ge, Y.-C. Liang, J. Joung, and S. Sun, “Deep reinforcement learning for distributed dynamic MISO downlink-
beamforming coordination,” IEEE Transactions on Communications, vol. 68, no. 10, pp. 6070–6085, 2020.
[29] L. Zhang and Y.-C. Liang, “Deep Reinforcement Learning for Multi-Agent Power Control in Heterogeneous Networks,”
IEEE Transactions on Wireless Communications, vol. 20, no. 4, pp. 2551–2564, 2020.
[30] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface assisted multiuser MISO systems exploiting deep
reinforcement learning,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 8, pp. 1839–1850, 2020.
[31] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, and Q. Wu, “Deep reinforcement learning-based intelligent reflecting
surface for secure wireless communications,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 375–388,
2020.
[32] C. Huang, Z. Yang, G. C. Alexandropoulos, K. Xiong, L. Wei, C. Yuen, Z. Zhang, and M. Debbah, “Multi-hop RIS-
empowered terahertz communications: A DRL-based hybrid beamforming design,” IEEE Journal on Selected Areas in
Communications, vol. 39, no. 6, pp. 1663–1677, 2021.
[33] W. Li, W. Ni, H. Tian, and M. Hua, “Deep Reinforcement Learning for Energy-Efficient Beamforming Design in Cell-Free
Networks,” in 2021 IEEE Wireless Communications and Networking Conference Workshops (WCNCW). IEEE, 2021, pp.
1–6.
[34] R. Zhang, K. Xiong, Y. Lu, B. Gao, P. Fan, and K. B. Letaief, “Joint Coordinated Beamforming and Power Splitting Ratio
Optimization in MU-MISO SWIPT-Enabled HetNets: A Multi-Agent DDQN-Based Approach,” IEEE Journal on Selected
Areas in Communications, vol. 40, no. 2, pp. 677–693, 2021.
[35] H. Chen, Z. Zheng, X. Liang, Y. Liu, and Y. Zhao, “Beamforming in Multi-User MISO Cellular Networks with Deep
Reinforcement Learning,” in 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring). IEEE, 2021, pp. 1–5.
[36] Q. Hu, Y. Liu, Y. Cai, G. Yu, and Z. Ding, “Joint deep reinforcement learning and unfolding: Beam selection and precoding
for mmWave multiuser MIMO with lens arrays,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp.
2289–2304, 2021.
[37] H. Ren, C. Pan, L. Wang, W. Liu, Z. Kou, and K. Wang, “Long-Term CSI-based Design for RIS-Aided Multiuser MISO
Systems Exploiting Deep Reinforcement Learning,” IEEE Communications Letters, 2022.
[38] M. Fozi, A. R. Sharafat, and M. Bennis, “Fast MIMO Beamforming via Deep Reinforcement Learning for High Mobility
mmWave Connectivity,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 1, pp. 127–142, 2021.
[39] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement
learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp.
3133–3174, 2019.
[40] H. Sung, S.-R. Lee, and I. Lee, “Generalized channel inversion methods for multiuser mimo systems,” IEEE Transactions
on Communications, vol. 57, no. 11, pp. 3489–3499, 2009.
[41] V. Raghavan, J. Cezanne, S. Subramanian, A. Sampath, and O. Koymen, “Beamforming tradeoffs for initial UE discovery
in millimeter-wave MIMO systems,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 3, pp. 543–559,
2016.
[42] T. Kim, D. J. Love, and B. Clerckx, “MIMO systems with limited rate differential feedback in slowly varying channels,”
IEEE Transactions on Communications, vol. 59, no. 4, pp. 1175–1189, 2011.
[43] Y.-C. Liang and F. P. S. Chin, “Downlink channel covariance matrix (DCCM) estimation and its applications in wireless
DS-CDMA systems,” IEEE Journal on Selected Areas in Communications, vol. 19, no. 2, pp. 222–232, 2001.
[44] W. Zou, Z. Cui, B. Li, Z. Zhou, and Y. Hu, “Beamforming codebook design and performance evaluation for 60GHz
wireless communication,” in 2011 11th International Symposium on Communications & Information Technologies (ISCIT).
IEEE, 2011, pp. 30–35.

July 20, 2023 DRAFT


33

[45] E. Björnson, E. G. Larsson, and T. L. Marzetta, “Massive MIMO: Ten myths and one critical question,” IEEE
Communications Magazine, vol. 54, no. 2, pp. 114–123, 2016.
[46] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” Advances
in neural information processing systems, vol. 30, 2017.
[47] P. F. Smulders, “Statistical characterization of 60-GHz indoor radio channels,” IEEE Transactions on Antennas and
Propagation, vol. 57, no. 10, pp. 2820–2829, 2009.

July 20, 2023 DRAFT

You might also like