0% found this document useful (0 votes)
24 views13 pages

Liu 2019

1) The document proposes a novel framework for trajectory design of multiple UAVs based on predicting user mobility using machine learning techniques. 2) It formulates the problem of joint trajectory design and power control to maximize instantaneous sum transmit rate while satisfying user rate requirements. 3) A three-step approach is proposed using multi-agent Q-learning to obtain user position predictions from Twitter data, determine optimal UAV positions, and predict UAV trajectories based on predicted user movements.

Uploaded by

az B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Liu 2019

1) The document proposes a novel framework for trajectory design of multiple UAVs based on predicting user mobility using machine learning techniques. 2) It formulates the problem of joint trajectory design and power control to maximize instantaneous sum transmit rate while satisfying user rate requirements. 3) A three-step approach is proposed using multi-agent Q-learning to obtain user position predictions from Twitter data, determine optimal UAV positions, and predict UAV trajectories based on predicted user movements.

Uploaded by

az B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO.

8, AUGUST 2019 7957

Trajectory Design and Power Control for Multi-UAV


Assisted Wireless Networks: A Machine
Learning Approach
Xiao Liu , Student Member, IEEE, Yuanwei Liu , Senior Member, IEEE, Yue Chen, Senior Member, IEEE,
and Lajos Hanzo , Fellow, IEEE

Abstract—A novel framework is proposed for the trajectory de- received significant research interests as a means of mitigating
sign of multiple unmanned aerial vehicles (UAVs) based on the a wide range of challenges in commercial and civilian applica-
prediction of users’ mobility information. The problem of joint tra- tions [2], [3]. The future wireless communication systems are
jectory design and power control is formulated for maximizing the
instantaneous sum transmit rate while satisfying the rate require- expected to meet unprecedented demands for high quality wire-
ment of users. In an effort to solve this pertinent problem, a three- less services, which imposes challenges on the conventional ter-
step approach is proposed, which is based on machine learning restrial communication networks, especially in traffic hotspots
techniques to obtain both the position information of users and the such as in a football stadium or rock concert [4]–[6]. UAVs may
trajectory design of UAVs. First, a multi-agent Q-learning-based be relied upon as aerial base stations to complement and/or sup-
placement algorithm is proposed for determining the optimal posi-
tions of the UAVs based on the initial location of the users. Second, port the existing terrestrial communication infrastructure [5],
in an effort to determine the mobility information of users based [7], [8] since they can be flexibly redeployed in temporary traf-
on a real dataset, their position data is collected from Twitter to fic hotspots or after natural disasters. Secondly, UAVs have also
describe the anonymous user-trajectories in the physical world. In been deployed as relays between ground-based terminals and
the meantime, an echo state network (ESN) based prediction al- as aerial base stations for enhancing the link performance [9].
gorithm is proposed for predicting the future positions of users
based on the real dataset. Third, a multi-agent Q-learning-based Thirdly, UAVs can also be used as aerial base stations to collect
algorithm is conceived for predicting the position of UAVs in each data from Internet of Things (IoT) devices on the ground, where
time slot based on the movement of users. In this algorithm, multi- building a complete cellular infrastructure is unaffordable [7],
ple UAVs act as agents to find optimal actions by interacting with [10]. Fourthly, combined terrestrial anad UAV communication
their environment and learn from their mistakes. Additionally, we networks are capable of substantially improving the reliability,
also prove that the proposed multi-agent Q-learning-based trajec-
tory design and power control algorithm can converge under mild security, coverage and throughput of the existing point-to-point
conditions. Numerical results are provided to demonstrate that as UAV-to-ground communications [11].
the size of the reservoir increases, the proposed ESN approach im- Key examples of recent advance include the Google Loon
proves the prediction accuracy. Finally, we demonstrate that the project [12], Facebook’s Internet-delivery drone [13], and the
throughput gains of about 17% are achieved. AT&T project of [14]. The drone manufacturing industry faces
Index Terms—Multi-agent Q-learning, power control, trajectory both opportunities and challenges in the design of UAV-assisted
design, Twitter, unmanned aerial vehicle (UAV).
wireless networks. Before fully reaping all the aforementioned
I. INTRODUCTION benefits, several technical challenges have to be tackled, includ-
A. Motivation ing the optimal three dimensional (3D) deployment of UAVs,
their interference management [15], [16], energy supply [9],
S A benefit of their agility, as well as line-of-sight
A (LoS) propagation, unmanned aerial vehicles (UAVs) have
[17], trajectory design [18], the channel model between the UAV
and users [19], [20], resource allocation [7], as well as the com-
patibility with the existing infrastructure.
Manuscript received April 3, 2019; accepted May 21, 2019. Date of pub- The wide use of online social networks over smartphones
lication May 31, 2019; date of current version August 13, 2019. The work L.
Hanzo was supported in part by the Engineering and Physical Sciences Research has accumulated a rich set of geographical data that describes
Council Projects EP/Noo4558/1, EP/PO34284/1, and COALESCE of the Royal the anonymous users’ mobility information in the physical
Society’s Global Challenges Research Fund Grant and in part by the European world [21]. Many social networking applications like Facebook,
Research Council’s Advanced Fellow Grant QuantCom. The review of this paper
was coordinated by Guest Editors of the Special Issue on Vehicle Connectivity Twitter, Wechat, Weibo, etc allow users to ’check-in’ and explic-
and Automation using 5G. This paper was presented in part at the IEEE Global itly share their locations, while some other applications have im-
Communication Conference (GLOBECOM), Waikoloa, HI, USA, Dec. 9–13, plicitly recorded the users’ GPS coordinates [22], which holds
2019. (Corresponding author: Lajos Hanzo.)
X. Liu, Y. Liu, and Y. Chen are with the School of Electronic Engineering and the promise of estimating the geographic user distribution for
Computer Science, Queen Mary University of London, London E1 4NS, U.K. improving the performance of the system. Reinforcement learn-
(e-mail: [email protected]; [email protected]; [email protected]). ing has seen increasing applications in next-generation wireless
L. Hanzo is with the School of Electronics and Computer Science, University
of Southampton, Southampton SO17 1BJ, U.K. (e-mail: [email protected]). networks [23]. More expectantly, reinforcement learning mod-
Digital Object Identifier 10.1109/TVT.2019.2920284 els may be trained by interacting with an environment (states),
0018-9545 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
7958 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

and they can be expected to find the optimal behaviors (actions) time instant. The design-objective was to minimize the UAV’s
of agents by exploring the environment in an iterative manner mission completion time by optimizing its trajectory.
and by learning from their mistakes. The model is capable of
monitoring the reward resulting form its actions and is chosen
for solving problems in UAV-assisted wireless networks. C. Our New Contributions
The aforementioned research contributions considered the de-
ployment and trajectory design of UAVs in the scenario that
B. Related Works
users are static or studied the movement of UAVs based on the
1) Deployment of UAVs: Among all these challenges, the current user location information, where only the user location
geographic UAV deployment problems are fundamental. Early information of the current time slot is known. Studying the pre-
research contributions have studied the deployment of a sin- deployment of UAVs based on the full user location information
gle UAV either to provide maximum radio coverage on the implicitly assumes that the position and mobility information of
ground [24], [25] or to maximize the number of users by using users is known or it can be predicted. With this proviso the flight
the minimum transmit power [26]. As the research evolves trajectory of UAVs may be designed in advance for maintaining
further, UAV-assisted systems have received significant atten- a high service quality and hence reduce the response time. Mean-
tion and been combined with other promising technologies. while, no interaction is needed between the UAVs and ground
Specifically, the authors of [27]–[29] employed non-orthogonal control center after the pre-deployment of UAVs. To the best of
multiple access (NOMA) for improving the performance of our knowledge, this important problem is still unsolved.
UAV-enabled communication systems, which is capable of Again, deploying UAVs as aerial BSs is able to provide re-
outperforming orthogonal multiple access (OMA). In [30], liable services for the users [35]. However, there is a paucity
UAV-aided D2D communications was investigated and the of research on the problem of 3D trajectory design of multiple
tradeoff between the coverage area and the time required for UAVs based on the prediction of the users’ mobility informa-
covering the entire target area (delay) by UAV-aided data tion, which motivates this treatise. More particularly, i) most
acquisition was also analyzed. The authors of [13] proposed existing research contributions mainly focus on the 2D place-
a framework using multiple static UAVs for maximizing the ment of multiple UAVs or on the movement of a single UAV in
average data rate provided for users, while considering fairness the scenario, where the users are static. ii) the prediction of the
amongst the users. The authors of [31] used sphere packing users’ position and their mobility information based on a real
theory for determining the most appropriate 3D position of the dataset has never been considered, which helps us to design the
UAVs while jointly maximizing both the total coverage area trajectory of UAVs in advance, thus reducing both the response
and the battery operating period of the UAVs. time and the interaction between the UAVs as well as control
2) Trajectory Design of UAVs: It is intuitive that moving center. the transmit power of UAVs is controlled for obtaining a
UAVs are capable of improving the coverage provided by tradeoff between the received signal power and the interference
static UAVs, yet the existing research has mainly considered the power, which in turn increases the received signal-interference-
scenario that users are static [10], [32]. Having said that, authors noise-rate (SINR). Therefore, we formulate the problem of joint
of [33] jointly considered the UAV trajectory and transmit power trajectory design and power control of UAVs to improve the
optimization problem for maintaining fairness among users. An users’ throughput, while satisfying the rate requirement of users.
iterative algorithm was invoked for solving the resultant non- Against the above background, the primary contributions of this
convex problem by applying the classic block coordinate descent paper are as follows:
and successive convex optimization techniques. In [17], the r We propose a novel framework for the trajectory design
new design paradigm of jointly optimizing the communication of multiple UAVs, in which the UAVs move around in a
throughput and the UAV’s energy consumption was conceived 3D space to offer down-link service to users. Based on the
for the determining trajectory of UAV, including its initial/final proposed model, we formulate on throughput maximiza-
locations and velocities, as well as its minimum/maximum tion problem by designing the trajectory and power control
speed and acceleration. In [10], a pair of practical UAV trajecto- of multiple UAVs.
ries, namely the circular flight and straight flight were pursued r We develop a three-step approach for solving the proposed
for collecting a given amount of data from a ground terminal problem. More particularly, i) we propose a multi-agent
(GT) at a fixed location, while considering the associated energy Q-learning based placement algorithm for determining the
dissipation tradeoff. By contrast, a novel cyclical trajectory was initial deployment of UAVs; ii) we propose an echo state
considered in [32] to serve each user via TDMA. As shown network based prediction algorithm for predicting the mo-
in [32], a significant throughput gain was achieved over a static bility of users; iii) we conceive a multi-agent Q-learning
UAV. In [34], a simple circular trajectory was used along with based trajectory-acquisition and power-control algorithm
maximizing the minimum average throughput of all users. In for UAVs.
addition to designing the UAV’s trajectory for its action as an r We invoke the ESN algorithm for acquiring the mobility
aerial base station, the authors of [35] studied a cellular-enabled information of users relying on a real dataset of users col-
UAV communication system, in which the UAV flew from an lected from Twitter, which consists of the GPS coordinates
initial location to a final location, while maintaining reliable and recorded time stamps of Twitter.
wireless connection with the cellular network by associating r We conceive a multi-agent Q-learning based solution for
the UAV with one of the ground base stations (GBSs) at each the joint trajectory design and power control problem of
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7959

TABLE I
LIST OF NOTATIONS

UAVs. In contrast to a single-agent Q-learning algorithm, time without landing until maintenance is needed. The scenario
the multi-agent Q-learning algorithm is capable of support- that the energy of UAVs is limited will be discussed in our future
ing the deployment of cooperative UAVs. We also demon- work, in which DLC will also be utilized.
strate that the proposed algorithms is capable of converging
to an optimal state. A. Mobility Model
D. Organization and Notations Since the users are able to move continuously during the flying
period of UAVs, the UAVs have to travel based on the tele-traffic
The rest of the paper is organized as follows. In Section II, of users. Datasets can be collected to model the mobility of users.
the problem formulation of joint trajectory design and power Again, in this work, the real-time position information of users
control of UAVs is presented. In Section III, the prediction of is collected from Twitter by the Twitter API, where the data con-
the users’ mobility information is proposed, relying on the ESN sists of the GPS coordinates and recorded time stamps. When
algorithm. In Section IV, our multi-agent Q-learning based de- users post tweets, their GPS coordinates are recorded, provided
ployment algorithm is proposed for designing the trajectory and that they give their consent, for example in exchange for calling
power control of UAVs. Our numerical results are presented in credits. The detailed discussion of the data collection process is
Section V, which is followed by our conclusions in Section VI. in Section III. The mobility pattern of each user will then be used
The list of notations is illustrated in Table I. to determine the optimal location of each UAV, which will natu-
rally impact the service quality of users. The coordinate of each
II. SYSTEM MODEL user can be expressed as wkn = [xkn (t), ykn (t)]T ∈ R2×1 , kn ∈
We consider the downlink of UAV-assisted wireless com- Kn , where RM ×1 denotes the M -dimensional real-valued vec-
munication networks. Multiple UAVs are deployed as aerial tor space, while xkn (t) and ykn (t) are the X-coordinate and
BSs to support the users in a particular area, where the ter- Y-coordinate of user kn at time t, respectively.
restrial infrastructure was destroyed or had not been installed. Since the users are moving continuously, the location of the
The users are partitioned into N clusters and each user be- UAVs must be adjusted accordingly so as to efficiently serve
longs to a single cluster. Users in this particular area are de- them. The aim of the model is to design the trajectory of UAVs
noted as K = {K1 , . . . KN }, where Kn is the set of users that in advance according to the prediction of the users’ movement.
belong to the n-th cluster, n ∈ N = {1, 2, . . . N }. Then, we At any time slot during the UAVs’ flight period, both the verti-
have Kn ∩ Kn = φ, n = n, ∀n , n ∈ N, while Kn = |Kn | de- cal trajectory (altitude) and the horizontal trajectory of the UAV
notes the number of users in the n-th cluster. For any cluster n, can be adjusted to offer a high quality of service. The verti-
n ∈ N, we consider a UAV-enabled FDMA system [36], where cal trajectory is denoted by hn (t) ∈ [hmin , hmax ], 0 ≤ t ≤ Tn ,
the UAVs are connected to the core network by satellite. At while the horizontal one by qn (t) = [xn (t), yn (t)]T ∈ R2×1 ,
any time during the UAVs’ working period of Tn , each UAV with 0 ≤ t ≤ Tn . The UAVs’ operating period is discretized into
communicates simultaneously with multiple users by employ- NT equal-length time slots.
ing FDMA.
We assume that the energy of UAVs is supplied by laser charg- B. Transmission Model
ing as detailed in [37]. A compact distributed laser charging
(DLC) receiver can be mounted on a battery-powered off-the- In our model, the downlink between the UAVs and users can
shelf UAV for charging the UAV’s battery. A DLC transmitter be regarded as air-to-ground communications. The LoS con-
(termed as a power base station) on the ground is assumed to dition and Non-Line-of-Sight (NLoS) condition are assumed to
provide a laser based power supply for the UAVs. Since the DLC be encountered randomly. The LoS probability can be expressed
is capable of self-alignment and a LOS propagation is usually as [13]
available because of the high altitude of UAVs, the UAVs can   b2
be charged as long as they are flying within the DLC’s coverage 180
PLoS (θkn ) = b1 θkn − ζ , (1)
range. Thus, these DLC-equipped UAVs can operate for a long π
7960 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

where θkn (t) = sin−1 ( dhkn (t) ) is the elevation angle between
n (t)
the UAV and the user kn . Furthermore, b1 and b2 are constant
values reflecting the environmental impact, while ζ is also a
constant value which is determined both by the antenna and
the environment. Naturally, the NLoS probability is given by
PNLoS = 1 − PLoS .
Following the free-space path loss model, the channel’s power
gain between the UAV and user kn at instant time t is given by
gkn (t) = K0 −1 d−α −1
kn (t)[PLoS μLoS + PNLoS μNLoS ] , (2)

where K0 = ( 4πf c ) , α is the path loss exponent, μLoS and


c 2

μN LoS are the attenuation factors of the LoS and NLoS links,
fc is the carrier frequency, and finally c is the speed of light.
The distance from UAV n to user kn at time t is assumed to
be a constant that can be expressed as Fig. 1. Deployment of multiple UAVs in wireless communications based on
 the mobility information of users.
dkn (t) = hn 2 (t) + [xn (t) − xkn (t)]2 + [yn (t) − ykn (t)]2 .
(3)
The overall achievable sum rate at time t can be expressed as
The transmit power of UAV n has to obey Kn

N 

0 ≤ Pn (t) ≤ Pmax , (4) Rsum = rkn (t). (8)


n=1 kn =1
where Pmax is the maximum allowed transmit power of the UAV.
Then the transmit power allocated to user kn at time t is pkn (t) = C. Problem Formulation
Pn (t)/|Kn |.
Let P = {pkn (t), kn ∈ Cn , 0 ≤ t ≤ Tn }, Q = {qn (t), 0 ≤
Lemma 1: In order to ensure that every user is capable of
t ≤ Tn } and H = {hn (t), 0 ≤ t ≤ Tn }. Again, we aim for
connecting to the UAV-assisted network, the lower bound for
determining both the UAV trajectory and transmit power
the transmit power of UAVs has to satisfy
control at each time slot, i.e., {P1 (t), P2 (t), . . . , Pn (t)} and
 
Pmax ≥ |Kn | μNLoS σ 2 K0 2|Kn |r0 /B − 1 {xn (t), yn (t), hn (t)}, n = 1, 2, . . . N, t = 0, 1, . . . Tn , for
. (5) maximizing the total transmit rate, while satisfying the rate re-
· max {h1 , h2 , . . . hn } quirement of each user.
Let us assume that each user’s minimum rate requirement r0
Proof: See Appendix A.  is satisfied. This means that all users must have a capacity higher
Lemma 1 sets out the lower bound of the UAV’s transmit than a rate r0 . Our optimization problem is then formulated as
power for each users’ rate requirement to be satisfied.
 Kn
N 
Remark 1: Since the users tend to roam continuously, the
optimal position of UAVs is changed during each time slot. In max Rsum = rkn (t) (9a)
C,P ,Q,H
n=1 kn =1
this case, the UAVs may also move to offer a better service. When
a particular user supported by UAV A moves closer to UAV B s.t. Kn ∩ Kn = φ, n = n, ∀n , n ∈ N, (9b)
while leaving UAV A, the interference may be increased, hence
hmin ≤ hn (t) ≤ hmax , 0 ≤ t ≤ Tn , (9c)
reducing the received SINR, which emphasizes the importance
of accurate power control. rkn (t) ≥ r0 , ∀kn , t, (9d)
Accordingly, the received SINR Γkn (t) of user kn connected
0 ≤ Pn (t) ≤ Pmax , ∀kn , t. (9e)
to UAV i at time t can be expressed as
pkn (t)gkn (t) where K(n) is the set of users that belong to the cluster n, hn (t)
Γkn (t) = , (6) is the altitude of UAV n at time slot t, while Pn (t) is the total
I kn + σ 2
transmit power of UAV n assigned to all users supported by it at
where σ 2 = Bkn N0 with N0 denoting the power spectral density time slot t. Furthermore, (9b) indicates that each user belongs to a
of the additive white Gaussian noise (AWGN) at the receivers. specific cluster which is covered by a single UAV; 9c) formulates

Furthermore, Ikn (t) = n =n pkn (t)gkn (t) is the interference the altitude bound of UAVs; (9d) qualifies the rate requirement
imposed on user kn at time t by the UAVs, except for UAV n. of each user; (9e) represents the power control constraint of
Then the instantaneous achievable rate of user kn at time t, UAVs. Here we note that designing the trajectory of UAVs will
denoted by rkn (t) and expressed in bps/Hz becomes ensure that they are in the optimal position at each time slot.
  This, in turn, will lead to improving the instantaneous transmit
pkn (t)gkn (t)
rkn (t) = Bkn log2 1 + . (7) rate. Meanwhile, designing the trajectory of UAVs in advance
Ikn (t) + σ 2 based on the prediction of the users’ mobility will also reduce the
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7961

Fig. 2. The procedure and algorithms used for solving the joint problem of Fig. 3. The initial positions of the users derived from Twitter.
trajectory plan and power control of UAVs.

show their locations. Some other applications implicitly record


the users’ GPS coordinates [22].
response time of UAVs, despite reducing the interactions among The users’ locations can be predicted by mining data from
the UAVs and the ground control center. Fig. 2 summarizes the social networks, given that the observed movement is associ-
framework proposed for solving the problem considered. Given ated with certain reference locations. One of the most effective
this framework, we utilize the ESN-based predictions of the method of collecting position information relies on the Twitter
users’ movement. API. When Twitter users tweet, their GPS-related position infor-
Remark 2: The instantaneous transmit rate depends on the mation is recorded by the Twitter API and it becomes available
transmit power, on the number of UAVs, and on the location of to the general public. We relied on 12000 twitter collected near
UAVs (horizonal position and altitude). Oxford Street, in London on the 14th, March 2018.1 Among
Problem (9a) is challenging since the objective function is these twitter users, 50 users who tweeted more than 3 times
non-convex as a function of xn (t), yn (t) and hn (t) [7], [17]. were encountered. In this case, the movement of these 50 users
Indeed it has been shown that problem (9a) is NP-hard even if is recorded. Fig. 3 illustrates the distribution of these 50 users
we only consider the users’ clustering [38]. Exhaustive search at the initial time of collecting data. In an effort to obtain more
exhibits an excessive complexity. In order to solve this problem information about a user to characterise the movement more
at a low complexity, a multi-agent Q-learning algorithm will be specifically, classic interpolation methods was used to make sure
invoked in Section IV for finding the optimal solution with a that the position information of each users were recorded every
high probability, despite searching through only small fraction 200 seconds. In this case, the trajectory of each user during this
of the entire design-space. period was obtained. The position of users during the nth time
slot can be expressed as u(n) = [u1 (n), u2 (n), . . . uNu (n)]T ,
where Nu is the total number of users.
III. ECHO STATE NETWORK ALGORITHM FOR PREDICTION
OF USERS’ MOVEMENT
B. Echo State Network Algorithm for the Prediction of
In this section, we formulate our ESN algorithm for predicting Users’ Movement
the movement of users. A variety of mobility models have been
utilized in [39], [40]. However, in these mobility models, the The ESN model’s input is the position vector of users collected
direction of each user’s movement tends to be uniformly dis- from Twitter, namely u(n) = [u1 (n), u2 (n), . . . uNu (n)]T ,
tributed among left, right, forward and backward, which does while its output vector is the position information of
not fully reflect the real movement of users. In this section, we users predicted by the ESN algorithm, namely y(n) =
tackle this problem by predicting the mobility of users based on [y1 (n), y2 (n), . . . yNu (n)]T . For each different user, the ESN
a real dataset collected from Twitter. model is initialized before it imports in the new inputs. As illus-
trated in Fig. 4, the ESN model essentially consists of three lay-
ers: input layer, neuron reservoir and output layer [40]. The Win
A. Data Collection of Users and Wout represent the connections between these three layers,
In order to obtain real mobility information, the relevant po- represented as matrices. The W is another matrix that presents
sition data has to be collected. Serendipitously, the wide use of the connections between the neurons in neuron reservoir. Every
online social network (OSN) APPs over smartphones has accu- segment is fixed once the whole network is established, except
mulated a rich set of geographical data that describes anonymous Wout , which is the only trainable part in the network.
user trajectories in the physical world, which holds the promise
of providing a lightweight means of studying the mobility of 1 The dataset has been shared by authors in Github. It is shown on the websit:
users. For example, many social networking applications like https://github.com/pswf/Twitter-Dataset/blob/master/Dataset. Our approach can
Facebook and Weibo allow users to ‘check-in’ and explicitly accommodate other datasets without loss of generality.
7962 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

Fig. 4. The structure of Echo State Network for predicting the mobility of the
users.

Algorithm 1: ESN Algorithm for Predicting Users’


Fig. 5. The structure of multi-agent Q-learning for the trajectory design and
Movement. power control of the UAVs.
Input: 75% of the dataset for training process, 25% of the
dataset for testing process.
1: Initialize: Wja,in , Wja , Wja,out , yi = 0. Algorithm 2: The Proposed Multi-agent Q-learning Algo-
2: Training stage: rithm for Deployment of Multiple UAVs.
3: for i from 0 to Nu do
1: Let t = 0, Q0n (sn , an ) = 0 for all sn and an
4: for n from 0 to Nx do
2: Initialize: the starting state st
5: Computer the update equations according to
3: Loop:
Eq. (12).
4: send Qtn (stn , :) to all other cooperating agents j
6: Update the network outputs according to
5: receive Qtj (stj , :) from all other cooperating
Eq. (13).
agents j
7: end for
6: if random < ε then
8: end for
7: select action randomly
9: Prediction stage:
8: else 
10: Get the prediction of users’ mobility information
9: choose action: atn = arg maxa ( 1jN Qtj (stj , a))
based on the output weight matrix Wout .
Return: Predicted coordinate of users. 10: receive reward rnt
11: observe next state st+1n
12: update Q-table as
The classic mean square error (MSE) metric is invoked for Qn(sn , an ) ← (1 − α)Qn (sn , an )  +
evaluating the prediction accuracy [40]
α rn (sn , −

a ) + β max Qn (s n , b)
Nu
b∈An
target
1  1  T
MSE y, y = [yi (n) − yi target (n)]2 . 13: stn = st+1
n
Nu n=1 T i=1 14: end loop
(10)
where y and y target are the predicted and the real position of the input and the recurrent weight matrices, respectively. The input
users, respectively. matrix W in and the recurrent connection matrices W are ran-
Remark 3: The aim of the ESN algorithm is to train a model domly generated, while the leakage rate α is from the interval
with the aid of its input and out put to minimizes the MSE. of [0, 1).
The neuron reservoir is a sparse network, which consists of After data echoes in the pool, it flows to the output layer,
sparsely connected neurons, having a short-term memory of the which is characterized as
previous states encountered. In the neuron reservoir, the typical
update equations are given by y(n) = Wout [0; x(n)], (13)
 
where y(n) ∈ RNy represents the network outputs, while
x̃(n) = tan h Win [0 : u(n)] + W · x(n − 1) , (11)
Wout ∈ RNy ·(1+Nu +Nx ) the weight matrix of outputs.
x(n) = (1 − α)x(n − 1) + αx̃(n), (12) The neuron reservoir is determined by four parameters: the
size of the pool, its sparsity, the distribution of its nonzero ele-
where x(n) ∈ RNx is the updated version of the variable x̃(n), ments and spectral radius of W .
Nx is the size of the neuron reservoir, α is the leakage rate, r Size of Neuron Reservoir Nx : represents the number of
while tan h(·) is the activation function of neurons in the reser- neurons in the reservoir, which is the most crucial parame-
voir. Additionally, Win ∈ RNx ·(1+Nu ) and W ∈ RNx ·Nx are the ter of the ESN algorithm. The larger Nx , the more precise
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7963

prediction becomes, but at the same time it increases the agent model, each UAV acts as an agent, moving without coop-
probability of causing overfitting. erating with other UAVs. In this case, the geographic positioning
r Sparsity: Sparsity characterizes the density of the connec- of each UAV is not affected by the movement of other UAVs.
tions between neurons in the reservoir. When the density The single agent Q-learning model relies on four core elements:
is reduced, the non-linear closing capability is increased, the states, actions, rewards and Q-values. The aim of this al-
whilst the operation becomes more complex. gorithm is that of conceiving a policy (a set of actions will be
r Distribution of Nonzero Elements: The matrix W is typ- carried out by the agent) that maximizes the rewards observed
ically a sparse one, representing a network, which has nor- during the interaction time of the agent. During the iterations,
mally distributed elements centered around zero. In this the agent observes a state st , in each time slot t from the state
paper, we use a continuous-valued bounded uniform dis- space S. Accordingly, the agent carries out an action at , from
tribution, which provides an excellent performance [41], the action space A, selecting its specific flying directions and
outperforming many other distributions. transmit power based on policy J. The decision policy J is de-
r Spectral Radius of W : Spectral Radius of W scales the termined by a Q-table Q(st , at ). The policy promote choosing
matrix W and hence also the variance of its nonzero ele- specific actions, which enable the model to attain the maximum
ments. This parameter is fixed, once the neuron reservoir Q-values. Following each action, the state of the agent traverses
is established. to a new state st+1 , while the agent receives a reward, rt , which
Remark 4: The size of neuron reservoir has to be carefully is determined by the instantaneous sum rate of users. See (14),
chosen to satisfy the memory constraint, but Nx should also shown at the bottom of the page.
be at least equal to the estimate of independent real values the
reservoir has to remember from the input in order to solve the
B. State-Action Construction of the Multi-Agent Q-learning
task.
Algorithm
A larger memory capacity implies that the ESN model is capa-
ble of storing more locations that the users have visited, which In the multi-agent Q-learning model, each agent has to keep
tends to improve the prediction accuracy of the users’ move- a Q-table that includes data both about its own states as well as
ments. In the ESN model, typically 75% of the dataset is used of the other agents’ states and actions. More explicitly, it takes
for training and 25% for the testing process. account of the other agents’ actions with the goal of promoting
Remark 5: For challenging tasks, as large a neuron reservoir cooperative actions among agents so as to glean the highest
has to be used as one can computationally afford. possible rewards.
In the multi-agent Q-learning model, the individual agents
(n) (n)
IV. JOINT TRAJECTORY DESIGN AND TRANSMIT POWER are represented by a four-tuple state: ξn = (xUAV , yUAV ,
(n) (n) (n) (n)
CONTROL OF UAVS hUAV , PUAV ), where (xUAV , yUAV ) is the horizonal position
(n) (n)
of UAV n, while hUAV and PUAV are the altitude and the
In this section, we assume that in any cluster n, the UAV is
transmit power of UAV n, respectively. Since the UAVs op-
serving the users relying on an adaptively controlled flight tra-
erate across a particular area, the corresponding state space
jectory and transmit power. With the goal of maximizing the sum (n) (n)
transmit rate in each time slot by determining the flight trajec- is donated as: xUAV : {0, 1, . . . Xd }, yUAV : {0, 1, . . . Yd },
(n) (n)
tory and transmit power of the UAVs. User clustering constitutes hUAV : {hmin , . . . hmax }, PUAV = {0, . . . Pmax }, where Xd and
the first step of achieving the association between the UAVs and Yd represent the maximum coordinate of this particular area.
the users. The users are partitioned into different clusters, and while hmin and hmax are the lower and upper altitude bound
each cluster is served by a single UAV. The process of cell par- of UAVs, respectively. Finally, Pmax is the maximum transmit
titioning has been discussed in our previous work [38], which power derived from Lemma 1.
has demonstrated that the genetic K-means (GAK-means) algo- We assume that the initial state of UAVs is determined ran-
rithm is capable of obtaining globally optimal clustering results. domly. Then the convergence of the algorithm is determined by
The process of clustering is also detailed in [38], hence it will the number of users and UAVs, as well as by the initial position
not be elaborated on here. of UAVs. A faster convergence is attained when the UAVs are
placed closer to the respective optimal positions.
At each step, each UAV carries out an action at ∈ A, which
A. Signal-Agent Q-learning Algorithm includes choosing a specific direction and transmit power level,
In this section, a multi-agent Q-learning algorithm is invoked depending on its current state, st ∈ S, based on the decision
for obtaining the movement of the UAVs. Before introducing policy J. The UAVs may fly in arbitrary directions (with dif-
multi-agent Q-learning algorithm, the single agent Q-learning ferent angles), which makes the problem non-trivial to solve.
algorithm is introduced as the theoretical basis. In the single However, by assuming the UAVs fly at a constant velocity, and

 K  
pkn (t)gkn (t)
n
kn =1 B k n
log 2 1+ Ikn (t)+σ 2 , Sumratenew ≥ Sumrateold ,
rn (t) = (14)
0 Sumratenew < Sumrateold .
7964 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

obey coordinated turns, the model may be simplified to as few as rather than the single-agent Q-function, Q(s, a). Given the ex-
7 directions (left, right, forward, backward, upward, downward tended notion of the Q-function, we define the Q-value as the
and maintaining static). The number of the directions has to be expected sum of discounted rewards when all agents follow spe-
appropriately chosen in practice to strike a tradeoff between the cific strategies from the next period. This definition differs from
accuracy and algorithmic complexity. Additionally, we assume the single-agent model, where the future rewards are simply
that the transmit power of the UAVs only has 3 values, namely based on the agent’s own optimal strategy. More precisely, we
0.08 W, 0.09 W and 0.1 W.2 refer to Qn∗ as the Q-function for agent n.
Remark 6: In the real application of UAVs as aerial base sta- Remark 8: The difference of multi-agent model compared to
tions, they can fly in arbitrary directions, but we constrain their the single-agent model is that the reward function of multi-agent
mobility to as few as 7 directions. model is dependent on the joint action of all agents − →a.
We choose the 3D position of the UAVs (horizontal coordi- Sparked by Remark 8, the update rule has to obey
nates and altitudes) and the transmit power to define their states.
The actions of each agent are determined by a set of coordinates Qn (sn , an ) ← (1 − α)Qn (sn , an )
for specifying their travel directions and the candidate transmit  
power of the UAVs.3 Explicitly, (1, 0, 0) means that the UAV −
→ 
turns right; (−1, 0, 0) indicates that the UAV turns left; (0, 1, 0) + α rn (sn , a ) + β max Qn (s n , b) .
b∈An
represents that the UAV flies forward; (0, −1, 0) means that the
(15)
UAV flies backward; (0, 0, 1) implies that the UAV rises; (0, 0,
−1) means that the UAV descends; (0, 0, 0) indicates that the
UAV stays static. In terms of power, we assume 0.08 W, 0.09 W The nth agent shares the row of its Q-table that corre-
and 0.1 W. Again, we set the initial transmit power to 0.08 W, sponds to its current state with all other cooperating agents j,
and each UAV carries out an action from the set increase, de- j = 1, . . . , N . Then the nth agent selects its action according to
crease and maintain at each time slot. Then, the entire action
space has as few as 3 × 7 = 21 elements. ⎛ ⎞


atn = arg max ⎝ Qtj stj , a ⎠. (16)


C. Reward Function of Multi-Agent Q-learning Algorithm a
1≤j≤N
One of the main limitations of reinforcement learning is its
slow convergence. The beneficial design of the reward func- In order to carry out multi-agent training, we train one agent at
tion requires a sophisticated methodology for accelerating the a time, and keep the policies of all the other agents fixed during
convergence to the optimal solution [42]. In the multi-agent Q- this period.
learning model, each agent has the same reward or punishment. The main idea behind this strategy depends on the global
The reward function is directly related to the instantaneous sum Q-value Q(s, a), which represents the Q-value of the whole
rate of the users. When the UAV carries out an action at time model. This global Q-value can be decomposed into a lin-
instant t, and this action improves the sum rate, then the UAV ear combination
 of local agent-dependent Q-values as follows:
receives a reward, and vice versa. The global reward function is Q(s, a) = 1≤j≤N Qj (sj , aj ). Thus, if each agent j maxi-
formulated as (17). mizes its own Q-value, the global Q-value will be maximized.
Remark 7: Altering the value of reward does not change the The transition from the current state st to the state of the
final result of the algorithm, but its convergence rate is indeed next time slot st+1 with reward rt when action at is taken
influenced. Using a continuous reward function is capable of can be characterized by the conditional transition probability
faster convergence than a binary reward function [42]. p(st+1 , rt |st , at ). The goal of learning is that of maximizing the
gain defined as the expected cumulative discounted rewards
D. Transition of Multi-Agent Q-learning Algorithm
 ∞

In this part, we extend the model from single-agent Q-learning 
to multi-agent Q-learning. First, we redefine the Q-values for the Gt = E n
β rt+n , (17)
the multi-agent model, and then present the algorithm conceived n=0
for learning the Q-values.
To adapt the single-agent model to the multi-agent context, where β is the discount factor. The model relies on the learning
the first step is that of recognizing the joint actions, rather than rate α, discount factor β and a greedy policy J associated with
merely carrying out individual actions. For an N -agent sys- the probability ε to increase the exploration actions. The learning
tem, the Q-function for any individual agent is Q(s, a1 , . . . aN ), process is divided into episodes, and the UAVs’ state will be re-
2 In this paper, the proposed algorithm can accommodate any arbitrary number
initialized at the beginning of each episode. At each time slot,
of power level without loss of generality. We choose three power levels to strike
each UAV needs to figure out the optimal action for the objective
a tradeoff between the performance and complexity of the system. function.
3 In our future work, we will consider the online design of UAVs’ trajectories,
Theorem 1: Multi-agent Q-learning MQk+1 converges to an
and the mobility of UAVs will be constrained to 360 degree of angles instead optimal state MQ∗ [k + 1], where k is the episode time.
of 7 directions. Given that, the state-action space is huge, a deep multi-agent
Q-network based algorithm will be proposed in our future work. Proof: See Appendix A. 
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7965

Fig. 6. Comparison of real tracks and predicted tracks for different neuron reservoir size.

TABLE II
SIMULATION PARAMETERS

TABLE III
PERFORMANCE COMPARISON BETWEEN ESN ALGORITHM AND BENCHMARKS Fig. 7. Convergence of the proposed algorithm vs. the number of training
episodes.

E. Complexity of the Algorithm


The complexity of the algorithm has two main contributors,
namely the complexity of the GAK-means based clustering al-
gorithm and that of the multi-agent Q-learning based trajectory-
acquisition as well as power control algorithm. In terms of the
first one, the proposed scheme involves three steps during each it-
eration. The first stage calculates the Euclidean distance between
each user and cluster centers. For Nu users and N clusters, calcu-
lating all Euclidean distances requires on the order of O (6KNu )
floating-point operations. The second stage allocates each user Fig. 8. Comparison between movement and static scenario over throughput.
to the specific cluster having the closest center, which requires
O [Nu (N − 1]) comparisons.Furthermore, the complexity of 1

recalculating the cluster center is O (4N Nu ). Therefore, the to- is constructed


 1 of Q n s,
 a , ..., aN for all s, a1 , ..., aN . Assum-
tal computational complexity of GAK-means clustering is on ing A  = · · · = AN  = |A|, where |S| is the number of states,
the order of O [6N Nu + Nu (N + 1) + 4N Nu ] ≈ O (N Nu ). and |An | is the size of agent n’s action space An . Then, the total
In the multi-agent Q-learning model, the learning agent has to number of entries in Qn is |S| · |A|n . Finally, the total storage
handle N Q-functions, one for each agent in the model. These Q- space requirement is N |S| · |A|N . Therefore the space size of
functions are handled internally by the learning agent, assuming the model is increased linearly with the number of states, poly-
that it can observe other agents’
actions and rewards. The learn- nomially with the number of actions, but exponentially with the
ing agent updates Q1 , ..., QN , where each Qn , n = 1, ..., N , number of agents.
7966 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

Fig. 9. Positions of the users and the UAVs as well as the trajectory design of UAVs both with and with out power control.

long short term memory (LSTM) model are also used as our
benchmarks. It can be observed that the ESN having a neuron
reservoir size of 1000 attains a lower MSE than the HA model
and the LSTM model, even though the complexity of the ESN
model is far lower than that of the LSTM model. Overall, the
proposed ESN algorithm outperforms the benchmarks.

B. Trajectory Design and Power Control of UAVs


Fig. 7 characterizes the throughput vs the number of train-
ing episodes. It can be observed that the UAVs are capable
of carrying out their actions in an iterative manner and learn
Fig. 10. Trajectory design of one of the UAVs on Google Map. from their mistakes for improving the throughput. When three
UAVs are employed, convergence is achieved after about 45000
episodes, whilst 30000 more training episodes are required for
V. NUMERICAL RESULTS convergence when the number of UAV is four. Additionally, the
Our simulation parameters are given in Table II. The initial learning rate of 0.80 used for the multi-agent Q-learning model
locations of the UAVs are randomized. The maximum transmit outperforms that of 0.60 and 0.70 in terms of the throughput.
power of each UAV is the same, and transmit power is uniformly Although the model relying on a learning rate of 0.90 converges
allocated to users. On this basis, we analyze the instantaneous faster than other models, this model is more likely to converge
transmit rate of users, position prediction of the users, the 3D to a sub-optimal Q∗ value, which leads to a lower throughput.
trajectory design and power control of the UAVs. Fig. 8 characterizes the throughput with the movement derived
from multi-agent Q-learning. The throughput in the scenario
that users remain static and the throughput with the movement
A. Predicted Users’ Positions
derived by the GAK-means are also illustrated as benchmarks.
Fig. 6 characterizes the prediction accuracy of a user’s It can be observed that the instantaneous transmit rate decreases
position parameterized by the reservoir size. It can be observed as time elapses. This is because the users are roaming during
that increasing the reservoir size of the ESN algorithm leads to each time slot. At the initial time slot, the users (namely the
a reduced error between the real tracks and predicated tracks. people who tweet) are flocking together around Oxford Street
Again, the larger the neuron reservoir size, the more precise the in London, but after a few hundred seconds, some of the users
prediction becomes, but the probability of causing overfitting move away from Oxford Street. In this case, the density of users
is also increased. This is due to the fact that the size of the is reduced, which affects the instantaneous sum of the transmit
ESN reservoir directly affects the ESN’s memory requirement rate. It can also be observed that re-deploying UAVs based on
which in turn directly affects the number of user positions that the movement of users is an effective method of mitigating the
the ESN algorithm is capable of recording. When the neuron downward trend compared the static scenario. Fig. 8 also illus-
reservoir size is 1000, a high accuracy is attained. trates that the movement of UAVs relying on power control is
Table III characterizes the performance of the proposed ESN more capable of maintaining a high-quality service than the mo-
model. The so-called historical average (HA) model and the bility scenario operating without power control. Additionally,
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7967

it also demonstrates that the proposed multi-agent Q-learning Then we have


based trajectory-acquiring and power control algorithm outper-
Pmax ≥ |Kn | K0 dakn (t) (PLoS μLoS + PNLoS μNLoS )
forms GAK-means algorithm also used as a benchmark.
Fig. 9 characterizes the designed 3D trajectory for one of the
 
· Ikn (t) + σ 2 2|Kn |r0 /B − 1 (A.3)
UAVs both in the scenario of moving with power control and
in its counterpart operating without power control. Compared It can be proved that (PLoS μLoS + PN LoS μN LoS ) ≤
to only consider the trajectory design of UAVs, jointly consider μN LoS , and the condition for equality is the probability of NLoS
both the trajectory design and the power control results in differ- connection is 1. Following from the condition for equality, the
ent trajectories for the UAVs. However, the main flying direction maximize transmit rate of each UAV has to obey
of the UAVs remains the same. This is because the interference is  
also considered in our model and power control of UAVs is capa- Pmax ≥ |Kn | μNLoS σ 2 K0 2|Kn |r0 /B − 1
ble of striking a tradeoff between increasing the received signal
· max {h1 , h2 , . . . hn } (A.4)
power and the interference power, which in turn increases the
received SINR. The proof is completed.
Fig. 10 characterizes the trajectory designed for one of the
UAVs on Google map. The trajectory consists of 16 hevering APPENDIX B
points, where each UAV will stop for about 200 seconds. The PROOF OF THEOREM 1
number of the hovering points has to be appropriately chosen
Two steps are taken for proving the convergence of multi-
based on the specific requirements in the real scenario. Mean-
agent Q-learning algorithm. Firstly, the convergence of single-
while, the trajectory of the UAVs may be designed in advance on
agent model is proved. Secondly, we improve the results from
the map with the aid of predicting the users’ movements. In this
the single-agent domain to the multi-agent domain.
case, the UAVs are capable of obeying a beneficial trajectory for
The update rule of Q-learning algorithm is given by
maintaining a high quality of service without extra interaction
from the ground control center. Qt+1 (st , at ) = (1 − αt ) Qt (st , at )
+ αt [rt + β max Qt (st+1 , at )] . (B.1)
VI. CONCLUSION ∗
Subtracting the quantity Q (st , at ) from both side of the equa-
The trajectory design and power control of multiple UAVs tion, we have
was jointly designed for maintaining a high quality of service.
Δt (st , at ) = Qt (st , at ) − Q∗ (st , at )
Three steps were provided for tackling the formulated problem.
More particularly, firstly, multi-agent Q-learning based place- = (1 − αt ) Δt (st , at )
ment algorithm was proposed to deploy the UAVs at the initial
+ αt [rt + β max Qt (st+1 , at+1) − Q∗ (st , at )] .
time slot. Secondly, A real dataset was collected from Twit-
(B.2)
ter for representing the users’ position information and an ESN
based prediction algorithm was proposed for predicting the fu- We write Ft (st , at ) = rt + β max Qt (st+1 , at+1 ) −
ture positions of the users. Thirdly, a multi-agent Q-learning Q∗ (st , at ), then we have
based trajectory-acquisition and power-control algorithm was 
conceived for determining both the position and transmit power E [Ft (st , at ) |Ft ] = Pat (st , st+1 )
st ∈S
of the UAVs at each time slot. It was demonstrated that the pro-
posed ESN algorithm was capable of predicting the movement ×[rt + β max Qt (st+1 , at+1) − Q∗ (st , at )]
of the users at a high accuracy. Additionally, re-deploying (tra-
jectory design) and power control of the UAVs based on the = (HQt ) (st , at ) − Q∗ (st , at ). (B.3)
∗ ∗
movement of the users was an effective method of maintaining Using the fact that HQ = Q , then, E [Ft (st , at ) |Ft ] =
a high quality of downlink service. (HQt ) (st , at ) − (HQ∗ ) (st , at ). It has been proved that
HQ1 − HQ2 ∞ ≤ β Q1 − Q2  [43]. In this case, we have
APPENDIX A E [Ft (st , at ) |Ft ]∞ ≤ βQt − Q∗ ∞ = βΔt ∞ . Finally,
PROOF OF LEMMA 1 VAR [Ft (st , at ) |Ft ]
The rate requirement of each user is given by rkn (t) ≥ r0 ,  
= E (rt + β max Qt (st+1 , at+1 ) − (HQt ) (st , at ))2
then, we have
  = VAR [rt + β max Qt (st+1 , at+1 ) |Ft ] . (B.4)
pk (t) gkn (t)
r0 ≤ Bkn log 2 1 + n . (A.1)
Ikn (t) + σ 2 Due to the fact that r is bounded,  clearly verifies
2
VAR [Ft (st , at ) |Ft ] ≤ C 1 + Δt  for some constant C.
Rewrite equation (A.2) as
Then, as Δt converges to zero under the assumptions in [44],


the single model
Ikn (t) + σ 2 2r0 /Bkn − 1  converges to the optimal Q-function as long as
pkn (t) ≥ . (A.2) 0 ≤ αt ≤ 1, t αt = ∞ and t αt2 < ∞.
gkn (t)
7968 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019

Then, we improve the results to multi-agent domain. We as- [5] A. Osseiran et al., “Scenarios for 5G mobile and wireless communications:
sume that, there is an initial card which contains an initial value The vision of the METIS project,” IEEE Commun. Mag., vol. 52, no. 5,
pp. 26–35, May 2014.
M Qn∗ (sn , 0 , an ) at the bottom of the multi deck. The Q [6] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with un-
value for episode 0 in multi-agent algorithm has the same as manned aerial vehicles: Opportunities and challenges,” IEEE Commun.
an initial value, which is expressed as M Qn∗ (sn , 0 , an ) = Mag., vol. 54, no. 5, pp. 36–42, May 2016.
[7] Q. Wang, Z. Chen, H. Li, and S. Li, “Joint power and trajectory design for
M Qn0 (sn , an ). physical-layer secrecy in the UAV-aided mobile relaying system,” IEEE
For episode k, an optimal value is equivalent to the Q Access, vol. 6, pp. 62849–62855, 2018.
value, M Qn∗ (sn , k , an ) = M Qnk (sn , an ). Next, we con- [8] W. Yi, Y. Liu, E. Bodanese, A. Nallanathan, and G. K. Karagiannidis,
“A unified spatial framework for UAV-aided MmWave networks,” unpub-
sider a value function which selects optimal action by using an lished paper, 2019, arXiv:1901.01432.
equilibrium strategy. At the k level, an optimal value function is [9] S. Kandeepan, K. Gomez, L. Reynaud, and T. Rasheed, “Aerial-terrestrial
the same as the Q value function. communications: Terrestrial cooperation and energy-efficient transmis-
sions to aerial base stations,” IEEE Trans. Aerosp. Electron. Syst., vol. 50,
no. 4, pp. 2715–2735, Dec. 2014.
V n∗ (sn+1 , k) = Vkn (sn+1 ) [10] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy tradeoff in ground-to-
⎡ ⎤ UAV communication via trajectory design,” IEEE Trans. Veh. Technol.,

n vol. 67, no. 7, pp. 6721–6726, Jul. 2018.
= EQnk ⎣ βj × max M Qnk (sn+1 , an+1 )⎦ . (B.5) [11] S. Zhang, Y. Zeng, and R. Zhang, “Cellular-enabled UAV communi-
j=1 cation: A connectivity-constrained trajectory optimization perspective,”
IEEE Trans. Commun., vol. 67, no. 3, pp. 2580–2604, 2018.
[12] S. Katikala, “Google project loon,” InSight, Rivier Academic J., vol. 10,
One of the agents maintains the previous Q value for episode no. 2, pp. 1–6, 2014.
k + 1. Then, we have [13] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Wireless communi-
cation using unmanned aerial vehicles (UAVs): Optimal transport theory
M Qnk+1 (sn , an ) = M Qnk (sn , an ) = M Qn∗
k (sn , k , an ) for hover time optimization,” IEEE Trans. Wireless Commun., vol. 16,
no. 12, pp. 8052–8066, Dec. 2017.
= M Qn∗
k (sn , k + 1 , an ) . (B.6) [14] X. Zhang and L. Duan, “Optimization of emergency UAV deployment for
providing wireless coverage,” in Proc. IEEE Global Commun. Conf., Dec.
Otherwise, the Q value holds the previous multi-agent Q value 2017, pp. 1–6.
[15] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Drone small cells in
with probability of 1 − αk+1
n
and takes two types of rewards with the clouds: Design, deployment and performance analysis,” in Proc. IEEE
n
probability of αk+1 . Then we have Proc. Global Commun. Conf., 2015, pp. 1–6.
[16] B. Van der Bergh, A. Chiumento, and S. Pollin, “LTE in the sky: Trading
M Qn∗ (sn , k + 1 , an ) off propagation benefits with interference costs for aerial nodes,” IEEE
Commun. Mag., vol. 54, no. 5, pp. 44–50, May 2016.
= (1 − αk+1
n
)M Qn∗
k (sn , k , an )
[17] Y. Zeng and R. Zhang, “Energy-efficient UAV communication with tra-
jectory optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 6,
⎡ ⎤ pp. 3747–3760, Mar. 2017.
 [18] L. Liu, S. Zhang, and R. Zhang, “CoMP in the sky: UAV placement
+ αk+1
n ⎣rk+1
n
+β Psnn →sn+1 [an ] Vkn (sn+1 )⎦ and movement optimization for multi-user communications,” IEEE Trans.
sn+1 Commun., pp. 1–14, 2019.
[19] R. Sun and D. W. Matolak, “Air-ground channel characterization for un-
= (1 − αk+1
n
)M Qnk (sn , an ) manned aircraft systems part II: Hilly and mountainous settings,” IEEE
⎡ ⎤ Trans. Veh. Technol., vol. 66, no. 3, pp. 1913–1925, Mar. 2017.
[20] L. Bing, “Study on modeling of communication channel of UAV,” Proce-

+ αk+1
n ⎣rk+1
n
+β Psnn →sn+1 [an ] Vkn (sn+1 )⎦
dia Comput. Sci., vol. 107, pp. 550–557, 2017.
[21] N. Zhang, P. Yang, J. Ren, D. Chen, L. Yu, and X. Shen, “Syn-
sn+1 ergy of big data and 5G wireless networks: Opportunities, approaches,
and challenges,” IEEE Wireless Commun., vol. 25, no. 1, pp. 12–18,
= M Qnk+1 (sn , an ) . (B.7) Feb. 2018.
[22] B. Yang, W. Guo, B. Chen, G. Yang, and J. Zhang, “Estimating mobile
In this case, if M Qnk+1 (sn , an ) converge to an optimal valve traffic demand using Twitter,” IEEE Wireless Commun. Lett., vol. 5, no. 4,
pp. 380–383, Aug. 2016.
M Qn∗ (sn , k + 1 , an ), then a state equation of multi-agent [23] A. Galindo-Serrano and L. Giupponi, “Distributed Q-Learning for aggre-
Q-learning MQk+1 converges to an optimal state equation gated interference control in cognitive radio networks,” IEEE Trans. Veh.
MQ∗ [k + 1]. Technol., vol. 59, no. 4, pp. 1823–1834, May 2010.
[24] J. Kosmerl and A. Vilhar, “Base stations placement optimization in wire-
The proof is completed. less networks for emergency communications,” in Proc. IEEE Int. Com-
mun. Conf., 2014, pp. 200–205.
[25] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for
REFERENCES maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–
[1] X. Liu, Y. Liu, and Y. Chen, “Machine learning aided trajectory design 572, Jul. 2014.
and power control of multi-UAV,” in Proc. IEEE Global Commun. Conf., [26] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3D placement
2019, pp. 1–6. of an unmanned aerial vehicle base station (UAV-BS) for energy-efficient
[2] Y. Zhou et al., “Improving physical layer security via a UAV friendly maximal coverage,” IEEE Wireless Commun. Lett., vol. 6, no. 4, pp. 434–
jammer for unknown eavesdropper location,” IEEE Trans. Veh. Technol., 437, May 2017.
vol. 67, no. 11, pp. 11 280–11 284, Nov. 2018. [27] P. K. Sharma and D. I. Kim, “UAV-enabled downlink wireless system with
[3] W. Khawaja, O. Ozdemir, and I. Guvenc, “UAV air-to-ground channel non-orthogonal multiple access,” in Proc. IEEE GLOBECOM Workshops,
characterization for mmwave systems,” in Proc. IEEE Veh. Technol. Conf., Dec. 2017, pp. 1–6.
2017, pp. 1–5. [28] M. F. Sohail, C. Y. Leow, and S. Won, “Non-orthogonal multiple access for
[4] F. Cheng et al., “UAV trajectory optimization for data offloading at the unmanned aerial vehicle assisted communication,” IEEE Access, vol. 6,
edge of multiple cells,” IEEE Trans. Veh. Technol., vol. 67, no. 7, pp. 6732– pp. 22716–22727, 2018.
6736, Jul. 2018.
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7969

[29] T. Hou, Y. Liu, Z. Song, X. Sun, and Y. Chen, “Multiple antenna aided Yuanwei Liu (S’13–M’16–SM’19) received the B.S.
NOMA in UAV networks: A stochastic geometry approach,” IEEE Trans. and M.S. degrees from the Beijing University of Posts
Commun., vol. 67, no. 2, pp. 1031–1044, 2018. and Telecommunications, Beijing, China, in 2011 and
[30] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial ve- 2014, respectively, and the Ph.D. degree in electri-
hicle with underlaid device-to-device communications: Performance and cal engineering from the Queen Mary University of
tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–3963, London, London, U.K., in 2016. He was with the De-
Jun. 2016. partment of Informatics, Kings College London, from
[31] M. Mozaffari, W. Saad, and M. Bennis, “Efficient deployment of multiple 2016 to 2017, where he was a Postdoctoral Research
unmanned aerial vehicles for optimal wireless coverage,” IEEE Commun. Fellow. He has been a Lecturer (Assistant Professor)
Lett., vol. 20, no. 8, pp. 1647–1650, Jun. 2016. with the School of Electronic Engineering and Com-
[32] J. Lyu, Y. Zeng, and R. Zhang, “Cyclical multiple access in UAV-Aided puter Science, Queen Mary University of London,
communications: A throughput-delay tradeoff,” IEEE Wireless Commun. since 2017. His research interests include 5G and beyond wireless networks,
Lett., vol. 5, no. 6, pp. 600–603, Dec. 2016. Internet of Things, machine learning, and stochastic geometry.
[33] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication de- Mr. Liu received the Exemplary Reviewer Certificate of the IEEE WIRELESS
sign for multi-UAV enabled wireless networks,” IEEE Trans. Wireless COMMUNICATION LETTERS in 2015, the IEEE TRANSACTIONS ON COMMUNICA-
Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018. TIONS in 2016 and 2017, the IEEE TRANSACTIONS ON WIRELESS COMMUNICA-
[34] Q. Wu and R. Zhang, “Common throughput maximization in UAV-enabled TIONS in 2017. Currently, he is serving as an Editor of the IEEE TRANSACTIONS
OFDMA systems with delay consideration,” IEEE Trans. Commun., vol. ON COMMUNICATIONS, IEEE COMMUNICATION LETTERS, and the IEEE ACCESS.
66, no. 12, pp. 6614–6627, 2018. He is also a Guest Editor for IEEE Journal of Selected Topics in Signal Process-
[35] S. Zhang, Y. Zeng, and R. Zhang, “Cellular-enabled UAV communication: ing special issue on “Signal Processing Advances for Nonorthogonal Multiple
Trajectory optimization under connectivity constraint,” in Proc. IEEE Int. Access in Next-Generation Wireless Networks”. He was the Publicity Co-Chair
Commun. Conf., 2018, pp. 1–6. for VTC2019-Fall and a TPC Member for many IEEE conferences, such as
[36] H. He, S. Zhang, Y. Zeng, and R. Zhang, “Joint altitude and beamwidth op- GLOBECOM and ICC.
timization for UAV-enabled multiuser communications,” IEEE Commun.
Lett., vol. 22, no. 2, pp. 344–347, Feb. 2018.
[37] Q. Liu et al., “Charging unplugged: Will distributed laser charging for
mobile wireless power transfer work?” IEEE Veh. Technol. Mag., vol. 11,
no. 4, pp. 36–45, Dec. 2016. Yue Chen (S’02–M’03–SM’15) received the bache-
[38] X. Liu, Y. Liu, and Y. Chen, “Deployment and movement for multiple aerial lor’s and master’s degree from the Beijing University
base stations by reinforcement learning,” in Proc. IEEE Global Commun. of Posts and Telecommunications, Beijing, China, in
Conf., Dec. 2018, pp. 1–6. 1997 and 2000, respectively, and the Ph.D. degree
[39] J. Ren, G. Zhang, and D. Li, “Multicast capacity for VANETs with direc- from the Queen Mary University of London (QMUL),
tional antenna and delay constraint under random walk mobility model,” London, U.K., in 2003.
IEEE Access, vol. 5, pp. 3958–3970, 2017. She is a Professor of telecommunications engi-
[40] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong, neering with the School of Electronic Engineering
“Caching in the sky: Proactive deployment of cache-enabled unmanned and Computer Science, QMUL, U.K. Her current
aerial vehicles for optimized quality-of-experience,” IEEE J. Sel. Areas research interests include intelligent radio resource
Commun., vol. 35, no. 5, pp. 1046–1061, May 2017. management for wireless networks, cognitive and co-
[41] M. Chen, W. Saad, C. Yin, and M. Debbah, “Echo state networks for operative wireless networking, mobile edge computing, HetNets, smart energy
proactive caching in cloud-based radio access networks with mobile users,” systems, and Internet of Things.
IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3520–3535, Jun. 2017.
[42] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Reward function and
initial values: Better choices for accelerated goal-directed reinforcement
learning,” in Proc. Int. Conf. Artif. Neural Netw., 2006, pp. 840–849.
[43] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic
iterative dynamic programming algorithms,” in Proc. Advances Neural Lajos Hanzo (F’04) received the five year degree
Inf. Process. Syst., 1994, pp. 703–710. in electronics in 1976 and the doctorate degree
[44] M. L. Littman and C. Szepesvári, “A generalized reinforcement-learning from the Technical University of Budapest, Budapest,
model: Convergence and applications,” in ICML, vol. 96, 1996, pp. 310– Hungary, in 1983. In 2009, he was awarded an
318. honorary doctorate by the Technical University of
Budapest and in 2015 by the University of Edin-
burgh. In 2016, he joined the Hungarian Academy
of Science. During his 40-year career in telecommu-
nications he has held various research and academic
posts in Hungary, Germany, and the U.K. Since 1986,
he has been with the School of Electronics and Com-
puter Science, University of Southampton, U.K., where he holds the chair in
telecommunications. He has successfully supervised 119 Ph.D. students, coau-
thored 18 Wiley/IEEE Press books on mobile radio communications totaling in
excess of 10 000 pages, published 1800+ research contributions at IEEE Xplore,
acted both as TPC and General Chair of IEEE conferences, presented keynote
lectures and has been awarded a number of distinctions. He is currently directing
Xiao Liu (S’18) received the B.S. and M.S. degrees in a 60-strong academic research team, working on a range of research projects in
2013 and 2016, respectively. He is currently working the field of wireless multimedia communications sponsored by industry, the En-
toward the Ph.D. degree with Communication Sys- gineering and Physical Sciences Research Council U.K., the European Research
tems Research Group, School of Electronic Engineer- Council’s Advanced Fellow Grant and the Royal Society’s Wolfson Research
ing and Computer Science, Queen Mary University Merit Award. He is an enthusiastic supporter of industrial and academic liai-
of London, London, U.K. son and he offers a range of industrial courses. He is also the Governor of the
His research interests include unmanned aerial ve- IEEE ComSoc and VTS. He is a former Editor-in-Chief of the IEEE Press and
hicle aided networks, machine learning, nonorthog- a former Chaired Professor at Tsinghua University, Beijing. He is fellow of the
onal multiple access techniques, and autonomous Royal Academy of Engineering, Institution of Engineering and Technology, and
driving. EURASIP.

You might also like