Liu 2019
Liu 2019
Abstract—A novel framework is proposed for the trajectory de- received significant research interests as a means of mitigating
sign of multiple unmanned aerial vehicles (UAVs) based on the a wide range of challenges in commercial and civilian applica-
prediction of users’ mobility information. The problem of joint tra- tions [2], [3]. The future wireless communication systems are
jectory design and power control is formulated for maximizing the
instantaneous sum transmit rate while satisfying the rate require- expected to meet unprecedented demands for high quality wire-
ment of users. In an effort to solve this pertinent problem, a three- less services, which imposes challenges on the conventional ter-
step approach is proposed, which is based on machine learning restrial communication networks, especially in traffic hotspots
techniques to obtain both the position information of users and the such as in a football stadium or rock concert [4]–[6]. UAVs may
trajectory design of UAVs. First, a multi-agent Q-learning-based be relied upon as aerial base stations to complement and/or sup-
placement algorithm is proposed for determining the optimal posi-
tions of the UAVs based on the initial location of the users. Second, port the existing terrestrial communication infrastructure [5],
in an effort to determine the mobility information of users based [7], [8] since they can be flexibly redeployed in temporary traf-
on a real dataset, their position data is collected from Twitter to fic hotspots or after natural disasters. Secondly, UAVs have also
describe the anonymous user-trajectories in the physical world. In been deployed as relays between ground-based terminals and
the meantime, an echo state network (ESN) based prediction al- as aerial base stations for enhancing the link performance [9].
gorithm is proposed for predicting the future positions of users
based on the real dataset. Third, a multi-agent Q-learning-based Thirdly, UAVs can also be used as aerial base stations to collect
algorithm is conceived for predicting the position of UAVs in each data from Internet of Things (IoT) devices on the ground, where
time slot based on the movement of users. In this algorithm, multi- building a complete cellular infrastructure is unaffordable [7],
ple UAVs act as agents to find optimal actions by interacting with [10]. Fourthly, combined terrestrial anad UAV communication
their environment and learn from their mistakes. Additionally, we networks are capable of substantially improving the reliability,
also prove that the proposed multi-agent Q-learning-based trajec-
tory design and power control algorithm can converge under mild security, coverage and throughput of the existing point-to-point
conditions. Numerical results are provided to demonstrate that as UAV-to-ground communications [11].
the size of the reservoir increases, the proposed ESN approach im- Key examples of recent advance include the Google Loon
proves the prediction accuracy. Finally, we demonstrate that the project [12], Facebook’s Internet-delivery drone [13], and the
throughput gains of about 17% are achieved. AT&T project of [14]. The drone manufacturing industry faces
Index Terms—Multi-agent Q-learning, power control, trajectory both opportunities and challenges in the design of UAV-assisted
design, Twitter, unmanned aerial vehicle (UAV).
wireless networks. Before fully reaping all the aforementioned
I. INTRODUCTION benefits, several technical challenges have to be tackled, includ-
A. Motivation ing the optimal three dimensional (3D) deployment of UAVs,
their interference management [15], [16], energy supply [9],
S A benefit of their agility, as well as line-of-sight
A (LoS) propagation, unmanned aerial vehicles (UAVs) have
[17], trajectory design [18], the channel model between the UAV
and users [19], [20], resource allocation [7], as well as the com-
patibility with the existing infrastructure.
Manuscript received April 3, 2019; accepted May 21, 2019. Date of pub- The wide use of online social networks over smartphones
lication May 31, 2019; date of current version August 13, 2019. The work L.
Hanzo was supported in part by the Engineering and Physical Sciences Research has accumulated a rich set of geographical data that describes
Council Projects EP/Noo4558/1, EP/PO34284/1, and COALESCE of the Royal the anonymous users’ mobility information in the physical
Society’s Global Challenges Research Fund Grant and in part by the European world [21]. Many social networking applications like Facebook,
Research Council’s Advanced Fellow Grant QuantCom. The review of this paper
was coordinated by Guest Editors of the Special Issue on Vehicle Connectivity Twitter, Wechat, Weibo, etc allow users to ’check-in’ and explic-
and Automation using 5G. This paper was presented in part at the IEEE Global itly share their locations, while some other applications have im-
Communication Conference (GLOBECOM), Waikoloa, HI, USA, Dec. 9–13, plicitly recorded the users’ GPS coordinates [22], which holds
2019. (Corresponding author: Lajos Hanzo.)
X. Liu, Y. Liu, and Y. Chen are with the School of Electronic Engineering and the promise of estimating the geographic user distribution for
Computer Science, Queen Mary University of London, London E1 4NS, U.K. improving the performance of the system. Reinforcement learn-
(e-mail: [email protected]; [email protected]; [email protected]). ing has seen increasing applications in next-generation wireless
L. Hanzo is with the School of Electronics and Computer Science, University
of Southampton, Southampton SO17 1BJ, U.K. (e-mail: [email protected]). networks [23]. More expectantly, reinforcement learning mod-
Digital Object Identifier 10.1109/TVT.2019.2920284 els may be trained by interacting with an environment (states),
0018-9545 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
7958 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019
and they can be expected to find the optimal behaviors (actions) time instant. The design-objective was to minimize the UAV’s
of agents by exploring the environment in an iterative manner mission completion time by optimizing its trajectory.
and by learning from their mistakes. The model is capable of
monitoring the reward resulting form its actions and is chosen
for solving problems in UAV-assisted wireless networks. C. Our New Contributions
The aforementioned research contributions considered the de-
ployment and trajectory design of UAVs in the scenario that
B. Related Works
users are static or studied the movement of UAVs based on the
1) Deployment of UAVs: Among all these challenges, the current user location information, where only the user location
geographic UAV deployment problems are fundamental. Early information of the current time slot is known. Studying the pre-
research contributions have studied the deployment of a sin- deployment of UAVs based on the full user location information
gle UAV either to provide maximum radio coverage on the implicitly assumes that the position and mobility information of
ground [24], [25] or to maximize the number of users by using users is known or it can be predicted. With this proviso the flight
the minimum transmit power [26]. As the research evolves trajectory of UAVs may be designed in advance for maintaining
further, UAV-assisted systems have received significant atten- a high service quality and hence reduce the response time. Mean-
tion and been combined with other promising technologies. while, no interaction is needed between the UAVs and ground
Specifically, the authors of [27]–[29] employed non-orthogonal control center after the pre-deployment of UAVs. To the best of
multiple access (NOMA) for improving the performance of our knowledge, this important problem is still unsolved.
UAV-enabled communication systems, which is capable of Again, deploying UAVs as aerial BSs is able to provide re-
outperforming orthogonal multiple access (OMA). In [30], liable services for the users [35]. However, there is a paucity
UAV-aided D2D communications was investigated and the of research on the problem of 3D trajectory design of multiple
tradeoff between the coverage area and the time required for UAVs based on the prediction of the users’ mobility informa-
covering the entire target area (delay) by UAV-aided data tion, which motivates this treatise. More particularly, i) most
acquisition was also analyzed. The authors of [13] proposed existing research contributions mainly focus on the 2D place-
a framework using multiple static UAVs for maximizing the ment of multiple UAVs or on the movement of a single UAV in
average data rate provided for users, while considering fairness the scenario, where the users are static. ii) the prediction of the
amongst the users. The authors of [31] used sphere packing users’ position and their mobility information based on a real
theory for determining the most appropriate 3D position of the dataset has never been considered, which helps us to design the
UAVs while jointly maximizing both the total coverage area trajectory of UAVs in advance, thus reducing both the response
and the battery operating period of the UAVs. time and the interaction between the UAVs as well as control
2) Trajectory Design of UAVs: It is intuitive that moving center. the transmit power of UAVs is controlled for obtaining a
UAVs are capable of improving the coverage provided by tradeoff between the received signal power and the interference
static UAVs, yet the existing research has mainly considered the power, which in turn increases the received signal-interference-
scenario that users are static [10], [32]. Having said that, authors noise-rate (SINR). Therefore, we formulate the problem of joint
of [33] jointly considered the UAV trajectory and transmit power trajectory design and power control of UAVs to improve the
optimization problem for maintaining fairness among users. An users’ throughput, while satisfying the rate requirement of users.
iterative algorithm was invoked for solving the resultant non- Against the above background, the primary contributions of this
convex problem by applying the classic block coordinate descent paper are as follows:
and successive convex optimization techniques. In [17], the r We propose a novel framework for the trajectory design
new design paradigm of jointly optimizing the communication of multiple UAVs, in which the UAVs move around in a
throughput and the UAV’s energy consumption was conceived 3D space to offer down-link service to users. Based on the
for the determining trajectory of UAV, including its initial/final proposed model, we formulate on throughput maximiza-
locations and velocities, as well as its minimum/maximum tion problem by designing the trajectory and power control
speed and acceleration. In [10], a pair of practical UAV trajecto- of multiple UAVs.
ries, namely the circular flight and straight flight were pursued r We develop a three-step approach for solving the proposed
for collecting a given amount of data from a ground terminal problem. More particularly, i) we propose a multi-agent
(GT) at a fixed location, while considering the associated energy Q-learning based placement algorithm for determining the
dissipation tradeoff. By contrast, a novel cyclical trajectory was initial deployment of UAVs; ii) we propose an echo state
considered in [32] to serve each user via TDMA. As shown network based prediction algorithm for predicting the mo-
in [32], a significant throughput gain was achieved over a static bility of users; iii) we conceive a multi-agent Q-learning
UAV. In [34], a simple circular trajectory was used along with based trajectory-acquisition and power-control algorithm
maximizing the minimum average throughput of all users. In for UAVs.
addition to designing the UAV’s trajectory for its action as an r We invoke the ESN algorithm for acquiring the mobility
aerial base station, the authors of [35] studied a cellular-enabled information of users relying on a real dataset of users col-
UAV communication system, in which the UAV flew from an lected from Twitter, which consists of the GPS coordinates
initial location to a final location, while maintaining reliable and recorded time stamps of Twitter.
wireless connection with the cellular network by associating r We conceive a multi-agent Q-learning based solution for
the UAV with one of the ground base stations (GBSs) at each the joint trajectory design and power control problem of
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7959
TABLE I
LIST OF NOTATIONS
UAVs. In contrast to a single-agent Q-learning algorithm, time without landing until maintenance is needed. The scenario
the multi-agent Q-learning algorithm is capable of support- that the energy of UAVs is limited will be discussed in our future
ing the deployment of cooperative UAVs. We also demon- work, in which DLC will also be utilized.
strate that the proposed algorithms is capable of converging
to an optimal state. A. Mobility Model
D. Organization and Notations Since the users are able to move continuously during the flying
period of UAVs, the UAVs have to travel based on the tele-traffic
The rest of the paper is organized as follows. In Section II, of users. Datasets can be collected to model the mobility of users.
the problem formulation of joint trajectory design and power Again, in this work, the real-time position information of users
control of UAVs is presented. In Section III, the prediction of is collected from Twitter by the Twitter API, where the data con-
the users’ mobility information is proposed, relying on the ESN sists of the GPS coordinates and recorded time stamps. When
algorithm. In Section IV, our multi-agent Q-learning based de- users post tweets, their GPS coordinates are recorded, provided
ployment algorithm is proposed for designing the trajectory and that they give their consent, for example in exchange for calling
power control of UAVs. Our numerical results are presented in credits. The detailed discussion of the data collection process is
Section V, which is followed by our conclusions in Section VI. in Section III. The mobility pattern of each user will then be used
The list of notations is illustrated in Table I. to determine the optimal location of each UAV, which will natu-
rally impact the service quality of users. The coordinate of each
II. SYSTEM MODEL user can be expressed as wkn = [xkn (t), ykn (t)]T ∈ R2×1 , kn ∈
We consider the downlink of UAV-assisted wireless com- Kn , where RM ×1 denotes the M -dimensional real-valued vec-
munication networks. Multiple UAVs are deployed as aerial tor space, while xkn (t) and ykn (t) are the X-coordinate and
BSs to support the users in a particular area, where the ter- Y-coordinate of user kn at time t, respectively.
restrial infrastructure was destroyed or had not been installed. Since the users are moving continuously, the location of the
The users are partitioned into N clusters and each user be- UAVs must be adjusted accordingly so as to efficiently serve
longs to a single cluster. Users in this particular area are de- them. The aim of the model is to design the trajectory of UAVs
noted as K = {K1 , . . . KN }, where Kn is the set of users that in advance according to the prediction of the users’ movement.
belong to the n-th cluster, n ∈ N = {1, 2, . . . N }. Then, we At any time slot during the UAVs’ flight period, both the verti-
have Kn ∩ Kn = φ, n = n, ∀n , n ∈ N, while Kn = |Kn | de- cal trajectory (altitude) and the horizontal trajectory of the UAV
notes the number of users in the n-th cluster. For any cluster n, can be adjusted to offer a high quality of service. The verti-
n ∈ N, we consider a UAV-enabled FDMA system [36], where cal trajectory is denoted by hn (t) ∈ [hmin , hmax ], 0 ≤ t ≤ Tn ,
the UAVs are connected to the core network by satellite. At while the horizontal one by qn (t) = [xn (t), yn (t)]T ∈ R2×1 ,
any time during the UAVs’ working period of Tn , each UAV with 0 ≤ t ≤ Tn . The UAVs’ operating period is discretized into
communicates simultaneously with multiple users by employ- NT equal-length time slots.
ing FDMA.
We assume that the energy of UAVs is supplied by laser charg- B. Transmission Model
ing as detailed in [37]. A compact distributed laser charging
(DLC) receiver can be mounted on a battery-powered off-the- In our model, the downlink between the UAVs and users can
shelf UAV for charging the UAV’s battery. A DLC transmitter be regarded as air-to-ground communications. The LoS con-
(termed as a power base station) on the ground is assumed to dition and Non-Line-of-Sight (NLoS) condition are assumed to
provide a laser based power supply for the UAVs. Since the DLC be encountered randomly. The LoS probability can be expressed
is capable of self-alignment and a LOS propagation is usually as [13]
available because of the high altitude of UAVs, the UAVs can b2
be charged as long as they are flying within the DLC’s coverage 180
PLoS (θkn ) = b1 θkn − ζ , (1)
range. Thus, these DLC-equipped UAVs can operate for a long π
7960 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019
where θkn (t) = sin−1 ( dhkn (t) ) is the elevation angle between
n (t)
the UAV and the user kn . Furthermore, b1 and b2 are constant
values reflecting the environmental impact, while ζ is also a
constant value which is determined both by the antenna and
the environment. Naturally, the NLoS probability is given by
PNLoS = 1 − PLoS .
Following the free-space path loss model, the channel’s power
gain between the UAV and user kn at instant time t is given by
gkn (t) = K0 −1 d−α −1
kn (t)[PLoS μLoS + PNLoS μNLoS ] , (2)
μN LoS are the attenuation factors of the LoS and NLoS links,
fc is the carrier frequency, and finally c is the speed of light.
The distance from UAV n to user kn at time t is assumed to
be a constant that can be expressed as Fig. 1. Deployment of multiple UAVs in wireless communications based on
the mobility information of users.
dkn (t) = hn 2 (t) + [xn (t) − xkn (t)]2 + [yn (t) − ykn (t)]2 .
(3)
The overall achievable sum rate at time t can be expressed as
The transmit power of UAV n has to obey Kn
N
Fig. 2. The procedure and algorithms used for solving the joint problem of Fig. 3. The initial positions of the users derived from Twitter.
trajectory plan and power control of UAVs.
Fig. 4. The structure of Echo State Network for predicting the mobility of the
users.
prediction becomes, but at the same time it increases the agent model, each UAV acts as an agent, moving without coop-
probability of causing overfitting. erating with other UAVs. In this case, the geographic positioning
r Sparsity: Sparsity characterizes the density of the connec- of each UAV is not affected by the movement of other UAVs.
tions between neurons in the reservoir. When the density The single agent Q-learning model relies on four core elements:
is reduced, the non-linear closing capability is increased, the states, actions, rewards and Q-values. The aim of this al-
whilst the operation becomes more complex. gorithm is that of conceiving a policy (a set of actions will be
r Distribution of Nonzero Elements: The matrix W is typ- carried out by the agent) that maximizes the rewards observed
ically a sparse one, representing a network, which has nor- during the interaction time of the agent. During the iterations,
mally distributed elements centered around zero. In this the agent observes a state st , in each time slot t from the state
paper, we use a continuous-valued bounded uniform dis- space S. Accordingly, the agent carries out an action at , from
tribution, which provides an excellent performance [41], the action space A, selecting its specific flying directions and
outperforming many other distributions. transmit power based on policy J. The decision policy J is de-
r Spectral Radius of W : Spectral Radius of W scales the termined by a Q-table Q(st , at ). The policy promote choosing
matrix W and hence also the variance of its nonzero ele- specific actions, which enable the model to attain the maximum
ments. This parameter is fixed, once the neuron reservoir Q-values. Following each action, the state of the agent traverses
is established. to a new state st+1 , while the agent receives a reward, rt , which
Remark 4: The size of neuron reservoir has to be carefully is determined by the instantaneous sum rate of users. See (14),
chosen to satisfy the memory constraint, but Nx should also shown at the bottom of the page.
be at least equal to the estimate of independent real values the
reservoir has to remember from the input in order to solve the
B. State-Action Construction of the Multi-Agent Q-learning
task.
Algorithm
A larger memory capacity implies that the ESN model is capa-
ble of storing more locations that the users have visited, which In the multi-agent Q-learning model, each agent has to keep
tends to improve the prediction accuracy of the users’ move- a Q-table that includes data both about its own states as well as
ments. In the ESN model, typically 75% of the dataset is used of the other agents’ states and actions. More explicitly, it takes
for training and 25% for the testing process. account of the other agents’ actions with the goal of promoting
Remark 5: For challenging tasks, as large a neuron reservoir cooperative actions among agents so as to glean the highest
has to be used as one can computationally afford. possible rewards.
In the multi-agent Q-learning model, the individual agents
(n) (n)
IV. JOINT TRAJECTORY DESIGN AND TRANSMIT POWER are represented by a four-tuple state: ξn = (xUAV , yUAV ,
(n) (n) (n) (n)
CONTROL OF UAVS hUAV , PUAV ), where (xUAV , yUAV ) is the horizonal position
(n) (n)
of UAV n, while hUAV and PUAV are the altitude and the
In this section, we assume that in any cluster n, the UAV is
transmit power of UAV n, respectively. Since the UAVs op-
serving the users relying on an adaptively controlled flight tra-
erate across a particular area, the corresponding state space
jectory and transmit power. With the goal of maximizing the sum (n) (n)
transmit rate in each time slot by determining the flight trajec- is donated as: xUAV : {0, 1, . . . Xd }, yUAV : {0, 1, . . . Yd },
(n) (n)
tory and transmit power of the UAVs. User clustering constitutes hUAV : {hmin , . . . hmax }, PUAV = {0, . . . Pmax }, where Xd and
the first step of achieving the association between the UAVs and Yd represent the maximum coordinate of this particular area.
the users. The users are partitioned into different clusters, and while hmin and hmax are the lower and upper altitude bound
each cluster is served by a single UAV. The process of cell par- of UAVs, respectively. Finally, Pmax is the maximum transmit
titioning has been discussed in our previous work [38], which power derived from Lemma 1.
has demonstrated that the genetic K-means (GAK-means) algo- We assume that the initial state of UAVs is determined ran-
rithm is capable of obtaining globally optimal clustering results. domly. Then the convergence of the algorithm is determined by
The process of clustering is also detailed in [38], hence it will the number of users and UAVs, as well as by the initial position
not be elaborated on here. of UAVs. A faster convergence is attained when the UAVs are
placed closer to the respective optimal positions.
At each step, each UAV carries out an action at ∈ A, which
A. Signal-Agent Q-learning Algorithm includes choosing a specific direction and transmit power level,
In this section, a multi-agent Q-learning algorithm is invoked depending on its current state, st ∈ S, based on the decision
for obtaining the movement of the UAVs. Before introducing policy J. The UAVs may fly in arbitrary directions (with dif-
multi-agent Q-learning algorithm, the single agent Q-learning ferent angles), which makes the problem non-trivial to solve.
algorithm is introduced as the theoretical basis. In the single However, by assuming the UAVs fly at a constant velocity, and
K
pkn (t)gkn (t)
n
kn =1 B k n
log 2 1+ Ikn (t)+σ 2 , Sumratenew ≥ Sumrateold ,
rn (t) = (14)
0 Sumratenew < Sumrateold .
7964 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 68, NO. 8, AUGUST 2019
obey coordinated turns, the model may be simplified to as few as rather than the single-agent Q-function, Q(s, a). Given the ex-
7 directions (left, right, forward, backward, upward, downward tended notion of the Q-function, we define the Q-value as the
and maintaining static). The number of the directions has to be expected sum of discounted rewards when all agents follow spe-
appropriately chosen in practice to strike a tradeoff between the cific strategies from the next period. This definition differs from
accuracy and algorithmic complexity. Additionally, we assume the single-agent model, where the future rewards are simply
that the transmit power of the UAVs only has 3 values, namely based on the agent’s own optimal strategy. More precisely, we
0.08 W, 0.09 W and 0.1 W.2 refer to Qn∗ as the Q-function for agent n.
Remark 6: In the real application of UAVs as aerial base sta- Remark 8: The difference of multi-agent model compared to
tions, they can fly in arbitrary directions, but we constrain their the single-agent model is that the reward function of multi-agent
mobility to as few as 7 directions. model is dependent on the joint action of all agents − →a.
We choose the 3D position of the UAVs (horizontal coordi- Sparked by Remark 8, the update rule has to obey
nates and altitudes) and the transmit power to define their states.
The actions of each agent are determined by a set of coordinates Qn (sn , an ) ← (1 − α)Qn (sn , an )
for specifying their travel directions and the candidate transmit
power of the UAVs.3 Explicitly, (1, 0, 0) means that the UAV −
→
turns right; (−1, 0, 0) indicates that the UAV turns left; (0, 1, 0) + α rn (sn , a ) + β max Qn (s n , b) .
b∈An
represents that the UAV flies forward; (0, −1, 0) means that the
(15)
UAV flies backward; (0, 0, 1) implies that the UAV rises; (0, 0,
−1) means that the UAV descends; (0, 0, 0) indicates that the
UAV stays static. In terms of power, we assume 0.08 W, 0.09 W The nth agent shares the row of its Q-table that corre-
and 0.1 W. Again, we set the initial transmit power to 0.08 W, sponds to its current state with all other cooperating agents j,
and each UAV carries out an action from the set increase, de- j = 1, . . . , N . Then the nth agent selects its action according to
crease and maintain at each time slot. Then, the entire action
space has as few as 3 × 7 = 21 elements. ⎛ ⎞
Fig. 6. Comparison of real tracks and predicted tracks for different neuron reservoir size.
TABLE II
SIMULATION PARAMETERS
TABLE III
PERFORMANCE COMPARISON BETWEEN ESN ALGORITHM AND BENCHMARKS Fig. 7. Convergence of the proposed algorithm vs. the number of training
episodes.
Fig. 9. Positions of the users and the UAVs as well as the trajectory design of UAVs both with and with out power control.
long short term memory (LSTM) model are also used as our
benchmarks. It can be observed that the ESN having a neuron
reservoir size of 1000 attains a lower MSE than the HA model
and the LSTM model, even though the complexity of the ESN
model is far lower than that of the LSTM model. Overall, the
proposed ESN algorithm outperforms the benchmarks.
Then, we improve the results to multi-agent domain. We as- [5] A. Osseiran et al., “Scenarios for 5G mobile and wireless communications:
sume that, there is an initial card which contains an initial value The vision of the METIS project,” IEEE Commun. Mag., vol. 52, no. 5,
pp. 26–35, May 2014.
M Qn∗ (sn , 0 , an ) at the bottom of the multi deck. The Q [6] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with un-
value for episode 0 in multi-agent algorithm has the same as manned aerial vehicles: Opportunities and challenges,” IEEE Commun.
an initial value, which is expressed as M Qn∗ (sn , 0 , an ) = Mag., vol. 54, no. 5, pp. 36–42, May 2016.
[7] Q. Wang, Z. Chen, H. Li, and S. Li, “Joint power and trajectory design for
M Qn0 (sn , an ). physical-layer secrecy in the UAV-aided mobile relaying system,” IEEE
For episode k, an optimal value is equivalent to the Q Access, vol. 6, pp. 62849–62855, 2018.
value, M Qn∗ (sn , k , an ) = M Qnk (sn , an ). Next, we con- [8] W. Yi, Y. Liu, E. Bodanese, A. Nallanathan, and G. K. Karagiannidis,
“A unified spatial framework for UAV-aided MmWave networks,” unpub-
sider a value function which selects optimal action by using an lished paper, 2019, arXiv:1901.01432.
equilibrium strategy. At the k level, an optimal value function is [9] S. Kandeepan, K. Gomez, L. Reynaud, and T. Rasheed, “Aerial-terrestrial
the same as the Q value function. communications: Terrestrial cooperation and energy-efficient transmis-
sions to aerial base stations,” IEEE Trans. Aerosp. Electron. Syst., vol. 50,
no. 4, pp. 2715–2735, Dec. 2014.
V n∗ (sn+1 , k) = Vkn (sn+1 ) [10] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy tradeoff in ground-to-
⎡ ⎤ UAV communication via trajectory design,” IEEE Trans. Veh. Technol.,
n vol. 67, no. 7, pp. 6721–6726, Jul. 2018.
= EQnk ⎣ βj × max M Qnk (sn+1 , an+1 )⎦ . (B.5) [11] S. Zhang, Y. Zeng, and R. Zhang, “Cellular-enabled UAV communi-
j=1 cation: A connectivity-constrained trajectory optimization perspective,”
IEEE Trans. Commun., vol. 67, no. 3, pp. 2580–2604, 2018.
[12] S. Katikala, “Google project loon,” InSight, Rivier Academic J., vol. 10,
One of the agents maintains the previous Q value for episode no. 2, pp. 1–6, 2014.
k + 1. Then, we have [13] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Wireless communi-
cation using unmanned aerial vehicles (UAVs): Optimal transport theory
M Qnk+1 (sn , an ) = M Qnk (sn , an ) = M Qn∗
k (sn , k , an ) for hover time optimization,” IEEE Trans. Wireless Commun., vol. 16,
no. 12, pp. 8052–8066, Dec. 2017.
= M Qn∗
k (sn , k + 1 , an ) . (B.6) [14] X. Zhang and L. Duan, “Optimization of emergency UAV deployment for
providing wireless coverage,” in Proc. IEEE Global Commun. Conf., Dec.
Otherwise, the Q value holds the previous multi-agent Q value 2017, pp. 1–6.
[15] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Drone small cells in
with probability of 1 − αk+1
n
and takes two types of rewards with the clouds: Design, deployment and performance analysis,” in Proc. IEEE
n
probability of αk+1 . Then we have Proc. Global Commun. Conf., 2015, pp. 1–6.
[16] B. Van der Bergh, A. Chiumento, and S. Pollin, “LTE in the sky: Trading
M Qn∗ (sn , k + 1 , an ) off propagation benefits with interference costs for aerial nodes,” IEEE
Commun. Mag., vol. 54, no. 5, pp. 44–50, May 2016.
= (1 − αk+1
n
)M Qn∗
k (sn , k , an )
[17] Y. Zeng and R. Zhang, “Energy-efficient UAV communication with tra-
jectory optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 6,
⎡ ⎤ pp. 3747–3760, Mar. 2017.
[18] L. Liu, S. Zhang, and R. Zhang, “CoMP in the sky: UAV placement
+ αk+1
n ⎣rk+1
n
+β Psnn →sn+1 [an ] Vkn (sn+1 )⎦ and movement optimization for multi-user communications,” IEEE Trans.
sn+1 Commun., pp. 1–14, 2019.
[19] R. Sun and D. W. Matolak, “Air-ground channel characterization for un-
= (1 − αk+1
n
)M Qnk (sn , an ) manned aircraft systems part II: Hilly and mountainous settings,” IEEE
⎡ ⎤ Trans. Veh. Technol., vol. 66, no. 3, pp. 1913–1925, Mar. 2017.
[20] L. Bing, “Study on modeling of communication channel of UAV,” Proce-
+ αk+1
n ⎣rk+1
n
+β Psnn →sn+1 [an ] Vkn (sn+1 )⎦
dia Comput. Sci., vol. 107, pp. 550–557, 2017.
[21] N. Zhang, P. Yang, J. Ren, D. Chen, L. Yu, and X. Shen, “Syn-
sn+1 ergy of big data and 5G wireless networks: Opportunities, approaches,
and challenges,” IEEE Wireless Commun., vol. 25, no. 1, pp. 12–18,
= M Qnk+1 (sn , an ) . (B.7) Feb. 2018.
[22] B. Yang, W. Guo, B. Chen, G. Yang, and J. Zhang, “Estimating mobile
In this case, if M Qnk+1 (sn , an ) converge to an optimal valve traffic demand using Twitter,” IEEE Wireless Commun. Lett., vol. 5, no. 4,
pp. 380–383, Aug. 2016.
M Qn∗ (sn , k + 1 , an ), then a state equation of multi-agent [23] A. Galindo-Serrano and L. Giupponi, “Distributed Q-Learning for aggre-
Q-learning MQk+1 converges to an optimal state equation gated interference control in cognitive radio networks,” IEEE Trans. Veh.
MQ∗ [k + 1]. Technol., vol. 59, no. 4, pp. 1823–1834, May 2010.
[24] J. Kosmerl and A. Vilhar, “Base stations placement optimization in wire-
The proof is completed. less networks for emergency communications,” in Proc. IEEE Int. Com-
mun. Conf., 2014, pp. 200–205.
[25] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for
REFERENCES maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–
[1] X. Liu, Y. Liu, and Y. Chen, “Machine learning aided trajectory design 572, Jul. 2014.
and power control of multi-UAV,” in Proc. IEEE Global Commun. Conf., [26] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, “3D placement
2019, pp. 1–6. of an unmanned aerial vehicle base station (UAV-BS) for energy-efficient
[2] Y. Zhou et al., “Improving physical layer security via a UAV friendly maximal coverage,” IEEE Wireless Commun. Lett., vol. 6, no. 4, pp. 434–
jammer for unknown eavesdropper location,” IEEE Trans. Veh. Technol., 437, May 2017.
vol. 67, no. 11, pp. 11 280–11 284, Nov. 2018. [27] P. K. Sharma and D. I. Kim, “UAV-enabled downlink wireless system with
[3] W. Khawaja, O. Ozdemir, and I. Guvenc, “UAV air-to-ground channel non-orthogonal multiple access,” in Proc. IEEE GLOBECOM Workshops,
characterization for mmwave systems,” in Proc. IEEE Veh. Technol. Conf., Dec. 2017, pp. 1–6.
2017, pp. 1–5. [28] M. F. Sohail, C. Y. Leow, and S. Won, “Non-orthogonal multiple access for
[4] F. Cheng et al., “UAV trajectory optimization for data offloading at the unmanned aerial vehicle assisted communication,” IEEE Access, vol. 6,
edge of multiple cells,” IEEE Trans. Veh. Technol., vol. 67, no. 7, pp. 6732– pp. 22716–22727, 2018.
6736, Jul. 2018.
LIU et al.: TRAJECTORY DESIGN AND POWER CONTROL FOR MULTI-UAV ASSISTED WIRELESS NETWORKS 7969
[29] T. Hou, Y. Liu, Z. Song, X. Sun, and Y. Chen, “Multiple antenna aided Yuanwei Liu (S’13–M’16–SM’19) received the B.S.
NOMA in UAV networks: A stochastic geometry approach,” IEEE Trans. and M.S. degrees from the Beijing University of Posts
Commun., vol. 67, no. 2, pp. 1031–1044, 2018. and Telecommunications, Beijing, China, in 2011 and
[30] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial ve- 2014, respectively, and the Ph.D. degree in electri-
hicle with underlaid device-to-device communications: Performance and cal engineering from the Queen Mary University of
tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–3963, London, London, U.K., in 2016. He was with the De-
Jun. 2016. partment of Informatics, Kings College London, from
[31] M. Mozaffari, W. Saad, and M. Bennis, “Efficient deployment of multiple 2016 to 2017, where he was a Postdoctoral Research
unmanned aerial vehicles for optimal wireless coverage,” IEEE Commun. Fellow. He has been a Lecturer (Assistant Professor)
Lett., vol. 20, no. 8, pp. 1647–1650, Jun. 2016. with the School of Electronic Engineering and Com-
[32] J. Lyu, Y. Zeng, and R. Zhang, “Cyclical multiple access in UAV-Aided puter Science, Queen Mary University of London,
communications: A throughput-delay tradeoff,” IEEE Wireless Commun. since 2017. His research interests include 5G and beyond wireless networks,
Lett., vol. 5, no. 6, pp. 600–603, Dec. 2016. Internet of Things, machine learning, and stochastic geometry.
[33] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication de- Mr. Liu received the Exemplary Reviewer Certificate of the IEEE WIRELESS
sign for multi-UAV enabled wireless networks,” IEEE Trans. Wireless COMMUNICATION LETTERS in 2015, the IEEE TRANSACTIONS ON COMMUNICA-
Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018. TIONS in 2016 and 2017, the IEEE TRANSACTIONS ON WIRELESS COMMUNICA-
[34] Q. Wu and R. Zhang, “Common throughput maximization in UAV-enabled TIONS in 2017. Currently, he is serving as an Editor of the IEEE TRANSACTIONS
OFDMA systems with delay consideration,” IEEE Trans. Commun., vol. ON COMMUNICATIONS, IEEE COMMUNICATION LETTERS, and the IEEE ACCESS.
66, no. 12, pp. 6614–6627, 2018. He is also a Guest Editor for IEEE Journal of Selected Topics in Signal Process-
[35] S. Zhang, Y. Zeng, and R. Zhang, “Cellular-enabled UAV communication: ing special issue on “Signal Processing Advances for Nonorthogonal Multiple
Trajectory optimization under connectivity constraint,” in Proc. IEEE Int. Access in Next-Generation Wireless Networks”. He was the Publicity Co-Chair
Commun. Conf., 2018, pp. 1–6. for VTC2019-Fall and a TPC Member for many IEEE conferences, such as
[36] H. He, S. Zhang, Y. Zeng, and R. Zhang, “Joint altitude and beamwidth op- GLOBECOM and ICC.
timization for UAV-enabled multiuser communications,” IEEE Commun.
Lett., vol. 22, no. 2, pp. 344–347, Feb. 2018.
[37] Q. Liu et al., “Charging unplugged: Will distributed laser charging for
mobile wireless power transfer work?” IEEE Veh. Technol. Mag., vol. 11,
no. 4, pp. 36–45, Dec. 2016. Yue Chen (S’02–M’03–SM’15) received the bache-
[38] X. Liu, Y. Liu, and Y. Chen, “Deployment and movement for multiple aerial lor’s and master’s degree from the Beijing University
base stations by reinforcement learning,” in Proc. IEEE Global Commun. of Posts and Telecommunications, Beijing, China, in
Conf., Dec. 2018, pp. 1–6. 1997 and 2000, respectively, and the Ph.D. degree
[39] J. Ren, G. Zhang, and D. Li, “Multicast capacity for VANETs with direc- from the Queen Mary University of London (QMUL),
tional antenna and delay constraint under random walk mobility model,” London, U.K., in 2003.
IEEE Access, vol. 5, pp. 3958–3970, 2017. She is a Professor of telecommunications engi-
[40] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong, neering with the School of Electronic Engineering
“Caching in the sky: Proactive deployment of cache-enabled unmanned and Computer Science, QMUL, U.K. Her current
aerial vehicles for optimized quality-of-experience,” IEEE J. Sel. Areas research interests include intelligent radio resource
Commun., vol. 35, no. 5, pp. 1046–1061, May 2017. management for wireless networks, cognitive and co-
[41] M. Chen, W. Saad, C. Yin, and M. Debbah, “Echo state networks for operative wireless networking, mobile edge computing, HetNets, smart energy
proactive caching in cloud-based radio access networks with mobile users,” systems, and Internet of Things.
IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3520–3535, Jun. 2017.
[42] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Reward function and
initial values: Better choices for accelerated goal-directed reinforcement
learning,” in Proc. Int. Conf. Artif. Neural Netw., 2006, pp. 840–849.
[43] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic
iterative dynamic programming algorithms,” in Proc. Advances Neural Lajos Hanzo (F’04) received the five year degree
Inf. Process. Syst., 1994, pp. 703–710. in electronics in 1976 and the doctorate degree
[44] M. L. Littman and C. Szepesvári, “A generalized reinforcement-learning from the Technical University of Budapest, Budapest,
model: Convergence and applications,” in ICML, vol. 96, 1996, pp. 310– Hungary, in 1983. In 2009, he was awarded an
318. honorary doctorate by the Technical University of
Budapest and in 2015 by the University of Edin-
burgh. In 2016, he joined the Hungarian Academy
of Science. During his 40-year career in telecommu-
nications he has held various research and academic
posts in Hungary, Germany, and the U.K. Since 1986,
he has been with the School of Electronics and Com-
puter Science, University of Southampton, U.K., where he holds the chair in
telecommunications. He has successfully supervised 119 Ph.D. students, coau-
thored 18 Wiley/IEEE Press books on mobile radio communications totaling in
excess of 10 000 pages, published 1800+ research contributions at IEEE Xplore,
acted both as TPC and General Chair of IEEE conferences, presented keynote
lectures and has been awarded a number of distinctions. He is currently directing
Xiao Liu (S’18) received the B.S. and M.S. degrees in a 60-strong academic research team, working on a range of research projects in
2013 and 2016, respectively. He is currently working the field of wireless multimedia communications sponsored by industry, the En-
toward the Ph.D. degree with Communication Sys- gineering and Physical Sciences Research Council U.K., the European Research
tems Research Group, School of Electronic Engineer- Council’s Advanced Fellow Grant and the Royal Society’s Wolfson Research
ing and Computer Science, Queen Mary University Merit Award. He is an enthusiastic supporter of industrial and academic liai-
of London, London, U.K. son and he offers a range of industrial courses. He is also the Governor of the
His research interests include unmanned aerial ve- IEEE ComSoc and VTS. He is a former Editor-in-Chief of the IEEE Press and
hicle aided networks, machine learning, nonorthog- a former Chaired Professor at Tsinghua University, Beijing. He is fellow of the
onal multiple access techniques, and autonomous Royal Academy of Engineering, Institution of Engineering and Technology, and
driving. EURASIP.