0% found this document useful (0 votes)
45 views12 pages

Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks

This article proposes a two-layer digital twin edge network that combines mobile edge computing and digital twins. It formulates an optimization problem to jointly optimize task offloading, configurations of passive reflecting links, and computing resources. A federated deep reinforcement learning scheme is designed to solve this problem, where local agents train offloading decisions and global agents optimize edge computing resource allocation and configurations of passive reflecting elements. Numerical results show the proposed approach can reduce system cost by up to 67.1% compared to benchmarks.

Uploaded by

Mikassa Ackr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views12 pages

Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks

This article proposes a two-layer digital twin edge network that combines mobile edge computing and digital twins. It formulates an optimization problem to jointly optimize task offloading, configurations of passive reflecting links, and computing resources. A federated deep reinforcement learning scheme is designed to solve this problem, where local agents train offloading decisions and global agents optimize edge computing resource allocation and configurations of passive reflecting elements. Numerical results show the proposed approach can reduce system cost by up to 67.1% compared to benchmarks.

Uploaded by

Mikassa Ackr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in IEEE Transactions on Network Science and Engineering.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

Federated Deep Reinforcement Learning for Task


Offloading in Digital Twin Edge Networks
Yueyue Dai, Member, IEEE, Jintang Zhao, Jing Zhang, Yan Zhang, Fellow, IEEE
and Tao Jiang, Fellow, IEEE

Abstract—Digital twin edge networks provide a new paradigm Digital twin (DT) is a promising technology to tackle
that combines mobile edge computing (MEC) and digital twins network resource management [1]. Generally, digital twin
to improve network performance and reduce communication models collect sensor data, channel states, and environment
cost by utilizing digital twin models of physical objects. The
construction of digital twin models requires powerful computing characteristics from physical networks, predict mobile behav-
ability. However, the distributed devices with limited computing iors of physical objects, detect the dynamic changes of traffic,
resources cannot complete high-fidelity digital twin construction. and then design optimization strategies. Through the mutual
Moreover, weak communication links between these devices may communication between the physical objects and virtual twins,
hinder the potential of digital twins. To address these issues, digital twin models can make intelligent decision-making and
we propose a two-layer digital twin edge network, in which
the physical network layer offloads training tasks using passive highly efficient operation, thereby helping the network to
reflecting links, and the digital twin layer establishes a digital achieve extremely simplified and intelligent network orchestra-
twin model to record the dynamic states of physical components. tion [2]. However, the realization of digital twins is challenging
We then formulate a system cost minimization problem to jointly because the construction of digital twin models requires plenty
optimize task offloading, configurations of passive reflecting links, of data and abundant computing power for high-accuracy
and computing resources. Finally, we design a federated deep
reinforcement learning (DRL) scheme to solve the problem, where model construction. In reality, distributed devices contain
local agents train offloading decisions and global agents optimize massive data but their computing resource is limited. Base
the allocation of edge computing resources and configurations of stations with powerful computing resources cannot provide
passive reflecting elements. Numerical results show the effective- high-quality raw data. To this end, a digital twin edge network
ness of the proposed federated DRL and it can reduce the system is proposed to integrate digital twin with edge computing, to
cost by up to 67.1% compared to the benchmarks.
gather massive local data at the network edge and provide
Index Terms—Digital twin edge networks, task offloading, powerful task-processing ability in proximity to devices [3],
federated deep reinforcement learning [4]. In digital twin edge networks, devices can offload real-
time collected data or virtual model training tasks to edge
I. I NTRODUCTION servers and edge servers (e.g., base stations) can build virtual
The rapid development of the Internet of Things (IoT), twins based on the offloaded information [5]. Further, edge
fast deployment of wireless communication infrastructures, nodes also can perceive dynamic states of the physical items
and popularity of artificial intelligence (AI) have paved the to maintain high consistency of virtual models [6].
way towards supporting new applications, intelligent services, Since DT can help mobile edge computing (MEC) networks
and superior network performance in 6G networks. However, monitor and predict the network states, the combination of DT
the large-scale and complex distribution of network entities, and MEC can effectively help the MEC system build network
and the diverse and stringent performance requirements of de- topology and make resource allocation decisions. In [7], a
vices, the dynamics of network status make network resource DT-empowered MEC architecture for industrial automation
management, on-demand design of services, and fine-grained was proposed to minimize the end-to-end delay by iteratively
network arrangement face enormous challenges. Moreover, optimizing the transmit power of IoT devices, user association,
due to the lack of effective virtual simulation, complex net- and intelligent task offloading. In [8], the authors proposed a
work optimizations have to be directly applied to the existing novel architecture for the industrial IoT by combining the DT
network infrastructures to address the challenges in efficient capturing industrial features with federated learning to enhance
and flexible network coordination. the learning performance under resource constraints. However,
although edge servers deploy powerful computing capabilities
This work was supported in part by the National Natural Science Foundation near devices, the weak communication link and complex signal
of China under Grant 62201219 and the State Key Laboratory of Rail
Traffic Control and Safety (Contract No. RCS2023K010), Beijing Jiaotong propagation environment hinder the optimal potential of the
University. digital twin-based edge networks.
Y. Dai, J. Zhao, and T. Jiang are with the Research Center of Deep reinforcement learning (DRL), as an intelligent
6G Mobile Communications, School of Cyber Science and Engineering,
Huazhong University of Science and Technology, Wuhan 430074, China method, has been widely adopted to solve the complex
(email:[email protected], jintang [email protected], [email protected]). optimization problems with dynamic features. Since DRL
J. Zhang is with the Institute of Space Integrated Ground Network, Hefei can leverage its agent to interact with the varying network
230088, China; (email: [email protected]).
Y. Zhang is with the Department of Informatics, University of Oslo, Norway and utilize the deep neural network to explore the complex
(email: [email protected]). relationship between different variables, it can design optimal

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

schemes with real-time state updates for complex networks [9], series of technologies have emerged, such as digital twin,
[10]. For instance, in [11], DRL was used to allocate wireless machine learning, federated learning, and edge AI. Among
resource blocks to maximize network capacity while ensuring them, digital twin and machine learning have been widely
load balancing. In [12], a DRL-based strategy was proposed studied in edge networks to solve the problems of computation
to learn the joint decisions of model partitioning and resource offloading, parameter configuration, and resource allocation. In
allocation for edge intelligence. However, the traditional DRL this section, we present the latest researches on digital twin
operates in a centralized manner, in which data is collected and DRL for edge networks.
by a central server for unified decision-making. The cen-
tralized operation lacks flexibility and scalability. Compared A. Application of Digital Twin in Edge Networks
with DRL, federated DRL is an efficient distributed machine
Digital twin refers to an exact digital replica of a real-
learning approach that aims to solve the collaborative decision-
world object across multiple granularity levels. Specifically,
making problem with multiple agents.
DT is a digital representation of a physical entity or pro-
In this paper, we utilize federated DRL to jointly optimize
cess that combines data processing techniques, mathematical
task offloading and computing resources in digital twin edge
models, machine learning algorithms, and other technologies
networks. We first propose digital twin edge networks with a
to monitor, control, optimize, and predict the behavior of a
physical network layer and a digital twin layer. The physical
physical entity throughout its entire lifecycle [13]. Recently,
network layer includes distributed devices, passive reflecting
there has been increasing attention on integrating DT with
elements, and base stations, in which devices offload training
edge networks. The existing works on digital twin edge
tasks to powerful base stations with the assistance of passive
networks mainly pay attention to low-latency task offloading
reflecting links. Each physical component has a replica digital
[14], heterogeneous resource allocation [15], reliable edge
twin model to record its state in digital twin layer. Then,
association [16], and incentive mechanism for model training
we utilize digital twin models to build the communication
[17]. More specifically, in [14], a digital twin edge network
and computing models and formulate system cost minimiza-
was proposed to estimate user mobility and predict network
tion problems to optimize task offloading and computing
states. A task offloading method was designed to reduce
resources jointly. To solve the formulated problem, we develop
latency by utilizing predicted user mobility from digital twins.
a federated DRL based scheme with efficient cooperation
The authors in [18] considered digital twin-assisted edge
between devices and base stations. We summarize the main
networks and jointly optimized transmit power, association,
contributions of this paper as follows:
and offloading decisions to minimize task offloading latency
• We propose a digital twin edge network to allow resource-
by using an alternating optimization approach. The authors
contained devices to offload training tasks to base sta-
in [15] utilized digital twin-driven edge networks to allocate
tions. To perceive the status of network entities and the
diverse resources to meet the dynamic service requirements
dynamic characteristics of channels, we construct digital
of devices. In [16], digital twins were deployed at the edge
twin models of physical objects to record computing
servers to collect data from distributed devices. The authors
capability and channel states with estimation divergence.
in [17] conceived a lightweight space-air-ground digital twin
• The optimization of jointly determining multi-device task
network to build a ground digital twin model with real-time
offloading, configurations of passive reflecting elements,
states and designed an incentive scheme to initiate ground
and computing resources are formulated to minimize
nodes to take part in model training.
system cost by taking the stringent delay constraint. Due
to the ability to interact with the environment, we leverage
DRL to explore feasible solutions with dynamic changes. B. DRL for Computation Offloading
• A federated DRL scheme is designed to minimize system DRL is a promising technique to address problems with
cost, in which a global agent is to explore edge compu- uncertain and stochastic features and it also can be utilized
tation resource allocation and configurations of passive to solve the sophisticated optimization problem. Recently,
reflecting elements. The local agent is used to train plenty of studies have utilized DRL for optimizing com-
the offloading decision by estimating local computing putation offloading and allocating resources. The authors in
ability. Numerical results demonstrate that the proposed [19] proposed a method based on DRL for joint server
federated DRL algorithm can accelerate the convergence selection, cooperative offloading, and switching in dynamic
speed. environments. In [20], the authors proposed an intelligent
The remainder of this paper is organized as follows. We online offloading framework based on reinforcement learning
present the digital twin edge networks and formulate the to maximize the computation rate of the system. In [21],
system cost minimization problem in Section III. The federated a wireless-powered MEC network architecture that included
DRL algorithm to jointly optimize task offloading and resource edge computing servers and multiple edge devices was pro-
allocation is designed in Section IV. Numerical results are posed. The authors employed a DRL-based algorithm to
shown in Section V. Finally, the paper is concluded in Section jointly optimize the wireless power transmission duration,
VI. transmission time allocation for each edge device, and partial
offloading decisions to achieve the maximum sum computation
II. R ELATED WORK rate. The authors in [22] investigated the resource allocation
To fulfill the requirements of future edge networks, a problem in multi-unmanned aerial vehicles (UAVs) assisted

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

twin models, 2) DT models of passive reflecting elements, and


3) DT models of base stations. For the DT models of devices,
each one of them is a replica of its physical element, including
real-time battery state, training task, and local computing
resources. Since digital twins cannot fully record the state of
local computing resources, there is a deviation from the true
value and estimation value. For device i, we can use a 4-tuple
digital twin model to characterize its features, as
DT dv (i) = {bi (t), Di , filoc (t) + fbiloc (t)}, (1)
where bi (t) defines as the battery state of device i at the t-
th time slot. Di = (di , ci , τi , ϱi ) denotes the training task
of device i, where di is the training task’s size, ci denotes
the needed computing resource on unit bit training, τi is the
Fig. 1: The framework of digital twin edge networks. deadline to complete the training task, and ϱi is the required
accuracy of the training, respectively. filoc (t) is the estimated
local computing resources and fbiloc (t) is local computing
uplink communication scenarios. By utilizing DRL to optimize resource deviation. The key behavior of passive reflecting
UAV movement and multi-user association, they aimed to elements is to improve the wireless propagation environment,
minimize the weighted sum of total system delay and energy and the key components of these reflecting elements are
consumption. amplitude and phase shift. Thus, we define the digital twin
There are few studies that apply federated DRL to optimize model of each passive reflecting element as
the computation offloading and resource management of edge DT psrf (k) = {θk (t) + θbk (t), βk (t) + βbk (t)}, (2)
networks. The authors in [23] proposed a hierarchical feder-
ated learning with MEC servers, which utilizes a deep deter- where θk (t) is the estimated phase shift at time slot t, θbk (t)
ministic policy gradient algorithm to minimize latency, energy is its phase shift deviation, βk (t) is the estimated amplitude,
consumption, and model accuracy simultaneously. In [24], to and βbk (t) is its amplitude deviation, respectively.
provide effective task offloading and energy-saving strategies Base stations provide edge computing services via the
in uncertain aerial environments, an offloading approach based wireless link. Thus, we consider the digital twin of base
on adaptive federated DRL was employed, ensuring privacy stations characterizes by its communication and computing
protection and offloading decisions. resource states. For base station m, its digital twin can be
expressed as
III. S YSTEM M ODEL DT bs (m) = {Bm , fm
bs bs
(t) + fc
m (t)}, (3)
A. Digital Twin Edge Network bs
where Bm is the channel bandwidth of BS m, denotes the fm
We propose the digital twin edge network consisting of a bs
estimated computing resources of base station m, and fc
m (t)
physical network layer and a digital twin layer, as shown in
is its computing resource deviation.
Fig. 1. The physical layer includes three kinds of distributed
network objects, which are base stations, passive reflecting ele-
ments, and mobile devices. Each base station is equipped with B. Digital Twin for Communication Model
computing and caching resources to support edge computing Since the computing resources of devices are limited, train-
services for mobile devices via wireless communications. Due ing tasks will be offloaded to base stations via enhanced wire-
to the complex propagation environment, it is unreliable to use less links. As shown in Fig. 1, there are two communication
the weak wireless communication links to offload the comput- links, one is the direct communication link from the device to
ing tasks because of long transmission distances, obstacles, the base station, and the other is the device-passive reflecting
multipath effects, or interference from other wireless signals. element-base station link. We assume that there are I antennas
Thus, passive reflecting elements are introduced to assist in on each base station and one antenna on each device. Thus
communications through the reflecting link to establish a high- the direct device-base station channel vector can be denoted
quality communication link between base stations and mobile as hdi ∈ CI×1 . The reflecting link is a concatenation of the
devices. Mobile devices are distributed under the coverage area device to passive element link, passive element reflection, and
of base stations, and each of them needs to execute a machine passive element to base station link. Assume that each passive
learning-based training task, within a deadline. reflecting element has K items. The channel vectors of the
In the digital twin layer, we consider that each physical device to passive element link and passive element to base
object corresponds to a digital mirror model, also called station link can be denoted as hri ∈ CK×1 and G ∈ CI×K ,
the digital twin model. Through the real-time perception of respectively. Based on the digital twin model of a passive
network models and parameters, DT can perform accurate element, the reflection coefficients are defined as a diagonal
prediction, estimation, and analysis. Based on the physical matrix Φ = diag(..., (βk (t) + βbk (t))ej(θk (t)+θk (t)) , ...). When
b

layer, we define three types of DT models: 1) devices’ digital passive reflecting elements serve the target device, the signals

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

from other devices become interference to the target signal. 2) Offloading to edge server: Multiple devices may offload
We consider there are N devices, and the channel vector from their training task to their associated base station simultane-
device j is either hdj ∈ CI×1 (∀j ∈ {1, .., N }, j ̸= i) or ously and the base station needs to assign a certain amount
hrj ∈ CK×1 . Thus, the signal received at the BS from device of computing resources for these devices. The estimated total
total bs
i is expressed as computing resource can be denoted as Fm = fm (t) +
bs
fm (t). We consider the computing resource of the base station
N
c
√ X √ edge
yi = pi (hdi + GΦhri )si + pj (hdj + GΦhrj )sj + ni , appointed to device i be fim . If we prefer to offload parts
j=1,j̸=i of the training task to the base station, the time consumption
(4) will contain three items: 1) the task offloading delay from
where pi and pj are the transmit power of device i and device i for uplink transmission, 2) the delay of task training
device j, and si and sj are the signals sent from device i by consuming edge server resource, and 3) the delay of
and device j, respectively. ni is the noise, following Gaussian downlink wireless transmission of the computation feedback.
distribution N (0, σ 2 ). In Eq. (4), the first item is the desired The transmission delay of the feedback can be ignored due to
signal from the direct link and reflection link, and the second the small size of the data. Thus, the delay of task execution
item is interference from other devices. The base station uses on offloading can be presented as
linear beamforming to decode the transmitted signal. Let wi
be the beamforming vector for device i. We enhance the edge λi di ci λi di
Tim = edge
+ , (9)
signal in the target direction and suppress multipath signals fim Rim
from other directions by modifying the matrix of beamforming
coefficients. The uplink SINR ratio for task offloading can be PN edge
where i=1 fim ≤ Fmtotal
. Similarly, the consumed energy
denoted as on offloading and edge training can be expressed as
pi |wiH (hdi + GΦhri )|2
γi = PN , (5) λ i di
H d r 2 2
j=1,j̸=i pj |wi (hj + GΦhj )| + σ
edge
Eim = pi + λi di ci ∗ e, (10)
Rim
Based on Eq. (5), we can adjust the reflection coefficient
matrix Φ to compensate the effects of signal fading and where e is the unit energy consumption.
distortion caused by factors such as multipath propagation,
long-distance transmission, and obstacles. Specifically, we can
adjust the amplitude and phase in Φ to strengthen the channel D. System Cost
gain of the target direction and suppress multipath signals
from other directions. Thus, the transmission data rate for task The local execution and edge offloading are a parallel
offloading is denoted as process, so the training time of the device is determined by
pi |wiH (hdi + GΦhri )|2 either the local training time or the edge execution time, i.e.,
Rim = Bm log2 (1+ PN H d r 2
), (6) the maximum value of these two items. We can denote the
j=1,j̸=i pj |wi (hj + GΦhj )| + σ
2
task execution time as
where Bm is the channel bandwidth of base station m. Based
edge
on the transmission data rate, we can estimate offloading (1 − λi )di ci λi di (ci Rim + fim )
Ti = max{ , }. (11)
overhead on latency and energy consumption. loc loc
f (t) + fb (t) Rim f edge
i i im

C. Digital Twin for Computing Model Since both local execution and edge computing will con-
The training tasks of device i are executed on both local and sume energy, the total consumed energy on task training is
edge servers concurrently. Let λi be the variable that describes the sum of the two items, which can be expressed as
the ratio of the data executed on edge servers to total training
data. We consider that each device offloads λi di to the edge λ i di
Ei = ς(1 − λi )di ci (filoc (t) + fbiloc (t))2 + pi + λi di ci ∗ e.
server and computes (1 − λi )di locally. Rim
1) Local training: In the training process, we consider the (12)
whole local computing resource is used for training. On the We consider that system cost is the weighted sum of all
basis of the device digital twin model, we can estimate local energy consumption and its transmission and computing time
computing resources. Thus, the training time of each device of all users, which can be defined as
can be estimated by N
X
(1 − λi )di ci C(f , w, Φ, λ) = [αi Ti (f , w, Φ, λ)+(1 − αi )Ei (f , w, Φ, λ)],
Tiloc = , (7)
f (t) + fbloc (t)
loc
i i
i=1
(13)
The energy consumed to perform unit computing is ς(filoc (t)+ where αi is the weight of training delay relative to training
fbiloc (t))2 , where ς represents the efficient switched capacitance energy consumption in the total system cost with a range of
[25]. The consumed energy for local training can be written [0,1]. According to the diverse performance requirement of
as users, αi can be adjusted based on the user’s performance
Eiloc = ς(1 − λi )di ci (filoc (t) + fbiloc (t))2 . (8) requirements on latency and energy consumption.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

E. Problem Formulation the offloading decisions of devices to the global DRL


Our objective is to jointly optimize task offloading, coef- agent. Then, the global DRL agent learns the policy
ficients of passive elements, and computation resources by to optimize (f , w, Φ) and concurrently produces a new
minimizing the weighted system cost. Thus, the optimization global model. After finishing the global learning process,
problem of system cost can be formulated as the base station transmits the latest global model and the
learned policy to all devices.
min C(f , w, Φ, λ) • Local learning: Each device sends the latest global model,
f ,w,Φ,λ
N
current configuration of passive elements and computing
resource allocation to the local DRL agent. The local
X edge total
s.t. fim ≤ Fm (14a)
i=1
DRL aims to learn the offloading decision λ by utilizing
Ti ≤ τi , ∀i ∈ {1, ..., N } (14b) partial decision information as contextual knowledge.
After finishing local learning, devices upload the latest
wiH wi = 1, ∀i ∈ {1, ..., N } (14c) local model and the learned policies to the base station
θk ∈ [0, 2π), βk ∈ [0, 1] ∀k ∈ {1, ..., K} (14d) for the next round of iterations.
λi ∈ [0, 1] ∀i ∈ {1, ..., N } (14e) The global and local learning processes repeat until they
converge. The details of global DRL and local DRL will be
where (14a) indicates that the total computing resource that introduced below.
base station m allocates to devices cannot exceed its com-
puting resource capacity, (14b) ensures that the training task
can be completed within τi . (14c) is (14d) is the range of the B. Global DRL for Configuration of Passive Elements and
phase shift and amplitude of the passive elements. (14e) is Computing Resource Allocation
the range of offloading variables. Due to the target function With the offloading decisions of devices, problem (14) can
and constraints being non-convex, the formulated problem is be simplified as
NP-hard. Further, the highly complex correlation between the min C(f , w, Φ) (15a)
variables makes the problem quite untractable. As DRL is f ,w,Φ
very effective for dealing with highly coupled optimization s.t. (14b), (14a), (14d)
problems, we leverage DRL and design a new end-edge
federated DRL to address the optimization problem. In this case, we consider that λ is given and obtain the items of
filoc (t) + fbiloc (t) according to the devices’ digital twin model.
(1−λi )di ci
That is, loc , ς(1 − λi )di ci (filoc (t) + fbiloc (t))2 , and
IV. F EDERATED DRL- BASED S OLUTION fi (t)+fbiloc (t)
λi di ci ∗e are constant. Thus, problem (15) can be reformulated
In this section, we first present federated DRL to solve the
as
joint optimization problem. Then, we propose two different
N edge
DRL schemes which are deployed on the edge server and X λi di (ci Rim + (1 + pi /αi )fim )
min [αi ] (16a)
devices, respectively. f ,w,Φ edge
Rim fim
i=1
edge
λi di (ci Rim + fim )
A. Framework of Federated DRL s.t. edge
≤ τi (16b)
Rim fim
Here, we integrate federated learning with DRL to build the
(14a), (14d)
cooperation of devices and base stations to develop the feasible
strategy for minimizing system cost, as shown in Fig. 2. differ- To solve the problem (16), we first need to convert it to the
ent from traditional federated learning where the global server DRL form. A typical DRL is defined by the state, action,
only needs to make a model aggregation, in our federated and immediate reward (i.e., < s, a, R >). In the process of
DRL, the global server (i.e., the base station) not only needs DRL, each agent observes the current state before using a
to make the model aggregation but also needs to learn the deep neural network to explore an action to execute. Based
policy of computation resource allocation and configuration of on the problem (16), we define the global state, action, and
passive elements coefficients. There are two reasons for this reward.
consideration. First, devices are difficult to gather real-time 1) Global state: At the time step of τ g , the global DRL
information on edge computation resources and coefficients agent observes the system state which includes wireless com-
of passive elements, but the base station can easily gather this munication information and task information:
information to learn the optimization policy. Second, based
s(τ g ) = {CH(τ g ), T ask g (τ g )}. (17)
on the powerful computing capabilities of edge servers and
contextual information from offloading decisions, the learning The wireless communication information includes transmis-
process can be enhanced by exploring the configuration of sion power, channel information, noise power, and bandwidth:
passive reflection elements and computational resources. The
learning processes of the two DRLs are as follows: CH(τ g ) = {p(τ g ), H(τ g ), σ 2 , Bm }, (18)
• Global learning: Based on the received local models where p(τ g ) = [p1 (τ g ), ..., pi (τ g ), ..., pN (τ g )] is vector of
and offloading decisions of devices, the base station first transmission power, H(τ g ) = [hd (τ g ), hr (τ g ), G(τ g )] de-
aggregates the local models of all devices and transmits notes channel information. The device information includes

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

Fig. 2: Federated DRL with the cooperation of global agent and local agent.

local computation resources, task information, and the offload- Based on the immediate reward, the agent explores cumula-
ing decisions of devices: tive reward. Generally, the cumulative reward can be defined
as ∞
T ask g (τ g ) = {D(τ g ), λ(τ g )}, (19)
X
Rt = ζ i Υτ g +i+1 , (23)
i=0
where D(τ g ) = [D1 (τ g ), ..., Di (τ g ), ..., DN (τ g )] is task in-
where ζ ∈ [0, 1] represents the discount weight. Note that the
formation. Each Di (τ g ) contains three items, that is, data
policy indicates the probability of choosing action a(τ g ) at a
size, the unit required computation resource, and the deadline
certain state. If the agent is following a policy at time τ g , the
for task training. λ(τ g ) = [λ1 (τ g ), ..., λN (τ g )] denotes the
policy has a probability of π = p(a(τ g ) = a|s(τ g ) = s). DRL
offloading decisions of devices.
algorithms evaluate and improve the policy based on value
2) Global action: According to the problem (16), the global
function. Based on π, the agent’s state-action value function
DRL agent needs to explore action as,
is represented as
a(τ g ) = {f (τ g ), w(τ g ), Φ(τ g )}, (20) Qπ (s(τ g ), a(τ g )) = E[Υτ g +1 + ζΥτ g +2 + ...|s(τ g ), a(τ g )]
= E[Υτ g +1 + ζQπ (s(τ g + 1),
g edge g edge g
where f (τ ) = [f11 (τ ), ..., fN M (τ )] is the ac- a(τ g + 1))|s(τ g ), a(τ g )].
tion of edge computation resource allocation, w(τ g ) = (24)
[w1 (τ g ), ..., wM (τ g )] is the beamforming vector. Note that,
each element in w(τ g ) has a real part and an imaginary Upon the state-action value function reaching optimal, the
part. Φ(τ g ) = [θ1 (τ g ), ..., θK (τ g ) , β1 (τ g ), ..., βK (τ g )] is the Bellman optimality equation is obtained by
coefficient of passive elements. With a(τ g ), the agent can Q∗ (s(τ g ), a(τ g )) = E[Υt+1 + ζ max Q∗ (s(τ g + 1),
estimate the immediate reward. (25)
a(τ g + 1))|s(τ g ), a(τ g )].
3) Global reward: The immediate reward is to measure the
goodness of explored action under an observed state. Here, Thus, the optimal policy is achieved with
we define the immediate reward as the objective value of (16). π(s(τ g ))∗ = argmaxa Q∗ (s(τ g ), a(τ g )). (26)
Based on state and action, we can obtain
Next, we introduce the detailed global learning on the base
N station.
g
X αi λi (τ g )di (τ g )ci (τ g )
r(τ ) = −{ edge g Global learning: We use deep deterministic policy gradient
i=1 fim (τ ) (21) (DDPG) in global learning as the action space of the problem
αi λi (τ g )di (τ g )(1 + pi (τ g )/αi ) (16) is continuous. The objective of global learning is to
+ }.
Rim (τ g ) explore the actions of the configuration of passive elements,
and computing resource allocation. DDPG is based on actor
If the learned actions satisfy the constraint (16b), the DRL and critic networks and it has three components: main network,
agent gets this reward. Otherwise, it will receive a penalty, target network, and replay buffer. The main network produces
which is a negative constant. a configuration of passive elements configuration and comput-
( ing resource allocation policy with two deep neural networks:
r(τ g ), if (16b) satisfied; main actor and main critic networks. The structure of the target
Υt = (22)
pltg , otherwise; network is the same as the structure of the main network,

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

Algorithm 1 Global Learning for Configuration of Passive of the loss function is given by
Elements and Computing Resource Allocation g π
g Loss(θ ) = E[2(y − Q (s(j),
▽θQ j
1: Initialize global shared counter T g = 0; Q
(29)
g π
2: /*DDPG-based Global learning */ a(j)|θQ )) ▽θQ
g Q (s(j), a(j))].

3: Reset main actor and critic networks (i.e., π(s(τ g )|θπg ),


Qπ (s(τ g ), a(τ g )|θQg
) ) with parameters θπg and θvg ; Based on the gradient, we can estimate the network parameter
4: Reset target actor and critic networks (i.e., π ′ (s|θπ′ ), of the main critic network by
Q′ (s, a|θQ

)) with parameters θπg ′ and θQ g
′; V
g g αQ X
5: Set replay buffer ; θQ = θQ − [2(yj − Qπ (s(j),
6: for every episode do V i=1 (30)
Setup current environment; g π
7: a(j)|θQ )) ▽θ Q
g Q (s(j), a(j))],

8: for every step τ g do


g
9: Explore a(τ g ) = π(s(τ g )|θπ ) + Nt ; where αQ is the learning rate for updating θQ .
10: Observe reward Υτ g and update state s(τ g + 1); The sampled policy gradient is used to update the parameter
11: Record experience < s(τ g ), a(τ g ), Υτ g , s(τ g + θπg of the main actor-network. The sampled gradient is defined
1) > in replay buffer; as
12: Randomly catch a mini-batch of V experiences for
▽θπg J(π) ≈ E ▽θπg π(s(j)|θπg ) ▽a Qπ (s(j), π(s(j)|θπg )) ,
 
network parameter updates;
13: Estimate target value yj ; (31)
14: The parameter θQ g
is updated based on (30) using where J(π) = E[Rt ] is the expected return. According to the
loss function; mini-batch sampled V experiences, we update the parameter
15: The parameter θπg is updated based on (32) with of the main actor network based on
sampled policy gradient; U
απ X h g
i
16: Update target actor and critic network parameters θπg = θπg − ▽a Qπ (s(j), a(j)|θQ )) ▽θπg π(s(j)|θπg ) ,
by: V j=1
θπg ′ ← ωθπg + (1 − ω)θπg ′ g
(32)
g g g where απ is the learning rate used to update θQ .
θQ ′ ← ωθQ + (1 − ω)θQ′ .

17: end for Based on the previous neural network parameters of the
18: end for target and the main networks, the current parameters θπg ′ and
g
θQ ′ of the target network are updated with the soft updat-

ing method to gradually approach the structure of the main


that is, it also has two networks: target actor and target critic network Specifically, the parameters θπg ′ and θQ g
′ are updated
g g g g g g
networks. Replay buffer stores experience tuples for breaking based on θπ′ = ωθπ +(1−ω)θπ′ and θQ′ = ωθQ +(1−ω)θQ ′,

up the correlation among the experiences. Each experience where ω ∈ [0, 1].
tuple contains the current state, current action, current reward, The details of global learning by using DDPG are shown
and next state. The parameters of the critic and actor networks in algorithm 1. In the initialization step, the DDPG agent
can be upgraded by using the samples of experience in the initializes the main network with parameters θπg and θvg , and
replay buffer. the target network with θπg ′ and θQ g
′ . It also initializes the

In DDPG, based on θπg , the actor-network learns current experience replay buffer which is used to store the sample
policy π(s(τ g )|θπg ) by mapping states to a specific action in a data generated by the interaction between the agent and the
deterministic way. To evaluate the learned action performance, environment. After the initialization, the DDPG agent first
g g
the critic network utilizes Qπ (s(τ g ), a(τ g )|θQ ) based on θQ . catches the environment information in the training process,
g such as wireless channel information and training task infor-
In the main network, the critic updates its parameter θQ
through the loss function. Specifically, the agent first uniformly mation. Based on this state information, the agent explores
samples U experiences < s(j), a(j), Υj , s(j + 1) > from the the action, calculates its immediate reward, and updates the
replay buffer and generate the target value yj , which is given next state. Replay buffer packages the information of state,
by reward, and action as an experience. Then, the replay buffer
stores all experiences. When calculating the parameters of
yj = Υj + ζQ′ (s(j + 1), π ′ (s(j + 1)|θπg ′ )|θQ
g
′ ), (27) neural networks, a mini-batch of experiences is sampled from
the replay buffer before calculating the target value. The loss
where Q′ (s(j + 1), π ′ (s(j + 1)|θπg ′ )|θQ
g
′ ) is measured by the
g function is used to update the parameters of the main critic
target critic network, θπ′ is the network parameter of the target
g network. The parameters of the main participant network are
actor-network, and θQ ′ is the parameter of the target critic
updated using policy gradients.
network. Meanwhile, the agent calculates the loss based on
h i
g g 2
Loss(θQ ) = E (yj − Qπ (s(j), a(j)|θQ )) . (28) C. Local DRL for Task Offloading
Then, we can calculate the loss function and its gradient to With the coefficients of passive elements and
update the parameter of the main critic network. The gradient computation resource allocation of the base station

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

(i.e.,{f (τ g ), w(τ g ), Φ(τ g )}), problem (14) can be simplified Algorithm 2 Local Learning for Offloading
as 1: Obtain configuration of passive elements and computation
(1 − λi )di ci resource allocation;
min Ci (λ) = αi max{ , (33a) 2: Obtain global agent parameter θπg and θvg ;
λ
i f (t) + fbloc (t)
loc
i
edge 3: Initialize learning step counter τ l ;
λi di (ci Rim + fim ) 4: repeat
edge
} + ς(1 − λi )di ci
Rim fim 5: Reset dθπg ← 0 and dθvg ← 0;
λi di 6: /*Sharing global model to devices */
∗ (filoc (t) + fbiloc (t))2 + pi + λi di ci ∗ e
Rim 7: Synchronize local agent parameter θπl ← θπg and θvl ←
g
(1 − λi )di ci θv ;
s.t. ≤ τi (33b) 8: for each device do
f (t) + fbloc (t)
loc
i i
edge
9: Observe current state s(τ l ) and configuration of
λi di (ci Rim + fim ) passive elements and computation resource allocation
edge
≤ τi (33c)
Rim fim (i.e.,{f (τ g ), w(τ g ), Φ(τ g )});
(14e) 10: tstart = τ l ;
11: repeat
According to the aforementioned DRL form, we transform 12: Explore a(τ l ) based on π(s(τ l )|θπl );
the problem (33) with local state, local action, and local 13: Update s(τ l + 1) and Rloc (s(τ l ), a(τ l ));
reward. 14: τ l ← τ l + 1; T ← T + 1;
1) Local state: At step τ l , each local DRL agent on devices 15: until τ l − tstart == tmax
gathers the state, including communication data rate, task in- 16: Rloc = V (s(τ l )|θvl );
formation, local computation resource, and edge computation 17: while receive Rloc from each local agent do
resource: 18: for j ∈ {τ l − 1, ..., τstart
l
} do
edge l
sli (τ l ) = {Rim (τ l ), Di (τ l ), fim (τ )}. (34) 19: R ← R + δRloc ;
loc loc

20: Accumulate gradients P and update critic net-


Based on global policy {w(τ g ), Φ(τ g )} and transmission work parameter with θvl ← θvl − αv τ l ▽θvl (Rloc l
i (τ ) +
power pi (τ l ), wireless transmission data rate Rim (τ l ) can be l l l
δV (s(τ + 1)|θv ) − V (s(τ )|θv )) ; l 2
edge l
obtained. Di (τ l ) is the training task information. fim (τ ) is 21: Accumulate gradients and update
the computing resource that base station m allocates to device actor network parameters with θ l
← θπl +
π
i, which is obtained from f (τ g ). P l l l l
απ τ l ▽θπl log π(si (τ )|θπ )Aπ (si (τ ), ai (τ )) + φ ▽θπl
2) Local action: According to the problem (33), each device H(π(s(τ l )|θπl ));
needs to decide on computation offloading λi . Hence, the 22: end for
action of the local DRL is denoted as ali (τ l ) = λi (τ l ). 23: /*Global model updates */
Compared to a global DRL agent, a local DRL agent only 24: Asynchronous update of θπl and θvl ;
needs to make low-dimensional action. 25: Update θπg ← ωl θπl + +(1 − ω)θπg and θvg ←
3) Local reward: The immediate reward function is defined l
ωl θv + (1 − ω)θQ g
;
as ri (τ l ) = −Ci (λ). If the offloading decision satisfies 26: Upload action to the base station (i.e., offload-
the constraint (33b) and (33c), the agent gets this reward. ing decision).
Otherwise, it gets a penalty, 27: end while
end for
(
28:
loc l ri (τ l ), if ((33b) and (33c) satisfied;
ri (τ ) = 29: until T > Tmax
pltl , otherwise;
(35)
The goal of each local DRL is to search for an optimal
policy to maximize local cumulative discounted rewards. δ ∈ in parallel with a similar network structure to the global agent.
[0, 1] is the discount factor, thus we have We consider the global agent is deployed at the base station

X to make global model aggregation and parameter collection.
Rloc l
i (τ ) = δ i riloc (τ l + i + 1). (36) Local agents are distributed on each device. In each local
i=0 agent, a policy gradient scheme is used to produce offloading
Local learning: We use asynchronous actor-critic DRL to decisions under the given parameterized policy with the actor-
explore the action of task offloading decisions on each device. network parameter θπl . We denote ai (τ l ) = π(si (τ l )|θπl ) as
Asynchronous actor-critic DRL is policy-based and value- the offloading policy learned from actor, which is generated
based for solving continuous time and state space problems. by the method of nonlinear function fitting.
This DRL also includes actors and critics. The actors are re- Critic network is utilized to estimate the output performance
sponsible for interacting with the environment and generating from the actor with the critic network parameter θvl . Here, the
actions that are evaluated and guided by the critics. Different temporal difference-based gradient descent method is used to
from DDPG, the asynchronous DRL contains a global agent update the network parameter of the critic network with a
and multiple local learning agents. Multiple local agents run state-value function. We define a state-value function as the

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

long-term cumulative rewards by applying strategy π in state and aggregated gradients to update actor network parameter
si (τ l ). Thus. it can be denoted as θπl and critic network parameter θvl .
The complexity of the proposed federated DRL is jointly
V (si (τ l )|θvl ) = E[Rloc l l l l
i (τ )+δV ((si (τ +1)|θv )|si (τ ). (37) determined by the efficiency of the local and global DRLs.
Based on V (si (τ l )), temporal difference error can be calcu- Specifically, the complexity primarily depends on the number
lated by of agents, the neural network structures of the actor and critic
networks used by each agent, and the training frequency. We
∆ = Rloc l l l l l
i (τ ) + δV (si (τ + 1)|θv ) − V (si (τ )|θv ). (38) consider that the actor and critic networks of the local DRL
contain J and L fully connected layers, and the actor and critic
The loss function is presented as a mean square error to update networks of the global DRL contain Z and Q fully connected
the network parameters of the critical network.to update the layers. Moreover, we denote the training frequencies of the
network parameters of the critical network. The loss can be local DRL and global DRL as E and B, respectively. Then
defined as the overall complexity of the proposed federated DRL can be
X
Loss(θvl ) = (Rloc l l l l l 2 expressed as
i (τ )+δV (s(τ +1)|θv )−V (s(τ )|θv )) . J L
τl
X X
(39) O(B × N × E × (2 nA1,j nA1,j+1 + 2 nC1,l nC1,l+1 )
We define αv as the learning rate of the critic network. Thus, j=0 l=0
Q
the parameter θvl is updated by: Z
X X
X + B × (2 nA2,z nA2,z+1 + 2 nC2,q nC2,q+1 )), (43)
θvl = θvl − αv ▽θvl (Rloc l l l
i (τ ) + δV (s(τ + 1)|θv ) z=0 q=0
τ l (40)
where nA1,j , nC1,l and nA2,z , nC2,q represent the unit num-
− V (s(τ l )|θvl ))2 . bers of the actor and critic networks for the local DRL and
The actor-network uses an advantage function and gra- global DRL, respectively. The values of nA1,0 , nC1,0 and
dient ascending method to update its network parameters. nA2,0 , nC2,0 are equal to the input sizes of the corresponding
The advantage functions Aπ (si (τ l ), ai (τ l )) is denoted as networks.
the difference of the reward and state-value function, which
means whether the explored action is better than expected. V. N UMERICAL R ESULTS
Aπ (si (τ l ), ai (τ l )) is defined as
A. Simulation Settings
Aπ (si (τ l ), ai (τ l )) = Riloc (τ l ) − V (si (τ l )|θvl )
i−1
The digital twin network for the experimental simulation
X includes 10 single-antenna devices under the coverage of a
= δ j Rjloc (τ l + j) + δ j V (sj (τ l + i)|θvl ) − V (sj (τ l )|θvl ).
base station with 3 antennas. To improve the quality of wire-
j=0
(41) less communication, passive elements are deployed between
the devices and the base station, consisting of 5 passive re-
θπl is the parameter of the actor-network and it is updated flecting elements. The base station uses a linear beamforming
based on : strategy to decode the signals directly transmitted by the
X mobile terminals and the signals of the passive reflecting
θπl = θπl + απ ▽θπl log π(si (τ l )|θπl )Aπ (si (τ l ), ai (τ l ))
elements. The bandwidth of the wireless channel for each
τl
device is 5 MHz. Noise power follows Gaussian distribution
+ φ ▽θπl H(π(s(τ l )|θπl )), with σ 2 = 10−11 mW, and λ is randomly distributed in [0, 1].
(42) Digital twin models of devices and base stations estimate their
where απ is the learning rate, H(·) represents the regularized real-time computing capabilities. We consider the computing
entropy value to ensure that the decision-making agent fully capabilities of devices and base stations to follow the uniform
explores the environment space. φ is the coefficient of entropy. distribution of U [0.1, 0.5] GHz and U [10, 50] GHz, respec-
φ will be set as a large value at the beginning so that tively. Each device needs to compute a training task with an
the parameters of the actor will change in the direction of accuracy constraint, that its data size follows the distribution
increasing entropy. of U [1, 100] MB, and the number of CPU cycles required per
Algorithm 2 shows the details of local learning by using byte following a distribution of U [50, 100] Hz. In the federated
asynchronous actor-critic DRL. In the initialization step, the DRL, we utilize Relu in local DRL training and tanh(x)+1 2
agent set initialize actor with parameter θπl ← θπg and critic as the activation function for global training. The penalty is
with θvl ← θvg . In each device, the agent first catches the net- proportional to the number of devices, i.e., 100 × N .
work state and then explores offloading action. After running To verify the evaluation effectiveness of the proposed algo-
tmax steps, the agent uses state-action value V (s(τ l )|θvl ) to rithm, we consider two reinforcement learning algorithms as
calculate the future reward R. After multiple interactions with benchmark strategies:
the environment, the agent utilizes the received reward, state- • A3C-based offloading: each device estimates its own
action value, action, and state to calculate loss and gradient. computing resource and interacts with its surrounding
The agent deployed on the base station uses the obtained loss environment to predict offloading overhead, and then uses

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

10

A3C-based DRL to determine offloading variable λ with


a stringent latency constraint.
• DDPG-based configuration of passive elements and
computing resource allocation: devices choose to of-
fload half of the training tasks to the base station and
execute half of it locally. The base station interacts with
the environment to configure phase shifts and amplitude
of passive reflecting elements and allocate computing
resources for each task based on its latency constraint.

B. Performance Analysis
Figs. 3-4 illustrate the impact of different variables on
the cumulative system cost. Here, x-axis is the number of
episodes, and y-axis refers to cumulative system cost. To Fig. 4: Impact of different learning rates.
facilitate depicting the trend on y-axis, we replace the original
values on the y-axis with their logarithmic value (i.e., y
= log(Rt )). Fig. 3 shows the efficiency of the proposed adjusting the configuration of passive elements and computing
federated DRL with two benchmarks. Here, the reward is resources in the background of optimizing the offloading co-
the combination of system overhead on time and energy. The efficients, thereby further improving the wireless transmission
smaller the reward, the smaller the system overhead. The solid rate and reducing-edge computing overhead. Moreover, the
line and shaded area in Fig. 3 represent the mean and standard flexible offloading capability of the federated DRL algorithm
deviation of the system overhead, respectively. can further reduce energy and time consumption based on
We can make two conclusions from Fig. 3. First, the pro- different training requirements. Secondly, the federated DRL
posed federated DRL algorithm outperforms the two baseline algorithm demonstrates a faster convergence speed than the
algorithms in terms of reward by leveraging the cooperation of two benchmarks. This is because local-global collaborative
local and global agents. Specifically, compared with the A3C- learning is executed in parallel, and the parameters of the
based offloading, the federated DRL-based joint optimiza- agents are shared to jointly explore strategies, accelerating the
tion algorithm optimizes the task offloading, edge computing convergence of the system.
resource allocation, and antenna beamforming coefficients, To evaluate the effect of the learning rate on the performance
and adjusts the optimal reflection angle of passive elements, of the activity evaluation and feedback network, we consider
resulting in smaller system overhead. The A3C-based offload- three different learning rates: {0.05, 0.005, 0.0005}. Fig. 4
ing can only determine the offloading proportion of each illustrates the convergence behavior of the federated DRL
device in a fixed communication and computing environment, algorithm under these three learning rates. From Fig. 4, our
resulting in larger system overhead. In comparison to the results indicate that the convergence behavior varied signifi-
DDPG-based configuration of passive elements and resource cantly with different learning rates. Specifically, the federated
allocation algorithm, the federated DRL algorithm can further DRL model with a learning rate of 0.005 achieves the lowest
adjust the number of device offloading tasks, i.e., further cost, although it requires more episodes to converge compared
to the case with a learning rate of 0.05. The cumulative cost
under the learning rate of 0.05 is greater than that under
0.005 because a tremendous learning rate can cause the model
to miss the optimal value and lead to large oscillations in
the convergence curve. When the learning rate is 0.0005, the
curve converges quickly due to the small learning rate, but the
cumulative cost is greater than that under a learning rate of
0.005 because an excessively small learning rate can cause the
model to get stuck in local optima. Therefore, a learning rate
of 0.005 is the best setting for the proposed federated DRL
algorithm.
Fig. 5 illustrates how the number of terminal devices
affects the cumulative system cost, cumulative latency cost,
and cumulative energy cost. The experiment uses N =
{10, 20, 30, 40} device. As depicted in Fig. 5 (a), the system
cost for all three algorithms rises when the number of terminal
Fig. 3: Comparison of the performance of A3C-based offload- devices grows from 10 to 40. However, the proposed federated
ing, DDPG-based passive elements configuration, computing DRL algorithm always has the smallest cumulative system
resource allocation, and federated DRL-based joint optimiza- cost compared to the two benchmarks because it calculates the
tion. task offloading size, allocates edge computing resources, and

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

11

(a) Cumulative system cost (b) Cumulative delay cost (c) Cumulative energy cost

Fig. 5: Effects of different numbers of terminal devices

(a) Cumulative system cost (b) Cumulative delay cost (c) Cumulative energy cost

Fig. 6: Effects of different task offloading schemes

allocated computing resources may not meet the needs of


the devices, resulting in a violation of the latency constraint
and possible penalties. Moreover, Fig. 5(c) reveals that as the
number of devices increases, the A3C and DDPG systems’
energy costs alternate in their growth, while the federated DRL
algorithm consistently maintains the lowest system energy
cost. Therefore, the proposed federated DRL algorithm has
better performance in terms of cumulative system cost, latency
cost, and energy cost when the number of devices changes.
Fig. 6 shows the convergence curves of cumulative system
cost, cumulative delay cost, and cumulative energy cost under
different task offloading schemes. These schemes are 1) local
computing, 2) full offloading, 3) random offloading, and 4)
federated DRL-optimized offloading. From Fig. 6(a), we can
Fig. 7: Impact of different signal transmission power see that the schemes of all local computing and all offloading
to edge servers have the highest system cost among the four
convergence curves, while in Fig. 6(b) and Fig. 6(c), they have
adjusts the parameters of passive elements based on the real- the highest system delay cost and energy cost, respectively.
time computing capabilities and task volume of the devices. This is because, in the all-local computing scheme, the limited
This also verifies that the proposed federated DRL algorithm is computing capacity may not be able to complete the compu-
suitable for dynamic scenarios with terminal devices of various tational tasks within the specified time limit, resulting in pun-
sizes. In Fig. 5 (b), the system’s latency cost experiences a ishment. When all the computational tasks are offloaded to the
sharp increase when device numbers grow from 30 to 40. edge, a large amount of energy is consumed in the transmission
This occurs because the edge server’s computing resources are of data over the communication link, resulting in the highest
fixed, causing each device’s share of resources to diminish energy cost. The federated DRL scheme considers both system
as the system expands, leading to higher latency costs. The delay and energy cost, seeking a balance point between the two

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710

12

to achieve the lowest system cost. In summary, compared with [8] W. Sun, S. Lei, L. Wang, Z. Liu, and Y. Zhang, “Adaptive federated
other task offloading schemes, the federated DRL optimized learning and digital twin for industrial internet of things,” IEEE Trans-
actions on Industrial Informatics, vol. 17, no. 8, pp. 5605–5614, 2021.
offloading scheme can achieve a smaller system cost. [9] G. He, S. Cui, Y. Dai, and T. Jiang, “Learning task-oriented channel allo-
Fig.7 shows the changes in system cost of the three al- cation for multi-agent communication,” IEEE Transactions on Vehicular
gorithms as the transmission power increases from -20 dB Technology, vol. 71, no. 11, pp. 12 016–12 029, 2022.
[10] M. S. Allahham, A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad,
to 20 dB. When Pt ={−20, −10, 0} dB, the system cost is and M. Guizani, “Multi-agent reinforcement learning for network selec-
roughly equal and relatively small. This is because increasing tion and resource allocation in heterogeneous multi-rat networks,” IEEE
the device transmission power will increase the energy cost of Transactions on Cognitive Communications and Networking, vol. 8,
no. 2, pp. 1287–1300, 2022.
transmitting data, but it will also increase the SINR, improve [11] S. Zhang, Z. Ni, L. Kuang, C. Jiang, and X. Zhao, “Load-aware dis-
the data transmission speed, and reduce the transmission delay. tributed resource allocation for mf-tdma ad hoc networks: A multi-agent
In addition, selecting a larger device transmission power can drl approach,” IEEE Transactions on Network Science and Engineering,
vol. 9, no. 6, pp. 4426–4443, 2022.
improve signal quality. However, when the transmission power [12] F. Dong, H. Wang, D. Shen, Z. Huang, Q. He, J. Zhang, L. Wen,
increases from 10 dB to 20 dB, the slope of the line sharply and T. Zhang, “Multi-exit dnn inference acceleration based on multi-
increases, indicating that the system transmission energy cost dimensional optimization for edge intelligence,” IEEE Transactions on
Mobile Computing, vol. 22, no. 9, pp. 5389–5405, 2023.
dominates. Furthermore, Fig. 7 shows that the federated DRL [13] Y. Wu, K. Zhang, and Y. Zhang, “Digital twin networks: A survey,” IEEE
algorithm performs better than the two benchmarks at various Internet of Things Journal, vol. 8, no. 18, pp. 13 789–13 804, 2021.
transmission power levels. [14] W. Sun, H. Zhang, R. Wang, and Y. Zhang, “Reducing offloading latency
for digital twin edge networks in 6g,” IEEE Transactions on Vehicular
Technology, vol. 69, no. 10, pp. 12 240–12 251, 2020.
VI. C ONCLUSIONS [15] M. Adhikari, A. Munusamy, N. Kumar, and S. N. Srirama, “Cybertwin-
driven resource provisioning for ioe applications at 6g-enabled edge
In this paper, a digital twin edge network is proposed to networks,” IEEE Transactions on Industrial Informatics, vol. 18, no. 7,
enable resource-contained devices to offload training tasks to pp. 4850–4858, 2022.
a powerful edge server with an enhanced wireless communica- [16] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang, “Low-latency
federated learning and blockchain for edge association in digital twin
tion link. In the proposed digital twin edge network, physical empowered 6g networks,” IEEE Transactions on Industrial Informatics,
objects contain devices, passive reflecting elements, and base vol. 17, no. 7, pp. 5098–5107, 2021.
stations. Utilizing digital twins, we construct virtual twins for [17] W. Sun, S. Lian, H. Zhang, and Y. Zhang, “Lightweight digital twin and
federated learning with distributed incentive in air-ground 6g networks,”
each of them to record their dynamic changes. Then we formu- IEEE Transactions on Network Science and Engineering, pp. 1–13,
late a system cost minimization problem by jointly considering 2022.
task offloading, configurations of passive reflecting elements, [18] D. Van Huynh, V.-D. Nguyen, S. R. Khosravirad, V. Sharma, O. A.
Dobre, H. Shin, and T. Q. Duong, “Urllc edge networks with joint
and computing resource allocation. To solve the complex opti- optimal user association, task offloading and resource allocation: A
mization problem, we design a federated DRL scheme, where digital twin approach,” IEEE Transactions on Communications, vol. 70,
the global agent is used to explore edge computation resource no. 11, pp. 7669–7682, 2022.
[19] T. M. Ho and K.-K. Nguyen, “Joint server selection, cooperative offload-
allocation and coefficients of passive reflecting elements, and ing and handover in multi-access edge computing wireless network: A
local agents are responsible for offloading decision-making. deep reinforcement learning approach,” IEEE Transactions on Mobile
Numerical results indicate that the proposed federated DRL Computing, vol. 21, no. 7, pp. 2421–2435, 2022.
[20] E. Mustafa, J. Shuja, K. Bilal, S. Mustafa, T. Maqsood, F. Rehman,
could gradually improve its performance to adapt to different and A. u. R. Khan, “Reinforcement learning for intelligent online
scenarios compared with other DRL methods. computation offloading in wireless powered edge networks,” Cluster
Computing, vol. 26, no. 2, pp. 1053–1062, 2023.
R EFERENCES [21] S. Zhang, H. Gu, K. Chi, L. Huang, K. Yu, and S. Mumtaz, “Drl-based
partial offloading for maximizing sum computation rate of wireless pow-
[1] R. Minerva, G. M. Lee, and N. Crespi, “Digital twin in the iot context: ered mobile edge computing network,” IEEE Transactions on Wireless
A survey on technical features, scenarios, and architectural models,” Communications, vol. 21, no. 12, pp. 10 934–10 948, 2022.
Proceedings of the IEEE, vol. 108, no. 10, pp. 1785–1824, 2020. [22] J. Chen, X. Cao, P. Yang, M. Xiao, S. Ren, Z. Zhao, and D. O. Wu,
[2] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Deep reinforcement “Deep reinforcement learning based resource allocation in multi-uav-
learning for stochastic computation offloading in digital twin networks,” aided mec networks,” IEEE Transactions on Communications, vol. 71,
IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4968– no. 1, pp. 296–309, 2023.
4977, 2021. [23] T. Zhao, F. Li, and L. He, “Drl-based secure aggregation and resource
[3] L. U. Khan, W. Saad, D. Niyato, Z. Han, and C. S. Hong, “Digital-twin- orchestration in mec-enabled hierarchical federated learning,” IEEE
enabled 6g: Vision, architectural trends, and future directions,” IEEE Internet of Things Journal, vol. 10, no. 20, pp. 17 865–17 880, 2023.
Communications Magazine, vol. 60, no. 1, pp. 74–80, 2022. [24] Y. Liu, L. Jiang, Q. Qi, and S. Xie, “Energy-efficient space–air–ground
[4] X. Yuan, J. Chen, N. Zhang, J. Ni, F. R. Yu, and V. C. M. Leung, “Digital integrated edge computing for internet of remote things: A federated drl
twin-driven vehicular task offloading and irs configuration in the internet approach,” IEEE Internet of Things Journal, vol. 10, no. 6, pp. 4845–
of vehicles,” IEEE Transactions on Intelligent Transportation Systems, 4856, 2023.
pp. 1–15, 2022. [25] Y. Dai, D. Xu, S. Maharjan, and Y. Zhang, “Joint computation offloading
[5] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Edge intelligence for and user association in multi-task mobile edge computing,” IEEE
energy-efficient computation offloading and resource allocation in 5g Transactions on Vehicular Technology, vol. 67, no. 12, pp. 12 313–
beyond,” IEEE Transactions on Vehicular Technology, vol. 69, no. 10, 12 325, 2018.
pp. 12 175–12 186, 2020.
[6] P. Gorla, K. V, V. Chamola, and M. Guizani, “A novel framework of
federated and distributed machine learning for resource provisioning in
5g and beyond using mobile-edge scbs,” IEEE Transactions on Network
and Service Management, pp. 1–1, 2022.
[7] T. Do-Duy, D. Van Huynh, O. A. Dobre, B. Canberk, and T. Q. Duong,
“Digital twin-aided intelligent offloading with edge selection in mobile
edge computing,” IEEE Wireless Communications Letters, vol. 11, no. 4,
pp. 806–810, 2022.

Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like