Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks
Federated Deep Reinforcement Learning For Task Offloading in Digital Twin Edge Networks
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
Abstract—Digital twin edge networks provide a new paradigm Digital twin (DT) is a promising technology to tackle
that combines mobile edge computing (MEC) and digital twins network resource management [1]. Generally, digital twin
to improve network performance and reduce communication models collect sensor data, channel states, and environment
cost by utilizing digital twin models of physical objects. The
construction of digital twin models requires powerful computing characteristics from physical networks, predict mobile behav-
ability. However, the distributed devices with limited computing iors of physical objects, detect the dynamic changes of traffic,
resources cannot complete high-fidelity digital twin construction. and then design optimization strategies. Through the mutual
Moreover, weak communication links between these devices may communication between the physical objects and virtual twins,
hinder the potential of digital twins. To address these issues, digital twin models can make intelligent decision-making and
we propose a two-layer digital twin edge network, in which
the physical network layer offloads training tasks using passive highly efficient operation, thereby helping the network to
reflecting links, and the digital twin layer establishes a digital achieve extremely simplified and intelligent network orchestra-
twin model to record the dynamic states of physical components. tion [2]. However, the realization of digital twins is challenging
We then formulate a system cost minimization problem to jointly because the construction of digital twin models requires plenty
optimize task offloading, configurations of passive reflecting links, of data and abundant computing power for high-accuracy
and computing resources. Finally, we design a federated deep
reinforcement learning (DRL) scheme to solve the problem, where model construction. In reality, distributed devices contain
local agents train offloading decisions and global agents optimize massive data but their computing resource is limited. Base
the allocation of edge computing resources and configurations of stations with powerful computing resources cannot provide
passive reflecting elements. Numerical results show the effective- high-quality raw data. To this end, a digital twin edge network
ness of the proposed federated DRL and it can reduce the system is proposed to integrate digital twin with edge computing, to
cost by up to 67.1% compared to the benchmarks.
gather massive local data at the network edge and provide
Index Terms—Digital twin edge networks, task offloading, powerful task-processing ability in proximity to devices [3],
federated deep reinforcement learning [4]. In digital twin edge networks, devices can offload real-
time collected data or virtual model training tasks to edge
I. I NTRODUCTION servers and edge servers (e.g., base stations) can build virtual
The rapid development of the Internet of Things (IoT), twins based on the offloaded information [5]. Further, edge
fast deployment of wireless communication infrastructures, nodes also can perceive dynamic states of the physical items
and popularity of artificial intelligence (AI) have paved the to maintain high consistency of virtual models [6].
way towards supporting new applications, intelligent services, Since DT can help mobile edge computing (MEC) networks
and superior network performance in 6G networks. However, monitor and predict the network states, the combination of DT
the large-scale and complex distribution of network entities, and MEC can effectively help the MEC system build network
and the diverse and stringent performance requirements of de- topology and make resource allocation decisions. In [7], a
vices, the dynamics of network status make network resource DT-empowered MEC architecture for industrial automation
management, on-demand design of services, and fine-grained was proposed to minimize the end-to-end delay by iteratively
network arrangement face enormous challenges. Moreover, optimizing the transmit power of IoT devices, user association,
due to the lack of effective virtual simulation, complex net- and intelligent task offloading. In [8], the authors proposed a
work optimizations have to be directly applied to the existing novel architecture for the industrial IoT by combining the DT
network infrastructures to address the challenges in efficient capturing industrial features with federated learning to enhance
and flexible network coordination. the learning performance under resource constraints. However,
although edge servers deploy powerful computing capabilities
This work was supported in part by the National Natural Science Foundation near devices, the weak communication link and complex signal
of China under Grant 62201219 and the State Key Laboratory of Rail
Traffic Control and Safety (Contract No. RCS2023K010), Beijing Jiaotong propagation environment hinder the optimal potential of the
University. digital twin-based edge networks.
Y. Dai, J. Zhao, and T. Jiang are with the Research Center of Deep reinforcement learning (DRL), as an intelligent
6G Mobile Communications, School of Cyber Science and Engineering,
Huazhong University of Science and Technology, Wuhan 430074, China method, has been widely adopted to solve the complex
(email:[email protected], jintang [email protected], [email protected]). optimization problems with dynamic features. Since DRL
J. Zhang is with the Institute of Space Integrated Ground Network, Hefei can leverage its agent to interact with the varying network
230088, China; (email: [email protected]).
Y. Zhang is with the Department of Informatics, University of Oslo, Norway and utilize the deep neural network to explore the complex
(email: [email protected]). relationship between different variables, it can design optimal
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
schemes with real-time state updates for complex networks [9], series of technologies have emerged, such as digital twin,
[10]. For instance, in [11], DRL was used to allocate wireless machine learning, federated learning, and edge AI. Among
resource blocks to maximize network capacity while ensuring them, digital twin and machine learning have been widely
load balancing. In [12], a DRL-based strategy was proposed studied in edge networks to solve the problems of computation
to learn the joint decisions of model partitioning and resource offloading, parameter configuration, and resource allocation. In
allocation for edge intelligence. However, the traditional DRL this section, we present the latest researches on digital twin
operates in a centralized manner, in which data is collected and DRL for edge networks.
by a central server for unified decision-making. The cen-
tralized operation lacks flexibility and scalability. Compared A. Application of Digital Twin in Edge Networks
with DRL, federated DRL is an efficient distributed machine
Digital twin refers to an exact digital replica of a real-
learning approach that aims to solve the collaborative decision-
world object across multiple granularity levels. Specifically,
making problem with multiple agents.
DT is a digital representation of a physical entity or pro-
In this paper, we utilize federated DRL to jointly optimize
cess that combines data processing techniques, mathematical
task offloading and computing resources in digital twin edge
models, machine learning algorithms, and other technologies
networks. We first propose digital twin edge networks with a
to monitor, control, optimize, and predict the behavior of a
physical network layer and a digital twin layer. The physical
physical entity throughout its entire lifecycle [13]. Recently,
network layer includes distributed devices, passive reflecting
there has been increasing attention on integrating DT with
elements, and base stations, in which devices offload training
edge networks. The existing works on digital twin edge
tasks to powerful base stations with the assistance of passive
networks mainly pay attention to low-latency task offloading
reflecting links. Each physical component has a replica digital
[14], heterogeneous resource allocation [15], reliable edge
twin model to record its state in digital twin layer. Then,
association [16], and incentive mechanism for model training
we utilize digital twin models to build the communication
[17]. More specifically, in [14], a digital twin edge network
and computing models and formulate system cost minimiza-
was proposed to estimate user mobility and predict network
tion problems to optimize task offloading and computing
states. A task offloading method was designed to reduce
resources jointly. To solve the formulated problem, we develop
latency by utilizing predicted user mobility from digital twins.
a federated DRL based scheme with efficient cooperation
The authors in [18] considered digital twin-assisted edge
between devices and base stations. We summarize the main
networks and jointly optimized transmit power, association,
contributions of this paper as follows:
and offloading decisions to minimize task offloading latency
• We propose a digital twin edge network to allow resource-
by using an alternating optimization approach. The authors
contained devices to offload training tasks to base sta-
in [15] utilized digital twin-driven edge networks to allocate
tions. To perceive the status of network entities and the
diverse resources to meet the dynamic service requirements
dynamic characteristics of channels, we construct digital
of devices. In [16], digital twins were deployed at the edge
twin models of physical objects to record computing
servers to collect data from distributed devices. The authors
capability and channel states with estimation divergence.
in [17] conceived a lightweight space-air-ground digital twin
• The optimization of jointly determining multi-device task
network to build a ground digital twin model with real-time
offloading, configurations of passive reflecting elements,
states and designed an incentive scheme to initiate ground
and computing resources are formulated to minimize
nodes to take part in model training.
system cost by taking the stringent delay constraint. Due
to the ability to interact with the environment, we leverage
DRL to explore feasible solutions with dynamic changes. B. DRL for Computation Offloading
• A federated DRL scheme is designed to minimize system DRL is a promising technique to address problems with
cost, in which a global agent is to explore edge compu- uncertain and stochastic features and it also can be utilized
tation resource allocation and configurations of passive to solve the sophisticated optimization problem. Recently,
reflecting elements. The local agent is used to train plenty of studies have utilized DRL for optimizing com-
the offloading decision by estimating local computing putation offloading and allocating resources. The authors in
ability. Numerical results demonstrate that the proposed [19] proposed a method based on DRL for joint server
federated DRL algorithm can accelerate the convergence selection, cooperative offloading, and switching in dynamic
speed. environments. In [20], the authors proposed an intelligent
The remainder of this paper is organized as follows. We online offloading framework based on reinforcement learning
present the digital twin edge networks and formulate the to maximize the computation rate of the system. In [21],
system cost minimization problem in Section III. The federated a wireless-powered MEC network architecture that included
DRL algorithm to jointly optimize task offloading and resource edge computing servers and multiple edge devices was pro-
allocation is designed in Section IV. Numerical results are posed. The authors employed a DRL-based algorithm to
shown in Section V. Finally, the paper is concluded in Section jointly optimize the wireless power transmission duration,
VI. transmission time allocation for each edge device, and partial
offloading decisions to achieve the maximum sum computation
II. R ELATED WORK rate. The authors in [22] investigated the resource allocation
To fulfill the requirements of future edge networks, a problem in multi-unmanned aerial vehicles (UAVs) assisted
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
layer, we define three types of DT models: 1) devices’ digital passive reflecting elements serve the target device, the signals
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
from other devices become interference to the target signal. 2) Offloading to edge server: Multiple devices may offload
We consider there are N devices, and the channel vector from their training task to their associated base station simultane-
device j is either hdj ∈ CI×1 (∀j ∈ {1, .., N }, j ̸= i) or ously and the base station needs to assign a certain amount
hrj ∈ CK×1 . Thus, the signal received at the BS from device of computing resources for these devices. The estimated total
total bs
i is expressed as computing resource can be denoted as Fm = fm (t) +
bs
fm (t). We consider the computing resource of the base station
N
c
√ X √ edge
yi = pi (hdi + GΦhri )si + pj (hdj + GΦhrj )sj + ni , appointed to device i be fim . If we prefer to offload parts
j=1,j̸=i of the training task to the base station, the time consumption
(4) will contain three items: 1) the task offloading delay from
where pi and pj are the transmit power of device i and device i for uplink transmission, 2) the delay of task training
device j, and si and sj are the signals sent from device i by consuming edge server resource, and 3) the delay of
and device j, respectively. ni is the noise, following Gaussian downlink wireless transmission of the computation feedback.
distribution N (0, σ 2 ). In Eq. (4), the first item is the desired The transmission delay of the feedback can be ignored due to
signal from the direct link and reflection link, and the second the small size of the data. Thus, the delay of task execution
item is interference from other devices. The base station uses on offloading can be presented as
linear beamforming to decode the transmitted signal. Let wi
be the beamforming vector for device i. We enhance the edge λi di ci λi di
Tim = edge
+ , (9)
signal in the target direction and suppress multipath signals fim Rim
from other directions by modifying the matrix of beamforming
coefficients. The uplink SINR ratio for task offloading can be PN edge
where i=1 fim ≤ Fmtotal
. Similarly, the consumed energy
denoted as on offloading and edge training can be expressed as
pi |wiH (hdi + GΦhri )|2
γi = PN , (5) λ i di
H d r 2 2
j=1,j̸=i pj |wi (hj + GΦhj )| + σ
edge
Eim = pi + λi di ci ∗ e, (10)
Rim
Based on Eq. (5), we can adjust the reflection coefficient
matrix Φ to compensate the effects of signal fading and where e is the unit energy consumption.
distortion caused by factors such as multipath propagation,
long-distance transmission, and obstacles. Specifically, we can
adjust the amplitude and phase in Φ to strengthen the channel D. System Cost
gain of the target direction and suppress multipath signals
from other directions. Thus, the transmission data rate for task The local execution and edge offloading are a parallel
offloading is denoted as process, so the training time of the device is determined by
pi |wiH (hdi + GΦhri )|2 either the local training time or the edge execution time, i.e.,
Rim = Bm log2 (1+ PN H d r 2
), (6) the maximum value of these two items. We can denote the
j=1,j̸=i pj |wi (hj + GΦhj )| + σ
2
task execution time as
where Bm is the channel bandwidth of base station m. Based
edge
on the transmission data rate, we can estimate offloading (1 − λi )di ci λi di (ci Rim + fim )
Ti = max{ , }. (11)
overhead on latency and energy consumption. loc loc
f (t) + fb (t) Rim f edge
i i im
C. Digital Twin for Computing Model Since both local execution and edge computing will con-
The training tasks of device i are executed on both local and sume energy, the total consumed energy on task training is
edge servers concurrently. Let λi be the variable that describes the sum of the two items, which can be expressed as
the ratio of the data executed on edge servers to total training
data. We consider that each device offloads λi di to the edge λ i di
Ei = ς(1 − λi )di ci (filoc (t) + fbiloc (t))2 + pi + λi di ci ∗ e.
server and computes (1 − λi )di locally. Rim
1) Local training: In the training process, we consider the (12)
whole local computing resource is used for training. On the We consider that system cost is the weighted sum of all
basis of the device digital twin model, we can estimate local energy consumption and its transmission and computing time
computing resources. Thus, the training time of each device of all users, which can be defined as
can be estimated by N
X
(1 − λi )di ci C(f , w, Φ, λ) = [αi Ti (f , w, Φ, λ)+(1 − αi )Ei (f , w, Φ, λ)],
Tiloc = , (7)
f (t) + fbloc (t)
loc
i i
i=1
(13)
The energy consumed to perform unit computing is ς(filoc (t)+ where αi is the weight of training delay relative to training
fbiloc (t))2 , where ς represents the efficient switched capacitance energy consumption in the total system cost with a range of
[25]. The consumed energy for local training can be written [0,1]. According to the diverse performance requirement of
as users, αi can be adjusted based on the user’s performance
Eiloc = ς(1 − λi )di ci (filoc (t) + fbiloc (t))2 . (8) requirements on latency and energy consumption.
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
Fig. 2: Federated DRL with the cooperation of global agent and local agent.
local computation resources, task information, and the offload- Based on the immediate reward, the agent explores cumula-
ing decisions of devices: tive reward. Generally, the cumulative reward can be defined
as ∞
T ask g (τ g ) = {D(τ g ), λ(τ g )}, (19)
X
Rt = ζ i Υτ g +i+1 , (23)
i=0
where D(τ g ) = [D1 (τ g ), ..., Di (τ g ), ..., DN (τ g )] is task in-
where ζ ∈ [0, 1] represents the discount weight. Note that the
formation. Each Di (τ g ) contains three items, that is, data
policy indicates the probability of choosing action a(τ g ) at a
size, the unit required computation resource, and the deadline
certain state. If the agent is following a policy at time τ g , the
for task training. λ(τ g ) = [λ1 (τ g ), ..., λN (τ g )] denotes the
policy has a probability of π = p(a(τ g ) = a|s(τ g ) = s). DRL
offloading decisions of devices.
algorithms evaluate and improve the policy based on value
2) Global action: According to the problem (16), the global
function. Based on π, the agent’s state-action value function
DRL agent needs to explore action as,
is represented as
a(τ g ) = {f (τ g ), w(τ g ), Φ(τ g )}, (20) Qπ (s(τ g ), a(τ g )) = E[Υτ g +1 + ζΥτ g +2 + ...|s(τ g ), a(τ g )]
= E[Υτ g +1 + ζQπ (s(τ g + 1),
g edge g edge g
where f (τ ) = [f11 (τ ), ..., fN M (τ )] is the ac- a(τ g + 1))|s(τ g ), a(τ g )].
tion of edge computation resource allocation, w(τ g ) = (24)
[w1 (τ g ), ..., wM (τ g )] is the beamforming vector. Note that,
each element in w(τ g ) has a real part and an imaginary Upon the state-action value function reaching optimal, the
part. Φ(τ g ) = [θ1 (τ g ), ..., θK (τ g ) , β1 (τ g ), ..., βK (τ g )] is the Bellman optimality equation is obtained by
coefficient of passive elements. With a(τ g ), the agent can Q∗ (s(τ g ), a(τ g )) = E[Υt+1 + ζ max Q∗ (s(τ g + 1),
estimate the immediate reward. (25)
a(τ g + 1))|s(τ g ), a(τ g )].
3) Global reward: The immediate reward is to measure the
goodness of explored action under an observed state. Here, Thus, the optimal policy is achieved with
we define the immediate reward as the objective value of (16). π(s(τ g ))∗ = argmaxa Q∗ (s(τ g ), a(τ g )). (26)
Based on state and action, we can obtain
Next, we introduce the detailed global learning on the base
N station.
g
X αi λi (τ g )di (τ g )ci (τ g )
r(τ ) = −{ edge g Global learning: We use deep deterministic policy gradient
i=1 fim (τ ) (21) (DDPG) in global learning as the action space of the problem
αi λi (τ g )di (τ g )(1 + pi (τ g )/αi ) (16) is continuous. The objective of global learning is to
+ }.
Rim (τ g ) explore the actions of the configuration of passive elements,
and computing resource allocation. DDPG is based on actor
If the learned actions satisfy the constraint (16b), the DRL and critic networks and it has three components: main network,
agent gets this reward. Otherwise, it will receive a penalty, target network, and replay buffer. The main network produces
which is a negative constant. a configuration of passive elements configuration and comput-
( ing resource allocation policy with two deep neural networks:
r(τ g ), if (16b) satisfied; main actor and main critic networks. The structure of the target
Υt = (22)
pltg , otherwise; network is the same as the structure of the main network,
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
Algorithm 1 Global Learning for Configuration of Passive of the loss function is given by
Elements and Computing Resource Allocation g π
g Loss(θ ) = E[2(y − Q (s(j),
▽θQ j
1: Initialize global shared counter T g = 0; Q
(29)
g π
2: /*DDPG-based Global learning */ a(j)|θQ )) ▽θQ
g Q (s(j), a(j))].
17: end for Based on the previous neural network parameters of the
18: end for target and the main networks, the current parameters θπg ′ and
g
θQ ′ of the target network are updated with the soft updat-
up the correlation among the experiences. Each experience where ω ∈ [0, 1].
tuple contains the current state, current action, current reward, The details of global learning by using DDPG are shown
and next state. The parameters of the critic and actor networks in algorithm 1. In the initialization step, the DDPG agent
can be upgraded by using the samples of experience in the initializes the main network with parameters θπg and θvg , and
replay buffer. the target network with θπg ′ and θQ g
′ . It also initializes the
In DDPG, based on θπg , the actor-network learns current experience replay buffer which is used to store the sample
policy π(s(τ g )|θπg ) by mapping states to a specific action in a data generated by the interaction between the agent and the
deterministic way. To evaluate the learned action performance, environment. After the initialization, the DDPG agent first
g g
the critic network utilizes Qπ (s(τ g ), a(τ g )|θQ ) based on θQ . catches the environment information in the training process,
g such as wireless channel information and training task infor-
In the main network, the critic updates its parameter θQ
through the loss function. Specifically, the agent first uniformly mation. Based on this state information, the agent explores
samples U experiences < s(j), a(j), Υj , s(j + 1) > from the the action, calculates its immediate reward, and updates the
replay buffer and generate the target value yj , which is given next state. Replay buffer packages the information of state,
by reward, and action as an experience. Then, the replay buffer
stores all experiences. When calculating the parameters of
yj = Υj + ζQ′ (s(j + 1), π ′ (s(j + 1)|θπg ′ )|θQ
g
′ ), (27) neural networks, a mini-batch of experiences is sampled from
the replay buffer before calculating the target value. The loss
where Q′ (s(j + 1), π ′ (s(j + 1)|θπg ′ )|θQ
g
′ ) is measured by the
g function is used to update the parameters of the main critic
target critic network, θπ′ is the network parameter of the target
g network. The parameters of the main participant network are
actor-network, and θQ ′ is the parameter of the target critic
updated using policy gradients.
network. Meanwhile, the agent calculates the loss based on
h i
g g 2
Loss(θQ ) = E (yj − Qπ (s(j), a(j)|θQ )) . (28) C. Local DRL for Task Offloading
Then, we can calculate the loss function and its gradient to With the coefficients of passive elements and
update the parameter of the main critic network. The gradient computation resource allocation of the base station
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
(i.e.,{f (τ g ), w(τ g ), Φ(τ g )}), problem (14) can be simplified Algorithm 2 Local Learning for Offloading
as 1: Obtain configuration of passive elements and computation
(1 − λi )di ci resource allocation;
min Ci (λ) = αi max{ , (33a) 2: Obtain global agent parameter θπg and θvg ;
λ
i f (t) + fbloc (t)
loc
i
edge 3: Initialize learning step counter τ l ;
λi di (ci Rim + fim ) 4: repeat
edge
} + ς(1 − λi )di ci
Rim fim 5: Reset dθπg ← 0 and dθvg ← 0;
λi di 6: /*Sharing global model to devices */
∗ (filoc (t) + fbiloc (t))2 + pi + λi di ci ∗ e
Rim 7: Synchronize local agent parameter θπl ← θπg and θvl ←
g
(1 − λi )di ci θv ;
s.t. ≤ τi (33b) 8: for each device do
f (t) + fbloc (t)
loc
i i
edge
9: Observe current state s(τ l ) and configuration of
λi di (ci Rim + fim ) passive elements and computation resource allocation
edge
≤ τi (33c)
Rim fim (i.e.,{f (τ g ), w(τ g ), Φ(τ g )});
(14e) 10: tstart = τ l ;
11: repeat
According to the aforementioned DRL form, we transform 12: Explore a(τ l ) based on π(s(τ l )|θπl );
the problem (33) with local state, local action, and local 13: Update s(τ l + 1) and Rloc (s(τ l ), a(τ l ));
reward. 14: τ l ← τ l + 1; T ← T + 1;
1) Local state: At step τ l , each local DRL agent on devices 15: until τ l − tstart == tmax
gathers the state, including communication data rate, task in- 16: Rloc = V (s(τ l )|θvl );
formation, local computation resource, and edge computation 17: while receive Rloc from each local agent do
resource: 18: for j ∈ {τ l − 1, ..., τstart
l
} do
edge l
sli (τ l ) = {Rim (τ l ), Di (τ l ), fim (τ )}. (34) 19: R ← R + δRloc ;
loc loc
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
long-term cumulative rewards by applying strategy π in state and aggregated gradients to update actor network parameter
si (τ l ). Thus. it can be denoted as θπl and critic network parameter θvl .
The complexity of the proposed federated DRL is jointly
V (si (τ l )|θvl ) = E[Rloc l l l l
i (τ )+δV ((si (τ +1)|θv )|si (τ ). (37) determined by the efficiency of the local and global DRLs.
Based on V (si (τ l )), temporal difference error can be calcu- Specifically, the complexity primarily depends on the number
lated by of agents, the neural network structures of the actor and critic
networks used by each agent, and the training frequency. We
∆ = Rloc l l l l l
i (τ ) + δV (si (τ + 1)|θv ) − V (si (τ )|θv ). (38) consider that the actor and critic networks of the local DRL
contain J and L fully connected layers, and the actor and critic
The loss function is presented as a mean square error to update networks of the global DRL contain Z and Q fully connected
the network parameters of the critical network.to update the layers. Moreover, we denote the training frequencies of the
network parameters of the critical network. The loss can be local DRL and global DRL as E and B, respectively. Then
defined as the overall complexity of the proposed federated DRL can be
X
Loss(θvl ) = (Rloc l l l l l 2 expressed as
i (τ )+δV (s(τ +1)|θv )−V (s(τ )|θv )) . J L
τl
X X
(39) O(B × N × E × (2 nA1,j nA1,j+1 + 2 nC1,l nC1,l+1 )
We define αv as the learning rate of the critic network. Thus, j=0 l=0
Q
the parameter θvl is updated by: Z
X X
X + B × (2 nA2,z nA2,z+1 + 2 nC2,q nC2,q+1 )), (43)
θvl = θvl − αv ▽θvl (Rloc l l l
i (τ ) + δV (s(τ + 1)|θv ) z=0 q=0
τ l (40)
where nA1,j , nC1,l and nA2,z , nC2,q represent the unit num-
− V (s(τ l )|θvl ))2 . bers of the actor and critic networks for the local DRL and
The actor-network uses an advantage function and gra- global DRL, respectively. The values of nA1,0 , nC1,0 and
dient ascending method to update its network parameters. nA2,0 , nC2,0 are equal to the input sizes of the corresponding
The advantage functions Aπ (si (τ l ), ai (τ l )) is denoted as networks.
the difference of the reward and state-value function, which
means whether the explored action is better than expected. V. N UMERICAL R ESULTS
Aπ (si (τ l ), ai (τ l )) is defined as
A. Simulation Settings
Aπ (si (τ l ), ai (τ l )) = Riloc (τ l ) − V (si (τ l )|θvl )
i−1
The digital twin network for the experimental simulation
X includes 10 single-antenna devices under the coverage of a
= δ j Rjloc (τ l + j) + δ j V (sj (τ l + i)|θvl ) − V (sj (τ l )|θvl ).
base station with 3 antennas. To improve the quality of wire-
j=0
(41) less communication, passive elements are deployed between
the devices and the base station, consisting of 5 passive re-
θπl is the parameter of the actor-network and it is updated flecting elements. The base station uses a linear beamforming
based on : strategy to decode the signals directly transmitted by the
X mobile terminals and the signals of the passive reflecting
θπl = θπl + απ ▽θπl log π(si (τ l )|θπl )Aπ (si (τ l ), ai (τ l ))
elements. The bandwidth of the wireless channel for each
τl
device is 5 MHz. Noise power follows Gaussian distribution
+ φ ▽θπl H(π(s(τ l )|θπl )), with σ 2 = 10−11 mW, and λ is randomly distributed in [0, 1].
(42) Digital twin models of devices and base stations estimate their
where απ is the learning rate, H(·) represents the regularized real-time computing capabilities. We consider the computing
entropy value to ensure that the decision-making agent fully capabilities of devices and base stations to follow the uniform
explores the environment space. φ is the coefficient of entropy. distribution of U [0.1, 0.5] GHz and U [10, 50] GHz, respec-
φ will be set as a large value at the beginning so that tively. Each device needs to compute a training task with an
the parameters of the actor will change in the direction of accuracy constraint, that its data size follows the distribution
increasing entropy. of U [1, 100] MB, and the number of CPU cycles required per
Algorithm 2 shows the details of local learning by using byte following a distribution of U [50, 100] Hz. In the federated
asynchronous actor-critic DRL. In the initialization step, the DRL, we utilize Relu in local DRL training and tanh(x)+1 2
agent set initialize actor with parameter θπl ← θπg and critic as the activation function for global training. The penalty is
with θvl ← θvg . In each device, the agent first catches the net- proportional to the number of devices, i.e., 100 × N .
work state and then explores offloading action. After running To verify the evaluation effectiveness of the proposed algo-
tmax steps, the agent uses state-action value V (s(τ l )|θvl ) to rithm, we consider two reinforcement learning algorithms as
calculate the future reward R. After multiple interactions with benchmark strategies:
the environment, the agent utilizes the received reward, state- • A3C-based offloading: each device estimates its own
action value, action, and state to calculate loss and gradient. computing resource and interacts with its surrounding
The agent deployed on the base station uses the obtained loss environment to predict offloading overhead, and then uses
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
10
B. Performance Analysis
Figs. 3-4 illustrate the impact of different variables on
the cumulative system cost. Here, x-axis is the number of
episodes, and y-axis refers to cumulative system cost. To Fig. 4: Impact of different learning rates.
facilitate depicting the trend on y-axis, we replace the original
values on the y-axis with their logarithmic value (i.e., y
= log(Rt )). Fig. 3 shows the efficiency of the proposed adjusting the configuration of passive elements and computing
federated DRL with two benchmarks. Here, the reward is resources in the background of optimizing the offloading co-
the combination of system overhead on time and energy. The efficients, thereby further improving the wireless transmission
smaller the reward, the smaller the system overhead. The solid rate and reducing-edge computing overhead. Moreover, the
line and shaded area in Fig. 3 represent the mean and standard flexible offloading capability of the federated DRL algorithm
deviation of the system overhead, respectively. can further reduce energy and time consumption based on
We can make two conclusions from Fig. 3. First, the pro- different training requirements. Secondly, the federated DRL
posed federated DRL algorithm outperforms the two baseline algorithm demonstrates a faster convergence speed than the
algorithms in terms of reward by leveraging the cooperation of two benchmarks. This is because local-global collaborative
local and global agents. Specifically, compared with the A3C- learning is executed in parallel, and the parameters of the
based offloading, the federated DRL-based joint optimiza- agents are shared to jointly explore strategies, accelerating the
tion algorithm optimizes the task offloading, edge computing convergence of the system.
resource allocation, and antenna beamforming coefficients, To evaluate the effect of the learning rate on the performance
and adjusts the optimal reflection angle of passive elements, of the activity evaluation and feedback network, we consider
resulting in smaller system overhead. The A3C-based offload- three different learning rates: {0.05, 0.005, 0.0005}. Fig. 4
ing can only determine the offloading proportion of each illustrates the convergence behavior of the federated DRL
device in a fixed communication and computing environment, algorithm under these three learning rates. From Fig. 4, our
resulting in larger system overhead. In comparison to the results indicate that the convergence behavior varied signifi-
DDPG-based configuration of passive elements and resource cantly with different learning rates. Specifically, the federated
allocation algorithm, the federated DRL algorithm can further DRL model with a learning rate of 0.005 achieves the lowest
adjust the number of device offloading tasks, i.e., further cost, although it requires more episodes to converge compared
to the case with a learning rate of 0.05. The cumulative cost
under the learning rate of 0.05 is greater than that under
0.005 because a tremendous learning rate can cause the model
to miss the optimal value and lead to large oscillations in
the convergence curve. When the learning rate is 0.0005, the
curve converges quickly due to the small learning rate, but the
cumulative cost is greater than that under a learning rate of
0.005 because an excessively small learning rate can cause the
model to get stuck in local optima. Therefore, a learning rate
of 0.005 is the best setting for the proposed federated DRL
algorithm.
Fig. 5 illustrates how the number of terminal devices
affects the cumulative system cost, cumulative latency cost,
and cumulative energy cost. The experiment uses N =
{10, 20, 30, 40} device. As depicted in Fig. 5 (a), the system
cost for all three algorithms rises when the number of terminal
Fig. 3: Comparison of the performance of A3C-based offload- devices grows from 10 to 40. However, the proposed federated
ing, DDPG-based passive elements configuration, computing DRL algorithm always has the smallest cumulative system
resource allocation, and federated DRL-based joint optimiza- cost compared to the two benchmarks because it calculates the
tion. task offloading size, allocates edge computing resources, and
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
11
(a) Cumulative system cost (b) Cumulative delay cost (c) Cumulative energy cost
(a) Cumulative system cost (b) Cumulative delay cost (c) Cumulative energy cost
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Network Science and Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNSE.2024.3350710
12
to achieve the lowest system cost. In summary, compared with [8] W. Sun, S. Lei, L. Wang, Z. Liu, and Y. Zhang, “Adaptive federated
other task offloading schemes, the federated DRL optimized learning and digital twin for industrial internet of things,” IEEE Trans-
actions on Industrial Informatics, vol. 17, no. 8, pp. 5605–5614, 2021.
offloading scheme can achieve a smaller system cost. [9] G. He, S. Cui, Y. Dai, and T. Jiang, “Learning task-oriented channel allo-
Fig.7 shows the changes in system cost of the three al- cation for multi-agent communication,” IEEE Transactions on Vehicular
gorithms as the transmission power increases from -20 dB Technology, vol. 71, no. 11, pp. 12 016–12 029, 2022.
[10] M. S. Allahham, A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad,
to 20 dB. When Pt ={−20, −10, 0} dB, the system cost is and M. Guizani, “Multi-agent reinforcement learning for network selec-
roughly equal and relatively small. This is because increasing tion and resource allocation in heterogeneous multi-rat networks,” IEEE
the device transmission power will increase the energy cost of Transactions on Cognitive Communications and Networking, vol. 8,
no. 2, pp. 1287–1300, 2022.
transmitting data, but it will also increase the SINR, improve [11] S. Zhang, Z. Ni, L. Kuang, C. Jiang, and X. Zhao, “Load-aware dis-
the data transmission speed, and reduce the transmission delay. tributed resource allocation for mf-tdma ad hoc networks: A multi-agent
In addition, selecting a larger device transmission power can drl approach,” IEEE Transactions on Network Science and Engineering,
vol. 9, no. 6, pp. 4426–4443, 2022.
improve signal quality. However, when the transmission power [12] F. Dong, H. Wang, D. Shen, Z. Huang, Q. He, J. Zhang, L. Wen,
increases from 10 dB to 20 dB, the slope of the line sharply and T. Zhang, “Multi-exit dnn inference acceleration based on multi-
increases, indicating that the system transmission energy cost dimensional optimization for edge intelligence,” IEEE Transactions on
Mobile Computing, vol. 22, no. 9, pp. 5389–5405, 2023.
dominates. Furthermore, Fig. 7 shows that the federated DRL [13] Y. Wu, K. Zhang, and Y. Zhang, “Digital twin networks: A survey,” IEEE
algorithm performs better than the two benchmarks at various Internet of Things Journal, vol. 8, no. 18, pp. 13 789–13 804, 2021.
transmission power levels. [14] W. Sun, H. Zhang, R. Wang, and Y. Zhang, “Reducing offloading latency
for digital twin edge networks in 6g,” IEEE Transactions on Vehicular
Technology, vol. 69, no. 10, pp. 12 240–12 251, 2020.
VI. C ONCLUSIONS [15] M. Adhikari, A. Munusamy, N. Kumar, and S. N. Srirama, “Cybertwin-
driven resource provisioning for ioe applications at 6g-enabled edge
In this paper, a digital twin edge network is proposed to networks,” IEEE Transactions on Industrial Informatics, vol. 18, no. 7,
enable resource-contained devices to offload training tasks to pp. 4850–4858, 2022.
a powerful edge server with an enhanced wireless communica- [16] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang, “Low-latency
federated learning and blockchain for edge association in digital twin
tion link. In the proposed digital twin edge network, physical empowered 6g networks,” IEEE Transactions on Industrial Informatics,
objects contain devices, passive reflecting elements, and base vol. 17, no. 7, pp. 5098–5107, 2021.
stations. Utilizing digital twins, we construct virtual twins for [17] W. Sun, S. Lian, H. Zhang, and Y. Zhang, “Lightweight digital twin and
federated learning with distributed incentive in air-ground 6g networks,”
each of them to record their dynamic changes. Then we formu- IEEE Transactions on Network Science and Engineering, pp. 1–13,
late a system cost minimization problem by jointly considering 2022.
task offloading, configurations of passive reflecting elements, [18] D. Van Huynh, V.-D. Nguyen, S. R. Khosravirad, V. Sharma, O. A.
Dobre, H. Shin, and T. Q. Duong, “Urllc edge networks with joint
and computing resource allocation. To solve the complex opti- optimal user association, task offloading and resource allocation: A
mization problem, we design a federated DRL scheme, where digital twin approach,” IEEE Transactions on Communications, vol. 70,
the global agent is used to explore edge computation resource no. 11, pp. 7669–7682, 2022.
[19] T. M. Ho and K.-K. Nguyen, “Joint server selection, cooperative offload-
allocation and coefficients of passive reflecting elements, and ing and handover in multi-access edge computing wireless network: A
local agents are responsible for offloading decision-making. deep reinforcement learning approach,” IEEE Transactions on Mobile
Numerical results indicate that the proposed federated DRL Computing, vol. 21, no. 7, pp. 2421–2435, 2022.
[20] E. Mustafa, J. Shuja, K. Bilal, S. Mustafa, T. Maqsood, F. Rehman,
could gradually improve its performance to adapt to different and A. u. R. Khan, “Reinforcement learning for intelligent online
scenarios compared with other DRL methods. computation offloading in wireless powered edge networks,” Cluster
Computing, vol. 26, no. 2, pp. 1053–1062, 2023.
R EFERENCES [21] S. Zhang, H. Gu, K. Chi, L. Huang, K. Yu, and S. Mumtaz, “Drl-based
partial offloading for maximizing sum computation rate of wireless pow-
[1] R. Minerva, G. M. Lee, and N. Crespi, “Digital twin in the iot context: ered mobile edge computing network,” IEEE Transactions on Wireless
A survey on technical features, scenarios, and architectural models,” Communications, vol. 21, no. 12, pp. 10 934–10 948, 2022.
Proceedings of the IEEE, vol. 108, no. 10, pp. 1785–1824, 2020. [22] J. Chen, X. Cao, P. Yang, M. Xiao, S. Ren, Z. Zhao, and D. O. Wu,
[2] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Deep reinforcement “Deep reinforcement learning based resource allocation in multi-uav-
learning for stochastic computation offloading in digital twin networks,” aided mec networks,” IEEE Transactions on Communications, vol. 71,
IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 4968– no. 1, pp. 296–309, 2023.
4977, 2021. [23] T. Zhao, F. Li, and L. He, “Drl-based secure aggregation and resource
[3] L. U. Khan, W. Saad, D. Niyato, Z. Han, and C. S. Hong, “Digital-twin- orchestration in mec-enabled hierarchical federated learning,” IEEE
enabled 6g: Vision, architectural trends, and future directions,” IEEE Internet of Things Journal, vol. 10, no. 20, pp. 17 865–17 880, 2023.
Communications Magazine, vol. 60, no. 1, pp. 74–80, 2022. [24] Y. Liu, L. Jiang, Q. Qi, and S. Xie, “Energy-efficient space–air–ground
[4] X. Yuan, J. Chen, N. Zhang, J. Ni, F. R. Yu, and V. C. M. Leung, “Digital integrated edge computing for internet of remote things: A federated drl
twin-driven vehicular task offloading and irs configuration in the internet approach,” IEEE Internet of Things Journal, vol. 10, no. 6, pp. 4845–
of vehicles,” IEEE Transactions on Intelligent Transportation Systems, 4856, 2023.
pp. 1–15, 2022. [25] Y. Dai, D. Xu, S. Maharjan, and Y. Zhang, “Joint computation offloading
[5] Y. Dai, K. Zhang, S. Maharjan, and Y. Zhang, “Edge intelligence for and user association in multi-task mobile edge computing,” IEEE
energy-efficient computation offloading and resource allocation in 5g Transactions on Vehicular Technology, vol. 67, no. 12, pp. 12 313–
beyond,” IEEE Transactions on Vehicular Technology, vol. 69, no. 10, 12 325, 2018.
pp. 12 175–12 186, 2020.
[6] P. Gorla, K. V, V. Chamola, and M. Guizani, “A novel framework of
federated and distributed machine learning for resource provisioning in
5g and beyond using mobile-edge scbs,” IEEE Transactions on Network
and Service Management, pp. 1–1, 2022.
[7] T. Do-Duy, D. Van Huynh, O. A. Dobre, B. Canberk, and T. Q. Duong,
“Digital twin-aided intelligent offloading with edge selection in mobile
edge computing,” IEEE Wireless Communications Letters, vol. 11, no. 4,
pp. 806–810, 2022.
Authorized licensed use limited to: Consortium - Algeria (CERIST). Downloaded on February 06,2024 at 14:27:26 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.