0% found this document useful (0 votes)
19 views12 pages

Deep Reinforcement Learning For Multiobjective Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Deep Reinforcement Learning For Multiobjective Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO.

6, JUNE 2021 3103

Deep Reinforcement Learning for


Multiobjective Optimization
Kaiwen Li , Tao Zhang, and Rui Wang

Abstract—This article proposes an end-to-end framework s.t. x ∈ X (1)


for solving multiobjective optimization problems (MOPs) using
deep reinforcement learning (DRL), that we call DRL-based where f(x) consists of M different objective functions and
multiobjective optimization algorithm (DRL-MOA). The idea
of decomposition is adopted to decompose the MOP into a set X ⊆ RD is the decision space. Since the M objectives usu-
of scalar optimization subproblems. Then, each subproblem ally confilict with each other, a set of tradeoff solutions called
is modeled as a neural network. Model parameters of all Pareto-optimal solutions is expected to be found for MOPs.
the subproblems are optimized collaboratively according to a Among MOPs, various multiobjective combinatorial
neighborhood-based parameter-transfer strategy and the DRL optimization problems have been investigated in recent years.
training algorithm. Pareto-optimal solutions can be directly
obtained through the trained neural-network models. Specifically, A canonical example is the multiobjective traveling salesman
the multiobjective traveling salesman problem (MOTSP) is solved problem (MOTSP), where given n cities and M cost functions
in this article using the DRL-MOA method by modeling the sub- to travel from city i to j, one needs to find a cyclic tour of the
problem as a Pointer Network. Extensive experiments have been n cities, minimizing the M cost functions. This is an NP-hard
conducted to study the DRL-MOA and various benchmark meth- problem even for the single-objective TSP. The best known
ods are compared with it. It is found that once the trained model
is available, it can scale to newly encountered problems with exact method, that is, a dynamic programming algorithm,
no need for retraining the model. The solutions can be directly requires a complexity of (2n n2 ) for single-objective TSP.
obtained by a simple forward calculation of the neural network; It appears to be much harder for its multiobjective version.
thereby, no iteration is required and the MOP can be always Hence, in practice, approximate algorithms are commonly
solved in a reasonable time. The proposed method provides a new used to solve MOTSPs, that is, finding near-optimal solutions.
way of solving the MOP by means of DRL. It has shown a set
of new characteristics, for example, strong generalization ability During the last two decades, multiobjective evolutionary
and fast solving speed in comparison with the existing methods algorithms (MOEAs) have proven effective in dealing with
for multiobjective optimizations. The experimental results show MOPs since they can obtain a set of solutions in a single run
the effectiveness and competitiveness of the proposed method in due to their population-based characteristic. NSGA-II [1] and
terms of model performance and running time. MOEA/D [2] are two of the most popular MOEAs that have
Index Terms—Deep reinforcement learning (DRL), been widely studied and applied in many real-world applica-
multiobjective optimization, Pointer Network, traveling tions. The two algorithms, as well as their variants, have also
salesman problem. been applied to solve the MOTSP (see [3]–[5]).
In addition, several handcrafted heuristics especially
I. I NTRODUCTION designed for TSP have been studied, such as the Lin–
Kernighan heuristic [6] and the 2-opt local search [7]. By
ULTIOBJECTIVE optimization problems arise regu-
M larly in the real world where two or more objectives
are required to be optimized simultaneously. Without loss of
adopting these carefully designed tricks, a number of special-
ized methods have been proposed to solve MOTSP, such as
the Pareto local search method (PLS) [8], multiple objective
generality, a multiobjective optimization problem (MOP) can genetic local search algorithm (MOGLS) [9], and other simi-
be defined as follows: lar variants [10]–[12]. More other methods and details can be
min f(x) = (f1 (x), f2 (x), . . . , fM (x)) found in this review [13].
x Evolutionary algorithms and/or handcrafted heuristics have
Manuscript received September 27, 2019; revised January 3, 2020; accepted long been recognized as suitable methods to handle such prob-
February 25, 2020. Date of publication March 18, 2020; date of current version lems. However, these algorithms, as iteration-based solvers,
May 18, 2021. This work was supported in part by the National Natural have suffered obvious limitations that have been widely dis-
Science Foundation of China under Grant 61773390 and Grant 71571187. This
article was recommended by Associate Editor H. Ishibuchi. (Corresponding cussed [13]–[15]. First, to find near-optimal solutions, espe-
author: Rui Wang.) cially when the dimension of problems is large, a large
The authors are with the College of Systems Engineering, National number of iterations are required for population updating or
University of Defense Technology, Changsha 410073, China, and also with
the Hunan Key Laboratory of Multi-Energy System Intelligent Interconnection iterative searching, thus usually leading to a long running
Technology, Changsha 410073, China (e-mail: [email protected]; time for optimization. Second, once there is a slight change
[email protected]; [email protected]). in the problem, for example, changing the city locations of
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCYB.2020.2977661. the MOTSP, the algorithm may need to be reperformed to
Digital Object Identifier 10.1109/TCYB.2020.2977661 compute the solutions. When it comes to newly encountered
2168-2267 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3104 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

problems, or even new instances of a similar problem, the comparison with classical methods, for example, strong
algorithm needs to be revised to obtain a good result, which generalization ability and fast solving speed.
is known as the No Free Lunch theorem [16]. Furthermore, 2) With a slight change of the problem instance, the clas-
such problem-specific methods are usually optimized for one sical methods usually need to be reconducted from
task only. scratch, which is impractical for application, especially,
Carefully handcrafted evolution strategies and heuristics for large-scale problems. In contrast, our method is
can certainly improve the performance. However, the recent robust to problem changes. Once the model is trained,
advances in machine-learning algorithms have shown their it can scale to problems that the algorithm never saw
ability of replacing humans as the engineers of algorithms to before in terms of the number and locations of the cities
solve different problems. Several years ago, most people used of the MOTSP.
man-engineered features in the field of computer vision but 3) The proposed method requires much lower running time
now the deep neural networks (DNNs) have become the main than the classical methods, since the Pareto-optimal
techniques. While DNNs focus on making predictions, deep solutions can be directly obtained by a simple for-
reinforcement learning (DRL) is mainly used to learn how to ward propagation of the trained networks without any
make decisions. Thereby, we believe that DRL is a possible population updating or iterative searching procedures.
way of learning how to solve various optimization problems 4) Empirical studies show that the proposed method signif-
automatically, thus demanding no man-engineered evolution icantly outperforms the classical methods especially for
strategies and heuristics. large-scale MOTSPs in terms of both convergence and
In this article, we explore the possibility of using DRL to diversity, while requiring much less running time.
solve MOPs, MOTSP in specific, in an end-to-end manner, that It is noted that several papers [23], [24] have introduced the
is, given n cities as input, the optimal solutions can be directly concept of Multiobjective DRL in the field of RL. However,
obtained through a forward propagation of the trained network. they mainly focus on how to apply RL to control a robot
The network model is trained through the trial-and-error pro- with multiple goals, such as controlling a mountain car or
cess of DRL and can be viewed as a black-box heuristic or controlling a submarine searching for treasures, as investigated
a meta-algorithm [17] with strong learned heuristics. Because in [23]. These studies are not explicitly proposed to deal with
of the exploring characteristic of DRL training, the obtained mathematical optimization problems like (1), and thus is out
model can have a strong generalization ability, that is, it can of the scope of this article.
solve the problems that it never saw before. The remainder of this article is organized as follows.
This article is originally motivated by several recent Section II-A introduces the general framework of the DRL-
proposed neural-network-based single-objective TSP based multiobjective optimization algorithm (DRL-MOA)
solvers. Vinyals et al. [18] first proposed a Pointer Network that describes the idea of using DRL for solving MOPs.
that uses the attention mechanism [19] to predict the city Section II-B elaborates the detailed modeling and the train-
permutation. This model is trained in a supervised way ing process of solving the specific MOTSP problem by means
that requires enormous TSP examples and their optimal of the proposed DRL-MOA framework. Finally, the effective-
tours as a training set. It is hard for use and the supervised ness of the method is demonstrated through experiments in
training process prevents the model from obtaining better Sections III and IV.
tours than the ones provided in the training set. To resolve
this issue, Bello et al. [20] adopted an Actor–Critic DRL II. D EEP -R EINFORCEMENT-L EARNING -BASED
training algorithm to train the Point Network with no need of M ULTIOBJECTIVE O PTIMIZATION A LGORITHM
providing the optimal tours. Nazari et al. [17] simplified the
In this section, we propose to solve the MOP by means of
Point Network model and adds dynamic elements input to
DRL based on a simple but effective framework (DRL-MOA).
extend the model to solve the vehicle routing problem (VRP).
First, the decomposition strategy [2] is adopted to decompose
Moreover, the advanced Transformer model is employed to
the MOP into a number of subproblems. Each subproblem
solve the routing problems and proves to be effective as
is modeled as a neural network. Then, model parameters of
well [21], [22].
all the subproblems are optimized collaboratively according
The recent progress in solving the TSP by means of DRL is
to the neighborhood-based parameter-transfer strategy and the
really appealing and inspiring due to its noniterative yet effi-
Actor–Critic [25] training algorithm. In particular, MOTSP is
cient characteristic and strong generalization ability. However,
taken as a specific problem to elaborate how to model and
there are no such studies concerning solving MOPs (or the
solve the MOP based on the DRL-MOA.
MOTSP in specific) by DRL-based methods.
Therefore, this article proposes to use the DRL method to
deal with MOPs based on a simple but effective framework and A. General Framework
the MOTSP is taken as a specific test problem to demonstrate Decomposition Strategy: Decomposition, as a simple yet
its effectiveness. efficient way to design the multiobjective optimization algo-
The main contributions of this article are as follows. rithms, has fostered a number of researches in the com-
1) This article provides a new way of solving the MOP munity, for example, cellular-based MOGA [26], MOEA/D,
by means of DRL. Some encouragingly new charac- MOEA/DD [27], DMOEA-εC [28], and NSGA-III [29]. The
teristics of the proposed method have been found in idea of decomposition is also adopted as the basic framework

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
LI et al.: DRL FOR MULTIOBJECTIVE OPTIMIZATION 3105

Algorithm 1 General Framework of DRL-MOA


Input: The model of the subproblem M = [w, b], weight
vectors λ1 , . . . , λN
Output: The optimal model M∗ = [w∗ , b∗ ]
1: [ωλ1 , bλ1 ] ← Random_Initialize
2: for i ← 1 : N do
3: if i == 1 then
4: [ω∗1 , b∗1 ] ← Actor_Critic([ωλ1 , bλ1 ], gws (λ1 ))
λ λ
5: else
6: [ωλi , bλi ] ← [ω∗i−1 , b∗i−1 ]
λ λ
7: [ω∗i , b∗i ] ← Actor_Critic([ωλi , bλi ], gws (λi ))
λ λ
8: end if
Fig. 1. Illustration of the decomposition strategy.
9: end for
10: return [w∗ , b∗ ]
11: Given inputs of the MOP, the PF can be directly calcu-
of the proposed DRL-MOA in this article. Specifically, the
MOP, for example, the MOTSP, is explicitly decomposed into lated by [w∗ , b∗ ].
a set of scalar optimization subproblems and solved in a col-
laborative manner. Solving each scalar optimization problem
usually leads to a Pareto-optimal solution. The desired Pareto
front (PF) can be obtained when all the scalar optimization
problems are solved.
In specific, the well-known weighted sum [30] approach
is employed. Certainly, other scalarizing methods can also
be applied, for example, the Chebyshev and the penalty-
based boundary intersection (PBI) method [31], [32]. First, a
set of uniformly spread weight vectors λ1 , . . . , λN is given,
for example, (1, 0), (0.9, 0.1), . . . , (0, 1) for a biobjective
j j
problem, as shown in Fig. 1. Here, λj = (λ1 , . . . , λM )T , where
M represents the number of objectives. Thus, the original
MOP is converted into N scalar optimization subproblems by
Fig. 2. Illustration of the parameter-transfer strategy.
the weighted sum approach. The objective function of the jth
subproblem is shown as follows:
  
M network training in the ith subproblem. Briefly, the network
j j
minimize gws x|λi = λi fi (x). (2) parameters are transferred from the previous subproblem to
i=1 the next subproblem in a sequence, as depicted in Fig. 2.
Therefore, the PF can be formed by the solutions obtained The neighborhood-based parameter-transfer strategy makes it
by solving all the N subproblems. possible for the training of the DRL-MOA model; otherwise
Neighborhood-Based Parameter-Transfer Strategy: To solve a tremendous amount of time is required for training the N
each subproblem by means of DRL, the subproblem is mod- subproblems.
eled as a neural network. Then, the N scalar optimization Each subproblem is modeled and solved by the DRL algo-
subproblems are solved in a collaborative manner by the rithm and all subproblems can be solved in sequence by
neighborhood-based parameter-transfer strategy, which is transferring the network weights. Thus, the PF can be finally
introduced as follows. approximated according to the obtained model. Employing
According to (2), it is observed that two neighboring sub- the decomposition in conjunction with the neighborhood-
problems could have very close optimal solutions [2] as their based parameter-transfer strategy, the general framework of
weight vectors are adjacent. Thus, a subproblem can be solved DRL-MOA is presented in Algorithm 1.
and assisted by the knowledge of its neighboring subproblems. One obvious advantage of the DRL-MOA is its modular-
Specifically, as the subproblem in this article is modeled ity and simplicity for use. For example, the MOTSP can be
as a neural network, the parameters of the network model of solved by integrating any of the recently proposed novel DRL-
the (i − 1)th subproblem can be expressed as [ωλi−1 , bλi−1 ]. based TSP solvers [17], [21] into the DRL-MOA framework.
Here, [ω∗ , b∗ ] represents the parameters of the neural-network Also, other problems, such as VRP and Knapsack problem,
model that have been optimized already and [ω, b] represents can be easily handled with the DRL-MOA framework by sim-
the parameters that are not optimized yet. Assume that the ply replacing the model of the subproblem. Moreover, once
(i − 1)th subproblem has been solved, that is, its network the trained model is available, the PF can be directly obtained
parameters have been optimized to its near optimum. Then, by a simple forward propagation of the model.
the best network parameters [ω∗i−1 , b∗ i−1 ] obtained in the The proposed DRL-MOA acts as an outer loop. The next
λ λ
(i − 1)th subproblem are set as the starting point for the issue is how to model and solve the decomposed scalar

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3106 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

subproblems. Therefore, we take the MOTSP as a specific


example and introduce how to model and solve the subproblem
of MOTSP in the next section.

B. Modeling the Subproblem of MOTSP


To solve the MOTSP, we first decompose the MOTSP into
a set of subproblems and solve each one collaboratively based
on the foregoing DRL-MOA framework. Each subproblem is
modeled and solved by means of DRL. This section introduces
how to model the subproblem of MOTSP in a neural-network
manner. Here, a modified Pointer Network similar to [17] is Fig. 3. Input structures of the neural-network model for solving Euclidean
used to model the subproblem and the Actor–Critic algorithm biobjective TSPs.
is used for training.
1) Formulation of MOTSP: We recall the formulation of
an MOTSP. One needs to find a tour of n cities, that is, a
cyclic permutation ρ, to minimize M different cost functions
simultaneously

n−1
min zk (ρ) = ckρ(i),ρ(i+1) + ckρ(n),ρ(1) , k = 1, . . . , M (3)
i=1

where ckρ(i),ρ(i+1) is the kth cost of traveling from city ρ(i) to


ρ(i + 1). The cost functions may, for example, correspond to
tour length, safety index, or tourist attractiveness in practical
applications. Fig. 4. Illustration of the model structure. Attention mechanism, in conjunc-
2) Model: In this part, the above problem is modeled using tion with encoder and decoder, produces the probability of selecting the next
city.
a modified Pointer Network [17].
First, the input and output structures of the network model
.
are introduced: let the given set of inputs be X = {si , i =
RNN is used to decode the knowledge vector to a desired
1, . . . , n}, where n is the number of cities. Each si is rep-
sequence. Thus, the nature of the Sequence-to-Sequence model
resented by a tuple {si = (si1 , . . . , siM )}. sij is the attribute
that maps one input sequence to an output sequence is suitable
of the ith city that is used to calculate the jth cost function.
for solving the TSP.
For instance, si1 = (xi , yi ) represents the x-coordinate and y-
In this article, the architecture of the model is shown in
coordinate of the ith city and is used to calculate the distance
Fig. 4, where the left part is the encoder and the right part is
between two cities. Taking a biobjective TSP as an example
the decoder. The model is elaborated as follows.
where both the two cost functions are defined by the Euclidean
Encoder: An encoder is used to condense the input sequence
distance [13], the input structure is shown in Fig. 3. The input
into a vector. Since the coordinates of the cities convey no
is 4-D and consists of total 4 × n values. Moreover, the output
sequential information [17] and the order of city locations
of the model is a permutation of the cities Y = {ρ1 , . . . , ρn }.
in the inputs is not meaningful, RNN is not used in the
To map input X to output Y, the probability chain rule is
encoder in this article. Instead, a simple embedding layer is
used
used to encode the inputs to a vector which can decrease the

n
complexity of the model and reduce the computational cost.
P(Y|X) = P(ρt+1 |ρ1 , . . . , ρt , Xt ). (4) Specifically, the 1-D convolution layer is used to encode the
t=1
inputs to a high-dimensional vector [17] (dh = 128 in this
First, an arbitrary city is selected as ρ1 . At each decoding article), as shown in Fig. 4. The number of in-channels equals
step t = 1, 2, . . . , we choose ρt+1 from the available cities Xt . the dimension of the inputs. For example, the Euclidean biob-
The available cities Xt are updated every time a city is visited. jective TSP has a 4-D input as shown in Fig. 3 and thus the
In a nutshell, (4) provides the probability of selecting the next number of in-channels is four. The encoder finally results to
city according to ρ1 , . . . , ρt , that is, the already visited cities. an n × dh vector, where n indicates the city number. It is note-
Then, a modified Pointer Network similar to [17] is used worthy that the parameters of the 1-D convolution layer are
to model (4). Its basic structure is the Sequence-to-Sequence shared among all the cities. It means that, no matter how many
model [33], a recently proposed powerful model in the field cities there are, each city shares the same set of parameters
of machine translation, which maps one sequence to another. that encode the city information to a high-dimensional vector.
The general Sequence-to-Sequence model consists of two Thus, the encoder is robust to the number of cities.
RNN networks, called encoder and decoder. An encoder RNN Decoder: The decoder is used to unfold the obtained high-
encodes the input sequence into a code vector that contains the dimensional vector, which stores the knowledge of inputs, into
knowledge of the input. Based on the code vector, a decoder the output sequence. Different from the encoder, an RNN

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
LI et al.: DRL FOR MULTIOBJECTIVE OPTIMIZATION 3107

is required in the decoder as we need to summarize the Algorithm 2 Actor–Critic Training Algorithm
information of previously selected cities ρ1 , . . . , ρt so as to Input: θ, φ ← initialized parameters given in Algorithm 1
make the decision of ρt+1 . RNN has the ability of memoriz- Output: The optimal parameters θ, φ
ing the previous outputs. In this article, we adopt the RNN for iteration ← 1, 2, . . . do 
model of the gated recurrent unit (GRU) [34] that has similar 2: generate T problem instances from M1 , . . . , MM
performance but fewer parameters than the long short-term for the MOTSP.
memory (LSTM) which is employed in the original Pointer for k ← 1, . . . , T do
Network in [17]. It is noted that RNN is not directly used 4: t←0
to output the sequence. What we need is the RNN decoder while not terminated do
hidden state dt at decoding step t that stores the knowledge 6: select
 k the next city  ρt+1
k according to
of previous steps ρ1 , . . . , ρt . Then, dt and the encoding of P ρt+1 |ρ1 , . . . , ρt , Xt
k k k

the inputs e1 , . . . , en are used together to calculate the con- Update Xtk to Xt+1 k by leaving out the visited cities.
ditional probability P(yt+1 |ρ1 , . . . , ρt , Xt ) over the next step 8: end while
of city selection. This calculation is realized by the attention compute the reward Rk
mechanism. As shown in Fig. 4, to select the next city at step 10: end for  k  k   k k
t + 1, first, we obtain the hidden state dt through decoder. In dθ ← N1 N k=1 R − V X0 ; φ ∇θ log P Y |X0
conjunction with e1 , . . . , en , the index of the next city can be  k  k 2
12: dφ ← N1 N k=1 ∇φ R − V X0 ; φ
calculated using the attention mechanism. θ ← θ + ηdθ
Attention Mechanism: Intuitively, the attention mechanism 14: φ ← φ + ηdφ
calculates how much every input is relevant in the next decod- end for
ing step t. The most relevant one is given more attention and
can be selected as the next visiting city. The calculation is as
follows:
coordinates and M1 or M2 can be a uniform distribution
  of [0, 1] × [0, 1].
utj = vT tanh W1 ej + W2 dt j ∈ (1, . . . , n)
  To train the actor and critic networks with parameters θ
× P(ρt+1 |ρ1 , . . . , ρt , Xt ) = softmax ut (5) and φ, N instances are sampled from {M1 , . . . , MM } for
training. For each instance, we use the actor network with
where v, W1 , and W2 are learnable parameters. dt is a key
current parameters θ to produce the cyclic tour of the cities
variable for calculating P(ρt+1 |ρ1 , . . . , ρt , Xt ) as it stores the
and the corresponding reward can be computed. Then, the pol-
information of previous steps ρ1 , . . . , ρt . Then, for each city
icy gradient is computed in line 11 (refer to [35] for details
j, its utj is computed by dt and its encoder hidden state ej , as
of the formula derivation of policy gradient) to update the
shown in Fig. 4. The softmax operator is used to normalize
actor network. Here, V(X0n ; φ) is the reward approximation of
ut1 , . . . , utn and finally, the probability for selecting each city
instance n calculated by the critic network. The critic network
j at step t can be finally obtained. The greedy decoder can
is then updated in line 12 by reducing the difference between
be used to select the next city. For example, in Fig. 4, city
the true observed rewards and the approximated rewards.
2 has the largest P(ρt+1 |ρ1 , . . . , ρt , Xt ) and so is selected as
Once all of the models of the subproblems are trained, the
the next visiting city. Instead of selecting the city with the
Pareto-optimal solutions can be directly output by a simple
largest probability greedily, during training, the model selects
forward propagation of the models. The time complexity of a
the next city by sampling from the probability distribution.
forward calculation of encoder is O(dh n), and the time com-
3) Training Method: The model of the subproblem is
plexity of a forward calculation of decoder is O(dh2 n), where
trained using the well-known Actor–Critic method similar
O(dh2 ) is the approximated time complexity of the RNN. Thus,
to [17] and [20]. However, as [17], [20] trains the model of
the approximated time complexity of using DRL-MOA for
single-objective TSP, the training procedure is different for the
solving the MOTSP is O(Nndh2 ), where N is the number of sub-
MOTSP case, as presented in Algorithm 2. Next, we briefly
problems. As a forward propagation of the encoder–decoder
introduce the training procedure.
neural network can be quite fast, the solutions can be always
Two networks are required for training: 1) an actor network,
obtained within a reasonable time.
which is exactly the Pointer Network in this article, gives
the probability distribution for choosing the next action and
2) a critic network that evaluates the expected reward given III. E XPERIMENTAL S ETUPS
a specific problem sate. The critic network employs the same The proposed DRL-MOA is tested on biobjective TSPs. All
architecture as the Pointer Network’s encoder that maps the experiments are conducted on a single GTX 2080Ti GPU.
encoder hidden state into the critic output. The code is written in Python and is publicly available1 to
The training is conducted in an unsupervised way. During reproduce the experimental results and to facilitate future stud-
the training, we generate the MOTSP instances from dis- ies. Meanwhile, all experiments of the compared MOEAs are
tributions {M1 , . . . , MM }. Here, M represents different conducted on the standard software platform PlatEMO2 [36]
input features of the cities, for example, the city locations or
the security indices of the cities. For example, for Euclidean 1 https://github.com/kevin031060/RL_TSP_4static
instances of a biobjective TSP, M1 and M2 are both city 2 http://bimk.ahu.edu.cn/index.php?s=/Index/Software/index.html

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3108 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

TABLE I
which is written in MATLAB. The compared MOGLS is writ- PARAMETER S ETTINGS OF THE M ODEL . 1D-C ONV M EANS THE 1-D
ten in Python.3 All of the compared algorithms are run on the C ONVOLUTION L AYER . Dinput R EPRESENTS THE D IMENSION OF I NPUT.
Intel 16-Core i7-9800X CPU with 64-GB memory. K ERNEL S IZE AND S TRIDE A RE E SSENTIAL PARAMETERS OF
THE 1-D C ONVOLUTION L AYER

A. Test Instances
The considered biobjective TSP instances are described as
follows [13].
Euclidean Instances: Euclidean instances are the commonly
used test instances for solving MOTSP [13]. Intuitively, both
the cost functions are defined by the Euclidean distance. The
first cost is defined by the distance between the real coordi-
nates of two cities i and j. The second cost of traveling from
city i to city j is defined by another set of virtual coordinates
that used to calculate another objective. Thus, the input is 4-D
as shown in Fig. 3.
Mixed Instances: In order to test the ability of our model that
can adapt to different input structures, Mixed instances with We train both of the actor and critic networks using the
3-D input are tested. Here, the first cost function is still defined Adam optimizer [38] with the learning rate η of 0.0001 and
by the Euclidean distance between two points that represents the batch size of 200. The Xavier initialization method [39] is
the real city location, which is a 2-D input with x-coordinate used to initialize the weights for the first subproblem. Weights
and y-coordinate. Moreover, the second cost of traveling from for the following subproblems are generated by the introduced
city i to j is defined by a 1-D input. This 1-D input of city i neighborhood-based parameter-transfer strategy.
can be interpreted as the altitude of city i. Thus, the objective In addition, different size of generated instances is required
is to minimize the altitude variance when traveling between for training different types of models. As compared with the
two cities. A smoother tour can be obtained with less altitude Mixed MOTSP problem, the model of the Euclidean MOTSP
variance, therefore, we can reduce the fuel cost or improve the problem requires more weights to be optimized because
comfort of the journey. Thereby, Mixed instances have a 3-D its dimension of input is larger, thus requiring more train-
input. ing instances in each iteration. In this article, we generate
Training Set: As an unsupervised learning method, only 500 000 instances to train the Euclidean biobjective TSP and
the model input and the reward function are required dur- 120 000 instances to train the Mixed one. All the problem
ing the training process, with no need for the best tours as the instances are generated from a uniform distribution of [0, 1]
labels. Euclidean instances with 4-D input and Mixed instances and used for training for five epochs. It costs about 3 h to train
with 3-D input are generated as the training set. They are all the Mixed instances and 7 h to train the Euclidean instances.
generated from a uniform distribution of [0, 1]. Once the model is trained, it can be used to directly output
Test Set: The standard TSP test problems kroA and kroB in the PFs.
the TSPLIB library [37] are used to construct the Euclidean
test instances kroAB100, kroAB150, and kroAB200 which are
IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS
commonly used MOTSP test instances [12], [13]. kroA and
kroB are two sets of different city locations and used to cal- In this section, DRL-MOA is compared with classical
culate the two Euclidean costs. For Mixed test instances, ran- MOEAs of NSGA-II and MOEA/D on different MOTSP
domly generated 40-, 70-, 100-, 150-, and 200-city instances instances. The maximum number of iteration for NSGA-II and
are constructed. MOEA/D is set to 500, 1000, 2000, and 4000, respectively.
In this article, the model is trained on 40-city MOTSP The population size is set to 100 for NSGA-II and MOEA/D.
instances and it is used to approximate the PFs of 40-, 70-, The number of subproblems for DRL-MOA is set to 100 as
100-, 150-, and 200-city test instances. well. The Tchebycheff approach, which we found would per-
form better on MOTSP, is used for MOEA/D. In addition, only
the nondominated solutions are reserved in the final PF.
B. Parameter Settings of Model and Training
Most parameters of the model and training are similar to
A. Results on Mixed-Type Biobjective TSP
that in [17] which can solve single-objective TSPs, effectively.
Specifically, the parameter settings of the network model are We first test the model that is trained on 40-city Mixed-
shown in Table I. Dinput represents the dimension of input, type biobjective TSP instances. The model is then used to
that is, Dinput = 4 for Euclidean biobjective TSPs. We employ approximate the PF of 40-, 70-, 100-, 150-, and 200-city
a one-layer GRU RNN with the hidden size of 128 in the instances.
decoder. For the critic network, the hidden size is also set The performance of the PFs obtained by all the compared
to 128. algorithms on various instances are shown in Figs. 5–8. It is
observed that the trained model can efficiently scale to biobjec-
3 https://github.com/kevin031060/Genetic_Local_Search_TSP tive TSP with different number of cities. Although the model

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
LI et al.: DRL FOR MULTIOBJECTIVE OPTIMIZATION 3109

(a) (b) (a) (b)

Fig. 5. Randomly generated 40-city Mixed biobjective TSP problem instance: Fig. 7. Randomly generated 150-city Mixed biobjective TSP problem
the PF obtained using our method (trained using 40-city instances) in com- instance: the PF obtained using our method (trained using 40-city instances)
parison with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iterations in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iter-
are applied, respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA and ations are applied, respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA
MOEA/D. and MOEA/D.

(a) (b) (a) (b)

Fig. 6. Randomly generated 100-city Mixed biobjective TSP problem Fig. 8. Randomly generated 200-city Mixed biobjective TSP problem
instance: the PF obtained using our method (trained using 40-city instances) instance: the PF obtained using our method (trained using 40-city instances)
in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iter- in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iter-
ations are applied, respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA ations are applied, respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA
and MOEA/D. and MOEA/D.

is obtained by training on the 40-city instances, it can still 4000 iterations, which is a pretty large number of iterations,
exhibit good performances on the 70-, 100-, 150-, and 200-city DRL-MOA still shows a far better performance than them.
instances. Moreover, the performance indicator of hypervol- In addition, the DRL-MOA achieves the best HV compared
ume (HV) and the running time that are obtained based on to other algorithms, as shown in Table II. Also, its running time
five runs are also listed in Table II. is much lower in comparison with the competitors. Overall, the
As shown in Fig. 5, all of the compared algorithms can experimental results clearly indicate the effectiveness of DRL-
work well for the small-scale problems, for example, 40-city MOA on solving the large-scale biobjective TSPs. The brain
instances. By increasing the number of iterations, NSGA-II of the trained model has learned how to select the next city
and MOEA/D even show a better ability of convergence. given the city information and the selected cities. Thus, it does
However, a large number of iterations can lead to a large not suffer the deterioration of performance with the increas-
amount of computing time. For example, 4000 iterations cost ing number of cities. In contrast, NSGA-II and MOEA/D fail
130.2 s for MOEA/D and 28.3 s for NSGA-II while our to converge within a reasonable computing time for large-
method just requires 2.7 s. scale biobjective TSPs. In addition, the PF obtained by the
As can be seen in Figs. 6–8, as the number of cities DRL-MOA method shows a significantly better diversity as
increases, the competitors of NSGA-II and MOEA/D strug- compared with NSGA-II and MOEA/D whose PF has a much
gle to converge while the DRL-MOA exhibits a much better smaller spread.
ability of convergence.
For 100-city instances in Fig. 6, MOEA/D shows a slightly
better performance in terms of convergence than other methods B. Results on Euclidean-Type Biobjective TSP
by running 4000 iterations with 140.3 s. However, the diver- We then test the model on Euclidean-type instances. The
sity of solutions found by our method is much better than DRL-MOA model is trained on 40-city instances and applied
MOEA/D. to approximate the PF of 40-, 70-, 100-, 150-, and 200-city
For 150- and 200-city instances as depicted in Figs. 7 instances. For 100-, 150-, and 200-city problems, we adopt
and 8, NSGA-II and MOEA/D exhibit an obviously inferior the commonly used kroAB100, kroAB150, and kroAB200
performance than our method in terms of both the convergence instances [13]. The HV indicator and computing time are
and diversity. Even though the competitors are conducted for shown in Table III.

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3110 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

TABLE II
HV VALUES O BTAINED BY DRL-MOA, NSGA-II, AND MOEA/D. I NSTANCES OF 40-, 70-, 100-, 150-, AND 200-C ITY M IXED -T YPE B IOBJECTIVE
TSP A RE T EST. T HE RUNNING T IME I S L ISTED . T HE B EST HV I S M ARKED IN G RAY BACKGROUND AND THE L ONGEST RUNNING T IME I S M ARKED
B OLD

TABLE III
HV VALUES O BTAINED BY DRL-MOA, NSGA-II, AND MOEA/D. I NSTANCES OF 40-, 70-, 100-, 150-, AND 200-C ITY E UCLIDEAN -T YPE
B IOBJECTIVE TSP A RE T EST. T HE RUNNING T IME I S L ISTED . T HE B EST HV I S M ARKED IN G RAY BACKGROUND
AND THE L ONGEST RUNNING T IME I S M ARKED B OLD

(a) (b) (a) (b)

Fig. 9. KroAB100 Euclidean biobjective TSP problem instance: the PF Fig. 10. KroAB150 Euclidean biobjective TSP problem instance: the PF
obtained using our method (trained using 40-city instances) in comparison obtained using our method (trained using 40-city instances) in comparison
with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iterations are applied, with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iterations are applied,
respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA and MOEA/D. respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA and MOEA/D.

Figs. 9–11 show the experimental results on kroAB100, MOEA/D, there is still an obvious gap of performance between
kroAB150, and kroAB200 instances. For the kroAB100 the two methods and the DRL-MOA.
instance, by increasing the number of iterations to 4000, In terms of the HV indicator as demonstrated in Table III,
NSGA-II, MOEA/D, and DRL-MOA achieve a similar level DRL-MOA performs the best in all instances. The running
of convergence while MOEA/D performs slightly better. time of DRL-MOA is much lower than the compared MOEAs.
However, MOEA/D performs the worst in terms of diversity Increasing the number of iterations for MOEA/D and NSGA-II
with all solutions crowded in a small region and its running can certainly improve the performance but would result in a
time is not acceptable. large amount of computing time. It requires more than 150 s
When the number of cities increases to 150 and 200, DRL- for MOEA/D to reach an acceptable level of convergence. The
MOA significantly outperforms the competitors in terms of computing time of NSGA-II is less, approximately 30 s, for
both convergence and diversity, as shown in Figs. 10 and 11. running 4000 iterations. However, the performance for NSGA-
Even though 4000 iterations are conducted for NSGA-II and II is always the worst among the compared methods.

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
LI et al.: DRL FOR MULTIOBJECTIVE OPTIMIZATION 3111

(a) (b) (a) (b)

Fig. 11. KroAB200 Euclidean biobjective TSP problem instance: the PF Fig. 13. PFs obtained by DRL-MOA, NSGA-II, and MOEA/D on a randomly
obtained using our method (trained using 40-city instances) in comparison generated 3-objective 100-city TSP instance. (a) DRL-MOA and NSGA-II.
with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iterations are applied, (b) DRL-MOA and MOEA/D.
respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA and MOEA/D.

(a) (b)
(a) (b)
Fig. 14. PFs obtained by DRL-MOA, NSGA-II, and MOEA/D on a randomly
Fig. 12. Randomly generated 500-city Euclidean biobjective TSP problem generated 3-objective 200-city TSP instance. (a) DRL-MOA and NSGA-II.
instance: the PF obtained using our method (trained using 40-city instances) (b) DRL-MOA and MOEA/D.
in comparison with NSGA-II and MOEA/D. 500, 1000, 2000, and 4000 iter-
ations are applied, respectively. (a) DRL-MOA and NSGA-II. (b) DRL-MOA TABLE IV
and MOEA/D. HV VALUES O BTAINED BY DRL-MOA, NSGA-II, AND MOEA/D ON 3-
AND 5-O BJECTIVE TSP I NSTANCES . T HE B EST HV I S M ARKED
IN G RAY BACKGROUND

We further try to evaluate the performance of the model on


500-city instances. The model is still the one that is trained
on 40-city instances and it is used to approximate the PF of
a 500-city instance. The results are shown in Fig. 12. It is
observed that DRL-MOA significantly outperforms the com-
petitors and the performance gap is especially larger than that
on smaller-scale problems.

C. Extension to MOTSPs With More Objectives


In this section, the efficiency of our method is further evalu-
ated on the 3- and 5-objective TSPs. The model is still trained
on 40-city instances and it is used to approximate the PF of
outperforms NSGA-II and MOEA/D in all instances. NSGA-II
100- and 200-city instances. The 3-objective TSP instances are
is still the least effective method.
constructed by combining two 2-D inputs and a 1-D input sim-
ilar to the Mixed-type instances. The 5-objective TSP instances
are constructed in the same way with two 2-D inputs and three D. Comparisons With Local-Search-Based Methods
1-D input. In this section, DRL-MOA is compared with the local-
The results of the experiments on the 3-objective TSP are search-based method. Ishibuchi and Murata first proposed a
visualized in Figs. 13 and 14. It can be observed that DRL- MOGLS [40] for multiobjective combinatorial optimization.
MOA significantly outperforms the classical MOEAs on all It is further improved and specialized to solve the MOTSP
of the 100- and 200-city instances. Moreover, DRL-MOA in [41], which significantly outperforms the original MOGLS.
performs clearly better on the 200-city instance than on the In this section, the DRL-MOA is compared with the improved
100-city instance while the competitors struggle to converge. MOGLS.
In addition, the results of the HV values on the 3- and 5- Note that local search has been widely developed and
objective TSP instances that are obtained based on five runs various local-search-based methods that use a number of
are presented in Table IV. It can be seen that the DRL-MOA specialized techniques to improve effectiveness have been

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3112 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

TABLE V
HV VALUES OF THE PF S O BTAINED BY MOGLS-100, MOGLS-200,
MOGLS-300, DRL-MOA, AND DRL-MOA + LS. T HE B EST AND THE
S ECOND B EST HV S A RE M ARKED IN G RAY AND L IGHT G RAY
BACKGROUND . T HE L ONGEST C OMPUTATIONAL T IME I S M ARKED B OLD

(a) (b)

Fig. 15. PFs obtained by MOGLS-100, MOGLS-200, and DRL-MOA on


200-city biobjective TSP instances. (a) PFs of Mixed instances. (b) PFs of
Euclidean instances.

DRL-MOA only requires about 13 s. Although MOGLS per-


forms slightly better than DRL-MOA on 100-city instances,
DRL-MOA can always obtain a comparable result within 7 s.
The results of the experiments indicate the fast computing
proposed these years. However, the goal of our method is not speed and guaranteed performance of the DRL-MOA method.
to outperform a nonlearned, specialized MOTSP algorithm. Moreover, using local search to post-process the solutions can
Rather, we show the fast solving speed and high generaliza- further improve the performance.
tion ability of our method via the combination of DRL and
multiobjective optimization. Thus, we did not consider more E. Effectiveness of Parameter-Transfer Strategy
other local-search-based methods in this article. In this section, the effectiveness of the neighborhood-based
As no source code is found, the improved MOGLS is imple- parameter-transfer strategy is checked experimentally. First,
mented by ourselves strictly according to [41]. The algorithm the performances of the models that are trained with and with-
is written in Python and run in the same machine to make fair out the parameter-transfer strategy are compared. They are
comparisons and the code is publicly available.4 both trained on 120 000 20-city MOTSP instances for five
The parameters, for example, the size of the temporary pop- epochs. PFs obtained by the two models are presented in
ulation and the number of the initial solution are consistent Fig. 16(a). It is obvious that the performance of the model
with the settings in [41]. The local search uses a standard is dramatically poor if the parameter-transfer strategy is not
2-opt algorithm, which is terminated if a prespecified number used for its training.
of iterations NLS is completed. NLS is used to control the bal- Moreover, we train the model without applying the
ance of computational time and model performance [40]. The parameter-transfer strategy on 240 000 instances for ten
improved MOGLS with NLS = 100, 200, 300 is used for com- epochs, that is, the model is trained four times longer than
parisons. It should be noted that improving NLS or repeating before. The result is presented in Fig. 16(b). It can be seen
the local search until no better solution is found might be able that, without the parameter-transfer strategy, even if the model
to further improve the performance of MOGLS. But it would is trained four times longer, it still exhibits a poor performance.
be quite time consuming and not experimented in this article. Thus, it is effective and efficient to apply the parameter-
Since the local search can further improve the quality of transfer strategy for training; otherwise, it is impossible to
solutions obtained by DRL as reported in [22], we thus use a obtain a promising model within a reasonable time.
simple 2-opt local search to post-process the solutions obtained
by DRL-MOA, leading to the results of DRL-MOA+LS. It is F. Impact of Training on Different Number of Cities
noted that the 2-opt is conducted only once for each solution
The forgoing models are trained on 40-city instances. In
and it only costs several seconds in total. Thus, the results of
this part, we try to figure out whether there is any difference
the experiments on MOGLS-100, MOGLS-200, MOGLS-300,
of the model performance if the model is trained on 20-city
DRL-MOA, and DRL-MOA+LS are presented in Table V. For
instances. The performance of the model that is trained on
clarity, only the PFs obtained by MOGLS-100, MOGLS-200,
20-city instances is presented in Fig. 17(a).
and our method are visualized in Fig. 15.
It can be observed that the model trained on 20-city
It is found that using local search to post-process the solu-
instances exhibits an apparently worse performance than the
tions can further improve the performance. It can be observed
one trained on 40-city instances. A large number of solutions
that DRL-MOA+LS outperforms the compared MOGLS on all
obtained by the 20-city model are crowded in several regions
instances while requiring much less computational time. DRL-
and there are less nondominated solutions. A possible reason
MOA without local search can also outperform MOGLS on all
for the deteriorated result is that, when training on 40-city
200-city instances even the MOGLS has run for 1000 s, while
instances, 40-city selecting decisions are made and evaluated
in the process of training each instance, which are twice of
4 https://github.com/kevin031060/Genetic_Local_Search_TSP that when training on 20-city instances. Loosely speaking, if

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
LI et al.: DRL FOR MULTIOBJECTIVE OPTIMIZATION 3113

2) A Better Balance Between the Solving Speed and the


Quality of Solutions: The Pareto-optimal solutions can
be always obtained within a reasonable time while the
quality of the solutions is still guaranteed.

V. C ONCLUSION
Multiobjective optimization, appeared in various disciplines,
is a fundamental mathematical problem. Evolutionary algo-
(a) (b)
rithms have been recognized as suitable methods to handle
Fig. 16. Performances of the models that are trained with and without the such problem for a long time. However, evolutionary algo-
parameter-transfer strategy. (a) Performances of two models: one is trained via rithms, as iteration-based solvers, are difficult to be used for
the parameter-transfer strategy; and another is trained without transferring the online optimization. Moreover, without the use of a large
network weights. They are both trained on 120 000 instances for five epochs.
(b) Performances of two models: the first model is the same with that in (a); number of iterations and/or a large population size, evolu-
and the second model is trained on 240 000 instances for ten epochs without tionary algorithms can hardly solve large-scale optimization
applying the parameter-transfer strategy. problems [13]–[15].
Inspired by the very recent work of DRL for the single-
objective optimization, this article provides a new way
of solving the MOP) by means of DRL and has found
very encouraging results. In specific, on MOTSP instances,
the proposed DRL-MOA significantly outperforms NSGA-II,
MOEA/D, and MOGLS in terms of the solution convergence,
spread performance, as well as the computing time, and thus,
making a strong claim to use the DRL-MOA, a noniterative
solver, to deal with MOPs in the future.
With respect to the future studies, first, in the current
(a) (b) DRL-MOA, a 1-D convolution layer which corresponds to
the city information is used as inputs. Effectively, a distance
Fig. 17. Two models trained, respectively, on 20- and 40-city Euclidean
biobjective TSP instances. They are used to approximate the PF of 40-, 100-
matrix used as inputs can be further studied, that is, using a
, and 200-city problems. (a) Model trained on 20-city instances. (b) Model 2-D convolution layer. Second, the distribution of the solu-
trained on 40-city instances. tions obtained by the DRL-MOA is not as even as expected.
Therefore, it is worth investigating how to improve the dis-
tribution of the obtained solutions. Overall, multiobjective
both the two models use 120 000 instances, the 40-city model optimization by DRL is still in its infancy. It is expected that
is trained based on 120 000 × 40 cities which are twice of this article will motivate more researchers to investigate this
that of the 20-city model. Therefore, the model trained on 40- promising direction, developing more advanced methods in the
city instances is better. We can simply increase the number of future.
training instances to improve the performance.
Finally, it is also interesting to see that the solutions out-
put by DRL-MOA are not all nondominated. Moreover, these R EFERENCES
solutions are not distributed evenly (being along with the pro- [1] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
vided search directions). These issues deserve more studies in multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput.,
vol. 6, no. 2, pp. 182–197, Apr. 2002.
the future.
[2] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm
based on decomposition,” IEEE Trans. Evol. Comput., vol. 11, no. 6,
pp. 712–731, Dec. 2007.
G. Summary of the Results [3] L. Ke, Q. Zhang, and R. Battiti, “MOEA/D-ACO: A multiobjective evo-
lutionary algorithm using decomposition and antcolony,” IEEE Trans.
Observed from the experimental results, we can conclude Cybern., vol. 43, no. 6, pp. 1845–1859, Dec. 2013.
that the DRL-MOA is able to handle MOTSP both effectively [4] B. A. Beirigo and A. G. dos Santos, “Application of NSGA-II frame-
and efficiently. In comparison with classical multiobjective work to the travel planning problem using real-world travel data,” in
Proc. IEEE Congr. Evol. Comput. (CEC), Vancouver, BC, Canada, 2016,
optimization methods, DRL-MOA has shown some encourag- pp. 746–753.
ingly new characteristics, for example, strong generalization [5] W. Peng, Q. Zhang, and H. Li, “Comparison between MOEA/D and
ability, fast solving speed, and promising quality of the NSGA-II on the multi-objective travelling salesman problem,” in Multi-
Objective Memetic Algorithms. Heidelberg, Germany: Springer, 2009,
solutions, which can be summarized as follows. pp. 309–324.
1) Strong Generalization Ability: Once the trained model [6] S. Lin and B. W. Kernighan, “An effective heuristic algorithm for the
is available, it can scale to newly encountered prob- traveling-salesman problem,” Oper. Res., vol. 21, no. 2, pp. 498–516,
lems with no need of retraining the model. Moreover, 1973.
[7] D. Johnson, “Local search and the traveling salesman problem,” in
its performance is less affected by the increase in the Proc. 17th Int. Colloquium Automata Lang. Program. Lecture Notes
number of cities compared to existing methods. Comput. Sci., 1990, pp. 443–460.

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.
3114 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 51, NO. 6, JUNE 2021

[8] E. Angel, E. Bampis, and L. Gourvès, “A dynasearch neighbor- [33] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
hood for the bicriteria traveling salesman problem,” in Metaheuristics with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014,
for Multiobjective Optimisation. Heidelberg, Germany: Springer, 2004, pp. 3104–3112.
pp. 153–176. [34] K. Cho et al., “Learning phrase representations using RNN encoder–
[9] A. Jaszkiewicz, “On the performance of multiple-objective genetic local decoder for statistical machine translation,” 2014. [Online]. Available:
search on the 0/1 knapsack problem—A comparative experiment,” IEEE arXiv:1406.1078.
Trans. Evol. Comput., vol. 6, no. 4, pp. 402–412, Aug. 2002. [35] V. R. Konda and J. N. Tsitsiklis, “Actor–Critic algorithms,” in Proc. Adv.
[10] L. Ke, Q. Zhang, and R. Battiti, “Hybridization of decomposition and Neural Inf. Process. Syst., 2000, pp. 1008–1014.
local search for multiobjective optimization,” IEEE Trans. Cybern., [36] Y. Tian, R. Cheng, X. Zhang, and Y. Jin, “PlatEMO: A MATLAB
vol. 44, no. 10, pp. 1808–1820, Oct. 2014. platform for evolutionary multi-objective optimization,” IEEE Comput.
[11] X. Cai, Y. Li, Z. Fan, and Q. Zhang, “An external archive guided Intell. Mag., vol. 12, no. 4, pp. 73–87, Nov. 2017.
multiobjective evolutionary algorithm based on decomposition for com- [37] G. Reinelt, “TSPLIB—A traveling salesman problem library,” ORSA J.
binatorial optimization,” IEEE Trans. Evol. Comput., vol. 19, no. 4, Comput., vol. 3, no. 4, pp. 376–384, 1991.
pp. 508–523, Aug. 2015. [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[12] X. Cai, H. Sun, Q. Zhang, and Y. Huang, “A grid weighted sum Pareto 2014. [Online]. Available: arXiv:1412.6980,
local search for combinatorial multi and many-objective optimization,” [39] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
IEEE Trans. Cybern., vol. 49, no. 9, pp. 3586–3598, Sep. 2018. feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Stat.,
[13] T. Lust and J. Teghem, “The multiobjective traveling salesman problem: 2010, pp. 249–256.
A survey and a new approach,” in Advances in Multi-Objective [40] H. Ishibuchi and T. Murata, “A multi-objective genetic local search algo-
Nature Inspired Computing. Heidelberg, Germany: Springer, 2010, rithm and its application to flowshop scheduling,” IEEE Trans. Syst.,
pp. 119–141. Man, Cybern. C, Appl. Rev., vol. 28, no. 3, pp. 392–403, Aug. 1998.
[14] X. Zhang, Y. Tian, R. Cheng, and Y. Jin, “A decision variable [41] A. Jaszkiewicz, “Genetic local search for multi-objective combinatorial
clustering-based evolutionary algorithm for large-scale many-objective optimization,” Eur. J. Oper. Res., vol. 137, no. 1, pp. 50–71, 2002.
optimization,” IEEE Trans. Evol. Comput., vol. 22, no. 1, pp. 97–112,
Feb. 2018.
[15] M. Ming, R. Wang, and T. Zhang, “Evolutionary many-constraint
optimization: An exploratory analysis,” in Proc. Int. Conf. Evol. Multi
Criterion Optim., 2019, pp. 165–176.
[16] D. H. Wolpert, W. G. Macready, “No free lunch theorems for
optimization,” IEEE Trans. Evol. Comput., vol. 1, no. 1, pp. 67–82,
Apr. 1997.
[17] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác, “Reinforcement
learning for solving the vehicle routing problem,” in Proc. Adv. Neural Kaiwen Li received the B.S. and M.S. degrees from
Inf. Process. Syst., 2018, pp. 9839–9849. the National University of Defense Technology
[18] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Proc. (NUDT), Changsha, China, in 2016 and 2018,
Adv. Neural Inf. Process. Syst., 2015, pp. 2692–2700. respectively.
[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation He is a student with the College of Systems
by jointly learning to align and translate,” 2014. [Online]. Available: Engineering, NUDT. His research interests
arXiv:1409.0473. include prediction technique, multiobjective
[20] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural com- optimization, reinforcement learning, data mining,
binatorial optimization with reinforcement learning,” 2016. [Online]. and optimization methods on energy Internet.
Available: arXiv:1611.09940.
[21] W. Kool, H. van Hoof, and M. Welling, “Attention, learn to solve routing
problems!” 2018. [Online]. Available: arXiv:1803.08475.
[22] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau,
“Learning heuristics for the TSP by policy gradient,” in Proc. Int. Conf.
Integr. Constraint Program. Artif. Intell. Oper. Res., 2018, pp. 170–181.
[23] C.-H. Hsu et al., “MONAS: Multi-objective neural architecture
search using reinforcement learning,” 2018. [Online]. Available:
arXiv:1806.10332. Tao Zhang received the B.S., M.S., and Ph.D.
[24] H. Mossalam, Y. M. Assael, D. M. Roijers, and S. Whiteson, “Multi- degrees from the National University of Defense
objective deep reinforcement learning,” 2016. [Online]. Available: Technology (NUDT), Changsha, China, in 1998,
arXiv:1610.02707. 2001, and 2004, respectively.
[25] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” He is a Professor with the College of Systems
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. Engineering, NUDT. His research interests include
[26] T. Murata, H. Ishibuchi, and M. Gen, “Specification of genetic search multicriteria decision making, optimal scheduling,
directions in cellular multi-objective genetic algorithms,” in Proc. Int. data mining, and optimization methods on energy
Conf. Evol. Multi Criterion Optim., 2001, pp. 82–95. Internet network.
[27] K. Li, K. Deb, Q. Zhang, and S. Kwong, “An evolutionary many-
objective optimization algorithm based on dominance and decompo-
sition,” IEEE Trans. Evol. Comput., vol. 19, no. 5, pp. 694–716,
Oct. 2015.
[28] J. Chen, J. Li, and B. Xin, “Dmoea-εC : Decomposition-based
multiobjective evolutionary algorithm with the ε-constraint framework,”
IEEE Trans. Evol. Comput., vol. 21, no. 5, pp. 714–730, Oct. 2017.
[29] K. Deb and H. Jain, “An evolutionary many-objective optimization
algorithm using reference-point-based nondominated sorting approach,
Rui Wang received the B.S. degree from the
part I: Solving problems with box constraints,” IEEE Trans. Evol.
National University of Defense Technology
Comput., vol. 18, no. 4, pp. 577–601, Aug. 2014.
(NUDT), Changsha, China, in 2008, and the Ph.D.
[30] K. Miettinen, Nonlinear Multiobjective Optimization, vol. 12. New York,
degree from the University of Sheffield, Sheffield,
NY, USA: Springer, 2012.
U.K., in 2013.
[31] R. Wang, Z. Zhou, H. Ishibuchi, T. Liao, and T. Zhang, “Localized
He is a Lecturer with the College of Systems
weighted sum method for many-objective optimization,” IEEE Trans.
Engineering, NUDT. His research interests
Evol. Comput., vol. 22, no. 1, pp. 3–18, Feb. 2018.
include evolutionary computation, multiobjective
[32] R. Wang, Q. Zhang, and T. Zhang, “Decomposition-based algorithms
optimization, machine learning, and various
using Pareto adaptive scalarizing methods,” IEEE Trans. Evol. Comput.,
applications using evolutionary algorithms.
vol. 20, no. 6, pp. 821–837, Dec. 2016.

Authorized licensed use limited to: NASATI. Downloaded on November 25,2024 at 09:30:03 UTC from IEEE Xplore. Restrictions apply.

You might also like