0% found this document useful (0 votes)
11 views9 pages

Constrained Reinforcement Learning For Dynamic Material Handling

New Reinforcement Learning in material research

Uploaded by

vedgeta.ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Constrained Reinforcement Learning For Dynamic Material Handling

New Reinforcement Learning in material research

Uploaded by

vedgeta.ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Constrained Reinforcement Learning for

Dynamic Material Handling


Chengpeng Hu1, 2 , Ziming Wang1, 2 , Jialin Liu2, 1 , Junyi Wen3 , Bifei Mao3 and Xin Yao2, 1
1 Research
Institute of Trustworthy Autonomous Systems (RITAS),
Southern University of Science and Technology, Shenzhen, China.
2 Guangdong Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering,

Southern University of Science and Technology, Shenzhen, China.


3 Trustworthiness Theory Research Center, Huawei Technologies Co., Ltd Shenzhen, China.

[email protected], [email protected], [email protected],


arXiv:2305.13824v1 [cs.LG] 23 May 2023

[email protected], [email protected], [email protected]

Abstract—As one of the core parts of flexible manufacturing


systems, material handling involves storage and transportation
of materials between workstations with automated vehicles.
The improvement in material handling can impulse the overall
efficiency of the manufacturing system. However, the occurrence
of dynamic events during the optimisation of task arrangements
poses a challenge that requires adaptability and effectiveness.
In this paper, we aim at the scheduling of automated guided
vehicles for dynamic material handling. Motivated by some real-
world scenarios, unknown new tasks and unexpected vehicle
breakdowns are regarded as dynamic events in our problem.
We formulate the problem as a constrained Markov decision
process which takes into account tardiness and available vehicles
as cumulative and instantaneous constraints, respectively. An Fig. 1. Simulation of material handling.
adaptive constrained reinforcement learning algorithm that com-
bines Lagrangian relaxation and invalid action masking, named
RCPOM, is proposed to address the problem with two hybrid
Dispatching rules [2], [3] are classic and common methods
constraints. Moreover, a gym-like dynamic material handling
simulator, named DMH-GYM, is developed and equipped with for DMH. Although easy to implement, they suffer from
diverse problem instances, which can be used as benchmarks some complex situations, which usually lead to suboptimal
for dynamic material handling. Experimental results on the performance and are hard to be improved further. Considering
problem instances demonstrate the outstanding performance of such issues, search-based methods come to prominence. For
our proposed approach compared with eight state-of-the-art con-
example, a hybrid genetic algorithm and ant colony optimisa-
strained and non-constrained reinforcement learning algorithms,
and widely used dispatching rules for material handling. tion is proposed for intelligent manufacturing workshop [4].
Index Terms—Dynamic material handling, constrained rein- However, the search process is often time-consuming. The
forcement learning, automated guided vehicle, manufacturing aforementioned methods are less suitable in real-world sce-
system, benchmark narios when fast response and adaptability are needed.
Recently, some work has leveraged reinforcement learning
I. I NTRODUCTION (RL) to DMH [5], [6]. The problem is formulated as a Markov
Material handling can be widely found in manufacturing, decision process (MDP) and the reward function is constructed
warehouses and other logistic scenarios. It aims to transport manually according to makespan and travel distance [5], [6].
some goods from their storage locations to some delivery Trained parameterised policies schedule the vehicles in real
sites. With the help of automated guided vehicles (AGV), time when new tasks arrive. However, more complicated
tasks and jobs can be accomplished fast and automatically. scenarios involving vehicle breakdowns or some vital problem
An example of material handling is shown in Fig. 1. In real- constraints, such as task delay, are not considered while
world flexible manufacturing, AGV scheduling plans usually optimising the policy [5], [6]. It is impossible to overlook
need to be changed due to some dynamics such as newly these scenarios when dealing with real-world problems. For
arrived tasks, due time change, as well as vehicle and site example, manufacturing accidents may happen if an emerged
breakdowns [1]. These dynamics pose a serious challenge to task is assigned to a broken vehicle, which leads to a task
vehicle scheduling, called dynamic material handling (DMH). delay or this task will never be completed.
Motivated by real-world scenarios in the flexible manu-
Corresponding author: Jialin Liu ([email protected]). facturing system, we consider a DMH problem with multi-
ple AGVs and two dynamic events, namely newly arrived optimisation problem due to their simplicity. Although reason-
tasks and vehicle breakdowns during the handling process. able solutions may be found by these rules, they hardly get
To tackle the dynamics and constraints, we formulate the further improvement. Dispatching rules are usually constructed
problem as a constrained Markov decision process (CMDP), with some specific considerations, thus few certain single
considering the tardiness of tasks as a cumulative constraint. rules work well in general cases [13]. Mix rule policy that
In addition, the status of vehicles including “idle”, “working” combines multiple dispatching rules improves the performance
and “broken” caused by vehicle breakdowns is considered of a single one [14]. New AGVs and jobs are considered
as an instantaneous constraint which determines if a vehicle dynamic events and solved with the multi-agent negotiation
can be assigned with a task or not. A constrained reinforce- method [15]. An adaptive mixed integer programming model
ment learning (CRL) approach is proposed to perform real- triggered by some events is applied to solve the single AGV
time scheduling considering two widely used performance scheduling problem with time windows [16]. Nevertheless,
indicators, makespan and tardiness in manufacturing. The these methods are usually limited for poor adaptability and
proposed approach, named RCPOM, incorporates Lagrangian require domain knowledge.
relaxation and invalid action masking to handle both cumu- Search-based methods can also be applied to DMH. Since
lative constraint and instantaneous constraint. Moreover, due priory conditions such as the number of tasks are not deter-
to the lack of free simulators and instance datasets for DMH, mined apriori, these methods seek to decompose the dynamic
we develop an OpenAI’s gym-like [7] simulator and provide process as some static sub-problems. Genetic algorithm [17]
diverse problem instances as a benchmark1 . is applied for DMH. When new tasks arrive or the status of
The main contributions of our work are summarised as machines changes, the algorithm reschedules the plan. Ana-
follows. lytic hierarchy process based optimisation [18] combines task
• We formulate a dynamic material handling problem sets and then uses mixed attributes for dispatching. NSGA-II
considering multiple vehicles, newly arrived tasks and is used by Wang et al. [19] to achieve proactive dispatching
vehicle breakdowns as a CMDP with the consideration considering distance and energy consumption at the same time.
of both cumulative and instantaneous constraints. However, these search-based methods usually suffer from long
• Extending OpenAI’s gym, we develop an open-source consumption time [17], and thus are not able to cater to the
simulator, named DMH-GYM, and provide a dataset requirements of fast response and adaptability in real-world
comprised of diverse instances for researching the dy- scenarios.
namic material handling problem. Reinforcement learning, which performs well in real-time
• Considering the occurrence of newly arrived tasks and decision-making problems [20], has been applied to optimise
vehicle breakdowns as dynamic events, a CRL algorithm real-world DMH problems. A Q−λ algorithm is proposed by
combining a Lagrangian relaxation method and invalid Chen et al. [21] to address a multiple-load carrier scheduling
action masking technique, named RCPOM, is proposed problem in a general assembly line. Xue et al. [5] consider
to address this problem. a non-stationary case that AGVs’ actions may affect each
• Extensive experiments show the adaptability and effec- other. Kardos et al. [22] use Q-learning to schedule tasks
tiveness of the proposed method for solving this problem, dynamically with real-time information. Hu et al. [6] improve
compared with eight state-of-the-art algorithms, including the mix rule policy [14] with a deep reinforcement learning
CRL agents, RL agents and several dispatching rules. algorithm to minimise the makespan and delay ratio. Multiple
AGVs are scheduled by the algorithm in real time. [23] applies
II. BACKGROUND reinforcement learning with a weight sum reward function to
AGVs are widely used for material handling systems [8], material handling. Although the algorithms of the above work
meeting the need for the high requirement of automation. perform promisingly, constraints in real-world applications are
Scheduling and routing are usually regarded as two splitting not considered in their problems.
parts of material handling. The former refers to the task Constrained reinforcement learning (CRL) follows the prin-
dispatching considered in the paper. The latter aims to find ciples of the constrained Markov decision process (CMDP)
feasible routing paths for AGVs to transfer loads. Pathfinding that maximises the long-term expected reward while respecting
is not the main focus of this paper, since fixed paths are constraints [24]. It has been applied to solving problems such
commonly found in many real-world cases [9], [10], [11]. as robot controlling [25], [26] and resource allocation [27],
Scheduling of DMH [12] refers to the problem with some [28], however, to the best of our knowledge, it has never been
real-time dynamic events, such as newly arrived tasks, due considered in solving DMH problems.
time change and vehicle breakdowns [1], which are com-
monly found in modern flexible manufacturing systems and III. DYNAMIC M ATERIAL H ANDLING WITH MULTIPLE
warehouses. Traditional dispatching rules [3], e.g., first come VEHICLES
first serve (FCFS), earliest due date first (EDD), and nearest
vehicle first (NVF), are widely used to solve the dynamic This section describes our DMH, considering newly arrived
tasks and vehicle breakdowns. DMH-GYM, the simulator
1 Codes available at https://github.com/HcPlu/DMH-GYM developed in this work is also presented.
A. Problem description T : Paths set that connect sites.
In our DMH problem with multiple vehicles, transporting K: Parking set, K ⊂ L.
tasks can be raised over time dynamically. These newly arrived U: Staging list of tasks.
tasks, called unassigned tasks, need to be assigned to AGVs u: Task u ∈ U is denoted as a tuple ⟨s, e, τ, o⟩ that refers to
with a policy π. The problem is formed as a graph G(L, T ) pickup point s, delivery point e, arrival time τ and expiry
where L and T denote the sets of sites and paths, respectively. time o.
Three kinds of sites, namely pickup points, delivery points u.s: Pickup point of a given task u, u.s ∈ L.
and parking positions are located in graph G(L, T ). The set u.e: Delivery point of a given task u, u.e ∈ L.
of parking positions is denoted as K. The total number of u.τ : Arrival time of a given task u.
tasks that need to be completed is unknown in advance. All u.o: Expiry time of a given task u.
unassigned tasks will be stored in a staging list U temporarily. u0 : Start task that manipulates AGV to leave its parking lot.
A task u = ⟨s, e, τ, o⟩ ∈ U is determined by its pickup point v: AGV v is denoted as tuple ⟨vl, rp, pl, ψ⟩ that refers to its
u.s, delivery point u.e, arrival time u.τ and expiry time u.o, velocity vl, repair time rp, parking location pl and status ψ.
where u.s, u.e ∈ L. v.vl: Velocity of a given AGV v.
A fleet of AGVs V works in the system to serve tasks. A task v.rp: Repairing time of a given AGV v.
can only be picked by one and only one AGV at once. Each v.pl: Parking position of a given AGV v, v.pl ∈ K.
AGV v = ⟨vl, rp, pl, ψ⟩ ∈ V is denoted by its velocity v.vl, v.ψ: Status of a given AGV v. Three statuses are defined
repairing time v.rp, parking position v.pl and status v.ψ. The as Idle, Working and Broken and represented by set
location v.pl denotes the initial parking location of AGV v, Ψ = {I, W, B}.
where v.pl ∈ K. AGVs keep serving the current assigned task v.u: Current assigned task of a given AGV v.
v.u during the process. If there is no task to be performed, the HL(v): List of completed tasks of a given AGV v.
AGV stays at its current place until a new task is assigned. All F T (u, v): Finish time of a task u by an AGV v.
finished tasks by an AGV v will be recorded in the historical π: Decision policy for task assignments.
task list HL(v). A starting task u0 will first be added to the
historical list, where u0 denotes that an AGV v leaves its B. DMH-GYM
parking site to the pickup point of the first assigned task u1 . To our best knowledge, no existing work has studied this
Three statuses of AGV are denoted as set Ψ = {I, W, B} problem and there is no existing open-source simulator of
representing Idle, Working and Broken, respectively. An AGV related problems. To study the problem, we develop a simula-
can break down sometimes. In this broken status, the broken tor that is compatible with OpenAI’s gym [7], named DMH-
AGV v will stop in place and release its current task to the GYM. Diverse instances are also provided that show various
task pool. After repairing in time v.rp, AGV can be available performances on different dispatching rules to fully evaluate
to accept tasks. the given algorithms. The layout of the shop floor is shown
While the system is running, a newly arrived task u can in Fig. 2. Stations from st1 to st8 are the workstations on
be assigned to an available AGV if and only if v.ψ = I. The the floor which can be pickup points and delivery points. The
decision policy π assigns the task. Makespan and tardiness are warehouse can only be a delivery point. All AGVs depart from
selected as the optimising objectives of the problem, especially the carport, i.e. initial parking position, to accomplish tasks
the tardiness is regarded as a constraint. Makespan is the that are generated randomly. The start site of a task can only
maximal time cost of AGVs for finishing all the assigned be a workstation, while its end site can be a workstation or
tasks using Eq. (1). Tardiness denotes the task delay in the warehouse. All AGVs move on preset paths with the same
completion of tasks, as formalised in Eq. (2). velocity. And they are guided by an A∗ algorithm for path
Fm (π) = maxπ F T (uv|HL(v)| , v), (1) finding while performing tasks. The distance between stations
v∈V is calculated according to their coordinates and path. For
π
V |HL(v)| example, the distance between st8 (0,45) and st1 (20,70) in
1 X X
Ft (π) = max(F T (uvi , v) − uvi .o − uvi .τ, 0), (2) Fig. 2 is 45 as a vehicle should pass the coordinate of the
m v i corner point (0,70). Collisions and traffic congestion are not
where HL(v) refers to the historical list of tasks completed by considered in this paper for simplification.
v and v.pl ∈ P. The time cost of waiting to handle material IV. C ONSTRAINED R EINFORCEMENT L EARNING
is ignored at pickup points and delivery points for AGVs as To meet the needs of real-world applications, we extend the
they are often constant. DMH problem as a CMDP by considering the tardiness and
Notations for our problem are summarised as follows. the constraint of vehicle availability.
| · |: Size of a given set or list.
V: A fleet of AGVs. A. Modeling DMH as a CMDP
n: Number of AGVs, i.e., n = |V|. The CMDP is defined as a tuple (S, A, R, C, P, γ), where
m: Number of tasks that needs to be accomplished. S is state space, A is action space, R represents the re-
L: Set of pickup points, delivery points, and parking lots. ward function, P is the transition probability function and
Fig. 3. Reward constrained policy optimisation with masking.

Fig. 2. Layout of DMH-GYM. and nearest vehicle first (NVF), form the dispatching rule
space D. Those dispatching rules are formulated [3] as follows:
• FCFS: arg min u.τ ;
γ is the discount factor. A cumulative constraint C = u∈U
• EDD: arg min(u.o − u.τ );
g(c(s0 , a0 , s1 ), . . . , c(st , at , st+1 )) that consists of per-step u∈U
cost c(s, a, s′ ) can be restricted by a corresponding constraint • NVF: arg min d(v.c, u.s);
u∈U
threshold ϵ. g can be any of the functions for constraints • STD: arg min d(v.c, u.s) + d(u.s, u.e),
u∈U
such as average and sum. JCπ denotes the expectation of the
where v.s is the current position of an AGV v and d(p, p′ )
cumulative constraint. A policy optimised by parameters θ is
determines the distance between p and p′ . Action at =
defined as πθ which determines the probability to take an
⟨dt , vt ⟩ ∈ At at time t determines a single task assignment
action at at state st with π(at |st ) at time t. The goal of
at that task ut is assigned to AGV vt by dispatching rule dt .
the problem is to maximise the long-term expected reward
With the action space At , the policy can decide the rule and
with optimised policy πθ while satisfying the constraints, as
AGV at the same time, which breaks the tie of the multiple
formulated below.
vehicles case.

X 3) Constraints: A cumulative constraint and an instanta-
max Eπθ [ γ t R(st , at , st+1 )] (3) neous constraint are both considered in the CMDP. We treat
θ
t=0 the tardiness as a cumulative constraint JC , formulated in Eq.
s.t. JCπθ = Eτ ∼πθ [C] ≤ ϵ. (4) (5). The task assignment constraint of vehicle availability is
regarded as an instantaneous constraint that only vehicles with
1) State: At each decision time t, we consider the current status Idle are considered as available, denoted as Eq. (6).
state of the whole system consisting of tasks and AGVs,
denoted as St = ρ(Ut , Vt ), where Ut is the set of unassigned
tasks and Vt represents information of all AGVs at time t. We JCπθ = Eπθ (Ft (πθ )) ≤ ϵ, (5)
encode the information of the system at decision time t with v.ψ = I. (6)
a feature extract function ρ(Ut , Vt ) as structure representation.
Specifically, a state is mapped into a vector by ρ consisting of 4) Reward function: The negative makespan returned at the
the number of unassigned tasks, tasks’ remaining time, waiting last timestep is set as a reward. τ is denoted as a trajectory
time and distance between pickup and delivery point as well (s0 , a0 , s1 , a1 , . . . , st , at , st+1 , . . . ) sampled from current pol-
as statuses of vehicles and the minimal time from the current icy π. The per-step reward function is defined as follows.
position to pickup point then delivery point. 
−Fm (π), if terminates,
2) Action: The goal of dynamic material handling is to R(st , at , st+1 ) = (7)
0, otherwise.
execute all tasks, thus the discrete time step refers to one
single, independent task assignment. Different from the regular B. Constraint handling
MDP like in games, the number of steps in our problem is We combine invalid action masking and the reward con-
usually fixed. In other words, the length of an episode is strained policy optimisation (RCPO) [26], a Lagrangian re-
decided by the number of tasks to be assigned. In our CMDP, laxation approach, to handle the instantaneous and cumulative
making an action is to choose a suitable dispatching rule and constraints at the same time.
a corresponding vehicle. A hybrid action space At = D × Vt 1) Feasible action selection via masking: We can indeed
is considered for the CMDP, where D and Vt are defined determine the available actions according to the current state.
as dispatching rule space and AGV space, respectively. Four For example, in our problem, the dimension of the current
dispatching rules, including first come first served (FCFS), action space can be 4, 8, ..., which depends on the available
shortest travel distance (STD), earliest due date first (EDD) vehicles at any time step. However, since the dimension of the
input and output of neural networks used are usually fixed, it Algorithm 2 RCPO with masking (RCPOM).
is hard to address the variable action space directly. Input: Epoch N , Lagrange multiplier λ, learning rate of
To handle the instantaneous constraint, we adapt the invalid Lagrange multiplier η, temperature parameter α
action masking technique that has been widely applied in Output: Policy πθ
games [29], [30]. The action space is then compressed into 1: Initialise an experience replay buffer BR
At = D × AV t with the masking, where AV t is the set of 2: Initialise policy πθ and two critics Q̂ψ1 and Q̂ψ2
available AGVs. Logits corresponding to all invalid actions 3: for n = 1 to N do
will be replaced with a large negative number, and then 4: for t = 0, 1, . . . do
the final action will be resampled according to the new 5: a′t ← Masking(πθ , st ) ▷ Alg. 1
probability distribution. Masking helps the policy to choose 6: Get st+1 , rt , ct by executing a′t
valid actions and avoid some dangerous situations like getting 7: Store ⟨st , a′t , st+1 , rt , ct ⟩ into BR
stuck somewhere. Pseudo code is shown in Alg. 1. 8: Randomly sample a minibatch BR of transitions
It is notable that, to the best of our knowledge, no ex- T = ⟨s, a, s′ , r, c⟩ from BR
isting work has applied the masking technique to handle 9: Compute y ← r − λc + γ( min Q̂ψj (s′ , ã′ ) −
j=1,2
the instantaneous constraint in DMH. The closest work is α log πθ (ã′ |s′ )), where ã′ ∼ πθ (·|s′ )
presented by [6], which applies the DQN [31] with a designed 10: Update critic with
reward function. According to [6], the DQN agent selects the ∇ψj |B1R |
P
(y − Q̂ψj (s, a))2 for j = 1, 2
action with maximal Q value and repeats this step until a T ∈BR
feasible action is selected, i.e., assigning a task to an available 11: Update actor
P with
vehicle. However, when applying the method of [6] to our ∇θ |B1R | ( min Q̂ψj (s, ãθ ) − α log πθ (ãθ |s)),
T ∈BR j=1,2
problem, the agent will keep being trapped in this step if the ãθ is sampled from πθ (·|s) via reparametrisation trick
selected action is infeasible. Even though keeping resampling 12: Apply soft update on target networks
the action, the same feasible action will always be selected by 13: end for
this deterministic policy at a certain state. 14: λ ← max(λ + η(JCπ − ϵ), 0) ▷ Eq. (10)
2) Reward constrained policy optimisation with masking: 15: end for
Pseudo code of our proposed method is shown in Alg. 2.
We combine RCPO [26] and invalid action masking, named
RCPOM, to handle the hybrid constraints at the same time, Specifically, we implement RCPOM with soft actor critic
seeing Fig. 3. RCPO is a Lagrangian relaxation type method (SAC) because of its remarkable performance [20]. SAC is
based on Actor-Critic framework [32] which converts a con- a maximum entropy reinforcement learning that maximises
strained problem into an unconstrained one by introducing a cumulative reward as well as the expected entropy of the
multiplier λ, seeing Eq. (8), policy. The objective function is adapted as follows:

min max[Rπ − λ(JCπ − ϵ)], (8) X
λ θ J(πθ ) = Eπθ [ γ t (R(st , at , st+1 ) + αH(π(·|st )))], (11)
where λ > 0. The reward function is reshaped with Lagrangian t=0

multiplier λ as follows: where the temperature α decides how stochastic the policy is,
which encourages exploration.
R̂(s, a, s′ , λ) = R(s, a, s′ ) − λc(s, a, s′ ). (9) 3) Invariant reward shaping: The raw reward function is
reshaped with linear transformation for more positive feedback
In a slower time scale than the update of the actor and critic,
and better estimation of the value function, seeing Eq. (12)
the multiplier is updated with the collected constraint values
according to Eq. (10). R̃(s, a, s′ , λ) = βR(s, a, s′ ) + b, (12)
λ = max{λ + η(JCπ − ϵ), 0}, (10) where β > 0 and b is a positive constant number. It is easy to
guarantee policy invariance under this reshape [33], declared
where η is the learning rate of the multiplier. in Remark 1.
Remark 1: Given a CMDP (S, A, R, C, P, γ), the optimal
Algorithm 1 Invalid action masking. policy keeps invariant with linear transformation in Eq. 12,
Input: policy πθ , state s where β > 0 and b ∈ R:
Output: action a b
1: Logits l ← πθ (·|s)
∀s ∈ S, V ∗ (s) = arg max V π (s) = arg max βV π (s)+ .
π π 1−γ
2: Compute l′ by replacing πθ (a′ |s) with −∞ where a′ is
an invalid action V. E XPERIMENTS
3: πθ′ (·|s) ← softmax(l′ )
4: Sample a ∼ πθ′ (·|s)
To verify our proposed method, numerous experiments are
conducted and discussed in this section.
TABLE I
T IME CONSUMED FOR MAKING A DECISION AVERAGED OVER 2000 TRIALS .

RCPOM RCPOM-NS IPO L-SAC L-PPO SAC PPO FCFS EDD NVF STD Random Hu et al. [6]
Time (ms) 2.143 2.112 2.038 2.74 2.514 2.79 2.626 0.0169 0.0199 0.0239 0.0354 0.0259 Timeout

The proposed method denoted as “RCPOM” is compared It is obvious that the random agent performs the worst. In
with five groups of methods: (i) an advanced constrained terms of dispatching rules, EDD shows its ability to optimise
reinforcement learning agent Interior-point Policy Optimisa- the time-based metric, namely tardiness. In all instances, the
tion (IPO) [34], (ii) state-of-the-art reinforcement learning tardiness of EDD is almost under the tardiness constraint
agents including proximal policy optimization (PPO) [35] threshold. For Instance01-Instance03, EDD gets the lowest
and soft actor critic (SAC) [20], (iii) SAC and PPO with tardiness 33.9, 29.6 and 26.5 compared with other policies,
fixed Lagrangian multiplier named as “L-SAC” and “L-PPO”, respectively. FCFS is also a time-related rule that always
and (iv) commonly used dispatching rules including FCFS, assigns the tasks that arrive first. It only performs better than
STD, EDD and NVF as presented in Section IV-A2. (v) To the other three rules in Instance07 for its low makespan and
validate the invariant reward shaping, an ablation study is also tardiness. Two other distance-related rules, NVF and STD
performed. The RCPOM agent without the invariant reward that optimise the travel distance can achieve good results
shaping denoted as “RCPOM-NS”, is compared. A random on makespan. For example, STD has the smallest makespan
agent is also compared. 1883.5 in Instance08 among all the policies. However, both
The simulator DMH-GYM and 16 problem instances are rules fail in terms of tardiness, which is shown in Tab. II for
used in our experiments, where instance01-08 are training their relatively high constraint value.
instances and instance09-16 are unseen during training. Trials Learning agents usually show better performance on
on the training instances and unseen instances verify the makespan than dispatching rules. In Instance01, the SAC agent
effectiveness and adaptability of our proposed method. gets the best makespan of 1798.4. The proposed RCPOM
performs the best among all the policies in terms of makespan
A. Settings
on most of the training instances except Instance01, Instance03
We apply the invalid action masking technique to all the and Instance08. Although STD achieves 1883.5 on Instance08,
learning agents to ensure safety when assigning tasks since RCPOM still gets a very closed makespan value of 1898.0. On
the instantaneous constraints should be satisfied per time step. tardiness, constrained learning agents show a lower value than
All learning agents except “RCPOM-NS” are equipped with the others in most instances. On Instance01, Instance04 and
the reward shaping as a fair comparison with the proposed Instance08, SAC agent gets the lowest tardiness. However, it is
method. To ensure instantaneous constraint satisfaction and notable that we care more about constraint satisfaction rather
explore more possible plans, we adapt the dispatching rules. than minimising tardiness. Even though in Instance04, SAC
Dispatching rules consider current feasible task assignments gets the lowest tardiness value with 28.9, we consider RCPOM
and shuffle these possible task assignments when the case of as the best agent since it gets the lowest makespan value of
multiple available vehicles is met. 1956.9, whose tardiness is under the constraint threshold, i.e.,
All learning agents are adapted based on the implementation 40.5 < 50. A similar case is also observed in unseen instances,
of Tianshou framework [36] 2 . The network structure is formed demonstrated in Tab. III. Learning agents still perform better
by two hidden fully connected layers 128 × 128. α is 0.1. γ on unseen instances compared with dispatching rules, except
is 0.97. The initial multiplier λ and its learning rate are set as STD which gets the best result on Instance16 with constraint
0.001 and 0.0001, respectively. The constraint threshold ϵ is satisfaction. RCPOM, the proposed method, achieves the best
set as 50. Reward shaping parameters β and b are set as 1 and average makespan and is statistically better than almost other
2000 in terms of dispatching rules, respectively. All learning policies on Instance10-14.
agents are trained for 5e5 steps on an Intel Xeon Gold 6240
CPU and four TITAN RTX GPUs. The best policies during C. Discussion
training are selected. All dispatching policies and learning 1) Mediocre dispatching rules: Dispatching rules perform
agents are tested 30 times on the instances independently. promisingly on 16 instances, as shown in Tab. II and Tab. III.
Parameter values are either set following previous studies [26], EDD usually has the lowest tardiness and STD even gets the
[36] or arbitrarily chosen. lowest makespan on Instance08 and Instance16. FCFS and
B. Experimental result EDD are time-related type dispatching rules. FCFS always
Tab. II and Tab. III present the results on the training and chooses tasks according to their arrival time, i.e., assigns
unseen instances, respectively. The average time consumption the task that arrives first. EDD will select the task that has
per task assignment is demonstrated in Tab. I. the earliest due date. In contrast to time-related rules, NVF
and STD are distance-related rules. They both optimise the
2 https://github.com/thu-ml/tianshou objective that strongly relates to distance such as makespan.
TABLE II
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TRAINING SET (I NSTANCE 01-08). T HE BOLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.

Instance01 Instance02 Instance03 Instance04 Instance05 Instance06 Instance07 Instance08


Algorithm M/C (P)
Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft
RCPOM 1840.0/56.7 1908.4/48.2 1914.8/45.0 1956.9/40.5 1852.5/21.3 1911.1/41.0 1927.0/8.4 1898.0/21.0 0.97/0.89 (75%)
RCPOM-NS 1891.9-/58.8≈ 1979.5-/51.6≈ 1931.1≈/46.9≈ 2003.9-/53.0- 2028.8-/58.1- 2028.9-/88.8- 1951.4-/15.2- 1951.0-/41.8- 0.68/0.70 (56%)
IPO 1860.5-/48.6+ 1933.5≈/50.4≈ 1924.9≈/47.6≈ 1979.1-/49.6≈ 1951.1-/47.0- 1951.5-/71.4- 1977.0-/14.8- 1943.0-/34.9- 0.80/0.77 (61%)
L-SAC 1884.8-/60.3≈ 1987.4-/57.2- 1933.7≈/44.0≈ 1973.5≈/42.2≈ 1955.4-/44.7- 1992.9-/68.5- 1983.9-/19.8- 1923.6-/32.8- 0.74/0.76 (58%)
L-PPO 1872.1-/59.5≈ 1963.5-/52.3≈ 1910.3≈/49.7≈ 1984.0-/49.7- 1954.0-/55.5- 1968.6-/75.6- 1979.7-/18.9- 1899.2≈/32.6- 0.80/0.73 (53%)
SAC 1798.4+/49.1+ 1950.0-/68.9- 1965.0-/53.4- 2002.0-/28.9+ 1900.0-/26.6- 2011.9-/71.3- 1927.0≈/12.6- 1914.1-/15.7+ 0.82/0.83 (62%)
PPO 1858.0≈/52.3+ 1941.8≈/49.9≈ 1918.2≈/44.3≈ 1995.3-/55.6- 1937.1-/47.6- 1961.0-/82.0- 1988.6-/14.6- 1920.7≈/35.6- 0.79/0.75 (61%)
FCFS 2081.1-/90.2- 2084.5-/82.3- 2014.7-/64.2- 2136.7-/105.0- 2194.6-/122.5- 2123.5-/85.2- 1927.9≈/11.2- 1933.6-/27.4≈ 0.31/0.48 (37%)
EDD 1903.2≈/33.9+ 1968.6-/29.6+ 1977.8-/26.5+ 1988.1-/32.8+ 1950.7-/33.7- 2016.8-/46.8- 1940.5-/12.1- 2020.8-/45.3≈ 0.66/0.91 (86%)
NVF 1876.5≈/69.6≈ 1958.4-/56.4≈ 1946.7-/49.6≈ 2040.6-/66.7- 1933.9-/56.3- 1953.4-/78.5- 1996.5-/21.8- 1944.6-/35.5- 0.71/0.67 (51%)
STD 1868.8≈/66.6≈ 1961.7-/49.2≈ 1921.3≈/51.7- 1970.7-/35.6+ 1917.1-/41.4- 1955.4-/62.7- 1983.7-/30.4- 1883.5+/35.1- 0.83/0.75 (53%)
Random 2098.7-/124.5- 2113.8-/103.1- 2091.2-/143.7- 2135.1-/123.1- 2149.3-/119.0- 2159.1-/129.3- 2083.0-/70.2- 2067.8-/89.5- 0.02/0.00 (12%)
6/2 9/4 5/4 10/6 11/11 11/11 9/11 8/8

TABLE III
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TEST SET (I NSTANCE 09-16). T HE B OLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.

Instance09 Instance10 Instance11 Instance12 Instance13 Instance14 Instance15 Instance16


Algorithm M/C (P)
Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft
RCPOM 1840.0/57.1 1917.6/49.4 1900.3/45.8 1954.3/38.8 1862.7/18.4 1914.1/41.8 1960.8/19.6 1896.4/21.3 0.95/0.90 (74%)
RCPOM-NS 1904.6-/60.2≈ 1985.4-/52.0≈ 1931.6-/47.3≈ 1995.6-/51.0- 2009.1-/55.0- 2034.9-/92.6- 1956.6≈/16.2+ 1941.0≈/40.3- 0.68/0.72 (55%)
IPO 1860.5-/48.6+ 1933.5≈/50.4≈ 1924.9-/47.6≈ 1979.1-/49.6- 1951.1-/47.0- 1951.5-/71.4- 1977.0-/14.8+ 1943.0-/34.9- 0.80/0.80 (61%)
L-SAC 1892.3-/62.0≈ 1983.6-/56.9≈ 1933.1-/43.4+ 1974.5-/43.0≈ 1954.4-/41.0- 1999.4-/70.9- 1981.5-/18.6+ 1930.3-/34.8- 0.73/0.78 (58%)
L-PPO 1877.5-/60.8≈ 1961.3-/52.6≈ 1905.8≈/49.5≈ 1980.9-/49.8- 1960.2-/50.1- 1965.5-/78.5- 1975.1-/18.2+ 1901.5≈/33.1- 0.81/0.76 (57%)
SAC 1801.3+/49.3+ 1950.0-/69.1- 1956.0-/56.8- 2000.0-/28.6+ 1904.0-/27.3- 2015.1-/71.1- 1930.0+/12.6+ 1987.0-/38.2- 0.76/0.81 (62%)
PPO 1863.1-/53.5+ 1941.8≈/50.1≈ 1913.2≈/44.1≈ 1986.3-/55.0- 1927.4-/41.4- 1965.5-/85.9- 1990.1-/15.0+ 1908.2≈/35.4- 0.81/0.77 (62%)
FCFS 2089.3-/92.5- 2045.4-/75.7- 1996.9-/67.6- 2107.1-/95.2- 2191.9-/121.7- 2130.1-/89.9- 1934.1≈/11.1+ 1946.8-/35.8- 0.32/0.48 (35%)
EDD 1996.9-/107.7- 1976.3-/33.4+ 1978.7-/22.0+ 1997.6-/36.9≈ 1962.1-/37.0- 1993.5-/44.6- 1934.8≈/11.0+ 2052.3-/69.4- 0.59/0.79 (73%)
NVF 1847.5≈/57.2- 1958.8-/54.3≈ 1926.6≈/51.8≈ 2021.7-/65.9- 1939.2-/41.2- 1975.5-/89.8- 2003.5-/19.3+ 1933.8-/40.9- 0.73/0.71 (58%)
STD 1894.1-/72.5- 1958.2≈/49.7≈ 1923.1≈/52.4- 1985.2-/38.1≈ 1905.1-/31.4- 1974.3-/71.9- 1990.3≈/33.0- 1885.2+/38.6- 0.80/0.76 (55%)
Random 2132.7-/135.3- 2100.9-/99.5- 2076.2-/144.4- 2097.5-/106.3- 2144.1-/102.6- 2142.9-/123.6- 2101.3-/81.1- 2055.1-/94.1- 0.03/0.02 (11%)
9/5 8/3 7/4 11/7 11/11 11/11 6/2 7/11

The difference is that NVF selects the task with the nearest For example, EDD and FCFS consider time-related objectives
pickup point while STD selects the task with the shortest such as makespan which can be easily determined in Eq. 2.
distance from the current position to the pickup point and Both rules optimise the objective partially. When the instance
then the delivery point. has a large concession space for delayed tasks, EDD and FCFS
may hardly work. In the problem considered in this paper,
Dispatching rules use simple manual mechanisms to sched- two objectives (i.e., distance and time) are involved. Although
ule tasks and achieve a promising performance, which are tardiness is considered a constraint, it is hard to identify the
usually better than the Random agent. However, such mech- correlation between makespan and tardiness. It is shown on
anisms may not be able to handle more complex scenarios. Instance08 that NVF and STD are better in both makespan
And it is hard to improve the dispatching rule due to its poor and tardiness. But it can also be seen that in Instance01
adaptability. From Fig. 4, it is clear that learning agents usually and Instance03, EDD has lower tardiness, whose makespan
perform better than the dispatching rules with lower averaged is worse than NVF and STD. This observation implies that
makespan on the instances. Dispatching rules keep using the the ability of simple dispatching rules is limited. Our hybrid
same mechanism in the long sequential decision process. It action space is motivated by the phenomenon and expected to
makes sense that they show a limited performance since one provide more optimisation possibilities and adaptability.
rule usually works in certain specific situations. Instantaneous
modifications have to be made when unexpected events occur.
multiplier that is updated during training. It is demonstrated
in Tab. II, Tab. III and Fig. 4, RCPOM outperforms other
learning agents on average. RCPOM has a slight gap with
EDD on tardiness but outperforms it much on makespan.
Although IPO is also a CRL algorithm, its assumption that
the policy should satisfy constraints upon initialisation [37]
limits its performance when solving the DMH problem.
3) Promising performance on unseen instances: To further
validate the performance of the proposed method, we also test
it on some unseen instances from Instance09 to Instance16.
These unseen instances are generated by mutating the training
instances. Our method still outperforms others on the unseen
instances according to Tab. III. Such a stable performance
gives more possibility to apply our method to real-world
problems, meeting dynamic and complex situations.
4) Invariant reward shaping improves: The raw reward
function of the process is the negative makespan, which is
rewarded at the end, denoted in Eq. (7). It is challenging for
an RL agent to learn such a sparse reward function with only
negative values. The consequence brought by the lack of posi-
tive feedback is that agents may be stagnant and conservative.
In almost all instances, RCPOM agent with reward shaping
performs better than the one without reward shaping, shown
in Tab. II and Tab. III. Although the relative reward value is
modified by the reward shaping, it is proved that the optimal
policy keeps invariant. Invariant reward shaping helps agents
explore more and make the most of the positive feedback.

Fig. 4. Performance averaged over 16 instances.


VI. C ONCLUSION
This paper studies the dynamic material handling problem.
Newly arrived tasks and vehicle breakdowns are considered
2) Constraint handling helps: A vehicle should be available dynamic events. Due to the lack of free simulators and problem
when assigning a task, which is critical for safe manufacturing. instances, we develop a gym-like simulator, namely DMH-
Previous work [6] resamples an action in case of constraint GYM and provide a dataset of diverse instances. Considering
violations, which does not work out in our scenario. Tab. I the constraints of tardiness and vehicle availability, we formu-
shows the time cost for determining task assignments. The late the problem as a constrained Markov decision process. A
agents using the invalid action masking technique take much constrained reinforcement learning algorithm that combines
shorter decision-making time in contrast to the timeout by [6]. Lagrangian relaxation and invalid action masking, named
The application of the invalid action masking technique guar- RCPOM is proposed to meet the hybrid cumulative and instan-
antees instantaneous safety. Sampling once is enough to obtain taneous constraints. We validate the outstanding performance
a valid action. Another benefit of masking is the compression of our approach on both training and unseen instances. The
of action space. The agent can focus more on the choice of experimental results show RCPOM statistically outperforms
valid actions and further improve the exploration efficiency. state-of-the-art reinforcement learning agents, constrained re-
Even though SAC and PPO agents show remarkable per- inforcement learning agents, and several commonly used dis-
formance in terms of makespan, they fail in tardiness. SAC patching rules. It is also validated that the invariant reward
agent gets a high tardiness value on Instances indexed with shaping helps. The effectiveness of RCPOM provides a new
2, 3, 6, 10, 11 and 14. It is attributed to the single-attribute perspective for dealing with real-world scheduling problems
reward function since it only relies on the makespan and does with constraints. Instead of constructing a complicated reward
not consider tardiness. A straightforward method to handle function manually, it is possible to restrict the policy directly
the cumulative constraint is augmenting the objective function. by constrained reinforcement learning methods.
L-SAC and L-PPO reshape the reward function with a fixed As future work, we are interested in extending the problem
Lagrangian multiplier, seeing Eq. (9). However, it is hard to by introducing more dynamic events and constraints that
decide the multiplier. From Tab. II, L-SAC and L-PPO do widely exist in real-world scenarios.
not outperform SAC and PPO much, even L-PPO has the R EFERENCES
lowest makespan on Instance03. RCPOM provides a more [1] V. Kaplanoğlu, C. Şahin, A. Baykasoğlu, R. Erol, A. Ekinci, and
flexible way to restrict the behaviour of policy with a suitable M. Demirtaş, “A multi-agent based approach to dynamic scheduling
of machines and automated guided vehicles (AGV) in manufacturing [20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
systems by considering AGV breakdowns,” International Journal of policy maximum entropy deep reinforcement learning with a stochastic
Engineering Research & Innovation, vol. 7, no. 2, pp. 32–38, 2015. actor,” in International Conference on Machine Learning. PMLR, 2018,
[2] T. Raghu and C. Rajendran, “An efficient dynamic dispatching rule pp. 1861–1870.
for scheduling in a job shop,” International Journal of Production [21] C. Chen, B. Xia, B.-h. Zhou, and L. Xi, “A reinforcement learning based
Economics, vol. 32, no. 3, pp. 301–313, 1993. approach for a multiple-load carrier scheduling problem,” Journal of
[3] I. Sabuncuoglu, “A study of scheduling rules of flexible manufacturing Intelligent Manufacturing, vol. 26, no. 6, pp. 1233–1245, 2015.
systems: A simulation approach,” International Journal of Production [22] C. Kardos, C. Laflamme, V. Gallina, and W. Sihn, “Dynamic scheduling
Research, vol. 36, no. 2, pp. 527–546, 1998. in a job-shop production system with reinforcement learning,” Procedia
[4] W. Xu, S. Guo, X. Li, C. Guo, R. Wu, and Z. Peng, “A dynamic CIRP, vol. 97, pp. 104–109, 2021.
scheduling method for logistics tasks oriented to intelligent manufac- [23] S. Govindaiah and M. D. Petty, “Applying reinforcement learning to
turing workshop,” Mathematical Problems in Engineering, vol. 2019, plan manufacturing material handling,” Discover Artificial Intelligence,
pp. 1–18, 2019. vol. 1, no. 1, pp. 1–33, 2021.
[5] T. Xue, P. Zeng, and H. Yu, “A reinforcement learning method for [24] E. Altman, Constrained Markov decision processes: Stochastic model-
multi-AGV scheduling in manufacturing,” in 2018 IEEE International ing. Routledge, 1999.
Conference on Industrial Technology. IEEE, 2018, pp. 1557–1561. [25] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy opti-
[6] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, “Deep reinforcement learning mization,” in International Conference on Machine Learning. PMLR,
based agvs real-time scheduling with mixed rule for flexible shop floor in 2017, pp. 22–31.
industry 4.0,” Computers & Industrial Engineering, vol. 149, p. 106749, [26] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy
2020. optimization,” in International Conference on Learning Representations,
[7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, 2019. [Online]. Available: https://openreview.net/forum?id=SkfrvsA9FX
J. Tang, and W. Zaremba, “OpenAI Gym,” 2016. [27] A. Bhatia, P. Varakantham, and A. Kumar, “Resource constrained deep
[8] L. Qiu, W.-J. Hsu, S.-Y. Huang, and H. Wang, “Scheduling and routing reinforcement learning,” in Proceedings of the International Conference
algorithms for AGVs: A survey,” International Journal of Production on Automated Planning and Scheduling, vol. 29, 2019, pp. 610–620.
Research, vol. 40, no. 3, pp. 745–760, 2002. [28] Y. Liu, J. Ding, and X. Liu, “A constrained reinforcement learning
[9] L. Qiu, J. Wang, W. Chen, and H. Wang, “Heterogeneous AGV routing based approach for network slicing,” in 2020 IEEE 28th International
problem considering energy consumption,” in 2015 IEEE International Conference on Network Protocols (ICNP). IEEE, 2020, pp. 1–6.
Conference on Robotics and Biomimetics (ROBIO). IEEE, 2015, pp. [29] D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang, X. Wu,
1894–1899. Q. Guo, Q. Chen, Y. Yin, H. Zhang, T. Shi, L. Wang, Q. Fu, W. Yang,
[10] N. Singh, Q.-V. Dang, A. Akcay, I. Adan, and T. Martagan, “A and L. Huang, “Mastering complex control in MOBA games with deep
matheuristic for AGV scheduling with battery constraints,” European reinforcement learning,” in Proceedings of the AAAI Conference on
Journal of Operational Research, vol. 298, no. 3, pp. 855–873, 2022. Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6672–6679.
[11] Q.-V. Dang, N. Singh, I. Adan, T. Martagan, and D. van de Sande, [30] S. Huang and S. Ontañón, “A closer look at invalid action masking
“Scheduling heterogeneous multi-load AGVs with battery constraints,” in policy gradient algorithms,” in Proceedings of the Thirty-Fifth In-
Computers & Operations Research, vol. 136, p. 105517, 2021. ternational Florida Artificial Intelligence Research Society Conference,
[12] D. Ouelhadj and S. Petrovic, “A survey of dynamic scheduling in vol. 35, 2022, pp. 1–6.
manufacturing systems,” Journal of Scheduling, vol. 12, no. 4, pp. 417– [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
431, 2009. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
[13] J. H. Blackstone, D. T. Phillips, and G. L. Hogg, “A state-of-the-art S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
survey of dispatching rules for manufacturing job shop operations,” The D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
International Journal of Production Research, vol. 20, no. 1, pp. 27–45, deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
1982. 2015.
[14] C. Chen, L.-f. Xi, B.-h. Zhou, and S.-s. Zhou, “A multiple-criteria real- [32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
time scheduling approach for multiple-load carriers subject to LIFO D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
loading constraints,” International Journal of Production Research, forcement learning,” in International Conference on Machine Learning.
vol. 49, no. 16, pp. 4787–4806, 2011. PMLR, 2016, pp. 1928–1937.
[15] C. Sahin, M. Demirtas, R. Erol, A. Baykasoğlu, and V. Kaplanoğlu, [33] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
“A multi-agent based approach to dynamic scheduling with flexible MIT press, 2018.
processing capabilities,” Journal of Intelligent Manufacturing, vol. 28, [34] Y. Liu, J. Ding, and X. Liu, “IPO: Interior-point policy optimization
no. 8, pp. 1827–1845, 2017. under constraints,” in Proceedings of the AAAI Conference on Artificial
[16] S. Liu, P. H. Tan, E. Kurniawan, P. Zhang, and S. Sun, “Dynamic Intelligence, vol. 34, no. 04, 2020, pp. 4940–4947.
scheduling for pickup and delivery with time windows,” in 2018 IEEE [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
4th World Forum on Internet of Things. IEEE, 2018, pp. 767–770. imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
[17] G. Chryssolouris and V. Subramaniam, “Dynamic scheduling of man- 2017.
ufacturing job shops using genetic algorithms,” Journal of Intelligent [36] J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang,
Manufacturing, vol. 12, no. 3, pp. 281–293, 2001. Y. Su, H. Su, and J. Zhu, “Tianshou: A highly modularized
[18] Y. Zhang, G. Zhang, W. Du, J. Wang, E. Ali, and S. Sun, “An deep reinforcement learning library,” Journal of Machine Learning
optimization method for shopfloor material handling based on real- Research, vol. 23, no. 267, pp. 1–6, 2022. [Online]. Available:
time and multi-source manufacturing data,” International Journal of http://jmlr.org/papers/v23/21-1127.html
Production Economics, vol. 165, pp. 282–292, 2015. [37] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints in
[19] W. Wang, Y. Zhang, and R. Y. Zhong, “A proactive material handling model-free reinforcement learning: A survey,” in the International Joint
method for CPS enabled shop-floor,” Robotics and Computer-integrated Conference on Artificial Intelligence, 2021, pp. 4508–4515.
Manufacturing, vol. 61, p. 101849, 2020.

You might also like