Constrained Reinforcement Learning For Dynamic Material Handling

New Reinforcement Learning in material research

Uploaded by

vedgeta.ca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

Constrained Reinforcement Learning For Dynamic Material Handling

New Reinforcement Learning in material research

Uploaded by

vedgeta.ca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Constrained Reinforcement Learning for

Dynamic Material Handling

Chengpeng Hu1, 2 , Ziming Wang1, 2 , Jialin Liu2, 1 , Junyi Wen3 , Bifei Mao3 and Xin Yao2, 1
1 Research
Institute of Trustworthy Autonomous Systems (RITAS),
Southern University of Science and Technology, Shenzhen, China.
2 Guangdong Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering,

Southern University of Science and Technology, Shenzhen, China.

3 Trustworthiness Theory Research Center, Huawei Technologies Co., Ltd Shenzhen, China.

[email protected], [email protected], [email protected],

arXiv:2305.13824v1 [cs.LG] 23 May 2023

[email protected], [email protected], [email protected]

Abstract—As one of the core parts of flexible manufacturing

systems, material handling involves storage and transportation
of materials between workstations with automated vehicles.
The improvement in material handling can impulse the overall
efficiency of the manufacturing system. However, the occurrence
of dynamic events during the optimisation of task arrangements
poses a challenge that requires adaptability and effectiveness.
In this paper, we aim at the scheduling of automated guided
vehicles for dynamic material handling. Motivated by some real-
world scenarios, unknown new tasks and unexpected vehicle
breakdowns are regarded as dynamic events in our problem.
We formulate the problem as a constrained Markov decision
process which takes into account tardiness and available vehicles
as cumulative and instantaneous constraints, respectively. An Fig. 1. Simulation of material handling.
adaptive constrained reinforcement learning algorithm that com-
bines Lagrangian relaxation and invalid action masking, named
RCPOM, is proposed to address the problem with two hybrid
Dispatching rules [2], [3] are classic and common methods
constraints. Moreover, a gym-like dynamic material handling
simulator, named DMH-GYM, is developed and equipped with for DMH. Although easy to implement, they suffer from
diverse problem instances, which can be used as benchmarks some complex situations, which usually lead to suboptimal
for dynamic material handling. Experimental results on the performance and are hard to be improved further. Considering
problem instances demonstrate the outstanding performance of such issues, search-based methods come to prominence. For
our proposed approach compared with eight state-of-the-art con-
example, a hybrid genetic algorithm and ant colony optimisa-
strained and non-constrained reinforcement learning algorithms,
and widely used dispatching rules for material handling. tion is proposed for intelligent manufacturing workshop [4].
Index Terms—Dynamic material handling, constrained rein- However, the search process is often time-consuming. The
forcement learning, automated guided vehicle, manufacturing aforementioned methods are less suitable in real-world sce-
system, benchmark narios when fast response and adaptability are needed.
Recently, some work has leveraged reinforcement learning
I. I NTRODUCTION (RL) to DMH [5], [6]. The problem is formulated as a Markov
Material handling can be widely found in manufacturing, decision process (MDP) and the reward function is constructed
warehouses and other logistic scenarios. It aims to transport manually according to makespan and travel distance [5], [6].
some goods from their storage locations to some delivery Trained parameterised policies schedule the vehicles in real
sites. With the help of automated guided vehicles (AGV), time when new tasks arrive. However, more complicated
tasks and jobs can be accomplished fast and automatically. scenarios involving vehicle breakdowns or some vital problem
An example of material handling is shown in Fig. 1. In real- constraints, such as task delay, are not considered while
world flexible manufacturing, AGV scheduling plans usually optimising the policy [5], [6]. It is impossible to overlook
need to be changed due to some dynamics such as newly these scenarios when dealing with real-world problems. For
arrived tasks, due time change, as well as vehicle and site example, manufacturing accidents may happen if an emerged
breakdowns [1]. These dynamics pose a serious challenge to task is assigned to a broken vehicle, which leads to a task
vehicle scheduling, called dynamic material handling (DMH). delay or this task will never be completed.
Motivated by real-world scenarios in the flexible manu-
Corresponding author: Jialin Liu ([email protected]). facturing system, we consider a DMH problem with multi-
ple AGVs and two dynamic events, namely newly arrived optimisation problem due to their simplicity. Although reason-
tasks and vehicle breakdowns during the handling process. able solutions may be found by these rules, they hardly get
To tackle the dynamics and constraints, we formulate the further improvement. Dispatching rules are usually constructed
problem as a constrained Markov decision process (CMDP), with some specific considerations, thus few certain single
considering the tardiness of tasks as a cumulative constraint. rules work well in general cases [13]. Mix rule policy that
In addition, the status of vehicles including “idle”, “working” combines multiple dispatching rules improves the performance
and “broken” caused by vehicle breakdowns is considered of a single one [14]. New AGVs and jobs are considered
as an instantaneous constraint which determines if a vehicle dynamic events and solved with the multi-agent negotiation
can be assigned with a task or not. A constrained reinforce- method [15]. An adaptive mixed integer programming model
ment learning (CRL) approach is proposed to perform real- triggered by some events is applied to solve the single AGV
time scheduling considering two widely used performance scheduling problem with time windows [16]. Nevertheless,
indicators, makespan and tardiness in manufacturing. The these methods are usually limited for poor adaptability and
proposed approach, named RCPOM, incorporates Lagrangian require domain knowledge.
relaxation and invalid action masking to handle both cumu- Search-based methods can also be applied to DMH. Since
lative constraint and instantaneous constraint. Moreover, due priory conditions such as the number of tasks are not deter-
to the lack of free simulators and instance datasets for DMH, mined apriori, these methods seek to decompose the dynamic
we develop an OpenAI’s gym-like [7] simulator and provide process as some static sub-problems. Genetic algorithm [17]
diverse problem instances as a benchmark1 . is applied for DMH. When new tasks arrive or the status of
The main contributions of our work are summarised as machines changes, the algorithm reschedules the plan. Ana-
follows. lytic hierarchy process based optimisation [18] combines task
• We formulate a dynamic material handling problem sets and then uses mixed attributes for dispatching. NSGA-II
considering multiple vehicles, newly arrived tasks and is used by Wang et al. [19] to achieve proactive dispatching
vehicle breakdowns as a CMDP with the consideration considering distance and energy consumption at the same time.
of both cumulative and instantaneous constraints. However, these search-based methods usually suffer from long
• Extending OpenAI’s gym, we develop an open-source consumption time [17], and thus are not able to cater to the
simulator, named DMH-GYM, and provide a dataset requirements of fast response and adaptability in real-world
comprised of diverse instances for researching the dy- scenarios.
namic material handling problem. Reinforcement learning, which performs well in real-time
• Considering the occurrence of newly arrived tasks and decision-making problems [20], has been applied to optimise
vehicle breakdowns as dynamic events, a CRL algorithm real-world DMH problems. A Q−λ algorithm is proposed by
combining a Lagrangian relaxation method and invalid Chen et al. [21] to address a multiple-load carrier scheduling
action masking technique, named RCPOM, is proposed problem in a general assembly line. Xue et al. [5] consider
to address this problem. a non-stationary case that AGVs’ actions may affect each
• Extensive experiments show the adaptability and effec- other. Kardos et al. [22] use Q-learning to schedule tasks
tiveness of the proposed method for solving this problem, dynamically with real-time information. Hu et al. [6] improve
compared with eight state-of-the-art algorithms, including the mix rule policy [14] with a deep reinforcement learning
CRL agents, RL agents and several dispatching rules. algorithm to minimise the makespan and delay ratio. Multiple
AGVs are scheduled by the algorithm in real time. [23] applies
II. BACKGROUND reinforcement learning with a weight sum reward function to
AGVs are widely used for material handling systems [8], material handling. Although the algorithms of the above work
meeting the need for the high requirement of automation. perform promisingly, constraints in real-world applications are
Scheduling and routing are usually regarded as two splitting not considered in their problems.
parts of material handling. The former refers to the task Constrained reinforcement learning (CRL) follows the prin-
dispatching considered in the paper. The latter aims to find ciples of the constrained Markov decision process (CMDP)
feasible routing paths for AGVs to transfer loads. Pathfinding that maximises the long-term expected reward while respecting
is not the main focus of this paper, since fixed paths are constraints [24]. It has been applied to solving problems such
commonly found in many real-world cases [9], [10], [11]. as robot controlling [25], [26] and resource allocation [27],
Scheduling of DMH [12] refers to the problem with some [28], however, to the best of our knowledge, it has never been
real-time dynamic events, such as newly arrived tasks, due considered in solving DMH problems.
time change and vehicle breakdowns [1], which are com-
monly found in modern flexible manufacturing systems and III. DYNAMIC M ATERIAL H ANDLING WITH MULTIPLE
warehouses. Traditional dispatching rules [3], e.g., first come VEHICLES
first serve (FCFS), earliest due date first (EDD), and nearest
vehicle first (NVF), are widely used to solve the dynamic This section describes our DMH, considering newly arrived
tasks and vehicle breakdowns. DMH-GYM, the simulator
1 Codes available at https://github.com/HcPlu/DMH-GYM developed in this work is also presented.
A. Problem description T : Paths set that connect sites.
In our DMH problem with multiple vehicles, transporting K: Parking set, K ⊂ L.
tasks can be raised over time dynamically. These newly arrived U: Staging list of tasks.
tasks, called unassigned tasks, need to be assigned to AGVs u: Task u ∈ U is denoted as a tuple ⟨s, e, τ, o⟩ that refers to
with a policy π. The problem is formed as a graph G(L, T ) pickup point s, delivery point e, arrival time τ and expiry
where L and T denote the sets of sites and paths, respectively. time o.
Three kinds of sites, namely pickup points, delivery points u.s: Pickup point of a given task u, u.s ∈ L.
and parking positions are located in graph G(L, T ). The set u.e: Delivery point of a given task u, u.e ∈ L.
of parking positions is denoted as K. The total number of u.τ : Arrival time of a given task u.
tasks that need to be completed is unknown in advance. All u.o: Expiry time of a given task u.
unassigned tasks will be stored in a staging list U temporarily. u0 : Start task that manipulates AGV to leave its parking lot.
A task u = ⟨s, e, τ, o⟩ ∈ U is determined by its pickup point v: AGV v is denoted as tuple ⟨vl, rp, pl, ψ⟩ that refers to its
u.s, delivery point u.e, arrival time u.τ and expiry time u.o, velocity vl, repair time rp, parking location pl and status ψ.
where u.s, u.e ∈ L. v.vl: Velocity of a given AGV v.
A fleet of AGVs V works in the system to serve tasks. A task v.rp: Repairing time of a given AGV v.
can only be picked by one and only one AGV at once. Each v.pl: Parking position of a given AGV v, v.pl ∈ K.
AGV v = ⟨vl, rp, pl, ψ⟩ ∈ V is denoted by its velocity v.vl, v.ψ: Status of a given AGV v. Three statuses are defined
repairing time v.rp, parking position v.pl and status v.ψ. The as Idle, Working and Broken and represented by set
location v.pl denotes the initial parking location of AGV v, Ψ = {I, W, B}.
where v.pl ∈ K. AGVs keep serving the current assigned task v.u: Current assigned task of a given AGV v.
v.u during the process. If there is no task to be performed, the HL(v): List of completed tasks of a given AGV v.
AGV stays at its current place until a new task is assigned. All F T (u, v): Finish time of a task u by an AGV v.
finished tasks by an AGV v will be recorded in the historical π: Decision policy for task assignments.
task list HL(v). A starting task u0 will first be added to the
historical list, where u0 denotes that an AGV v leaves its B. DMH-GYM
parking site to the pickup point of the first assigned task u1 . To our best knowledge, no existing work has studied this
Three statuses of AGV are denoted as set Ψ = {I, W, B} problem and there is no existing open-source simulator of
representing Idle, Working and Broken, respectively. An AGV related problems. To study the problem, we develop a simula-
can break down sometimes. In this broken status, the broken tor that is compatible with OpenAI’s gym [7], named DMH-
AGV v will stop in place and release its current task to the GYM. Diverse instances are also provided that show various
task pool. After repairing in time v.rp, AGV can be available performances on different dispatching rules to fully evaluate
to accept tasks. the given algorithms. The layout of the shop floor is shown
While the system is running, a newly arrived task u can in Fig. 2. Stations from st1 to st8 are the workstations on
be assigned to an available AGV if and only if v.ψ = I. The the floor which can be pickup points and delivery points. The
decision policy π assigns the task. Makespan and tardiness are warehouse can only be a delivery point. All AGVs depart from
selected as the optimising objectives of the problem, especially the carport, i.e. initial parking position, to accomplish tasks
the tardiness is regarded as a constraint. Makespan is the that are generated randomly. The start site of a task can only
maximal time cost of AGVs for finishing all the assigned be a workstation, while its end site can be a workstation or
tasks using Eq. (1). Tardiness denotes the task delay in the warehouse. All AGVs move on preset paths with the same
completion of tasks, as formalised in Eq. (2). velocity. And they are guided by an A∗ algorithm for path
Fm (π) = maxπ F T (uv|HL(v)| , v), (1) finding while performing tasks. The distance between stations
v∈V is calculated according to their coordinates and path. For
π
V |HL(v)| example, the distance between st8 (0,45) and st1 (20,70) in
1 X X
Ft (π) = max(F T (uvi , v) − uvi .o − uvi .τ, 0), (2) Fig. 2 is 45 as a vehicle should pass the coordinate of the
m v i corner point (0,70). Collisions and traffic congestion are not
where HL(v) refers to the historical list of tasks completed by considered in this paper for simplification.
v and v.pl ∈ P. The time cost of waiting to handle material IV. C ONSTRAINED R EINFORCEMENT L EARNING
is ignored at pickup points and delivery points for AGVs as To meet the needs of real-world applications, we extend the
they are often constant. DMH problem as a CMDP by considering the tardiness and
Notations for our problem are summarised as follows. the constraint of vehicle availability.
| · |: Size of a given set or list.
V: A fleet of AGVs. A. Modeling DMH as a CMDP
n: Number of AGVs, i.e., n = |V|. The CMDP is defined as a tuple (S, A, R, C, P, γ), where
m: Number of tasks that needs to be accomplished. S is state space, A is action space, R represents the re-
L: Set of pickup points, delivery points, and parking lots. ward function, P is the transition probability function and
Fig. 3. Reward constrained policy optimisation with masking.

Fig. 2. Layout of DMH-GYM. and nearest vehicle first (NVF), form the dispatching rule
space D. Those dispatching rules are formulated [3] as follows:
• FCFS: arg min u.τ ;
γ is the discount factor. A cumulative constraint C = u∈U
• EDD: arg min(u.o − u.τ );
g(c(s0 , a0 , s1 ), . . . , c(st , at , st+1 )) that consists of per-step u∈U
cost c(s, a, s′ ) can be restricted by a corresponding constraint • NVF: arg min d(v.c, u.s);
u∈U
threshold ϵ. g can be any of the functions for constraints • STD: arg min d(v.c, u.s) + d(u.s, u.e),
u∈U
such as average and sum. JCπ denotes the expectation of the
where v.s is the current position of an AGV v and d(p, p′ )
cumulative constraint. A policy optimised by parameters θ is
determines the distance between p and p′ . Action at =
defined as πθ which determines the probability to take an
⟨dt , vt ⟩ ∈ At at time t determines a single task assignment
action at at state st with π(at |st ) at time t. The goal of
at that task ut is assigned to AGV vt by dispatching rule dt .
the problem is to maximise the long-term expected reward
With the action space At , the policy can decide the rule and
with optimised policy πθ while satisfying the constraints, as
AGV at the same time, which breaks the tie of the multiple
formulated below.
vehicles case.
∞
X 3) Constraints: A cumulative constraint and an instanta-
max Eπθ [ γ t R(st , at , st+1 )] (3) neous constraint are both considered in the CMDP. We treat
θ
t=0 the tardiness as a cumulative constraint JC , formulated in Eq.
s.t. JCπθ = Eτ ∼πθ [C] ≤ ϵ. (4) (5). The task assignment constraint of vehicle availability is
regarded as an instantaneous constraint that only vehicles with
1) State: At each decision time t, we consider the current status Idle are considered as available, denoted as Eq. (6).
state of the whole system consisting of tasks and AGVs,
denoted as St = ρ(Ut , Vt ), where Ut is the set of unassigned
tasks and Vt represents information of all AGVs at time t. We JCπθ = Eπθ (Ft (πθ )) ≤ ϵ, (5)
encode the information of the system at decision time t with v.ψ = I. (6)
a feature extract function ρ(Ut , Vt ) as structure representation.
Specifically, a state is mapped into a vector by ρ consisting of 4) Reward function: The negative makespan returned at the
the number of unassigned tasks, tasks’ remaining time, waiting last timestep is set as a reward. τ is denoted as a trajectory
time and distance between pickup and delivery point as well (s0 , a0 , s1 , a1 , . . . , st , at , st+1 , . . . ) sampled from current pol-
as statuses of vehicles and the minimal time from the current icy π. The per-step reward function is defined as follows.
position to pickup point then delivery point.
−Fm (π), if terminates,
2) Action: The goal of dynamic material handling is to R(st , at , st+1 ) = (7)
0, otherwise.
execute all tasks, thus the discrete time step refers to one
single, independent task assignment. Different from the regular B. Constraint handling
MDP like in games, the number of steps in our problem is We combine invalid action masking and the reward con-
usually fixed. In other words, the length of an episode is strained policy optimisation (RCPO) [26], a Lagrangian re-
decided by the number of tasks to be assigned. In our CMDP, laxation approach, to handle the instantaneous and cumulative
making an action is to choose a suitable dispatching rule and constraints at the same time.
a corresponding vehicle. A hybrid action space At = D × Vt 1) Feasible action selection via masking: We can indeed
is considered for the CMDP, where D and Vt are defined determine the available actions according to the current state.
as dispatching rule space and AGV space, respectively. Four For example, in our problem, the dimension of the current
dispatching rules, including first come first served (FCFS), action space can be 4, 8, ..., which depends on the available
shortest travel distance (STD), earliest due date first (EDD) vehicles at any time step. However, since the dimension of the
input and output of neural networks used are usually fixed, it Algorithm 2 RCPO with masking (RCPOM).
is hard to address the variable action space directly. Input: Epoch N , Lagrange multiplier λ, learning rate of
To handle the instantaneous constraint, we adapt the invalid Lagrange multiplier η, temperature parameter α
action masking technique that has been widely applied in Output: Policy πθ
games [29], [30]. The action space is then compressed into 1: Initialise an experience replay buffer BR
At = D × AV t with the masking, where AV t is the set of 2: Initialise policy πθ and two critics Q̂ψ1 and Q̂ψ2
available AGVs. Logits corresponding to all invalid actions 3: for n = 1 to N do
will be replaced with a large negative number, and then 4: for t = 0, 1, . . . do
the final action will be resampled according to the new 5: a′t ← Masking(πθ , st ) ▷ Alg. 1
probability distribution. Masking helps the policy to choose 6: Get st+1 , rt , ct by executing a′t
valid actions and avoid some dangerous situations like getting 7: Store ⟨st , a′t , st+1 , rt , ct ⟩ into BR
stuck somewhere. Pseudo code is shown in Alg. 1. 8: Randomly sample a minibatch BR of transitions
It is notable that, to the best of our knowledge, no ex- T = ⟨s, a, s′ , r, c⟩ from BR
isting work has applied the masking technique to handle 9: Compute y ← r − λc + γ( min Q̂ψj (s′ , ã′ ) −
j=1,2
the instantaneous constraint in DMH. The closest work is α log πθ (ã′ |s′ )), where ã′ ∼ πθ (·|s′ )
presented by [6], which applies the DQN [31] with a designed 10: Update critic with
reward function. According to [6], the DQN agent selects the ∇ψj |B1R |
P
(y − Q̂ψj (s, a))2 for j = 1, 2
action with maximal Q value and repeats this step until a T ∈BR
feasible action is selected, i.e., assigning a task to an available 11: Update actor
P with
vehicle. However, when applying the method of [6] to our ∇θ |B1R | ( min Q̂ψj (s, ãθ ) − α log πθ (ãθ |s)),
T ∈BR j=1,2
problem, the agent will keep being trapped in this step if the ãθ is sampled from πθ (·|s) via reparametrisation trick
selected action is infeasible. Even though keeping resampling 12: Apply soft update on target networks
the action, the same feasible action will always be selected by 13: end for
this deterministic policy at a certain state. 14: λ ← max(λ + η(JCπ − ϵ), 0) ▷ Eq. (10)
2) Reward constrained policy optimisation with masking: 15: end for
Pseudo code of our proposed method is shown in Alg. 2.
We combine RCPO [26] and invalid action masking, named
RCPOM, to handle the hybrid constraints at the same time, Specifically, we implement RCPOM with soft actor critic
seeing Fig. 3. RCPO is a Lagrangian relaxation type method (SAC) because of its remarkable performance [20]. SAC is
based on Actor-Critic framework [32] which converts a con- a maximum entropy reinforcement learning that maximises
strained problem into an unconstrained one by introducing a cumulative reward as well as the expected entropy of the
multiplier λ, seeing Eq. (8), policy. The objective function is adapted as follows:
∞
min max[Rπ − λ(JCπ − ϵ)], (8) X
λ θ J(πθ ) = Eπθ [ γ t (R(st , at , st+1 ) + αH(π(·|st )))], (11)
where λ > 0. The reward function is reshaped with Lagrangian t=0

multiplier λ as follows: where the temperature α decides how stochastic the policy is,
which encourages exploration.
R̂(s, a, s′ , λ) = R(s, a, s′ ) − λc(s, a, s′ ). (9) 3) Invariant reward shaping: The raw reward function is
reshaped with linear transformation for more positive feedback
In a slower time scale than the update of the actor and critic,
and better estimation of the value function, seeing Eq. (12)
the multiplier is updated with the collected constraint values
according to Eq. (10). R̃(s, a, s′ , λ) = βR(s, a, s′ ) + b, (12)
λ = max{λ + η(JCπ − ϵ), 0}, (10) where β > 0 and b is a positive constant number. It is easy to
guarantee policy invariance under this reshape [33], declared
where η is the learning rate of the multiplier. in Remark 1.
Remark 1: Given a CMDP (S, A, R, C, P, γ), the optimal
Algorithm 1 Invalid action masking. policy keeps invariant with linear transformation in Eq. 12,
Input: policy πθ , state s where β > 0 and b ∈ R:
Output: action a b
1: Logits l ← πθ (·|s)
∀s ∈ S, V ∗ (s) = arg max V π (s) = arg max βV π (s)+ .
π π 1−γ
2: Compute l′ by replacing πθ (a′ |s) with −∞ where a′ is
an invalid action V. E XPERIMENTS
3: πθ′ (·|s) ← softmax(l′ )
4: Sample a ∼ πθ′ (·|s)
To verify our proposed method, numerous experiments are
conducted and discussed in this section.
TABLE I
T IME CONSUMED FOR MAKING A DECISION AVERAGED OVER 2000 TRIALS .

RCPOM RCPOM-NS IPO L-SAC L-PPO SAC PPO FCFS EDD NVF STD Random Hu et al. [6]
Time (ms) 2.143 2.112 2.038 2.74 2.514 2.79 2.626 0.0169 0.0199 0.0239 0.0354 0.0259 Timeout

The proposed method denoted as “RCPOM” is compared It is obvious that the random agent performs the worst. In
with five groups of methods: (i) an advanced constrained terms of dispatching rules, EDD shows its ability to optimise
reinforcement learning agent Interior-point Policy Optimisa- the time-based metric, namely tardiness. In all instances, the
tion (IPO) [34], (ii) state-of-the-art reinforcement learning tardiness of EDD is almost under the tardiness constraint
agents including proximal policy optimization (PPO) [35] threshold. For Instance01-Instance03, EDD gets the lowest
and soft actor critic (SAC) [20], (iii) SAC and PPO with tardiness 33.9, 29.6 and 26.5 compared with other policies,
fixed Lagrangian multiplier named as “L-SAC” and “L-PPO”, respectively. FCFS is also a time-related rule that always
and (iv) commonly used dispatching rules including FCFS, assigns the tasks that arrive first. It only performs better than
STD, EDD and NVF as presented in Section IV-A2. (v) To the other three rules in Instance07 for its low makespan and
validate the invariant reward shaping, an ablation study is also tardiness. Two other distance-related rules, NVF and STD
performed. The RCPOM agent without the invariant reward that optimise the travel distance can achieve good results
shaping denoted as “RCPOM-NS”, is compared. A random on makespan. For example, STD has the smallest makespan
agent is also compared. 1883.5 in Instance08 among all the policies. However, both
The simulator DMH-GYM and 16 problem instances are rules fail in terms of tardiness, which is shown in Tab. II for
used in our experiments, where instance01-08 are training their relatively high constraint value.
instances and instance09-16 are unseen during training. Trials Learning agents usually show better performance on
on the training instances and unseen instances verify the makespan than dispatching rules. In Instance01, the SAC agent
effectiveness and adaptability of our proposed method. gets the best makespan of 1798.4. The proposed RCPOM
performs the best among all the policies in terms of makespan
A. Settings
on most of the training instances except Instance01, Instance03
We apply the invalid action masking technique to all the and Instance08. Although STD achieves 1883.5 on Instance08,
learning agents to ensure safety when assigning tasks since RCPOM still gets a very closed makespan value of 1898.0. On
the instantaneous constraints should be satisfied per time step. tardiness, constrained learning agents show a lower value than
All learning agents except “RCPOM-NS” are equipped with the others in most instances. On Instance01, Instance04 and
the reward shaping as a fair comparison with the proposed Instance08, SAC agent gets the lowest tardiness. However, it is
method. To ensure instantaneous constraint satisfaction and notable that we care more about constraint satisfaction rather
explore more possible plans, we adapt the dispatching rules. than minimising tardiness. Even though in Instance04, SAC
Dispatching rules consider current feasible task assignments gets the lowest tardiness value with 28.9, we consider RCPOM
and shuffle these possible task assignments when the case of as the best agent since it gets the lowest makespan value of
multiple available vehicles is met. 1956.9, whose tardiness is under the constraint threshold, i.e.,
All learning agents are adapted based on the implementation 40.5 < 50. A similar case is also observed in unseen instances,
of Tianshou framework [36] 2 . The network structure is formed demonstrated in Tab. III. Learning agents still perform better
by two hidden fully connected layers 128 × 128. α is 0.1. γ on unseen instances compared with dispatching rules, except
is 0.97. The initial multiplier λ and its learning rate are set as STD which gets the best result on Instance16 with constraint
0.001 and 0.0001, respectively. The constraint threshold ϵ is satisfaction. RCPOM, the proposed method, achieves the best
set as 50. Reward shaping parameters β and b are set as 1 and average makespan and is statistically better than almost other
2000 in terms of dispatching rules, respectively. All learning policies on Instance10-14.
agents are trained for 5e5 steps on an Intel Xeon Gold 6240
CPU and four TITAN RTX GPUs. The best policies during C. Discussion
training are selected. All dispatching policies and learning 1) Mediocre dispatching rules: Dispatching rules perform
agents are tested 30 times on the instances independently. promisingly on 16 instances, as shown in Tab. II and Tab. III.
Parameter values are either set following previous studies [26], EDD usually has the lowest tardiness and STD even gets the
[36] or arbitrarily chosen. lowest makespan on Instance08 and Instance16. FCFS and
B. Experimental result EDD are time-related type dispatching rules. FCFS always
Tab. II and Tab. III present the results on the training and chooses tasks according to their arrival time, i.e., assigns
unseen instances, respectively. The average time consumption the task that arrives first. EDD will select the task that has
per task assignment is demonstrated in Tab. I. the earliest due date. In contrast to time-related rules, NVF
and STD are distance-related rules. They both optimise the
2 https://github.com/thu-ml/tianshou objective that strongly relates to distance such as makespan.
TABLE II
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TRAINING SET (I NSTANCE 01-08). T HE BOLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.

Instance01 Instance02 Instance03 Instance04 Instance05 Instance06 Instance07 Instance08

Algorithm M/C (P)
Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft
RCPOM 1840.0/56.7 1908.4/48.2 1914.8/45.0 1956.9/40.5 1852.5/21.3 1911.1/41.0 1927.0/8.4 1898.0/21.0 0.97/0.89 (75%)
RCPOM-NS 1891.9-/58.8≈ 1979.5-/51.6≈ 1931.1≈/46.9≈ 2003.9-/53.0- 2028.8-/58.1- 2028.9-/88.8- 1951.4-/15.2- 1951.0-/41.8- 0.68/0.70 (56%)
IPO 1860.5-/48.6+ 1933.5≈/50.4≈ 1924.9≈/47.6≈ 1979.1-/49.6≈ 1951.1-/47.0- 1951.5-/71.4- 1977.0-/14.8- 1943.0-/34.9- 0.80/0.77 (61%)
L-SAC 1884.8-/60.3≈ 1987.4-/57.2- 1933.7≈/44.0≈ 1973.5≈/42.2≈ 1955.4-/44.7- 1992.9-/68.5- 1983.9-/19.8- 1923.6-/32.8- 0.74/0.76 (58%)
L-PPO 1872.1-/59.5≈ 1963.5-/52.3≈ 1910.3≈/49.7≈ 1984.0-/49.7- 1954.0-/55.5- 1968.6-/75.6- 1979.7-/18.9- 1899.2≈/32.6- 0.80/0.73 (53%)
SAC 1798.4+/49.1+ 1950.0-/68.9- 1965.0-/53.4- 2002.0-/28.9+ 1900.0-/26.6- 2011.9-/71.3- 1927.0≈/12.6- 1914.1-/15.7+ 0.82/0.83 (62%)
PPO 1858.0≈/52.3+ 1941.8≈/49.9≈ 1918.2≈/44.3≈ 1995.3-/55.6- 1937.1-/47.6- 1961.0-/82.0- 1988.6-/14.6- 1920.7≈/35.6- 0.79/0.75 (61%)
FCFS 2081.1-/90.2- 2084.5-/82.3- 2014.7-/64.2- 2136.7-/105.0- 2194.6-/122.5- 2123.5-/85.2- 1927.9≈/11.2- 1933.6-/27.4≈ 0.31/0.48 (37%)
EDD 1903.2≈/33.9+ 1968.6-/29.6+ 1977.8-/26.5+ 1988.1-/32.8+ 1950.7-/33.7- 2016.8-/46.8- 1940.5-/12.1- 2020.8-/45.3≈ 0.66/0.91 (86%)
NVF 1876.5≈/69.6≈ 1958.4-/56.4≈ 1946.7-/49.6≈ 2040.6-/66.7- 1933.9-/56.3- 1953.4-/78.5- 1996.5-/21.8- 1944.6-/35.5- 0.71/0.67 (51%)
STD 1868.8≈/66.6≈ 1961.7-/49.2≈ 1921.3≈/51.7- 1970.7-/35.6+ 1917.1-/41.4- 1955.4-/62.7- 1983.7-/30.4- 1883.5+/35.1- 0.83/0.75 (53%)
Random 2098.7-/124.5- 2113.8-/103.1- 2091.2-/143.7- 2135.1-/123.1- 2149.3-/119.0- 2159.1-/129.3- 2083.0-/70.2- 2067.8-/89.5- 0.02/0.00 (12%)
6/2 9/4 5/4 10/6 11/11 11/11 9/11 8/8

TABLE III
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TEST SET (I NSTANCE 09-16). T HE B OLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.

Instance09 Instance10 Instance11 Instance12 Instance13 Instance14 Instance15 Instance16

Algorithm M/C (P)
Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft Fm /Ft
RCPOM 1840.0/57.1 1917.6/49.4 1900.3/45.8 1954.3/38.8 1862.7/18.4 1914.1/41.8 1960.8/19.6 1896.4/21.3 0.95/0.90 (74%)
RCPOM-NS 1904.6-/60.2≈ 1985.4-/52.0≈ 1931.6-/47.3≈ 1995.6-/51.0- 2009.1-/55.0- 2034.9-/92.6- 1956.6≈/16.2+ 1941.0≈/40.3- 0.68/0.72 (55%)
IPO 1860.5-/48.6+ 1933.5≈/50.4≈ 1924.9-/47.6≈ 1979.1-/49.6- 1951.1-/47.0- 1951.5-/71.4- 1977.0-/14.8+ 1943.0-/34.9- 0.80/0.80 (61%)
L-SAC 1892.3-/62.0≈ 1983.6-/56.9≈ 1933.1-/43.4+ 1974.5-/43.0≈ 1954.4-/41.0- 1999.4-/70.9- 1981.5-/18.6+ 1930.3-/34.8- 0.73/0.78 (58%)
L-PPO 1877.5-/60.8≈ 1961.3-/52.6≈ 1905.8≈/49.5≈ 1980.9-/49.8- 1960.2-/50.1- 1965.5-/78.5- 1975.1-/18.2+ 1901.5≈/33.1- 0.81/0.76 (57%)
SAC 1801.3+/49.3+ 1950.0-/69.1- 1956.0-/56.8- 2000.0-/28.6+ 1904.0-/27.3- 2015.1-/71.1- 1930.0+/12.6+ 1987.0-/38.2- 0.76/0.81 (62%)
PPO 1863.1-/53.5+ 1941.8≈/50.1≈ 1913.2≈/44.1≈ 1986.3-/55.0- 1927.4-/41.4- 1965.5-/85.9- 1990.1-/15.0+ 1908.2≈/35.4- 0.81/0.77 (62%)
FCFS 2089.3-/92.5- 2045.4-/75.7- 1996.9-/67.6- 2107.1-/95.2- 2191.9-/121.7- 2130.1-/89.9- 1934.1≈/11.1+ 1946.8-/35.8- 0.32/0.48 (35%)
EDD 1996.9-/107.7- 1976.3-/33.4+ 1978.7-/22.0+ 1997.6-/36.9≈ 1962.1-/37.0- 1993.5-/44.6- 1934.8≈/11.0+ 2052.3-/69.4- 0.59/0.79 (73%)
NVF 1847.5≈/57.2- 1958.8-/54.3≈ 1926.6≈/51.8≈ 2021.7-/65.9- 1939.2-/41.2- 1975.5-/89.8- 2003.5-/19.3+ 1933.8-/40.9- 0.73/0.71 (58%)
STD 1894.1-/72.5- 1958.2≈/49.7≈ 1923.1≈/52.4- 1985.2-/38.1≈ 1905.1-/31.4- 1974.3-/71.9- 1990.3≈/33.0- 1885.2+/38.6- 0.80/0.76 (55%)
Random 2132.7-/135.3- 2100.9-/99.5- 2076.2-/144.4- 2097.5-/106.3- 2144.1-/102.6- 2142.9-/123.6- 2101.3-/81.1- 2055.1-/94.1- 0.03/0.02 (11%)
9/5 8/3 7/4 11/7 11/11 11/11 6/2 7/11

The difference is that NVF selects the task with the nearest For example, EDD and FCFS consider time-related objectives
pickup point while STD selects the task with the shortest such as makespan which can be easily determined in Eq. 2.
distance from the current position to the pickup point and Both rules optimise the objective partially. When the instance
then the delivery point. has a large concession space for delayed tasks, EDD and FCFS
may hardly work. In the problem considered in this paper,
Dispatching rules use simple manual mechanisms to sched- two objectives (i.e., distance and time) are involved. Although
ule tasks and achieve a promising performance, which are tardiness is considered a constraint, it is hard to identify the
usually better than the Random agent. However, such mech- correlation between makespan and tardiness. It is shown on
anisms may not be able to handle more complex scenarios. Instance08 that NVF and STD are better in both makespan
And it is hard to improve the dispatching rule due to its poor and tardiness. But it can also be seen that in Instance01
adaptability. From Fig. 4, it is clear that learning agents usually and Instance03, EDD has lower tardiness, whose makespan
perform better than the dispatching rules with lower averaged is worse than NVF and STD. This observation implies that
makespan on the instances. Dispatching rules keep using the the ability of simple dispatching rules is limited. Our hybrid
same mechanism in the long sequential decision process. It action space is motivated by the phenomenon and expected to
makes sense that they show a limited performance since one provide more optimisation possibilities and adaptability.
rule usually works in certain specific situations. Instantaneous
modifications have to be made when unexpected events occur.
multiplier that is updated during training. It is demonstrated
in Tab. II, Tab. III and Fig. 4, RCPOM outperforms other
learning agents on average. RCPOM has a slight gap with
EDD on tardiness but outperforms it much on makespan.
Although IPO is also a CRL algorithm, its assumption that
the policy should satisfy constraints upon initialisation [37]
limits its performance when solving the DMH problem.
3) Promising performance on unseen instances: To further
validate the performance of the proposed method, we also test
it on some unseen instances from Instance09 to Instance16.
These unseen instances are generated by mutating the training
instances. Our method still outperforms others on the unseen
instances according to Tab. III. Such a stable performance
gives more possibility to apply our method to real-world
problems, meeting dynamic and complex situations.
4) Invariant reward shaping improves: The raw reward
function of the process is the negative makespan, which is
rewarded at the end, denoted in Eq. (7). It is challenging for
an RL agent to learn such a sparse reward function with only
negative values. The consequence brought by the lack of posi-
tive feedback is that agents may be stagnant and conservative.
In almost all instances, RCPOM agent with reward shaping
performs better than the one without reward shaping, shown
in Tab. II and Tab. III. Although the relative reward value is
modified by the reward shaping, it is proved that the optimal
policy keeps invariant. Invariant reward shaping helps agents
explore more and make the most of the positive feedback.

Fig. 4. Performance averaged over 16 instances.

VI. C ONCLUSION
This paper studies the dynamic material handling problem.
Newly arrived tasks and vehicle breakdowns are considered
2) Constraint handling helps: A vehicle should be available dynamic events. Due to the lack of free simulators and problem
when assigning a task, which is critical for safe manufacturing. instances, we develop a gym-like simulator, namely DMH-
Previous work [6] resamples an action in case of constraint GYM and provide a dataset of diverse instances. Considering
violations, which does not work out in our scenario. Tab. I the constraints of tardiness and vehicle availability, we formu-
shows the time cost for determining task assignments. The late the problem as a constrained Markov decision process. A
agents using the invalid action masking technique take much constrained reinforcement learning algorithm that combines
shorter decision-making time in contrast to the timeout by [6]. Lagrangian relaxation and invalid action masking, named
The application of the invalid action masking technique guar- RCPOM is proposed to meet the hybrid cumulative and instan-
antees instantaneous safety. Sampling once is enough to obtain taneous constraints. We validate the outstanding performance
a valid action. Another benefit of masking is the compression of our approach on both training and unseen instances. The
of action space. The agent can focus more on the choice of experimental results show RCPOM statistically outperforms
valid actions and further improve the exploration efficiency. state-of-the-art reinforcement learning agents, constrained re-
Even though SAC and PPO agents show remarkable per- inforcement learning agents, and several commonly used dis-
formance in terms of makespan, they fail in tardiness. SAC patching rules. It is also validated that the invariant reward
agent gets a high tardiness value on Instances indexed with shaping helps. The effectiveness of RCPOM provides a new
2, 3, 6, 10, 11 and 14. It is attributed to the single-attribute perspective for dealing with real-world scheduling problems
reward function since it only relies on the makespan and does with constraints. Instead of constructing a complicated reward
not consider tardiness. A straightforward method to handle function manually, it is possible to restrict the policy directly
the cumulative constraint is augmenting the objective function. by constrained reinforcement learning methods.
L-SAC and L-PPO reshape the reward function with a fixed As future work, we are interested in extending the problem
Lagrangian multiplier, seeing Eq. (9). However, it is hard to by introducing more dynamic events and constraints that
decide the multiplier. From Tab. II, L-SAC and L-PPO do widely exist in real-world scenarios.
not outperform SAC and PPO much, even L-PPO has the R EFERENCES
lowest makespan on Instance03. RCPOM provides a more [1] V. Kaplanoğlu, C. Şahin, A. Baykasoğlu, R. Erol, A. Ekinci, and
flexible way to restrict the behaviour of policy with a suitable M. Demirtaş, “A multi-agent based approach to dynamic scheduling
of machines and automated guided vehicles (AGV) in manufacturing [20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
systems by considering AGV breakdowns,” International Journal of policy maximum entropy deep reinforcement learning with a stochastic
Engineering Research & Innovation, vol. 7, no. 2, pp. 32–38, 2015. actor,” in International Conference on Machine Learning. PMLR, 2018,
[2] T. Raghu and C. Rajendran, “An efficient dynamic dispatching rule pp. 1861–1870.
for scheduling in a job shop,” International Journal of Production [21] C. Chen, B. Xia, B.-h. Zhou, and L. Xi, “A reinforcement learning based
Economics, vol. 32, no. 3, pp. 301–313, 1993. approach for a multiple-load carrier scheduling problem,” Journal of
[3] I. Sabuncuoglu, “A study of scheduling rules of flexible manufacturing Intelligent Manufacturing, vol. 26, no. 6, pp. 1233–1245, 2015.
systems: A simulation approach,” International Journal of Production [22] C. Kardos, C. Laflamme, V. Gallina, and W. Sihn, “Dynamic scheduling
Research, vol. 36, no. 2, pp. 527–546, 1998. in a job-shop production system with reinforcement learning,” Procedia
[4] W. Xu, S. Guo, X. Li, C. Guo, R. Wu, and Z. Peng, “A dynamic CIRP, vol. 97, pp. 104–109, 2021.
scheduling method for logistics tasks oriented to intelligent manufac- [23] S. Govindaiah and M. D. Petty, “Applying reinforcement learning to
turing workshop,” Mathematical Problems in Engineering, vol. 2019, plan manufacturing material handling,” Discover Artificial Intelligence,
pp. 1–18, 2019. vol. 1, no. 1, pp. 1–33, 2021.
[5] T. Xue, P. Zeng, and H. Yu, “A reinforcement learning method for [24] E. Altman, Constrained Markov decision processes: Stochastic model-
multi-AGV scheduling in manufacturing,” in 2018 IEEE International ing. Routledge, 1999.
Conference on Industrial Technology. IEEE, 2018, pp. 1557–1561. [25] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy opti-
[6] H. Hu, X. Jia, Q. He, S. Fu, and K. Liu, “Deep reinforcement learning mization,” in International Conference on Machine Learning. PMLR,
based agvs real-time scheduling with mixed rule for flexible shop floor in 2017, pp. 22–31.
industry 4.0,” Computers & Industrial Engineering, vol. 149, p. 106749, [26] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy
2020. optimization,” in International Conference on Learning Representations,
[7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, 2019. [Online]. Available: https://openreview.net/forum?id=SkfrvsA9FX
J. Tang, and W. Zaremba, “OpenAI Gym,” 2016. [27] A. Bhatia, P. Varakantham, and A. Kumar, “Resource constrained deep
[8] L. Qiu, W.-J. Hsu, S.-Y. Huang, and H. Wang, “Scheduling and routing reinforcement learning,” in Proceedings of the International Conference
algorithms for AGVs: A survey,” International Journal of Production on Automated Planning and Scheduling, vol. 29, 2019, pp. 610–620.
Research, vol. 40, no. 3, pp. 745–760, 2002. [28] Y. Liu, J. Ding, and X. Liu, “A constrained reinforcement learning
[9] L. Qiu, J. Wang, W. Chen, and H. Wang, “Heterogeneous AGV routing based approach for network slicing,” in 2020 IEEE 28th International
problem considering energy consumption,” in 2015 IEEE International Conference on Network Protocols (ICNP). IEEE, 2020, pp. 1–6.
Conference on Robotics and Biomimetics (ROBIO). IEEE, 2015, pp. [29] D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang, X. Wu,
1894–1899. Q. Guo, Q. Chen, Y. Yin, H. Zhang, T. Shi, L. Wang, Q. Fu, W. Yang,
[10] N. Singh, Q.-V. Dang, A. Akcay, I. Adan, and T. Martagan, “A and L. Huang, “Mastering complex control in MOBA games with deep
matheuristic for AGV scheduling with battery constraints,” European reinforcement learning,” in Proceedings of the AAAI Conference on
Journal of Operational Research, vol. 298, no. 3, pp. 855–873, 2022. Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6672–6679.
[11] Q.-V. Dang, N. Singh, I. Adan, T. Martagan, and D. van de Sande, [30] S. Huang and S. Ontañón, “A closer look at invalid action masking
“Scheduling heterogeneous multi-load AGVs with battery constraints,” in policy gradient algorithms,” in Proceedings of the Thirty-Fifth In-
Computers & Operations Research, vol. 136, p. 105517, 2021. ternational Florida Artificial Intelligence Research Society Conference,
[12] D. Ouelhadj and S. Petrovic, “A survey of dynamic scheduling in vol. 35, 2022, pp. 1–6.
manufacturing systems,” Journal of Scheduling, vol. 12, no. 4, pp. 417– [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
431, 2009. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
[13] J. H. Blackstone, D. T. Phillips, and G. L. Hogg, “A state-of-the-art S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
survey of dispatching rules for manufacturing job shop operations,” The D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
International Journal of Production Research, vol. 20, no. 1, pp. 27–45, deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
1982. 2015.
[14] C. Chen, L.-f. Xi, B.-h. Zhou, and S.-s. Zhou, “A multiple-criteria real- [32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
time scheduling approach for multiple-load carriers subject to LIFO D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
loading constraints,” International Journal of Production Research, forcement learning,” in International Conference on Machine Learning.
vol. 49, no. 16, pp. 4787–4806, 2011. PMLR, 2016, pp. 1928–1937.
[15] C. Sahin, M. Demirtas, R. Erol, A. Baykasoğlu, and V. Kaplanoğlu, [33] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
“A multi-agent based approach to dynamic scheduling with flexible MIT press, 2018.
processing capabilities,” Journal of Intelligent Manufacturing, vol. 28, [34] Y. Liu, J. Ding, and X. Liu, “IPO: Interior-point policy optimization
no. 8, pp. 1827–1845, 2017. under constraints,” in Proceedings of the AAAI Conference on Artificial
[16] S. Liu, P. H. Tan, E. Kurniawan, P. Zhang, and S. Sun, “Dynamic Intelligence, vol. 34, no. 04, 2020, pp. 4940–4947.
scheduling for pickup and delivery with time windows,” in 2018 IEEE [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
4th World Forum on Internet of Things. IEEE, 2018, pp. 767–770. imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
[17] G. Chryssolouris and V. Subramaniam, “Dynamic scheduling of man- 2017.
ufacturing job shops using genetic algorithms,” Journal of Intelligent [36] J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang,
Manufacturing, vol. 12, no. 3, pp. 281–293, 2001. Y. Su, H. Su, and J. Zhu, “Tianshou: A highly modularized
[18] Y. Zhang, G. Zhang, W. Du, J. Wang, E. Ali, and S. Sun, “An deep reinforcement learning library,” Journal of Machine Learning
optimization method for shopfloor material handling based on real- Research, vol. 23, no. 267, pp. 1–6, 2022. [Online]. Available:
time and multi-source manufacturing data,” International Journal of http://jmlr.org/papers/v23/21-1127.html
Production Economics, vol. 165, pp. 282–292, 2015. [37] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints in
[19] W. Wang, Y. Zhang, and R. Y. Zhong, “A proactive material handling model-free reinforcement learning: A survey,” in the International Joint
method for CPS enabled shop-floor,” Robotics and Computer-integrated Conference on Artificial Intelligence, 2021, pp. 4508–4515.
Manufacturing, vol. 61, p. 101849, 2020.

Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop
No ratings yet
Deep Reinforcement Learning For Dynamic Scheduling of A Flexible Job Shop
22 pages
Large-Scale Dynamic Scheduling For Flexible Job-Shop With Random Arrivals of New
No ratings yet
Large-Scale Dynamic Scheduling For Flexible Job-Shop With Random Arrivals of New
12 pages
Zhou Et Al. - 2021 - Reinforcement Learning With Composite Rewards For Production Scheduling in A Smart Factory
No ratings yet
Zhou Et Al. - 2021 - Reinforcement Learning With Composite Rewards For Production Scheduling in A Smart Factory
15 pages
2504.07779v1
No ratings yet
2504.07779v1
15 pages
Improving Productivity in Mining Operations A Deep
No ratings yet
Improving Productivity in Mining Operations A Deep
14 pages
A Dijkstra Inspired Graph Algorithm For
No ratings yet
A Dijkstra Inspired Graph Algorithm For
13 pages
1-s2.0-S221282712400828X-main
No ratings yet
1-s2.0-S221282712400828X-main
4 pages
mathematics-10-04575
No ratings yet
mathematics-10-04575
19 pages
s10845-023-02161-w
No ratings yet
s10845-023-02161-w
18 pages
Data Driven Control of Interacting Two Tank Hybrid System Using Deep Reinforcement Learning
No ratings yet
Data Driven Control of Interacting Two Tank Hybrid System Using Deep Reinforcement Learning
7 pages
Matrix_and_Learning-Assisted_Distributed_Dual-Space_Memetic_Algorithm_for_Customized_Distributed_Blocking_Flowshop_Scheduling_Problem
No ratings yet
Matrix_and_Learning-Assisted_Distributed_Dual-Space_Memetic_Algorithm_for_Customized_Distributed_Blocking_Flowshop_Scheduling_Problem
14 pages
1-s2.0-S0305054822001046-main
No ratings yet
1-s2.0-S0305054822001046-main
14 pages
Flexible_Job-Shop_Scheduling_via_Graph_Neural_Network_and_Deep_Reinforcement_Learning
No ratings yet
Flexible_Job-Shop_Scheduling_via_Graph_Neural_Network_and_Deep_Reinforcement_Learning
11 pages
Real-Time Scheduling of Flexible Manufacturing Systems Using Support Vector Machines and Case-Based Reasoning
No ratings yet
Real-Time Scheduling of Flexible Manufacturing Systems Using Support Vector Machines and Case-Based Reasoning
6 pages
An Industrial Application of Deep Reinforcement Learning For Chemical Production Scheduling
No ratings yet
An Industrial Application of Deep Reinforcement Learning For Chemical Production Scheduling
7 pages
A Neural Network Based Decision Support System For Real-Time Scheduling of Flexible Manufacturing Systems
No ratings yet
A Neural Network Based Decision Support System For Real-Time Scheduling of Flexible Manufacturing Systems
6 pages
A Review of Reinforcement Learning Based Intelligent Optimization for Manufacturing Scheduling (1)
No ratings yet
A Review of Reinforcement Learning Based Intelligent Optimization for Manufacturing Scheduling (1)
14 pages
RobotScheduling
No ratings yet
RobotScheduling
15 pages
Reinforcement_Learning_with_Ta
No ratings yet
Reinforcement_Learning_with_Ta
20 pages
arquescorrales2020
No ratings yet
arquescorrales2020
5 pages
drones-06-00323-v3
No ratings yet
drones-06-00323-v3
18 pages
RL-UNIT IV QA
No ratings yet
RL-UNIT IV QA
16 pages
Sensors 24 04713
No ratings yet
Sensors 24 04713
23 pages
Intelligent Scheduling With Machine Learning Capabilities The Induction of Scheduling Knowledge
No ratings yet
Intelligent Scheduling With Machine Learning Capabilities The Induction of Scheduling Knowledge
38 pages
Biomimetics 08 00434
No ratings yet
Biomimetics 08 00434
26 pages
1-s2.0-S2212827122004504-main
No ratings yet
1-s2.0-S2212827122004504-main
6 pages
Jobshop and ML
No ratings yet
Jobshop and ML
12 pages
Applied Sciences: Deep Neural Network For Predicting Ore Production by Truck-Haulage Systems in Open-Pit Mines
No ratings yet
Applied Sciences: Deep Neural Network For Predicting Ore Production by Truck-Haulage Systems in Open-Pit Mines
25 pages
Ding vd. - 2022 - Dynamic Scheduling Optimization of Production Work
No ratings yet
Ding vd. - 2022 - Dynamic Scheduling Optimization of Production Work
16 pages
Digital Twin-Driven Adaptive Scheduling for Flexible Job Shops
No ratings yet
Digital Twin-Driven Adaptive Scheduling for Flexible Job Shops
17 pages
Neuro Fuzzy Control Paper
No ratings yet
Neuro Fuzzy Control Paper
4 pages
A Practical Approach To Truck Dispatch For Open Pit Mines: January 2011
No ratings yet
A Practical Approach To Truck Dispatch For Open Pit Mines: January 2011
14 pages
A Hybrid Evolutionary Algorithm For The Dynamic Resource Constrained Task Scheduling Problem
No ratings yet
A Hybrid Evolutionary Algorithm For The Dynamic Resource Constrained Task Scheduling Problem
8 pages
Hyper-Heuristics_and_Scheduling_Problems_Strategies_Application_Areas_and_Performance_Metrics
No ratings yet
Hyper-Heuristics_and_Scheduling_Problems_Strategies_Application_Areas_and_Performance_Metrics
15 pages
Petri Net-Based Modeling and Simulation of A Hybrid Manufacturing System
No ratings yet
Petri Net-Based Modeling and Simulation of A Hybrid Manufacturing System
7 pages
Learning NP-Hard Multi-Agent Assignment Planning Using GNN: Inference On A Random Graph and Provable Auction-Fitted Q-Learning
No ratings yet
Learning NP-Hard Multi-Agent Assignment Planning Using GNN: Inference On A Random Graph and Provable Auction-Fitted Q-Learning
12 pages
A Novel Hybrid-Action-Based Deep Reinforcement Learning for Industrial Energy Management
No ratings yet
A Novel Hybrid-Action-Based Deep Reinforcement Learning for Industrial Energy Management
15 pages
Machines 10 01195 v2
No ratings yet
Machines 10 01195 v2
25 pages
Energy Aware Flowshop Scheduling a Case for Ai Driven 1205av8sg3
No ratings yet
Energy Aware Flowshop Scheduling a Case for Ai Driven 1205av8sg3
15 pages
AGV Schedule Integrated With Production in Flexible
No ratings yet
AGV Schedule Integrated With Production in Flexible
13 pages
An Adaptive Scheduling Algorithm For Dynamic Jobs For Dealing With The Flexible Job Shop Scheduling Problem
No ratings yet
An Adaptive Scheduling Algorithm For Dynamic Jobs For Dealing With The Flexible Job Shop Scheduling Problem
11 pages
Sustainability 15 08262 v3
No ratings yet
Sustainability 15 08262 v3
24 pages
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
No ratings yet
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
17 pages
ETFA20-final_submission
No ratings yet
ETFA20-final_submission
9 pages
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making for Complex Systems MAR 2025
No ratings yet
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making for Complex Systems MAR 2025
8 pages
energies-16-07635-v2
No ratings yet
energies-16-07635-v2
17 pages
2001 A Review of Machine Learning in Dynamic Scheduling of Flexible Manufacturing Systems
No ratings yet
2001 A Review of Machine Learning in Dynamic Scheduling of Flexible Manufacturing Systems
13 pages
Digital-Twin-Driven AGV Scheduling and Routing in
No ratings yet
Digital-Twin-Driven AGV Scheduling and Routing in
25 pages
Pos 112
No ratings yet
Pos 112
2 pages
Intelligent Control:: Sub: Ai and Neural Networks Subject Code: MT-22 Semester: 2nd
No ratings yet
Intelligent Control:: Sub: Ai and Neural Networks Subject Code: MT-22 Semester: 2nd
6 pages
Beeftink_Mart - LiteratureSurvey - Learning manipulation tasks in a domestic care robot application using teleoperation
No ratings yet
Beeftink_Mart - LiteratureSurvey - Learning manipulation tasks in a domestic care robot application using teleoperation
78 pages
Neural Networks
No ratings yet
Neural Networks
6 pages
Deep Reinforcement Learning MultiAgent System For Resource Allocation in Industrial Internet of ThingsSensors
No ratings yet
Deep Reinforcement Learning MultiAgent System For Resource Allocation in Industrial Internet of ThingsSensors
23 pages
s12293-024-00407-5
No ratings yet
s12293-024-00407-5
15 pages
095_ISARC_2024_Paper_231
No ratings yet
095_ISARC_2024_Paper_231
7 pages
[A. Azadeh;S.F. Ghaderi;H. Izadbakhsh;2009]_Integration of DEA and AHP with computer simulation for railway system improvement and optimization
No ratings yet
[A. Azadeh;S.F. Ghaderi;H. Izadbakhsh;2009]_Integration of DEA and AHP with computer simulation for railway system improvement and optimization
11 pages
10 1287@trsc 2014 0569
No ratings yet
10 1287@trsc 2014 0569
16 pages
Simulation Modelling Practice and Theory: Burak Ozdemir, Mustafa Kumral T
No ratings yet
Simulation Modelling Practice and Theory: Burak Ozdemir, Mustafa Kumral T
13 pages
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
From Everand
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Verbals Handout PDF
No ratings yet
Verbals Handout PDF
13 pages
Gantrex GPT Pit Track Brochure
No ratings yet
Gantrex GPT Pit Track Brochure
2 pages
Manual SPcomp
No ratings yet
Manual SPcomp
1 page
Further Pure Mathematics: Paper 1
100% (1)
Further Pure Mathematics: Paper 1
36 pages
Ejercicio Libro Bueno
No ratings yet
Ejercicio Libro Bueno
3 pages
Information Quality Evaluation Framework: Extending ISO 25012 Data Quality Model
No ratings yet
Information Quality Evaluation Framework: Extending ISO 25012 Data Quality Model
6 pages
Project Delivery
No ratings yet
Project Delivery
20 pages
Subject:: A Quiz Game
No ratings yet
Subject:: A Quiz Game
8 pages
Week 8
No ratings yet
Week 8
37 pages
Alexander Vovin - Catching A Black Cat in A Dark Room
No ratings yet
Alexander Vovin - Catching A Black Cat in A Dark Room
14 pages
Camp Management Plan 2024
No ratings yet
Camp Management Plan 2024
5 pages
Digital Design: (CS /ECE/EEE/INSTR F215) Prof. Anita Agrawal
No ratings yet
Digital Design: (CS /ECE/EEE/INSTR F215) Prof. Anita Agrawal
11 pages
CAP 3300 - Appendix C STAT of Compliance
No ratings yet
CAP 3300 - Appendix C STAT of Compliance
6 pages
Worksheet25 Redox Key PDF
No ratings yet
Worksheet25 Redox Key PDF
7 pages
Estimation of Discharge For Ungauged Catchments Using Rainfall-Runoff Model in Didessa Sub-Basin: The Case of Blue Nile River Basin, Ethiopia
No ratings yet
Estimation of Discharge For Ungauged Catchments Using Rainfall-Runoff Model in Didessa Sub-Basin: The Case of Blue Nile River Basin, Ethiopia
11 pages
10 Principles of Effective Communication - Constant Content
No ratings yet
10 Principles of Effective Communication - Constant Content
3 pages
Cambridge O Level: Combined Science 5129/11
No ratings yet
Cambridge O Level: Combined Science 5129/11
3 pages
MEC531 Course Outline 2015 - Bibi
No ratings yet
MEC531 Course Outline 2015 - Bibi
7 pages
Cable Selection
No ratings yet
Cable Selection
1 page
Science Department Slac 2022-2023 Rasa Basa
No ratings yet
Science Department Slac 2022-2023 Rasa Basa
3 pages
Unesco-Eolss Sample Chapters: Insulation Co-Ordination in Power Systems
No ratings yet
Unesco-Eolss Sample Chapters: Insulation Co-Ordination in Power Systems
30 pages
Zeng Qi Yun Xing
100% (1)
Zeng Qi Yun Xing
33 pages
Openworldb2studentsbook Key and Audioscripts
No ratings yet
Openworldb2studentsbook Key and Audioscripts
74 pages
[Sau PB]_unit 3-grade 11_Cô Trúc (Thầy Trí PB)
No ratings yet
[Sau PB]_unit 3-grade 11_Cô Trúc (Thầy Trí PB)
5 pages
CHAPTER 10.5-Distance Time Graphs2
No ratings yet
CHAPTER 10.5-Distance Time Graphs2
13 pages
The Knee: Jason Peeler Jacquie Ripat
No ratings yet
The Knee: Jason Peeler Jacquie Ripat
11 pages
Retail Organization and Human Resource Management (HRM)
No ratings yet
Retail Organization and Human Resource Management (HRM)
22 pages
different_types_of_beams_gate_notes_82
No ratings yet
different_types_of_beams_gate_notes_82
3 pages
Alcatel 4400 BRI
No ratings yet
Alcatel 4400 BRI
7 pages
Aurangzeb Audrey Truschke pdf download
100% (4)
Aurangzeb Audrey Truschke pdf download
69 pages

Constrained Reinforcement Learning For Dynamic Material Handling

Uploaded by

Constrained Reinforcement Learning For Dynamic Material Handling

Uploaded by

Constrained Reinforcement Learning for

Dynamic Material Handling

Southern University of Science and Technology, Shenzhen, China.

[email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract—As one of the core parts of flexible manufacturing

Instance01 Instance02 Instance03 Instance04 Instance05 Instance06 Instance07 Instance08

Instance09 Instance10 Instance11 Instance12 Instance13 Instance14 Instance15 Instance16

Fig. 4. Performance averaged over 16 instances.

You might also like