Constrained Reinforcement Learning For Dynamic Material Handling
Constrained Reinforcement Learning For Dynamic Material Handling
Fig. 2. Layout of DMH-GYM. and nearest vehicle first (NVF), form the dispatching rule
space D. Those dispatching rules are formulated [3] as follows:
• FCFS: arg min u.τ ;
γ is the discount factor. A cumulative constraint C = u∈U
• EDD: arg min(u.o − u.τ );
g(c(s0 , a0 , s1 ), . . . , c(st , at , st+1 )) that consists of per-step u∈U
cost c(s, a, s′ ) can be restricted by a corresponding constraint • NVF: arg min d(v.c, u.s);
u∈U
threshold ϵ. g can be any of the functions for constraints • STD: arg min d(v.c, u.s) + d(u.s, u.e),
u∈U
such as average and sum. JCπ denotes the expectation of the
where v.s is the current position of an AGV v and d(p, p′ )
cumulative constraint. A policy optimised by parameters θ is
determines the distance between p and p′ . Action at =
defined as πθ which determines the probability to take an
⟨dt , vt ⟩ ∈ At at time t determines a single task assignment
action at at state st with π(at |st ) at time t. The goal of
at that task ut is assigned to AGV vt by dispatching rule dt .
the problem is to maximise the long-term expected reward
With the action space At , the policy can decide the rule and
with optimised policy πθ while satisfying the constraints, as
AGV at the same time, which breaks the tie of the multiple
formulated below.
vehicles case.
∞
X 3) Constraints: A cumulative constraint and an instanta-
max Eπθ [ γ t R(st , at , st+1 )] (3) neous constraint are both considered in the CMDP. We treat
θ
t=0 the tardiness as a cumulative constraint JC , formulated in Eq.
s.t. JCπθ = Eτ ∼πθ [C] ≤ ϵ. (4) (5). The task assignment constraint of vehicle availability is
regarded as an instantaneous constraint that only vehicles with
1) State: At each decision time t, we consider the current status Idle are considered as available, denoted as Eq. (6).
state of the whole system consisting of tasks and AGVs,
denoted as St = ρ(Ut , Vt ), where Ut is the set of unassigned
tasks and Vt represents information of all AGVs at time t. We JCπθ = Eπθ (Ft (πθ )) ≤ ϵ, (5)
encode the information of the system at decision time t with v.ψ = I. (6)
a feature extract function ρ(Ut , Vt ) as structure representation.
Specifically, a state is mapped into a vector by ρ consisting of 4) Reward function: The negative makespan returned at the
the number of unassigned tasks, tasks’ remaining time, waiting last timestep is set as a reward. τ is denoted as a trajectory
time and distance between pickup and delivery point as well (s0 , a0 , s1 , a1 , . . . , st , at , st+1 , . . . ) sampled from current pol-
as statuses of vehicles and the minimal time from the current icy π. The per-step reward function is defined as follows.
position to pickup point then delivery point.
−Fm (π), if terminates,
2) Action: The goal of dynamic material handling is to R(st , at , st+1 ) = (7)
0, otherwise.
execute all tasks, thus the discrete time step refers to one
single, independent task assignment. Different from the regular B. Constraint handling
MDP like in games, the number of steps in our problem is We combine invalid action masking and the reward con-
usually fixed. In other words, the length of an episode is strained policy optimisation (RCPO) [26], a Lagrangian re-
decided by the number of tasks to be assigned. In our CMDP, laxation approach, to handle the instantaneous and cumulative
making an action is to choose a suitable dispatching rule and constraints at the same time.
a corresponding vehicle. A hybrid action space At = D × Vt 1) Feasible action selection via masking: We can indeed
is considered for the CMDP, where D and Vt are defined determine the available actions according to the current state.
as dispatching rule space and AGV space, respectively. Four For example, in our problem, the dimension of the current
dispatching rules, including first come first served (FCFS), action space can be 4, 8, ..., which depends on the available
shortest travel distance (STD), earliest due date first (EDD) vehicles at any time step. However, since the dimension of the
input and output of neural networks used are usually fixed, it Algorithm 2 RCPO with masking (RCPOM).
is hard to address the variable action space directly. Input: Epoch N , Lagrange multiplier λ, learning rate of
To handle the instantaneous constraint, we adapt the invalid Lagrange multiplier η, temperature parameter α
action masking technique that has been widely applied in Output: Policy πθ
games [29], [30]. The action space is then compressed into 1: Initialise an experience replay buffer BR
At = D × AV t with the masking, where AV t is the set of 2: Initialise policy πθ and two critics Q̂ψ1 and Q̂ψ2
available AGVs. Logits corresponding to all invalid actions 3: for n = 1 to N do
will be replaced with a large negative number, and then 4: for t = 0, 1, . . . do
the final action will be resampled according to the new 5: a′t ← Masking(πθ , st ) ▷ Alg. 1
probability distribution. Masking helps the policy to choose 6: Get st+1 , rt , ct by executing a′t
valid actions and avoid some dangerous situations like getting 7: Store ⟨st , a′t , st+1 , rt , ct ⟩ into BR
stuck somewhere. Pseudo code is shown in Alg. 1. 8: Randomly sample a minibatch BR of transitions
It is notable that, to the best of our knowledge, no ex- T = ⟨s, a, s′ , r, c⟩ from BR
isting work has applied the masking technique to handle 9: Compute y ← r − λc + γ( min Q̂ψj (s′ , ã′ ) −
j=1,2
the instantaneous constraint in DMH. The closest work is α log πθ (ã′ |s′ )), where ã′ ∼ πθ (·|s′ )
presented by [6], which applies the DQN [31] with a designed 10: Update critic with
reward function. According to [6], the DQN agent selects the ∇ψj |B1R |
P
(y − Q̂ψj (s, a))2 for j = 1, 2
action with maximal Q value and repeats this step until a T ∈BR
feasible action is selected, i.e., assigning a task to an available 11: Update actor
P with
vehicle. However, when applying the method of [6] to our ∇θ |B1R | ( min Q̂ψj (s, ãθ ) − α log πθ (ãθ |s)),
T ∈BR j=1,2
problem, the agent will keep being trapped in this step if the ãθ is sampled from πθ (·|s) via reparametrisation trick
selected action is infeasible. Even though keeping resampling 12: Apply soft update on target networks
the action, the same feasible action will always be selected by 13: end for
this deterministic policy at a certain state. 14: λ ← max(λ + η(JCπ − ϵ), 0) ▷ Eq. (10)
2) Reward constrained policy optimisation with masking: 15: end for
Pseudo code of our proposed method is shown in Alg. 2.
We combine RCPO [26] and invalid action masking, named
RCPOM, to handle the hybrid constraints at the same time, Specifically, we implement RCPOM with soft actor critic
seeing Fig. 3. RCPO is a Lagrangian relaxation type method (SAC) because of its remarkable performance [20]. SAC is
based on Actor-Critic framework [32] which converts a con- a maximum entropy reinforcement learning that maximises
strained problem into an unconstrained one by introducing a cumulative reward as well as the expected entropy of the
multiplier λ, seeing Eq. (8), policy. The objective function is adapted as follows:
∞
min max[Rπ − λ(JCπ − ϵ)], (8) X
λ θ J(πθ ) = Eπθ [ γ t (R(st , at , st+1 ) + αH(π(·|st )))], (11)
where λ > 0. The reward function is reshaped with Lagrangian t=0
multiplier λ as follows: where the temperature α decides how stochastic the policy is,
which encourages exploration.
R̂(s, a, s′ , λ) = R(s, a, s′ ) − λc(s, a, s′ ). (9) 3) Invariant reward shaping: The raw reward function is
reshaped with linear transformation for more positive feedback
In a slower time scale than the update of the actor and critic,
and better estimation of the value function, seeing Eq. (12)
the multiplier is updated with the collected constraint values
according to Eq. (10). R̃(s, a, s′ , λ) = βR(s, a, s′ ) + b, (12)
λ = max{λ + η(JCπ − ϵ), 0}, (10) where β > 0 and b is a positive constant number. It is easy to
guarantee policy invariance under this reshape [33], declared
where η is the learning rate of the multiplier. in Remark 1.
Remark 1: Given a CMDP (S, A, R, C, P, γ), the optimal
Algorithm 1 Invalid action masking. policy keeps invariant with linear transformation in Eq. 12,
Input: policy πθ , state s where β > 0 and b ∈ R:
Output: action a b
1: Logits l ← πθ (·|s)
∀s ∈ S, V ∗ (s) = arg max V π (s) = arg max βV π (s)+ .
π π 1−γ
2: Compute l′ by replacing πθ (a′ |s) with −∞ where a′ is
an invalid action V. E XPERIMENTS
3: πθ′ (·|s) ← softmax(l′ )
4: Sample a ∼ πθ′ (·|s)
To verify our proposed method, numerous experiments are
conducted and discussed in this section.
TABLE I
T IME CONSUMED FOR MAKING A DECISION AVERAGED OVER 2000 TRIALS .
RCPOM RCPOM-NS IPO L-SAC L-PPO SAC PPO FCFS EDD NVF STD Random Hu et al. [6]
Time (ms) 2.143 2.112 2.038 2.74 2.514 2.79 2.626 0.0169 0.0199 0.0239 0.0354 0.0259 Timeout
The proposed method denoted as “RCPOM” is compared It is obvious that the random agent performs the worst. In
with five groups of methods: (i) an advanced constrained terms of dispatching rules, EDD shows its ability to optimise
reinforcement learning agent Interior-point Policy Optimisa- the time-based metric, namely tardiness. In all instances, the
tion (IPO) [34], (ii) state-of-the-art reinforcement learning tardiness of EDD is almost under the tardiness constraint
agents including proximal policy optimization (PPO) [35] threshold. For Instance01-Instance03, EDD gets the lowest
and soft actor critic (SAC) [20], (iii) SAC and PPO with tardiness 33.9, 29.6 and 26.5 compared with other policies,
fixed Lagrangian multiplier named as “L-SAC” and “L-PPO”, respectively. FCFS is also a time-related rule that always
and (iv) commonly used dispatching rules including FCFS, assigns the tasks that arrive first. It only performs better than
STD, EDD and NVF as presented in Section IV-A2. (v) To the other three rules in Instance07 for its low makespan and
validate the invariant reward shaping, an ablation study is also tardiness. Two other distance-related rules, NVF and STD
performed. The RCPOM agent without the invariant reward that optimise the travel distance can achieve good results
shaping denoted as “RCPOM-NS”, is compared. A random on makespan. For example, STD has the smallest makespan
agent is also compared. 1883.5 in Instance08 among all the policies. However, both
The simulator DMH-GYM and 16 problem instances are rules fail in terms of tardiness, which is shown in Tab. II for
used in our experiments, where instance01-08 are training their relatively high constraint value.
instances and instance09-16 are unseen during training. Trials Learning agents usually show better performance on
on the training instances and unseen instances verify the makespan than dispatching rules. In Instance01, the SAC agent
effectiveness and adaptability of our proposed method. gets the best makespan of 1798.4. The proposed RCPOM
performs the best among all the policies in terms of makespan
A. Settings
on most of the training instances except Instance01, Instance03
We apply the invalid action masking technique to all the and Instance08. Although STD achieves 1883.5 on Instance08,
learning agents to ensure safety when assigning tasks since RCPOM still gets a very closed makespan value of 1898.0. On
the instantaneous constraints should be satisfied per time step. tardiness, constrained learning agents show a lower value than
All learning agents except “RCPOM-NS” are equipped with the others in most instances. On Instance01, Instance04 and
the reward shaping as a fair comparison with the proposed Instance08, SAC agent gets the lowest tardiness. However, it is
method. To ensure instantaneous constraint satisfaction and notable that we care more about constraint satisfaction rather
explore more possible plans, we adapt the dispatching rules. than minimising tardiness. Even though in Instance04, SAC
Dispatching rules consider current feasible task assignments gets the lowest tardiness value with 28.9, we consider RCPOM
and shuffle these possible task assignments when the case of as the best agent since it gets the lowest makespan value of
multiple available vehicles is met. 1956.9, whose tardiness is under the constraint threshold, i.e.,
All learning agents are adapted based on the implementation 40.5 < 50. A similar case is also observed in unseen instances,
of Tianshou framework [36] 2 . The network structure is formed demonstrated in Tab. III. Learning agents still perform better
by two hidden fully connected layers 128 × 128. α is 0.1. γ on unseen instances compared with dispatching rules, except
is 0.97. The initial multiplier λ and its learning rate are set as STD which gets the best result on Instance16 with constraint
0.001 and 0.0001, respectively. The constraint threshold ϵ is satisfaction. RCPOM, the proposed method, achieves the best
set as 50. Reward shaping parameters β and b are set as 1 and average makespan and is statistically better than almost other
2000 in terms of dispatching rules, respectively. All learning policies on Instance10-14.
agents are trained for 5e5 steps on an Intel Xeon Gold 6240
CPU and four TITAN RTX GPUs. The best policies during C. Discussion
training are selected. All dispatching policies and learning 1) Mediocre dispatching rules: Dispatching rules perform
agents are tested 30 times on the instances independently. promisingly on 16 instances, as shown in Tab. II and Tab. III.
Parameter values are either set following previous studies [26], EDD usually has the lowest tardiness and STD even gets the
[36] or arbitrarily chosen. lowest makespan on Instance08 and Instance16. FCFS and
B. Experimental result EDD are time-related type dispatching rules. FCFS always
Tab. II and Tab. III present the results on the training and chooses tasks according to their arrival time, i.e., assigns
unseen instances, respectively. The average time consumption the task that arrives first. EDD will select the task that has
per task assignment is demonstrated in Tab. I. the earliest due date. In contrast to time-related rules, NVF
and STD are distance-related rules. They both optimise the
2 https://github.com/thu-ml/tianshou objective that strongly relates to distance such as makespan.
TABLE II
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TRAINING SET (I NSTANCE 01-08). T HE BOLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.
TABLE III
AVERAGE MAKESPAN AND TARDINESS OVER 30 INDEPENDENT TRAILS ON TEST SET (I NSTANCE 09-16). T HE B OLD NUMBER INDICATES THE BEST
MAKESPAN AND TARDINESS .“+/≈/-” INDICATES THE AGENT PERFORMS STATISTICALLY BETTER / SIMILAR / WORSE THAN RCPOM AGENT. “M/C (P)”
INDICATES THE AVERAGE NORMALISED MAKESPAN , TARDINESS AND PERCENTAGE OF CONSTRAINT SATISFACTION . T HE NUMBER OF POLICIES THAT
RCPOM IS BETTER THAN OTHERS IN TERMS OF MAKESPAN AND TARDINESS ON EACH INSTANCE IS GIVEN IN THE BOTTOM ROW.
The difference is that NVF selects the task with the nearest For example, EDD and FCFS consider time-related objectives
pickup point while STD selects the task with the shortest such as makespan which can be easily determined in Eq. 2.
distance from the current position to the pickup point and Both rules optimise the objective partially. When the instance
then the delivery point. has a large concession space for delayed tasks, EDD and FCFS
may hardly work. In the problem considered in this paper,
Dispatching rules use simple manual mechanisms to sched- two objectives (i.e., distance and time) are involved. Although
ule tasks and achieve a promising performance, which are tardiness is considered a constraint, it is hard to identify the
usually better than the Random agent. However, such mech- correlation between makespan and tardiness. It is shown on
anisms may not be able to handle more complex scenarios. Instance08 that NVF and STD are better in both makespan
And it is hard to improve the dispatching rule due to its poor and tardiness. But it can also be seen that in Instance01
adaptability. From Fig. 4, it is clear that learning agents usually and Instance03, EDD has lower tardiness, whose makespan
perform better than the dispatching rules with lower averaged is worse than NVF and STD. This observation implies that
makespan on the instances. Dispatching rules keep using the the ability of simple dispatching rules is limited. Our hybrid
same mechanism in the long sequential decision process. It action space is motivated by the phenomenon and expected to
makes sense that they show a limited performance since one provide more optimisation possibilities and adaptability.
rule usually works in certain specific situations. Instantaneous
modifications have to be made when unexpected events occur.
multiplier that is updated during training. It is demonstrated
in Tab. II, Tab. III and Fig. 4, RCPOM outperforms other
learning agents on average. RCPOM has a slight gap with
EDD on tardiness but outperforms it much on makespan.
Although IPO is also a CRL algorithm, its assumption that
the policy should satisfy constraints upon initialisation [37]
limits its performance when solving the DMH problem.
3) Promising performance on unseen instances: To further
validate the performance of the proposed method, we also test
it on some unseen instances from Instance09 to Instance16.
These unseen instances are generated by mutating the training
instances. Our method still outperforms others on the unseen
instances according to Tab. III. Such a stable performance
gives more possibility to apply our method to real-world
problems, meeting dynamic and complex situations.
4) Invariant reward shaping improves: The raw reward
function of the process is the negative makespan, which is
rewarded at the end, denoted in Eq. (7). It is challenging for
an RL agent to learn such a sparse reward function with only
negative values. The consequence brought by the lack of posi-
tive feedback is that agents may be stagnant and conservative.
In almost all instances, RCPOM agent with reward shaping
performs better than the one without reward shaping, shown
in Tab. II and Tab. III. Although the relative reward value is
modified by the reward shaping, it is proved that the optimal
policy keeps invariant. Invariant reward shaping helps agents
explore more and make the most of the positive feedback.