AMonteCarlotreesearchalgorithmforthefexible
AMonteCarlotreesearchalgorithmforthefexible
https://doi.org/10.1007/s10696-021-09437-4
Abstract
Flexible job-shop scheduling problem (FJSP) is an extension of the simple JSP with
additional features of routing flexibility. It is an essential class of sequencing and
planning problems that can apply in many real-life applications, especially in the
field of manufacturing systems and production management. Finding a scheduling
solution of sequential operations of various jobs by processing them on a defined
number of machines and following various constraints with the goal to minimize the
completion time of all operations, known as Makespan, is a big challenging issue.
To address this issue, we proposed a Monte Carlo Tree Search-based flexible job-
shop scheduling algorithm called MCTS-FJS for scheduling highly complex jobs
in a real-time job-shop environment. An MCTS is a tree search technique aimed at
making sequential decisions with uncertainty, calculate reward values from sub-tees,
and regularly explore the most promising sub-tree. Experimental results showed
that MCTS-scheduler outperformed various baseline scheduling algorithms and got
the best evaluation performance for our sample dataset. More importantly, results
showed that the performance of the proposed algorithm improved with increasing
the number of jobs. Hence, this novel approach can be used to solve the complex
FJSP in manufacturing systems.
* J. Y. Lee
[email protected]
M. Saqlain
[email protected]
S. Ali
[email protected]
1
Department of Computer Science, Chungbuk National University, Cheongju, Chungbuk 28644,
Republic of Korea
13
Vol.:(0123456789)
M. Saqlain et al.
Abbreviations
JSP Job-shop scheduling problem
FJSP Flexible job-shop scheduling problem
MCTS Monte Carlo Tree Search
RL Reinforcement learning
MDP Markov decision process
FIFO First in first out
SJF Shortest job first
LJF Longest job first
AWT Average waiting time
ARUT Average resource utilization time
1 Introduction
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
profitability (Vinod and Sridharan 2008). The planning of the JSP is one of the most
challenging optimization problems or NP-hard problems (Asadzadeh 2015).
Many ambiguities can occur in manufacturing systems, e.g., processing time vari-
ation, diversity of products, unpredictable events such as order change/cancel and
machine failure, complex priorities between various jobs, random process yield, or
rush order (Chaari et al. 2014). To overcome these issues, a flexible job-shop sched-
uling problem (FJSP) is required, which is an extended form of JSP and allows the
operations to be executed on different machines. There are three most important ele-
ments of FJSP: (1) services sequence, (2) start and end time of operations, and (3)
resources of all services. FJSP minimizes the effect of a sudden breakdown or over-
all disturbance in manufacturing production lines.
Many traditional methods for solving the FJSPs only focus on metaheuristics
approaches or static scheduling methods rather than dynamic scheduling methods.
Among these static scheduling methods, there are different baseline scheduling
techniques such as First In First Out (FIFO), Shortest Job First (SJF), and Long-
est Job First (LJF). Due to the simple structure and easy decision-making power,
these baseline techniques are widely used to solve the scheduling problems (Zhang
and Rose 2013; Pinedo 2008), but their strength is also their weakness, and they are
unable to adapt to the varying situation in the manufacturing processes (Floudas and
Lin 2005). These methods have the advantage of getting good results with compara-
tively low computational efforts. Some of these methods have been developed with
the inspiration of nature, thus named Evolutionary Algorithms (EAs) (Chiang and
Lin 2012). One of the well-known and mostly applied EA in FJSP is the Genetic
Algorithm (GA), inspired by the natural selection process (Hou et al. 1994). An EA
consists of static strategies with the requirement to develop all constraints and rules
through a manual mathematical description. This makes it a highly engineering
effort to adapt the FJSP and thus highly expensive for industrial operations. Addi-
tionally, the numbers of operations of different jobs are different rather than same in
a real-world industrial environment (Bierwirth and Mattfeld 1999). This shows that
the industrial system is stochastic, which means the state of the system continues
changing with the passages of time. This is a significant bottleneck that prevents the
EAs from achieving optimal performance. Thus, there is a need for a flexible sched-
uling policy for an FJSP assumed as the Markov Decision Process (MDP), which
defines a rule to assign different operations of a job on available machines (Sutton
and Barto 2018).
Machine learning (ML) is a prevalent domain of Artificial Intelligence (AI), which
has become very popular due to its state-of-the-art algorithms that have the ability to
learn behavior, functions, models, and patterns, and use that knowledge to make intel-
ligent decisions in the future. In modern days, reinforcement learning (RL) has gained
huge attention in the field of ML research (Gosavi 2009). In an RL model, an agent is
rewarded with penalties or rewards corresponding to its decision or action for achiev-
ing the goals by interacting with the environment (Shahrabi et al. 2017). RL algo-
rithms have many applications like robotics, intelligent assistants, industrial control,
and logistics, among others (Mnih et al. 2015). Monte-Carlo tree search (MCTS) is a
kind of RL algorithm that has accomplished remarkable success in the field of AI and
outperformed many human world champions of different classic games such as chess
13
M. Saqlain et al.
(Campbell et al. 2002), Go (Silver et al. 2016), checkers (Schaeffer et al. 1992), and
poker (Segler et al. 2018). On the other hand, the FJSP requires the dynamic scheduling
in which a real-time decision can be taken for the next operation at a specific machine
(Waschneck et al. 2016). Thus, we applied the MCTS model that is a stochastic algo-
rithm, to get accurate results through random sampling. For FJSP, the Monte-Carlo
evaluation accuracy can be improved by tree search. So, the highly effective scheduling
policy of FJSP can be found using MCTS because of their ability to solve any problem,
which can be modeled as an MDP. The MDP is a model-based RL and consists of two
elements: a state transition model that predicts the RL agent’s next state after it makes
an action, and a reward model, which predicts the expected calculated reward resulted
after the corresponding state transition. Once an MDP model has been constructed,
the optimal policy or optimal value is computed by applying MCTS (Coulom 2006) or
value iteration (Zhang et al. 2017).
The main contribution of our study can be summarized as follows:
This section presents the previous work for various scheduling models and the
MCTS algorithm.
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
2.1 Job‑shop scheduling
There are numerous researchers who have contributed to the literature of classic
JSP. Jain and Meeran wrote a complete review paper investigating both approximate
algorithms and exact methods applied to JSP (Jain and Meeran 1999). Scheduling is
a decision-making approach that is used regularly in all situations where a specific
number of operations of different jobs must be processed on a particular number of
resources (Zhang et al. 2019). Computer-based manufacturing scheduling is a very
active area for optimization. It plays an important role in improving the productivity
of a company as it efficiently allocates the limited available resources to the con-
tinuously arriving numerous sequential industrial operations. The resource alloca-
tion must follow some set of conditions or constraints that reflect the relationships
between industrial operations and the limited capacities of machines.
The scheduling problems are classified according to various characteristics of
jobs and machines; for example, job pre-emption is allowed or not, all jobs required
equal processing time or not, single machine scheduling or multiple parallel
machines scheduling and, so on. If a job has a specific number of operations requir-
ing various machines for their processing, this problem is called a shop problem. It
is known as one of the critical manufacturing problems because of its consequences
on the supply chain and the whole company’s performance. Depending on the con-
straints of the shop problem, it is classified as flow-shop, open-shop, and job-shop
(Dios and Framinan 2016). All these shop problems are NP-hard and solved by
metaheuristic or approximation methods.
A JSP is a fundamental type of manufacturing and combines various similar
production devices into closed units. In the JSP, each job may have multiple opera-
tions rather than a single operation (Leung 2004). It involves a set of machines and
a set of jobs while each job contains fixed number of operations, and each operation
should be processed by one of the defined machines from all available machines in
their corresponding time duration (Leusin et al. 2018). The main objective crite-
ria of JSP are to minimize the Makespan, which can be achieved by allocating the
available machines to the operations in such a way that the processing of all jobs
finishes in minimum duration (Reyna et al. 2015). Better understating of JSP with
the classic examples can be found in (Gabel and Riedmiller 2008). The multipro-
cessor JSP (MJSP) is an extension of the classical JSP in which multiple parallel
machines replace each machine. Carballo et al. proposed a reduction algorithm to
generate a feasible solution for the MJSP (Carballo et al. 2013). Their method suc-
cessfully reduces the solution space of MJSP after applying intensive computational
experimentations.
A JSP contains various constraints such as logistic constraints (e.g., sudden
machine failure, varying lot sizes), technological constraints (e.g., varying pro-
cessing time, time coupling), and production quantity constraints (Waschneck
et al. 2018). Under the above conditions and constraints, the JSP becomes a com-
plex JSP. Planning and dispatching of such complex scheduling problems are cru-
cial to improving a manufacturing system’s economic and logistic efficiency. Var-
ious dynamic scheduling problems such as two machines, limiting the number of
machines, minimum preemptive schedule, non-preemptive schedule, etc., and their
13
M. Saqlain et al.
solutions have been discussed in Leung (2004). An evaluation function and math-
ematical model is proposed to minimize the Makespan of mixed blocking constraint
JSP (Sauvey et al. 2020). It solved the metaheuristics compatibility issues with two
evaluation functions such as particle swarm optimization and genetic algorithm. An
algorithm was proposed to solve a complex and generalized JSP, which consider-
ably reduces solution space (Vakhania and Shchepin 2002). Their solution gener-
ates a deficient number of feasible schedules than the total number of feasible active
schedules.
• Selection This step starts from the root node and recursively selects the best child
node based on an evaluation function until an expandable leaf node is reached.
This expandable leaf node contains one or more unvisited children’s nodes, so it
is a non-terminal state.
• Expansion Selects all unvisited child nodes based on available actions unless no
more action can be taken on the current node.
• Simulation A random down path is selected with respect to an evaluation func-
tion called the roll-out policy function (default policy) to reach the leaf node (i.e.,
terminal state). Simulation is always applied at non-visited nodes and resulted in
an evaluation such as goal achieved or not (i.e., win/loss).
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
One complete iteration of a single search of the MCTS algorithm with four basic
steps is shown in Fig. 1, which is a redesign of the original model from (Browne
et al. 2012). These four steps of MCTS are repeated until either predefined time runs
out or a specific number of iterations is processed. Once the search is terminated,
the best performing action root is selected by following the gathered statistics. One
can get exploration and exploitation information of a visited node from its statistics.
For instance, a node should be explored more if it contains a high number of visits
value and a node with a high reward value shows how promising that node is; thus,
it should be exploited more.
Various evaluation functions are used to balance between exploration and exploi-
tation rate in MCTS (Kocsis et al. 2020). Upper Confidence Bound applied to trees
(UCT) is one of the commonly used evaluation functions for balancing exploration/
exploitation dilemma during the Selection phase of MCTS (Kocsis and Szepesv
2006). This function helps to select the best child node among all children nodes
to traverse through. The UCT function is shown in Eq. 1, where s denotes a specific
node and s′ the chile-node of s. Q(s′) contains the total simulation reward of this
node and N(s′) denotes the total number of visiting this node. The first part of this
function calculates the average reward value if the node s′ is selected, thus called the
exploitation component. The second part increases if N(s′) value is smaller then N(s)
value, where N(s) denotes the number of visits of the predecessor node. If the adja-
cent nodes have been visited more often, the overall value of the second part will
increase and the function will prefer exploration of this node, called the exploration
component. The constant c of the function is used to control the trade-off between
exploration and exploitation in MCTS and is generally determined empirically.
13
M. Saqlain et al.
√
Q(s� ) 2 log N(s)
�
UCT(s ) = �
+c (1)
N(s ) N(s� )
Note that in this study, UCT is applied for reward minimization (i.e., Makespan
minimization) instead of reward maximization, which can be achieved by apply-
ing a negative sign between the first component and the second component of the
UCT-formula.
Initially, the area of application of MCTS was to build gaming algorithms which can
defeat human beings and get master-level performance. Finally, in 2016 an MCTS
based algorithm introduced by Google DeepMind, namely Alpha Go, defeated the
18-times Korean world champion Lee Sedol in the game of Go by 4–1 (Silver et al.
2016). After the successful implementation of MCTS in gaming recently, it has been
applied to many other areas. Runarsson et al. (2012) used Rollout, Pilot, and MCTS
methods to solve JSP that contains a set of 300 problems of various sizes. They
found that MCTS outperformed the other methods for small and medium schedul-
ing problems (i.e., less than 14 jobs and 14 machines). The ε-greedy policy was
used for balancing the exploitation and exploration rate in the selection phase of
MCTS. Whereas, for more complex and large-size problems, MCTS was not a bet-
ter choice. An MCTS algorithm combined with constraints programming (CP) was
introduced to solve JSP (Loth et al. 2013). The CP helps the MCTS to expand the
tree at the child node with the best possible solution. Different evaluation functions
were examined for the selection phase, but UCT was selected as the best choice for
their specific problem.
Wu et al. (2013) proposed an MCTS based multi-objective FJSP algorithm by
combining it with a variable neighborhood descent algorithm (VNDA). They
improved the performance of their proposed algorithm by applying various other
techniques such as RAVE, LSONE, prior knowledge, subtree pruning, and trans-
position table. They evaluated their method using three evaluation functions like
Makespan, total workload, and max workload. Lubosch et al. (2018) combined
MCTS with a machine learning method gradient boosted decision trees (GBDT) to
solve complex industrial scheduling problems. The GBDT was used to predict the
best possible value of parameter c in UCT. This approach helped to create a fully
automated job-shop scheduling system. However, most of the previous studies apply
the MCTS for simple JSP instead of FJSP. An FJSP makes the scheduling environ-
ment very complex and requires more efficient ways of using MCTS.
In this section, we focus on solving the FJSP with the MCTS-based algorithm. The
detailed problems are given in the following subsections.
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
13
M. Saqlain et al.
To illustrate FJSP, we used a sample dataset that contains three jobs with ten
operations that can be processed by one or more machines from a given set of four
machines. Various constraints such as time constraints and level values among all
operations are given in Fig. 3a. For instance, Operation − 0 of Job – 0 can be exe-
cuted at Machine − 1 and Machine − 2 with their corresponding processing times of
7 and 12 min, while Machine − 0 and Machine − 3 have “ − ”, showing the inability
for processing of this operation. However, its level value is Level – 0 which shows
that this is the lowest priority operation, and it will be executed lastly among all the
operations of Job – 0.
Figure 3b reveals a Gantt chart obtained by applying one of the simplest baseline
scheduling algorithms First In, First Out (FIFO) on our sample dataset of three jobs
for FJSP. FIFO processes the operations on the first-come first served basis. This
chart follows all the constraints of FJSP and represents a solution. While the hori-
zontal axis represents the processing time of the machines in minutes and the verti-
cal axis represents all available machines. Each rectangle shows a specific operation
with the value of Ox , where x denotes the job number and y denotes the specific
y
operation of that job. All operations of a single job are denoted with a unique color
for better observation.
It is clear that multiple operations of different jobs are being processed at a
time on parallel machines, while multiple operations of individual jobs are being
Fig. 3 Sample problem of FJSP: a constraint table, and b corresponding Gantt chart with FIFO algorithm
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
processed one after the other. The second task of a job is started when the first
task of the same job is processed and so on (i.e., operations preemption is not
allowed). Moreover, it also fulfills the sequential condition by following the
level values. For instance, the processing sequence of all operations of Job – 1
are O71 → O61 → O41 → O51 → O31 by following their corresponding priorities with
level values such as 3 → 2 → 1 → 1 → 0. All machines are processing only a sin-
gle operation in the given time slot. Operation O31 is the lastly processed opera-
tion among all jobs at all parallel machines, so its finishing time is denoted as
Makespan(Cmax). Minimizing Cmax value is the well-known efficiency evaluation
method in production scheduling (Sriboonchandr et al. 2019).
13
M. Saqlain et al.
3.2.1 Input state
The number of states of the scheduling environment can be huge due to complex and
flexible architecture. These states are the input of the MCTS-scheduling agent. Each
state contains the information of remaining jobs and available machines. For better
understanding, we use the notions of [x, y, z, t, l] to represent each state as a jobvec-
tor. Each job vector consists of numerous operations and each operation contains
five basic attributes: (1) job number x, (2) operation number y, (3) machine at which
the operation should be processed z (i.e., requested machine), (4) processing time t,
and (5) level value l. So, the agent finds a relationship between requested machines
and available machines for each operation implicitly. For instance, in Fig. 4, we have
two job vectors such as [2, 2, M0, 7, 2] and [2, 3, M1, 10, 3], which show that there
are two operations O22 and O32 of Job – 2 that are processed at machine M0 and M1,
respectively, with their corresponding execution time of 7 and 10 time units. The
second operation has the level value 3, so it is processed earlier with higher prior-
ity at machine M1 with respect to the second operation, which has a level value 2.
Although the above two operations can be assigned to multiple machines with dif-
ferent processing times, our input job vectors are generated after observing the job-
shop environment and selecting only the best possible operations after observing the
availability of the machines.
3.2.2 MCTS‑scheduling agent
The most essential component of RL is its agent, which learns precisely from the
input state and applies accurate actions. The scheduling agent implements the
MCTS algorithm to define the policy. It starts at the root node denoted as the begin-
ning state of the tree, which becomes an initial state of the first job to be scheduled.
The following nodes indicate the possible states attained after the agent chooses the
possible actions. The tree edges indicate the possible scheduling actions that are
applied to process certain operations of a job at available machines. The algorithm
gets current awaiting job’s operations as input and provides an optimal schedule of
those operations to the available machines as output. Our modified MCTS algorithm
for policy update is given in Algorithm 1. During each iteration of the while loop,
the algorithm selects the best possible action or move and adds it to the final sched-
ule. Where, TreePolicy() selects a leaf node from all visited nodes in the search tree
during the selection and expansion phases of each iteration. And DefaultPolicy()
playouts the simulation at a non-terminal state using the roll-out policy function to
create a value estimation. This process continues until all jobs Jn are processed and
Sin becomes Scomplete. Each step of our modified MCTS-algorithm is explained as
follows.
Selection Starting at the root node, the best child node selection policy is
recursively implemented to descend by the tree until an expandable leaf node is
approached. An expandable leaf node has an unvisited child node and shows a non-
terminal state (line 1–5). In this paper, our goal is to minimize the Makespan while
updating policy for the selection of the best child node, which means that the agent
should try to maximize –UCT(s′) (i.e., see Eq. 1).
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
Expansion There are three possible actions for the MCTS-scheduling agent at
each node, such as FIFO, SJF, and LJF. According to these actions, new child
nodes are selected from the expandable leaf node (i.e., now parent node). Above
mentioned actions have a different influence to allocate various operations of the
same job to machines. The best action will be determined by comparing the over-
all processing time of all operations of a single job (line 10–12).
Simulation At the leaf node, a simulation/playout step is started by follow-
ing the default policy to create successive scheduling actions. This step con-
tinues until there is no more job to schedule or the agent reaches the terminal
state. While selecting an action, we choose a random job Jn and apply a heuris-
tic approach to choose a machine Mm to process that job. To enhance simulation
quality, the method prefers to select one of the available machines with the lowest
Makespan value. Thus, this method is also called greedysimulation method (line
6–8).
Backpropagation After the completion of the simulation step, the MCTS-
scheduler gets simulation results in the form of Makespan value. Then, this result
is backpropagated to all the ancestor nodes in the tree by updating their statistics
such, as the visit count value and the average Makespan value. The updated statis-
tics of the nodes are used to predict the possibility of selecting these nodes in the
future (line 9).
The above method helps to select the best scheduling action among the three
actions for each job during the training process. As the number of input jobs
increases, the tree becomes bigger, thus it requires a higher number of simula-
tions for each iteration which results in a more accurate selection of possible
scheduling actions. While selecting an action, the performance of UCT is sig-
nificantly enhanced by combining domain knowledge in the default policy of the
tree. The main advantages of the proposed MCTS-based method are its effective
simulation technique and the property to stop simulation any time by following
our computation capacity. The disadvantage of this method is that it is challeng-
ing to set the value of the constant c of the UCT function to control the trade-off
13
M. Saqlain et al.
3.2.3 Job‑shop environment
3.2.4 Scheduling action
An RL agent gets state and observation from the environment, and in response, takes
the best action. In the scheduling problem, the action is simply selecting a job from
the job vector and making its schedule on the available machines. There are three
possible actions in our problem such as FIFO, SJF, and LJF. According to the avail-
ability of machines, one of the best possible actions is selected to schedule the job.
3.2.5 Reward function
In a RL-based job shop environment, the reward function plays a vital role to
improve the overall performance of the algorithm. A reward is feedback that an
RL agent gets from the environment after applying the action. It highly impacts
the scheduling policy, which is essential for updating the final policy. In every epi-
sode, the agent applies an action and receives a reward. But this is an immediate
reward, not the average reward. To calculate the average reward, the whole sequence
of entire jobs is scheduled until the terminal state is reached, and then the agent
receives the final reward as ‘–UCT(s′)avg’. The UCT calculates the final reward score
by combining all exploration terms that support sampling infrequently actions.
A full T-length exploration is a sequence of numbers of state-action pairs such as
s0 a0 , s1 a1 , … , sT−1 aT−1 . Consequently, it computes the reward value of a state,
action, and policy parameter (s, a, θ) as the average return acquired after experienc-
ing various states.
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
3.3 Training MCTS‑algorithm
We train a scheduling agent with the MCTS algorithm. The training process depends
on the policy gradient techniques with various Monte-Carlo simulation episodes.
The algorithm receives a set of input states, possible actions, and a random pol-
icy value (i.e., default policy) as the input, which is defined as 𝜋(a ∨ s, 𝜃), where
a denotes the set of actions, s set of states, and θ random policy parameters. While
training, we applied various episodes to improve the overall policy. Each episode
shows a complete schedule of different operations of a single job, which begins from
the initial state s0, action a0, and the corresponding reward value of r0, to the last
state of sn, action an, and the final reward value of rn. After applying every step of
each episode, the parameters of policy θ is updated by the following Eq. 2 (Cheng
et al. 2019).
( )
( ) ∇𝜋 at |st , 𝜃
∇ ln 𝜋 at |st , 𝜃 = ( ) (2)
𝜋 at |st , 𝜃
The agent gets the reward rt at each episode t between the initial and the final
state. This is the immediate reward which is calculated after scheduling the job Jt at
the available machines. Thus, rt is dynamically created by following various sched-
uling action. For each step of a training episode, the MCTS algorithm defines the
long-term reward value of “R = –UCT(s′)avg” using a discount factor of γ = 0.99 and
constant parameter c = 0.05. This long-term reward R is used to calculate the final
optimal policy, as defined in Eq. 3.
( ) ( )
𝜋 ∗ at |st , 𝜃 = 𝛼𝛾 t R∇ ln 𝜋 at |st , 𝜃 (3)
where α represents the size of each training set which is always greater than zero.
The maximum tree length for each simulation is set to 163 episodes as the total
numbers of operations, after which the tree is supposed to be a tie. The processing
speed of MCTS-scheduling algorithm is 7 episodes per second and it takes an aver-
age of 23.3 s to entirely schedule a problem of 50 jobs.
3.4 Performance evaluation
The MCTS-scheduling agent was integrated with the job-shop environment through
the MDP interface and was implemented with the objective of exploring an opti-
mal scheduling policy. The simulation results were evaluated based on the following
three performance criteria:
(a) Makespan (Cmax) This criterion applied to calculate the total length of the sched-
ule, which determined the overall performance of the model.
(b) Average waiting time (AWT) Waiting time of a job includes waiting and frequen-
cies, completion time, and response time of the job. The AWT performance
method is used to find whether the waiting time of the jobs is decreased or not.
13
M. Saqlain et al.
(c) Average resource utilization time (ARUT) It represents the total utilization of
each machine, including the length of the queue for each machine. This criterion
is applied to find whether the applied scheduling methods improve the efficiency
and productivity of the scheduling system or not.
4.1 Case study
J0 O00 0 – – 17 22 –
O10 1 – – – – 20
O20 2 – 20 – – –
O30 2 – – – 23 18
J1 O41 0 16 22 – – –
J2 O51 0 – – – 18 22
J3 O63 0 – – 21 – 22
O73 1 – – 17 – –
O83 1 – – 21 25 –
O93 2 – – – – 19
O10
3
3 – – – 18 –
⋮ ⋮ ⋮ …
J48 O157
48
0 20 – 23 – –
O158
48
1 – 15 – 23 –
O159
48
1 – 23 – – –
O160
48
1 24 – – – –
J49 O161
49
0 – – 22 – 20
O162
49
1 17 – 16 – –
*Jx denotes the xth job; Ox yth operation of xth job; and Mi mth
y
machine
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
last two jobs dataset. Each job contains numerous operations, of which some opera-
tions can be processed at multiple machines at a time. In total, 163 sample opera-
tions are generated to be executed on five different machines. The simulation model
is designed and executed using Visual Studio 2017 and C# language, and it is run on
Intel CPU E5-2696 v5 @ 4.40 GHz and 512 GB RAM with Windows 10 operating
system.
Moreover, all operations face various processing times uncertainties (PTU),
such as constant, uniform, and triangular. The processing time of constant opera-
tions does not have a significant variation and contains a fixed value. The uniform
operations have limited information and contain processing time values between
the lower bound and upper bound (i.e., 13–17). The triangular operations have little
information and contain three parameters of processing time such as lower bound,
mode, and upper bound (i.e., 6–11–19). For the simplicity of the problem, we took
an average value of each uniform and triangular operation. The processing time of
all operations is given in minute time unit under their corresponding machines in
Table 1. Additionally, we suppose that all machines are failure-free and process the
jobs continuously.
4.2 Experimental results
MCTS-FJS starts with the Selection phase. It selects a random job and applies all
three possible actions such as FIFO, SJF, and LJF, thus visiting all child nodes at
least once. The best child node or action is selected by applying UCT evaluation
function (see Sect. 2.2). While applying each action and agent reaches a new child-
node, the Expansion phase is triggered by following the tree policy. All child nodes
with no visit so far are selected and added to the tree as new nodes. One by one
Simulation is applied by choosing one of these new nodes to the leaf nodes. This
is done by following the default policy and randomly selecting the nodes until the
agent reaches the terminal node. Each node visited using the tree policy is updated
with simulation results in the final Backpropagation phase. All these four phases of
our proposed MCTS-FJS algorithm are shown in Fig. 5, where the value of V shows
the number of visits of each node and the final selected actions/nodes are shown in
blue color. For instance, SJF is selected as the best action to schedule the first job
and FIFO to schedule the second job in our experiment, and so on. This process will
continue until all jobs have been successfully scheduled.
The performance results with graphical comparison using three performance cri-
teria are shown in Fig. 6. It is obvious that the MCTS-FJS algorithm determined a
schedule of fifty jobs with the Makespan of 705 min, which is significantly lower
than the Makespan of FIFO, SJF, and LJF with the values of 733, 760, and 789 min,
respectively, as shown in Fig. 6a. Our method outperformed all the baseline schedul-
ing methods and reduced the scheduling time up to 3.8%, 7.2%, and 10.6% for FIFO,
SJF, and LJF, respectively. MCTS-FJS found this progress because it explored the
early information of processing time of incoming operations and their correspond-
ing available machines by applying a simulation using UCT evaluation function to
13
M. Saqlain et al.
Fig. 6 Performance comparison of different scheduling algorithms using various performance criteria: a
Makespan, b AWT, and c ARUT. Note FIFO denotes first in first out; SJF shortest job first; LJF longest
job first; MCTS-FJS Monte Carlo tree search for flexible job shop
make a final scheduling action. Whereas all the baseline algorithms just naively fol-
lowed their fixed rules.
The proposed scheduling method also led to a decline in AWT of all avail-
able machines for a complete schedule and got only 3.5% of AWT, when com-
pared to baseline scheduling methods, all with an AWT of more than 9.3%, as
shown in Fig. 6b. Due to the decline in the AWT, the number of incoming jobs
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
in the scheduling system also decreased, which results in early completion of the
overall schedule. On the contrary, MCTS-FJS got the highest value of ARUT of all
machines by maximizing resource utilization and got a value of 96.5%, as shown in
Fig. 6c. It increased the average resource utilization up to 6.0%, 9.6%, and 6.0% for
FIFO, SJF, and LJF, respectively, which resulted from improving the overall effi-
ciency of the scheduling system.
Figure 7 presented a Gantt chart solution with the proposed MCTS-FJS model on
a defined problem of 50 jobs, 163 operations, and 5 machines, and got a Makespan
of 705 min. It followed all the FJSP constraints, as well as the additional sequential
constraint of level value, discussed in Sect. 3.1. For better observation, all operations
of a single job are denoted in rectangles with the same color. The blank white spaces
between two operations show the idle or waiting time of the machine.
4.3 Discussion
13
M. Saqlain et al.
Bold characters mean the best performance of Cmax, AWT, and ARUT
among applied algorithms, respectively
n denotes number of jobs; m number of machines; o number of oper-
ations; Cmax Makespan; AWTaverage waiting time; and ARUT aver-
age resource utilization time
Fig. 8 Makespan comparison of all scheduling algorithms with increasing number of jobs
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
5 Conclusions
This paper proposed an effective scheduling algorithm based on the MCTS strat-
egy for flexible job shop scheduling problems, which successfully improved the
overall manufacturing performance. The MCTS-scheduling agent gets informa-
tion about all scheduling rules and optimizes its policy autonomously by applying
MCTS-algorithm. The proposed method also considers the additional sequential
constraint making the job-shop environment highly flexible and agile to deal with
random events. Consequently, our findings were that the proposed method outper-
formed all the baseline scheduling algorithms such as FIFO, SJF, and LJF, and
reduced the scheduling time of 50 jobs problem up to 3.8%, 7.2%, and 10.6%,
respectively. Additionally, we found that the performance of the proposed method
is being gradually increased with the increasing number of input jobs. Thus, the
proposed algorithm can be used to solve the complex FJSP in the manufacturing
industries and improve their efficiency and productivity in the field.
A future extension of our study would be the use of deep reinforcement learn-
ing with a deep neural network to train the MCTS algorithm (Kartal et al. 2019),
unlike the current study where we used simple reinforcement learning-based
MCTS. Various hyper-parameters settings can play a vital role in improving the
performance of MCTS (Orhean et al. 2017). So, we will also work on hyper-
parameters tuning methods to enhance the performance of the proposed MCTS-
FJS algorithm.
Acknowledgements This work was done with the collaboration of the Singapore Institute of Manufactur-
ing Technology (SIMTech). We thank Dr. Byung Jun Joo, for providing us real-time industrial dataset
from SIMTech. Funding was provided by Ministry of Trade, Industry and Energy (Grant No. N0002429)
and National Research Foundation of Korea (Grant No. 2017R1D1A1A02018718).
Funding This work was supported by the KIAT (Korea Institute for Advancement of Technology) grant
funded by the Korea Government (MOTIE: Ministry of Trade Industry and Energy) (No. N0002429). It
was also supported by the Basic Science Research Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (2017R1D1A1A02018718).
References
Ahire S, Greenwood G, Gupta A, Terwilliger M (2007) Workforce-constrained preventive mainte-
nance scheduling using evolution strategies. Decis Sci 31(4):833–859. https://doi.org/10.1111/j.
1540-5915.2000.tb00945.x
Asadzadeh L (2015) A local search genetic algorithm for the job shop scheduling problem with intel-
ligent agents. Comput Ind Eng 85:376–383. https://doi.org/10.1016/j.cie.2015.04.006
Baier H, Drake PD (2011) The power of forgetting: Improving the last-good-reply policy in Monte
Carlo Go. IEEE Trans Comput Intell AI Games 2(4):303–309. https://doi.org/10.1109/TCIAIG.
2010.2100396
Bierwirth C, Mattfeld DC (1999) Production scheduling and rescheduling with genetic algorithms.
Evol Comput 7(1):1–17. https://doi.org/10.1162/evco.1999.7.1.1
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI
Gym. arXiv:1606.01540
13
M. Saqlain et al.
Browne C, Powley E, Whitehouse D, Lucas S, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis
S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games
4(1):1–49. https://doi.org/10.1109/TCIAIG.2012.2186810
Campbell M, Hoane AJ, Hsu F-H (2002) Deep blue. Artif Intell 134(1–2):57–83. https://doi.org/10.1016/
S0004-3702(01)00129-1
Carballo L, Vakhania N, Werner F (2013) Reducing efficiently the search tree for multiprocessor job-shop
scheduling problems. Int J Prod Res 51(23–24):7105–7119. https://doi.org/10.1080/00207543.2013.
837226
Chaari T, Chaabane S, Aissani N, Trentesaux D (2014) Scheduling under uncertainty: Survey and research
directions. ICALT pp 229–234. https://doi.org/10.1109/ICAdLT.2014.6866316
Chaslot G, Bakkes S, Szita I, Spronck P (2008) Monte-Carlo tree search: a new framework for game AI
AIIDE pp 216–217
Cheng Y, Wu Z, Liu K, Wu Q, Wang Y (2019) Smart DAG tasks scheduling between trusted and untrusted
entities using the MCTS method. Sustainability 11(7):1826. https://doi.org/10.3390/su11071826
Chiang T-C, Lin H-J (2012) Flexible job shop scheduling using a multiobjective memetic algorithm. Adv
Intell Comput Theories Appl pp 49–56. https://doi.org/10.1007/978-3-642-25944-9_7
Chiang T-C, Lin H-J (2013) A simple and effective evolutionary algorithm for multiobjective flexible job
shop scheduling. Intern J Prod Econ 141(1):87–98. https://doi.org/10.1016/j.ijpe.2012.03.034
Coulom R (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. International confer-
ence on computers and games pp 72–83. https://doi.org/10.1007/978-3-540-75538-8_7
Dios M, Framinan JM (2016) A review and classification of computer-based manufacturing scheduling tools.
Comput Ind Eng 99:229–249. https://doi.org/10.1016/j.cie.2016.07.020
Fera M, Fruggiero F, Lambiase A, Martino G, Nenni ME (2013) Production scheduling approaches for oper-
ations management.https://doi.org/10.5772/55431
Floudas CA, Lin X (2005) Mixed integer linear programming in process scheduling: modeling, algorithms,
and applications. Ann Oper Res 139(1):131–162. https://doi.org/10.1007/s10479-005-3446-x
Gabel T, Riedmiller M (2008) Adaptive reactive job-shop scheduling with reinforcement learning agents. Int
J Inf Technol Intell Comput 24(4)
Gosavi A (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS J Comput
21(2):178–192. https://doi.org/10.1287/ijoc.1080.0305
Hou ESH, Ansari N, Ren H (1994) A genetic algorithm for multiprocessor scheduling. IEEE Trans Parallel
Distrib Syst 5(2):113–120. https://doi.org/10.1109/71.265940
Jain AS, Meeran S (1999) Deterministic job-shop scheduling: past, present and future. Eur J Oper Res
113(2):390–434. https://doi.org/10.1016/S0377-2217(98)00113-1
Joo BJ, Shim S-H, Chua TJ, Cai TX (2018) Multi-level job scheduling under processing time uncertainty.
Comput Ind Eng 120:480–487. https://doi.org/10.1016/j.cie.2018.02.003
Kartal B, Hernandez-Leal P, Taylor ME (2019) Action guidance with MCTS for deep reinforcement learning
proarXiv:1907.11703v1
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. ECML pp 282–293. https://doi.org/10.
1007/11871842_29
Kocsis L, Szepesvári C, Willemson J (2020) Improved Monte-Carlo search
Leung YTJ (2004) Handbook of scheduling: algorithms, models and performance analysis. Chapman &
Hall, London. https://doi.org/10.1201/9780203489802
Leusin ME, Frazzon EM, Maldonado MU, Kück M, Freitag M (2018) Solving the job-shop scheduling prob-
lem in the industry 4.0 era. Technologies 6:107. https://doi.org/10.3390/technologies6040107
Li M, Yao L, Yang J, Wang Z (2014) Due date assignment and dynamic scheduling of one-of-a-kind assem-
bly production with uncertain processing time. Int J Comput Integr Manuf 28(6):1–12. https://doi.org/
10.1080/0951192X.2014.900859
Loth M, Sebag M, Hamadi Y, Schoenauer M, Schulte C (2013) Hybridizing constraint programming and
Monte-Carlo tree search: application to the job shop problem. ICLIO. https://doi.org/10.1007/978-3-
642-44973-4_35
Lu L, Zhang W, Gu X, Ji X, Chen J (2020) HMCTS-OP: Hierarchical MCTS based online planning in the
asymmetric adversarial environment. Symmetry 12(5):1–17. https://doi.org/10.3390/sym12050719
Lubosch M, Kunath M, Winkler H (2018) Industrial scheduling with monte tree search and machine learn-
ing. Procedia CIRP 72:1283–1287. https://doi.org/10.1016/j.procir.2018.03.171
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidje-
land AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra
D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature
518(7540):529–533. https://doi.org/10.1038/nature14236
13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…
Moras R, Smith ML, Kumar KS, Azim MA (1997) Analysis of antithetic sequences in flowshop scheduling
to minimize makespan. Prod Plan Control 8(8):780–787. https://doi.org/10.1080/095372897234678
Orhean AI, Pop F, Raicu I (2017) New scheduling approach using reinforcement learning for heterogeneous
distributed systems. J Parallel Distrib Comput. https://doi.org/10.1016/j.jpdc.2017.05.001
Pinedo ML (2008) Scheduling: theory, algorithms, and systems. https://doi.org/10.1007/978-0-387-78935-4
Reyna YCF, Jiménez YM, Cabrera JMB, Hernández BMM (2015) A reinforcement learning approach for
schedulingproblems. Revista Investigacion Operacional 36(3):225–231
Runarsson TP, Schoenauer M, Sebag M (2012) Pilot, rollout and Monte Carlo tree search methods for job
shop scheduling. pp 160–174. https://doi.org/10.1007/978-3-642-34413-8_12
Sauvey C, Trabelsi W, Sauer N (2020) Mathematical model and evaluation function for conflict-free war-
ranted makespan minimization of mixed blocking constraint job-shop problems. Mathematics 8(1):121.
https://doi.org/10.3390/math8010121
Schaeffer J, Culberson J, Treloar N, Knight B, Lu P, Szafron D (1992) A world championship caliber check-
ers program. Artif Intell 53(2–3):273–289. https://doi.org/10.1016/0004-3702(92)90074-8
Segler MHS, Preuss M, Waller MP (2018) Planning chemical syntheses with deep neural networks and sym-
bolic AI. Nature 555(7698):604–610. https://doi.org/10.1038/nature25978
Shahrabi J, Adibi MA, Mahootchi M (2017) A reinforcement learning approach to parameter estimation in
dynamic job shop scheduling. Comput Ind Eng. https://doi.org/10.1016/j.cie.2017.05.026
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, Schrittwieser J, Antonoglou I, Pan-
neershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T,
Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural
networks and tree search. Nature 529(7585):484–489. https://doi.org/10.1038/nature16961
Sriboonchandr P, Kriengkorakot N, Kriengkorakot P (2019) Improved differential evolution algorithm for
flexible job shop scheduling problems. Math Comput Appl 24(3):80. https://doi.org/10.3390/mca24
030080
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. The MIT Press
Vakhania N, Shchepin E (2002) Concurrent operations can be parallelized in scheduling multiprocessor job
shop. J Sched 5(3):227–245. https://doi.org/10.1002/jos.101
Vinod V, Sridharan R (2008) Scheduling a dynamic job shop production system with sequence-dependent
setups: an experimental study. Robot Comput-Integrated Manuf 24(3):435–449. https://doi.org/10.
1016/j.rcim.2007.05.001
Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement
learning. AAAI
Waschneck B, Reichstaller A, Belzner L, Altenmüller T, Bauernhansl T, Knapp A, Kyek A (2018) Optimiza-
tion of global production scheduling with deep reinforcement learning. Procedia CIRP 72:1264–1269.
https://doi.org/10.1016/j.procir.2018.03.212
Waschneck B, Altenmüller T, Bauernhansl T, Kyek A (2016). Production scheduling in complex job shops
from an industrie 4.0 perspective: a review and challenges in the semiconductor industry. SAMI
Wu T-Y, Wu I-C, Liang C-C (2013) Multi-objective flexible job shop scheduling problem based on Monte-
Carlo tree search. Conference on technologies and applications of artificial intelligence, pp 73–78.
https://doi.org/10.1109/TAAI.2013.27
Zhang T, Xie S, Rose O (2017) Real-time job shop scheduling based on simulation and Markov decision
processes. WSC pp 3899–3907. https://doi.org/10.1109/WSC.2017.8248100
Zhang D, Dai D, He Y, Bao FS (2019) RLScheduler: learn to schedule HPC batch jobs using deep reinforce-
ment learning. arXiv:1910.08925v1
Zhang T, Rose O (2013) Intelligent dispatching in dynamic stochastic job shops. WSC. https://doi.org/10.
1109/WSC.2013.6721634
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
M. Saqlain received his B.S. degree in Software Engineering from Government College University
Faisalabad (GCUF), Pakistan in 2014, and M.S. degree in the same major from National University of
Science and Technology (NUST), Pakistan in 2016. He is now Ph.D. candidate at the Department of
Computer Science at Chungbuk National University, Republic of Korea. His research interests include
13
M. Saqlain et al.
data mining, artificial intelligence, machine learning, deep learning, reinforcement learning, and smart
manufacturing.
S. Ali received his B.E. degree in Computer Engineering from Mehran University of Engineering & Tech-
nology, Pakistan in 2015. He is now M.S. candidate at the Department of Computer Science at Chungbuk
National University, Republic of Korea. His research interests include bioinformatics, data mining, artifi-
cial intelligence, and cardiovascular disease.
J. Y. Lee received the B.E. and M.E. degrees in computer engineering and the Ph.D. degree in computer
science from Chungbuk National University, South Korea, in 1985, 1987, and 1999, respectively. He was
a Research/Project Leader with the Institute of Software Research and Development, Hyundai Electron-
ics Industrial Company Ltd., and Hyundai Information Technologies Company Ltd., South Korea, from
1990 to 1996. He was with BIT Computer Cooperation in 1989. He was an assistant professor with the
Department of Information and Communication Engineering, Kangwon National University at Samcheok
Campus, from 1999 to 2003. He is then a full professor with the Department of Software Engineering,
Chungbuk National University, South Korea. He had been a president of Korea Convergence Society
from January 2010 to December 2017 and is a chief-in-editor at the journal since January 2020. His
current research interests include medical databases, query processing and optimization techniques in
databases, fault detection in semiconductor manufacturing, and production scheduling in smart factories,
machine, and reinforcement learning.
13