0% found this document useful (0 votes)
7 views24 pages

AMonteCarlotreesearchalgorithmforthefexible

The document presents a Monte Carlo Tree Search (MCTS) algorithm for solving the Flexible Job-Shop Scheduling Problem (FJSP) in manufacturing systems, which aims to minimize the completion time of operations (Makespan) under various constraints. The proposed MCTS-FJS algorithm outperformed traditional scheduling methods and demonstrated improved performance as the number of jobs increased. This research highlights the potential of using reinforcement learning techniques to optimize complex scheduling problems in dynamic industrial environments.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

AMonteCarlotreesearchalgorithmforthefexible

The document presents a Monte Carlo Tree Search (MCTS) algorithm for solving the Flexible Job-Shop Scheduling Problem (FJSP) in manufacturing systems, which aims to minimize the completion time of operations (Makespan) under various constraints. The proposed MCTS-FJS algorithm outperformed traditional scheduling methods and demonstrated improved performance as the number of jobs increased. This research highlights the potential of using reinforcement learning techniques to optimize complex scheduling problems in dynamic industrial environments.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Flexible Services and Manufacturing Journal

https://doi.org/10.1007/s10696-021-09437-4

A Monte‑Carlo tree search algorithm for the flexible


job‑shop scheduling in manufacturing systems

M. Saqlain1 · S. Ali1 · J. Y. Lee1

Accepted: 6 December 2021


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2021

Abstract
Flexible job-shop scheduling problem (FJSP) is an extension of the simple JSP with
additional features of routing flexibility. It is an essential class of sequencing and
planning problems that can apply in many real-life applications, especially in the
field of manufacturing systems and production management. Finding a scheduling
solution of sequential operations of various jobs by processing them on a defined
number of machines and following various constraints with the goal to minimize the
completion time of all operations, known as Makespan, is a big challenging issue.
To address this issue, we proposed a Monte Carlo Tree Search-based flexible job-
shop scheduling algorithm called MCTS-FJS for scheduling highly complex jobs
in a real-time job-shop environment. An MCTS is a tree search technique aimed at
making sequential decisions with uncertainty, calculate reward values from sub-tees,
and regularly explore the most promising sub-tree. Experimental results showed
that MCTS-scheduler outperformed various baseline scheduling algorithms and got
the best evaluation performance for our sample dataset. More importantly, results
showed that the performance of the proposed algorithm improved with increasing
the number of jobs. Hence, this novel approach can be used to solve the complex
FJSP in manufacturing systems.

Keywords Flexible job-shop scheduling · Production scheduling · Monte-Carlo tree


search · Reinforcement learning · Smart manufacturing

* J. Y. Lee
[email protected]
M. Saqlain
[email protected]
S. Ali
[email protected]
1
Department of Computer Science, Chungbuk National University, Cheongju, Chungbuk 28644,
Republic of Korea

13
Vol.:(0123456789)
M. Saqlain et al.

Abbreviations
JSP Job-shop scheduling problem
FJSP Flexible job-shop scheduling problem
MCTS Monte Carlo Tree Search
RL Reinforcement learning
MDP Markov decision process
FIFO First in first out
SJF Shortest job first
LJF Longest job first
AWT​ Average waiting time
ARUT​ Average resource utilization time

1 Introduction

The modern manufacturing environments contain multiple elements such as dif-


ferent customers, suppliers, huge numbers of products, numerous machines with
multiple operations, and complex shop configurations, making it highly complex,
dynamic, and uncertain in nature. Such an environment requires a flexible pro-
duction system to manage resource allocation, control, production planning and
scheduling, and optimization in real-time. The highly competitive nature of recent
manufacturing is forcing all the modern industries to execute high-tech and auto-
mated applications. The industrial operations can be optimized by exploiting and
managing the industrial data and so the manufacturing system can fulfill the goals
of production scheduling, self-adaptation, resilience, and intelligence. Thus, the job-
shop scheduling problem (JSP) for a dynamic industrial environment is an important
research issue in a flexible manufacturing system. The JSP is a crucial non-trivial
optimization issue in operation research and computer science. It is one of the essen-
tial industrial activities and being successfully applied in the field of distribution,
transportation, manufacturing (resource assignment and task allocation), computer
design, and communication (Ahire et al. 2007). Moreover, scheduling is vital to
reduce time, energy, and overall cost and to improve performance and quality of the
manufacturing production.
An ideal JSP is a method of assigning several jobs n to the available resources m
(here, machines) at a specific time within the scope of the intelligent manufacturing
to optimize some objective function (Fera et al. 2013). Each job contains multiple
operations and a sequential order in which each operation must be scheduled. Each
operation can be processed only on one of the specified machines. Once an opera-
tion has been scheduled on a machine and processing has been started, the same
machine cannot process another operation until it finishes the current operation.
Each machine must process only one operation at a time. It is critical to decide how
to schedule all operations on the available machines so that the length of the sched-
ule is minimized. In JSP, the length of the schedule is called Makespan (Cmax), which
is the time taken since the first job started until the last operation of the last job is
processed (Chiang and Lin 2013). Makespan reduction increases overall efficiency
and resource utilization and decreases processing time of operations and maximizes

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

profitability (Vinod and Sridharan 2008). The planning of the JSP is one of the most
challenging optimization problems or NP-hard problems (Asadzadeh 2015).
Many ambiguities can occur in manufacturing systems, e.g., processing time vari-
ation, diversity of products, unpredictable events such as order change/cancel and
machine failure, complex priorities between various jobs, random process yield, or
rush order (Chaari et al. 2014). To overcome these issues, a flexible job-shop sched-
uling problem (FJSP) is required, which is an extended form of JSP and allows the
operations to be executed on different machines. There are three most important ele-
ments of FJSP: (1) services sequence, (2) start and end time of operations, and (3)
resources of all services. FJSP minimizes the effect of a sudden breakdown or over-
all disturbance in manufacturing production lines.
Many traditional methods for solving the FJSPs only focus on metaheuristics
approaches or static scheduling methods rather than dynamic scheduling methods.
Among these static scheduling methods, there are different baseline scheduling
techniques such as First In First Out (FIFO), Shortest Job First (SJF), and Long-
est Job First (LJF). Due to the simple structure and easy decision-making power,
these baseline techniques are widely used to solve the scheduling problems (Zhang
and Rose 2013; Pinedo 2008), but their strength is also their weakness, and they are
unable to adapt to the varying situation in the manufacturing processes (Floudas and
Lin 2005). These methods have the advantage of getting good results with compara-
tively low computational efforts. Some of these methods have been developed with
the inspiration of nature, thus named Evolutionary Algorithms (EAs) (Chiang and
Lin 2012). One of the well-known and mostly applied EA in FJSP is the Genetic
Algorithm (GA), inspired by the natural selection process (Hou et al. 1994). An EA
consists of static strategies with the requirement to develop all constraints and rules
through a manual mathematical description. This makes it a highly engineering
effort to adapt the FJSP and thus highly expensive for industrial operations. Addi-
tionally, the numbers of operations of different jobs are different rather than same in
a real-world industrial environment (Bierwirth and Mattfeld 1999). This shows that
the industrial system is stochastic, which means the state of the system continues
changing with the passages of time. This is a significant bottleneck that prevents the
EAs from achieving optimal performance. Thus, there is a need for a flexible sched-
uling policy for an FJSP assumed as the Markov Decision Process (MDP), which
defines a rule to assign different operations of a job on available machines (Sutton
and Barto 2018).
Machine learning (ML) is a prevalent domain of Artificial Intelligence (AI), which
has become very popular due to its state-of-the-art algorithms that have the ability to
learn behavior, functions, models, and patterns, and use that knowledge to make intel-
ligent decisions in the future. In modern days, reinforcement learning (RL) has gained
huge attention in the field of ML research (Gosavi 2009). In an RL model, an agent is
rewarded with penalties or rewards corresponding to its decision or action for achiev-
ing the goals by interacting with the environment (Shahrabi et al. 2017). RL algo-
rithms have many applications like robotics, intelligent assistants, industrial control,
and logistics, among others (Mnih et al. 2015). Monte-Carlo tree search (MCTS) is a
kind of RL algorithm that has accomplished remarkable success in the field of AI and
outperformed many human world champions of different classic games such as chess

13
M. Saqlain et al.

(Campbell et al. 2002), Go (Silver et al. 2016), checkers (Schaeffer et al. 1992), and
poker (Segler et al. 2018). On the other hand, the FJSP requires the dynamic scheduling
in which a real-time decision can be taken for the next operation at a specific machine
(Waschneck et al. 2016). Thus, we applied the MCTS model that is a stochastic algo-
rithm, to get accurate results through random sampling. For FJSP, the Monte-Carlo
evaluation accuracy can be improved by tree search. So, the highly effective scheduling
policy of FJSP can be found using MCTS because of their ability to solve any problem,
which can be modeled as an MDP. The MDP is a model-based RL and consists of two
elements: a state transition model that predicts the RL agent’s next state after it makes
an action, and a reward model, which predicts the expected calculated reward resulted
after the corresponding state transition. Once an MDP model has been constructed,
the optimal policy or optimal value is computed by applying MCTS (Coulom 2006) or
value iteration (Zhang et al. 2017).
The main contribution of our study can be summarized as follows:

1. A MCTS-based algorithm for flexible job-shop scheduling called MCTS-FJS is


proposed which shows the state-of-the-art performance in the real-life manufac-
turing environment.
2. The proposed method is incorporated with three baseline scheduling techniques
as FIFO, SJF, and LJF. We merged these techniques with the MCTS algorithm to
find a best policy for scheduling the continuously arriving jobs on the available
machines. Thus, our method selects the most promising move from three possible
moves, e.g., FIFO, SJF, and LJF, for each arriving job into the jobs queue.
3. In the result of each move/action, we will find three outcomes: (1) one selected
move called policy, (2) value gain from the selected move known as value func-
tion, and (3) points scored by applying the selected move called immediate
reward. End-to-end training is applied so the algorithm can estimate the outcomes
accurately and find the improvements at each applied move.

The proposed scheduling algorithm successfully found an effective scheduling pol-


icy for the FJSP and outperformed all the baseline scheduling techniques. The real-time
simulation results of the MCTS-FJS algorithm represent the potential of using the RL-
based MCTS algorithm to optimize scheduling problems.
This paper is organized as follows: in Sect. 2, we review the background and past
works related to JSP and MCTS and discuss the challenges and the motivations behind
this study; Sect. 3 presents our proposed MCTS-JSP algorithm with its design and opti-
mization; Sect. 4 shows the experimental results of MCTS-JSP and its comparison with
baseline scheduling algorithms; Sect. 5 describes the conclusion and future work.

2 Background and literature work

This section presents the previous work for various scheduling models and the
MCTS algorithm.

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

2.1 Job‑shop scheduling

There are numerous researchers who have contributed to the literature of classic
JSP. Jain and Meeran wrote a complete review paper investigating both approximate
algorithms and exact methods applied to JSP (Jain and Meeran 1999). Scheduling is
a decision-making approach that is used regularly in all situations where a specific
number of operations of different jobs must be processed on a particular number of
resources (Zhang et al. 2019). Computer-based manufacturing scheduling is a very
active area for optimization. It plays an important role in improving the productivity
of a company as it efficiently allocates the limited available resources to the con-
tinuously arriving numerous sequential industrial operations. The resource alloca-
tion must follow some set of conditions or constraints that reflect the relationships
between industrial operations and the limited capacities of machines.
The scheduling problems are classified according to various characteristics of
jobs and machines; for example, job pre-emption is allowed or not, all jobs required
equal processing time or not, single machine scheduling or multiple parallel
machines scheduling and, so on. If a job has a specific number of operations requir-
ing various machines for their processing, this problem is called a shop problem. It
is known as one of the critical manufacturing problems because of its consequences
on the supply chain and the whole company’s performance. Depending on the con-
straints of the shop problem, it is classified as flow-shop, open-shop, and job-shop
(Dios and Framinan 2016). All these shop problems are NP-hard and solved by
metaheuristic or approximation methods.
A JSP is a fundamental type of manufacturing and combines various similar
production devices into closed units. In the JSP, each job may have multiple opera-
tions rather than a single operation (Leung 2004). It involves a set of machines and
a set of jobs while each job contains fixed number of operations, and each operation
should be processed by one of the defined machines from all available machines in
their corresponding time duration (Leusin et al. 2018). The main objective crite-
ria of JSP are to minimize the Makespan, which can be achieved by allocating the
available machines to the operations in such a way that the processing of all jobs
finishes in minimum duration (Reyna et al. 2015). Better understating of JSP with
the classic examples can be found in (Gabel and Riedmiller 2008). The multipro-
cessor JSP (MJSP) is an extension of the classical JSP in which multiple parallel
machines replace each machine. Carballo et al. proposed a reduction algorithm to
generate a feasible solution for the MJSP (Carballo et al. 2013). Their method suc-
cessfully reduces the solution space of MJSP after applying intensive computational
experimentations.
A JSP contains various constraints such as logistic constraints (e.g., sudden
machine failure, varying lot sizes), technological constraints (e.g., varying pro-
cessing time, time coupling), and production quantity constraints (Waschneck
et al. 2018). Under the above conditions and constraints, the JSP becomes a com-
plex JSP. Planning and dispatching of such complex scheduling problems are cru-
cial to improving a manufacturing system’s economic and logistic efficiency. Var-
ious dynamic scheduling problems such as two machines, limiting the number of
machines, minimum preemptive schedule, non-preemptive schedule, etc., and their

13
M. Saqlain et al.

solutions have been discussed in Leung (2004). An evaluation function and math-
ematical model is proposed to minimize the Makespan of mixed blocking constraint
JSP (Sauvey et al. 2020). It solved the metaheuristics compatibility issues with two
evaluation functions such as particle swarm optimization and genetic algorithm. An
algorithm was proposed to solve a complex and generalized JSP, which consider-
ably reduces solution space (Vakhania and Shchepin 2002). Their solution gener-
ates a deficient number of feasible schedules than the total number of feasible active
schedules.

2.2 Monte Carlo tree search (MCTS)

MCTS is a reinforcement learning (RL) method used to solve sequential decision


problems by following the Monte-Carlo simulation results (Lu et al. 2020). It takes
the advantage of the notion of the tree structure and gets real results through random
sampling. It has the same setting as the RL models have and contains a learning
agent and an environment. In a typical RL model, an agent interacts with the envi-
ronment and gets some random reward by selecting an action by following a certain
policy and moves from one state to the next state (Walsh et al. 2010). The agent con-
tinues changing its states until it reaches the final goal state. Whereas in the MCTS-
based method, there are predefined terminal states. Whenever the agent reaches one
of these states, it cannot take any further action and ends the interaction with the
environment. The MCTS finds an optimal policy for selecting a set of actions that
maximizes the total reward (Baier and Drake 2011). The MCTS algorithm simulates
the RL-environment many times to update its policy and predicts the best possible
actions on behalf of the simulation results. An antithetical Monte-Carlo method was
proposed in Moras et al. (1997), which can be used to find an optimal Makespan
value and determine a threshold Makespan value in a simple flow-shop scheduling
problem.
The MCTS is one of the best-first search (BFS) algorithms that generate a
sequential tree, where the nodes of the tree represent the states and edges repre-
sent the possible actions. For many computer game environments, the Monte-Carlo
evaluation accuracy is improved through tree search. The main concept of MCTS is
a search, which contains the number of iterations down the tree and each iteration
consists of four basic steps (Chaslot et al. 2008):

• Selection This step starts from the root node and recursively selects the best child
node based on an evaluation function until an expandable leaf node is reached.
This expandable leaf node contains one or more unvisited children’s nodes, so it
is a non-terminal state.
• Expansion Selects all unvisited child nodes based on available actions unless no
more action can be taken on the current node.
• Simulation A random down path is selected with respect to an evaluation func-
tion called the roll-out policy function (default policy) to reach the leaf node (i.e.,
terminal state). Simulation is always applied at non-visited nodes and resulted in
an evaluation such as goal achieved or not (i.e., win/loss).

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

• Backpropagation After following simulation results, statistics of all tree nodes


are updated on the path from the node where simulation started up to the root
node. Two commonly used backpropagation statistics of MCTS are (1) simula-
tion reward of each node and (2) the number of visits of each node.

One complete iteration of a single search of the MCTS algorithm with four basic
steps is shown in Fig. 1, which is a redesign of the original model from (Browne
et al. 2012). These four steps of MCTS are repeated until either predefined time runs
out or a specific number of iterations is processed. Once the search is terminated,
the best performing action root is selected by following the gathered statistics. One
can get exploration and exploitation information of a visited node from its statistics.
For instance, a node should be explored more if it contains a high number of visits
value and a node with a high reward value shows how promising that node is; thus,
it should be exploited more.
Various evaluation functions are used to balance between exploration and exploi-
tation rate in MCTS (Kocsis et al. 2020). Upper Confidence Bound applied to trees
(UCT) is one of the commonly used evaluation functions for balancing exploration/
exploitation dilemma during the Selection phase of MCTS (Kocsis and Szepesv
2006). This function helps to select the best child node among all children nodes
to traverse through. The UCT function is shown in Eq. 1, where s denotes a specific
node and s′ the chile-node of s. Q(s′) contains the total simulation reward of this
node and N(s′) denotes the total number of visiting this node. The first part of this
function calculates the average reward value if the node s′ is selected, thus called the
exploitation component. The second part increases if N(s′) value is smaller then N(s)
value, where N(s) denotes the number of visits of the predecessor node. If the adja-
cent nodes have been visited more often, the overall value of the second part will
increase and the function will prefer exploration of this node, called the exploration
component. The constant c of the function is used to control the trade-off between
exploration and exploitation in MCTS and is generally determined empirically.

Fig. 1  The four major phases in the MCTS method

13
M. Saqlain et al.


Q(s� ) 2 log N(s)

UCT(s ) = �
+c (1)
N(s ) N(s� )

Note that in this study, UCT is applied for reward minimization (i.e., Makespan
minimization) instead of reward maximization, which can be achieved by apply-
ing a negative sign between the first component and the second component of the
UCT-formula.

2.3 MCTS for JSP

Initially, the area of application of MCTS was to build gaming algorithms which can
defeat human beings and get master-level performance. Finally, in 2016 an MCTS
based algorithm introduced by Google DeepMind, namely Alpha Go, defeated the
18-times Korean world champion Lee Sedol in the game of Go by 4–1 (Silver et al.
2016). After the successful implementation of MCTS in gaming recently, it has been
applied to many other areas. Runarsson et al. (2012) used Rollout, Pilot, and MCTS
methods to solve JSP that contains a set of 300 problems of various sizes. They
found that MCTS outperformed the other methods for small and medium schedul-
ing problems (i.e., less than 14 jobs and 14 machines). The ε-greedy policy was
used for balancing the exploitation and exploration rate in the selection phase of
MCTS. Whereas, for more complex and large-size problems, MCTS was not a bet-
ter choice. An MCTS algorithm combined with constraints programming (CP) was
introduced to solve JSP (Loth et al. 2013). The CP helps the MCTS to expand the
tree at the child node with the best possible solution. Different evaluation functions
were examined for the selection phase, but UCT was selected as the best choice for
their specific problem.
Wu et al. (2013) proposed an MCTS based multi-objective FJSP algorithm by
combining it with a variable neighborhood descent algorithm (VNDA). They
improved the performance of their proposed algorithm by applying various other
techniques such as RAVE, LSONE, prior knowledge, subtree pruning, and trans-
position table. They evaluated their method using three evaluation functions like
Makespan, total workload, and max workload. Lubosch et al. (2018) combined
MCTS with a machine learning method gradient boosted decision trees (GBDT) to
solve complex industrial scheduling problems. The GBDT was used to predict the
best possible value of parameter c in UCT. This approach helped to create a fully
automated job-shop scheduling system. However, most of the previous studies apply
the MCTS for simple JSP instead of FJSP. An FJSP makes the scheduling environ-
ment very complex and requires more efficient ways of using MCTS.

3 Research method and material

In this section, we focus on solving the FJSP with the MCTS-based algorithm. The
detailed problems are given in the following subsections.

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

3.1 Flexible job‑shop scheduling problem (FJSP)

In a JSP production system, if a specific operation of a job can be processed by mul-


tiple parallel machines, unlike one machine, this problem is called the flexible job-
shop problem (FJSP). In such systems, each machine has numerous capabilities. The
focus of this study is to solve the FJSP, which is a more flexible and complicated
problem. Our FJSP is defined as follows:
{ }
} a n × m JSP contains a set of n jobs Jx 1≤x≤n and a set of m machines
1. {Assume
My 1≤y≤m.
{ }
2. Each job Jx has its own set of i operations Ox,z,y 1≤z≤i , which should process on
multiple parallel machines.
3. A new operation of a job can only be started, when the last operation of the same
job is already finished, which means preemption is negated.
4. Each operation Ox,y,z needs the exclusive use of one of the machines to complete
the process for an uninterrupted time period which is called processing time
pOx,y,z.
5. All operations should follow the level-value constraint to process each operation
in the priority sequence.
6. One machine should process only one operation at a given time period.
7. COx,y,z is the completion time of an operation Ox,y,z at machine My which is
achieved by scheduling the given operation according to the above constraints.
8. A job is completed when all of its operations has been processed.
9. A Makespan is the completion time of the last operation at the last machine or
the time required to completely process all the jobs, which is defined as
Cmax = maxCOx,z,y.

Comparing with most of previous FJSP methods, our production environment


contains an additional sequential constraint called level-value. In this constraint,
each operation is executed according to the specific level value. This additional con-
straint makes our research problem more complex and novel. Level values repre-
sent the sequence of operations to be processed according to their priorities. So, the
higher-level value operations must be processed on the first priority and vice versa,
which means all operations of a single job should be processed in descending order.
This is because the manufacturing systems are composed of numerous machines,
and each machine can produce different products. Each product belongs to one of
the multiple bills of materials (BOM) that defines the tree assembly structure to
manufacture different end products (Joo et al. 2018).
Figure 2 shows an example of a tree assembly structure with seven different prod-
ucts producing at three different levels. The only precedent constraint for these prod-
ucts is i → j, which implies that product i must precede the product j in the produc-
tion series. It means the product i is at a higher priority over the product j. Thus, for
the completion of one order of all these products, the sequence of eight products is
E → B → F → G → C → D → D → A. It is given a sequence that the topmost product
A is produced lastly.

13
M. Saqlain et al.

Fig. 2  An example of tree-like


production sequence structure

To illustrate FJSP, we used a sample dataset that contains three jobs with ten
operations that can be processed by one or more machines from a given set of four
machines. Various constraints such as time constraints and level values among all
operations are given in Fig. 3a. For instance, Operation − 0 of Job – 0 can be exe-
cuted at Machine − 1 and Machine − 2 with their corresponding processing times of
7 and 12 min, while Machine − 0 and Machine − 3 have “ − ”, showing the inability
for processing of this operation. However, its level value is Level – 0 which shows
that this is the lowest priority operation, and it will be executed lastly among all the
operations of Job – 0.
Figure 3b reveals a Gantt chart obtained by applying one of the simplest baseline
scheduling algorithms First In, First Out (FIFO) on our sample dataset of three jobs
for FJSP. FIFO processes the operations on the first-come first served basis. This
chart follows all the constraints of FJSP and represents a solution. While the hori-
zontal axis represents the processing time of the machines in minutes and the verti-
cal axis represents all available machines. Each rectangle shows a specific operation
with the value of Ox , where x denotes the job number and y denotes the specific
y

operation of that job. All operations of a single job are denoted with a unique color
for better observation.
It is clear that multiple operations of different jobs are being processed at a
time on parallel machines, while multiple operations of individual jobs are being

Fig. 3  Sample problem of FJSP: a constraint table, and b corresponding Gantt chart with FIFO algorithm

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

processed one after the other. The second task of a job is started when the first
task of the same job is processed and so on (i.e., operations preemption is not
allowed). Moreover, it also fulfills the sequential condition by following the
level values. For instance, the processing sequence of all operations of Job – 1
are O71 → O61 → O41 → O51 → O31 by following their corresponding priorities with
level values such as 3 → 2 → 1 → 1 → 0. All machines are processing only a sin-
gle operation in the given time slot. Operation O31 is the lastly processed opera-
tion among all jobs at all parallel machines, so its finishing time is denoted as
Makespan(Cmax). Minimizing Cmax value is the well-known efficiency evaluation
method in production scheduling (Sriboonchandr et al. 2019).

3.2 MCTS‑based method for FJSP

In this section, we design a single-player MCTS-based algorithm for FJSP. It has


four basic steps as described in Sect. 2.2. The overall architecture of our proposed
MCTS-based scheduler for FJSP is shown in Fig. 4. At a given time step, the
MCTS-scheduler agent observes an input State from the job-shop environment.
Depending on job observation, the agent determined the scheduling Action. After
the action is applied to the job-shop environment, the agent receives a Reward
and changes its state to the new one. After this, the new state will become a cur-
rent state for the agent, and the next iteration with the same steps will start. The
agent continues this process until all jobs have been scheduled or it reaches the
terminal state. Meanwhile, the agent acquires knowledge from its applied actions
and received rewards cumulatively. Using this knowledge, the agent updates its
policy (πθ) by applying the MCTS algorithm. All of these components are dis-
cussed in more detail below.

Fig. 4  MCTS-based scheduler architecture for flexible job-shop scheduling problems

13
M. Saqlain et al.

3.2.1 Input state

The number of states of the scheduling environment can be huge due to complex and
flexible architecture. These states are the input of the MCTS-scheduling agent. Each
state contains the information of remaining jobs and available machines. For better
understanding, we use the notions of [x, y, z, t, l] to represent each state as a jobvec-
tor. Each job vector consists of numerous operations and each operation contains
five basic attributes: (1) job number x, (2) operation number y, (3) machine at which
the operation should be processed z (i.e., requested machine), (4) processing time t,
and (5) level value l. So, the agent finds a relationship between requested machines
and available machines for each operation implicitly. For instance, in Fig. 4, we have
two job vectors such as [2, 2, M0, 7, 2] and [2, 3, M1, 10, 3], which show that there
are two operations O22 and O32 of Job – 2 that are processed at machine M0 and M1,
respectively, with their corresponding execution time of 7 and 10 time units. The
second operation has the level value 3, so it is processed earlier with higher prior-
ity at machine M1 with respect to the second operation, which has a level value 2.
Although the above two operations can be assigned to multiple machines with dif-
ferent processing times, our input job vectors are generated after observing the job-
shop environment and selecting only the best possible operations after observing the
availability of the machines.

3.2.2 MCTS‑scheduling agent

The most essential component of RL is its agent, which learns precisely from the
input state and applies accurate actions. The scheduling agent implements the
MCTS algorithm to define the policy. It starts at the root node denoted as the begin-
ning state of the tree, which becomes an initial state of the first job to be scheduled.
The following nodes indicate the possible states attained after the agent chooses the
possible actions. The tree edges indicate the possible scheduling actions that are
applied to process certain operations of a job at available machines. The algorithm
gets current awaiting job’s operations as input and provides an optimal schedule of
those operations to the available machines as output. Our modified MCTS algorithm
for policy update is given in Algorithm 1. During each iteration of the while loop,
the algorithm selects the best possible action or move and adds it to the final sched-
ule. Where, TreePolicy() selects a leaf node from all visited nodes in the search tree
during the selection and expansion phases of each iteration. And DefaultPolicy()
playouts the simulation at a non-terminal state using the roll-out policy function to
create a value estimation. This process continues until all jobs Jn are processed and
Sin becomes Scomplete. Each step of our modified MCTS-algorithm is explained as
follows.
Selection Starting at the root node, the best child node selection policy is
recursively implemented to descend by the tree until an expandable leaf node is
approached. An expandable leaf node has an unvisited child node and shows a non-
terminal state (line 1–5). In this paper, our goal is to minimize the Makespan while
updating policy for the selection of the best child node, which means that the agent
should try to maximize –UCT(s′) (i.e., see Eq. 1).

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

Expansion There are three possible actions for the MCTS-scheduling agent at
each node, such as FIFO, SJF, and LJF. According to these actions, new child
nodes are selected from the expandable leaf node (i.e., now parent node). Above
mentioned actions have a different influence to allocate various operations of the
same job to machines. The best action will be determined by comparing the over-
all processing time of all operations of a single job (line 10–12).
Simulation At the leaf node, a simulation/playout step is started by follow-
ing the default policy to create successive scheduling actions. This step con-
tinues until there is no more job to schedule or the agent reaches the terminal
state. While selecting an action, we choose a random job Jn and apply a heuris-
tic approach to choose a machine Mm to process that job. To enhance simulation
quality, the method prefers to select one of the available machines with the lowest
Makespan value. Thus, this method is also called greedysimulation method (line
6–8).
Backpropagation After the completion of the simulation step, the MCTS-
scheduler gets simulation results in the form of Makespan value. Then, this result
is backpropagated to all the ancestor nodes in the tree by updating their statistics
such, as the visit count value and the average Makespan value. The updated statis-
tics of the nodes are used to predict the possibility of selecting these nodes in the
future (line 9).

The above method helps to select the best scheduling action among the three
actions for each job during the training process. As the number of input jobs
increases, the tree becomes bigger, thus it requires a higher number of simula-
tions for each iteration which results in a more accurate selection of possible
scheduling actions. While selecting an action, the performance of UCT​ is sig-
nificantly enhanced by combining domain knowledge in the default policy of the
tree. The main advantages of the proposed MCTS-based method are its effective
simulation technique and the property to stop simulation any time by following
our computation capacity. The disadvantage of this method is that it is challeng-
ing to set the value of the constant c of the UCT​ function to control the trade-off

13
M. Saqlain et al.

between exploration and exploitation in MCTS under the limited computation


capacity. So, we tuned the value of c in our experiment for each explicit domain
to get best possible result.

3.2.3 Job‑shop environment

The MCTS-scheduling agent requires a job-shop environment where it can have


numerous interactions with available machines for training the algorithm. To do so,
we designed a simulated scheduling environment using the OpenAI Gym platform.
OpenAI Gym is a commonly used toolkit to design, implement, and compare vari-
ous RL algorithms (Brockman et al. 2016). Our environment simulates a schedul-
ing problem by virtually running the batch of operations of the job. All the batch
operations are produced based on actual operations collected from the job vector
and having the same properties as [x, y, z, t, l]. Whenever a new job arrives, the
environment will ask the agent for a scheduling action, in reply agent will select the
best possible action and the environment will implement that action. In our designed
environment, the agent observes the job vectors as the input and receives an imme-
diate reward value separately through the MCTS algorithm, and later a cumulative
reward is found which is used to update its policy.

3.2.4 Scheduling action

An RL agent gets state and observation from the environment, and in response, takes
the best action. In the scheduling problem, the action is simply selecting a job from
the job vector and making its schedule on the available machines. There are three
possible actions in our problem such as FIFO, SJF, and LJF. According to the avail-
ability of machines, one of the best possible actions is selected to schedule the job.

3.2.5 Reward function

In a RL-based job shop environment, the reward function plays a vital role to
improve the overall performance of the algorithm. A reward is feedback that an
RL agent gets from the environment after applying the action. It highly impacts
the scheduling policy, which is essential for updating the final policy. In every epi-
sode, the agent applies an action and receives a reward. But this is an immediate
reward, not the average reward. To calculate the average reward, the whole sequence
of entire jobs is scheduled until the terminal state is reached, and then the agent
receives the final reward as ‘–UCT​(s′)avg’. The UCT calculates the final reward score
by combining all exploration terms that support sampling infrequently actions.
A full T-length exploration is a sequence of numbers of state-action pairs such as
s0 a0 , s1 a1 , … , sT−1 aT−1 . Consequently, it computes the reward value of a state,
action, and policy parameter (s, a, θ) as the average return acquired after experienc-
ing various states.

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

3.3 Training MCTS‑algorithm

We train a scheduling agent with the MCTS algorithm. The training process depends
on the policy gradient techniques with various Monte-Carlo simulation episodes.
The algorithm receives a set of input states, possible actions, and a random pol-
icy value (i.e., default policy) as the input, which is defined as 𝜋(a ∨ s, 𝜃), where
a denotes the set of actions, s set of states, and θ random policy parameters. While
training, we applied various episodes to improve the overall policy. Each episode
shows a complete schedule of different operations of a single job, which begins from
the initial state s0, action a0, and the corresponding reward value of r0, to the last
state of sn, action an, and the final reward value of rn. After applying every step of
each episode, the parameters of policy θ is updated by the following Eq. 2 (Cheng
et al. 2019).
( )
( ) ∇𝜋 at |st , 𝜃
∇ ln 𝜋 at |st , 𝜃 = ( ) (2)
𝜋 at |st , 𝜃

The agent gets the reward rt at each episode t between the initial and the final
state. This is the immediate reward which is calculated after scheduling the job Jt at
the available machines. Thus, rt is dynamically created by following various sched-
uling action. For each step of a training episode, the MCTS algorithm defines the
long-term reward value of “R = –UCT​(s′)avg” using a discount factor of γ = 0.99 and
constant parameter c = 0.05. This long-term reward R is used to calculate the final
optimal policy, as defined in Eq. 3.
( ) ( )
𝜋 ∗ at |st , 𝜃 = 𝛼𝛾 t R∇ ln 𝜋 at |st , 𝜃 (3)

where α represents the size of each training set which is always greater than zero.
The maximum tree length for each simulation is set to 163 episodes as the total
numbers of operations, after which the tree is supposed to be a tie. The processing
speed of MCTS-scheduling algorithm is 7 episodes per second and it takes an aver-
age of 23.3 s to entirely schedule a problem of 50 jobs.

3.4 Performance evaluation

The MCTS-scheduling agent was integrated with the job-shop environment through
the MDP interface and was implemented with the objective of exploring an opti-
mal scheduling policy. The simulation results were evaluated based on the following
three performance criteria:

(a) Makespan (Cmax) This criterion applied to calculate the total length of the sched-
ule, which determined the overall performance of the model.
(b) Average waiting time (AWT​) Waiting time of a job includes waiting and frequen-
cies, completion time, and response time of the job. The AWT performance
method is used to find whether the waiting time of the jobs is decreased or not.

13
M. Saqlain et al.

(c) Average resource utilization time (ARUT​) It represents the total utilization of
each machine, including the length of the queue for each machine. This criterion
is applied to find whether the applied scheduling methods improve the efficiency
and productivity of the scheduling system or not.

4 Experimental results and discussion

In this section, we implemented our proposed MCTS-based scheduling algorithm


with a real-time case study and compared its performance with three classic baseline
scheduling algorithms.

4.1 Case study

We used a case study from the Singapore Institute of Manufacturing Technology


(SIMTech) in Singapore for the evaluation of the proposed scheduling method. Real-
time simulation is applied to generate a training dataset for FJSP. Table 1 shows a
stream of fifty jobs executing on five different machines with a maximum of four-
level values. Due to space limitations, we only present the first three jobs and the

Table 1  Simulation dataset of Job Operation Level Machine


fifty jobs for FJSP
M0 M1 M2 M3 M4

J0 O00 0 – – 17 22 –
O10 1 – – – – 20
O20 2 – 20 – – –
O30 2 – – – 23 18
J1 O41 0 16 22 – – –
J2 O51 0 – – – 18 22
J3 O63 0 – – 21 – 22
O73 1 – – 17 – –
O83 1 – – 21 25 –
O93 2 – – – – 19
O10
3
3 – – – 18 –
⋮ ⋮ ⋮ …
J48 O157
48
0 20 – 23 – –
O158
48
1 – 15 – 23 –
O159
48
1 – 23 – – –
O160
48
1 24 – – – –
J49 O161
49
0 – – 22 – 20
O162
49
1 17 – 16 – –

*Jx denotes the xth job; Ox yth operation of xth job; and Mi mth
y

machine

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

last two jobs dataset. Each job contains numerous operations, of which some opera-
tions can be processed at multiple machines at a time. In total, 163 sample opera-
tions are generated to be executed on five different machines. The simulation model
is designed and executed using Visual Studio 2017 and C# language, and it is run on
Intel CPU E5-2696 v5 @ 4.40 GHz and 512 GB RAM with Windows 10 operating
system.
Moreover, all operations face various processing times uncertainties (PTU),
such as constant, uniform, and triangular. The processing time of constant opera-
tions does not have a significant variation and contains a fixed value. The uniform
operations have limited information and contain processing time values between
the lower bound and upper bound (i.e., 13–17). The triangular operations have little
information and contain three parameters of processing time such as lower bound,
mode, and upper bound (i.e., 6–11–19). For the simplicity of the problem, we took
an average value of each uniform and triangular operation. The processing time of
all operations is given in minute time unit under their corresponding machines in
Table 1. Additionally, we suppose that all machines are failure-free and process the
jobs continuously.

4.2 Experimental results

MCTS-FJS starts with the Selection phase. It selects a random job and applies all
three possible actions such as FIFO, SJF, and LJF, thus visiting all child nodes at
least once. The best child node or action is selected by applying UCT​ evaluation
function (see Sect. 2.2). While applying each action and agent reaches a new child-
node, the Expansion phase is triggered by following the tree policy. All child nodes
with no visit so far are selected and added to the tree as new nodes. One by one
Simulation is applied by choosing one of these new nodes to the leaf nodes. This
is done by following the default policy and randomly selecting the nodes until the
agent reaches the terminal node. Each node visited using the tree policy is updated
with simulation results in the final Backpropagation phase. All these four phases of
our proposed MCTS-FJS algorithm are shown in Fig. 5, where the value of V shows
the number of visits of each node and the final selected actions/nodes are shown in
blue color. For instance, SJF is selected as the best action to schedule the first job
and FIFO to schedule the second job in our experiment, and so on. This process will
continue until all jobs have been successfully scheduled.
The performance results with graphical comparison using three performance cri-
teria are shown in Fig. 6. It is obvious that the MCTS-FJS algorithm determined a
schedule of fifty jobs with the Makespan of 705 min, which is significantly lower
than the Makespan of FIFO, SJF, and LJF with the values of 733, 760, and 789 min,
respectively, as shown in Fig. 6a. Our method outperformed all the baseline schedul-
ing methods and reduced the scheduling time up to 3.8%, 7.2%, and 10.6% for FIFO,
SJF, and LJF, respectively. MCTS-FJS found this progress because it explored the
early information of processing time of incoming operations and their correspond-
ing available machines by applying a simulation using UCT​ evaluation function to

13
M. Saqlain et al.

Fig. 5  Simulation steps of pro-


posed MCTS-FJS algorithm

Fig. 6  Performance comparison of different scheduling algorithms using various performance criteria: a
Makespan, b AWT, and c ARUT. Note FIFO denotes first in first out; SJF shortest job first; LJF longest
job first; MCTS-FJS Monte Carlo tree search for flexible job shop

make a final scheduling action. Whereas all the baseline algorithms just naively fol-
lowed their fixed rules.
The proposed scheduling method also led to a decline in AWT of all avail-
able machines for a complete schedule and got only 3.5% of AWT, when com-
pared to baseline scheduling methods, all with an AWT of more than 9.3%, as
shown in Fig. 6b. Due to the decline in the AWT, the number of incoming jobs

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

in the scheduling system also decreased, which results in early completion of the
overall schedule. On the contrary, MCTS-FJS got the highest value of ARUT of all
machines by maximizing resource utilization and got a value of 96.5%, as shown in
Fig. 6c. It increased the average resource utilization up to 6.0%, 9.6%, and 6.0% for
FIFO, SJF, and LJF, respectively, which resulted from improving the overall effi-
ciency of the scheduling system.
Figure 7 presented a Gantt chart solution with the proposed MCTS-FJS model on
a defined problem of 50 jobs, 163 operations, and 5 machines, and got a Makespan
of 705 min. It followed all the FJSP constraints, as well as the additional sequential
constraint of level value, discussed in Sect. 3.1. For better observation, all operations
of a single job are denoted in rectangles with the same color. The blank white spaces
between two operations show the idle or waiting time of the machine.

4.3 Discussion

A comparison of the proposed MCTS-FJS algorithm with the baseline scheduling


algorithms for the three different scenarios was shown in Table 2. Each scenario
contained a sample problem with a different number of jobs and operations but the
same number of available machines. By doing so, we could check the effectiveness
of our model for different sizes of sample problems. The results showed that the
MCTS-FJS algorithm outperformed all baseline scheduling algorithms in all three
scenarios and achieved the best results with all evaluation methods.
Figure 8 shows the learning patterns of all scheduling algorithms with an
increasing number of jobs. From the figure, we can see that in starting time until
the MCTS-FJS algorithm explores only the first fifteen jobs, its performance was
worse among all baseline scheduling algorithms. This is due to the simple struc-
ture and greedy approach of these baseline scheduling algorithms. However,
after that, with each increasing number of jobs, the exploration strength of our
proposed model is increasing, and its performance is becoming better and better,
and finally, it outperforms all baseline algorithms when it reaches job 44. This is
because with each increasing job, the MCTS algorithm applies more simulations
and visits an additional number of tree nodes during each episode to improve its
policy. Thus, the MCTS-FJS algorithm is more suitable for scheduling problems

Fig. 7  Gantt chart solution of a defined problem with MCTS-JSP model

13
M. Saqlain et al.

Table 2  Performance Scenario n × m × o Algorithm Evaluation Method


comparison of different
scheduling algorithms using Cmax AWT​(%) ARUT​(%)
three scenarios
1 5 × 5 × 15 FIFO 126 20.7 79.3
SJF 113 19.9 80.1
LJF 132 19.6 80.4
MCTS-FJS 101 12.3 87.8
2 25 × 5 × 89 FIFO 438 15.8 84.2
SJF 425 16.5 83.5
LJF 491 15.5 84.6
MCTS-FJS 413 8.6 91.4
3 50 × 5 × 163 FIFO 733 9.3 90.7
SJF 760 12.8 87.2
LJF 789 9.3 90.7
MCTS-FJS 705 3.5 96.5

Bold characters mean the best performance of Cmax, AWT​, and ARUT​
among applied algorithms, respectively
n denotes number of jobs; m number of machines; o number of oper-
ations; Cmax Makespan; AWT​average waiting time; and ARUT​ aver-
age resource utilization time

Fig. 8  Makespan comparison of all scheduling algorithms with increasing number of jobs

with a large number of jobs as compared to other scheduling methods. So, if we


will add more jobs to our experiment, the performance of our model should be
improved gradually.

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

5 Conclusions

This paper proposed an effective scheduling algorithm based on the MCTS strat-
egy for flexible job shop scheduling problems, which successfully improved the
overall manufacturing performance. The MCTS-scheduling agent gets informa-
tion about all scheduling rules and optimizes its policy autonomously by applying
MCTS-algorithm. The proposed method also considers the additional sequential
constraint making the job-shop environment highly flexible and agile to deal with
random events. Consequently, our findings were that the proposed method outper-
formed all the baseline scheduling algorithms such as FIFO, SJF, and LJF, and
reduced the scheduling time of 50 jobs problem up to 3.8%, 7.2%, and 10.6%,
respectively. Additionally, we found that the performance of the proposed method
is being gradually increased with the increasing number of input jobs. Thus, the
proposed algorithm can be used to solve the complex FJSP in the manufacturing
industries and improve their efficiency and productivity in the field.
A future extension of our study would be the use of deep reinforcement learn-
ing with a deep neural network to train the MCTS algorithm (Kartal et al. 2019),
unlike the current study where we used simple reinforcement learning-based
MCTS. Various hyper-parameters settings can play a vital role in improving the
performance of MCTS (Orhean et al. 2017). So, we will also work on hyper-
parameters tuning methods to enhance the performance of the proposed MCTS-
FJS algorithm.

Acknowledgements This work was done with the collaboration of the Singapore Institute of Manufactur-
ing Technology (SIMTech). We thank Dr. Byung Jun Joo, for providing us real-time industrial dataset
from SIMTech. Funding was provided by Ministry of Trade, Industry and Energy (Grant No. N0002429)
and National Research Foundation of Korea (Grant No. 2017R1D1A1A02018718).

Funding This work was supported by the KIAT (Korea Institute for Advancement of Technology) grant
funded by the Korea Government (MOTIE: Ministry of Trade Industry and Energy) (No. N0002429). It
was also supported by the Basic Science Research Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (2017R1D1A1A02018718).

References
Ahire S, Greenwood G, Gupta A, Terwilliger M (2007) Workforce-constrained preventive mainte-
nance scheduling using evolution strategies. Decis Sci 31(4):833–859. https://​doi.​org/​10.​1111/j.​
1540-​5915.​2000.​tb009​45.x
Asadzadeh L (2015) A local search genetic algorithm for the job shop scheduling problem with intel-
ligent agents. Comput Ind Eng 85:376–383. https://​doi.​org/​10.​1016/j.​cie.​2015.​04.​006
Baier H, Drake PD (2011) The power of forgetting: Improving the last-good-reply policy in Monte
Carlo Go. IEEE Trans Comput Intell AI Games 2(4):303–309. https://​doi.​org/​10.​1109/​TCIAIG.​
2010.​21003​96
Bierwirth C, Mattfeld DC (1999) Production scheduling and rescheduling with genetic algorithms.
Evol Comput 7(1):1–17. https://​doi.​org/​10.​1162/​evco.​1999.7.​1.1
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI
Gym. arXiv:1606.01540

13
M. Saqlain et al.

Browne C, Powley E, Whitehouse D, Lucas S, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis
S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games
4(1):1–49. https://​doi.​org/​10.​1109/​TCIAIG.​2012.​21868​10
Campbell M, Hoane AJ, Hsu F-H (2002) Deep blue. Artif Intell 134(1–2):57–83. https://​doi.​org/​10.​1016/​
S0004-​3702(01)​00129-1
Carballo L, Vakhania N, Werner F (2013) Reducing efficiently the search tree for multiprocessor job-shop
scheduling problems. Int J Prod Res 51(23–24):7105–7119. https://​doi.​org/​10.​1080/​00207​543.​2013.​
837226
Chaari T, Chaabane S, Aissani N, Trentesaux D (2014) Scheduling under uncertainty: Survey and research
directions. ICALT pp 229–234. https://​doi.​org/​10.​1109/​ICAdLT.​2014.​68663​16
Chaslot G, Bakkes S, Szita I, Spronck P (2008) Monte-Carlo tree search: a new framework for game AI
AIIDE pp 216–217
Cheng Y, Wu Z, Liu K, Wu Q, Wang Y (2019) Smart DAG tasks scheduling between trusted and untrusted
entities using the MCTS method. Sustainability 11(7):1826. https://​doi.​org/​10.​3390/​su110​71826
Chiang T-C, Lin H-J (2012) Flexible job shop scheduling using a multiobjective memetic algorithm. Adv
Intell Comput Theories Appl pp 49–56. https://​doi.​org/​10.​1007/​978-3-​642-​25944-9_7
Chiang T-C, Lin H-J (2013) A simple and effective evolutionary algorithm for multiobjective flexible job
shop scheduling. Intern J Prod Econ 141(1):87–98. https://​doi.​org/​10.​1016/j.​ijpe.​2012.​03.​034
Coulom R (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. International confer-
ence on computers and games pp 72–83. https://​doi.​org/​10.​1007/​978-3-​540-​75538-8_7
Dios M, Framinan JM (2016) A review and classification of computer-based manufacturing scheduling tools.
Comput Ind Eng 99:229–249. https://​doi.​org/​10.​1016/j.​cie.​2016.​07.​020
Fera M, Fruggiero F, Lambiase A, Martino G, Nenni ME (2013) Production scheduling approaches for oper-
ations management.https://​doi.​org/​10.​5772/​55431
Floudas CA, Lin X (2005) Mixed integer linear programming in process scheduling: modeling, algorithms,
and applications. Ann Oper Res 139(1):131–162. https://​doi.​org/​10.​1007/​s10479-​005-​3446-x
Gabel T, Riedmiller M (2008) Adaptive reactive job-shop scheduling with reinforcement learning agents. Int
J Inf Technol Intell Comput 24(4)
Gosavi A (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS J Comput
21(2):178–192. https://​doi.​org/​10.​1287/​ijoc.​1080.​0305
Hou ESH, Ansari N, Ren H (1994) A genetic algorithm for multiprocessor scheduling. IEEE Trans Parallel
Distrib Syst 5(2):113–120. https://​doi.​org/​10.​1109/​71.​265940
Jain AS, Meeran S (1999) Deterministic job-shop scheduling: past, present and future. Eur J Oper Res
113(2):390–434. https://​doi.​org/​10.​1016/​S0377-​2217(98)​00113-1
Joo BJ, Shim S-H, Chua TJ, Cai TX (2018) Multi-level job scheduling under processing time uncertainty.
Comput Ind Eng 120:480–487. https://​doi.​org/​10.​1016/j.​cie.​2018.​02.​003
Kartal B, Hernandez-Leal P, Taylor ME (2019) Action guidance with MCTS for deep reinforcement learning
proarXiv:1907.11703v1
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. ECML pp 282–293. https://​doi.​org/​10.​
1007/​11871​842_​29
Kocsis L, Szepesvári C, Willemson J (2020) Improved Monte-Carlo search
Leung YTJ (2004) Handbook of scheduling: algorithms, models and performance analysis. Chapman &
Hall, London. https://​doi.​org/​10.​1201/​97802​03489​802
Leusin ME, Frazzon EM, Maldonado MU, Kück M, Freitag M (2018) Solving the job-shop scheduling prob-
lem in the industry 4.0 era. Technologies 6:107. https://​doi.​org/​10.​3390/​techn​ologi​es604​0107
Li M, Yao L, Yang J, Wang Z (2014) Due date assignment and dynamic scheduling of one-of-a-kind assem-
bly production with uncertain processing time. Int J Comput Integr Manuf 28(6):1–12. https://​doi.​org/​
10.​1080/​09511​92X.​2014.​900859
Loth M, Sebag M, Hamadi Y, Schoenauer M, Schulte C (2013) Hybridizing constraint programming and
Monte-Carlo tree search: application to the job shop problem. ICLIO. https://​doi.​org/​10.​1007/​978-3-​
642-​44973-4_​35
Lu L, Zhang W, Gu X, Ji X, Chen J (2020) HMCTS-OP: Hierarchical MCTS based online planning in the
asymmetric adversarial environment. Symmetry 12(5):1–17. https://​doi.​org/​10.​3390/​sym12​050719
Lubosch M, Kunath M, Winkler H (2018) Industrial scheduling with monte tree search and machine learn-
ing. Procedia CIRP 72:1283–1287. https://​doi.​org/​10.​1016/j.​procir.​2018.​03.​171
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidje-
land AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra
D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature
518(7540):529–533. https://​doi.​org/​10.​1038/​natur​e14236

13
A Monte‑Carlo tree search algorithm for the flexible job‑shop…

Moras R, Smith ML, Kumar KS, Azim MA (1997) Analysis of antithetic sequences in flowshop scheduling
to minimize makespan. Prod Plan Control 8(8):780–787. https://​doi.​org/​10.​1080/​09537​28972​34678
Orhean AI, Pop F, Raicu I (2017) New scheduling approach using reinforcement learning for heterogeneous
distributed systems. J Parallel Distrib Comput. https://​doi.​org/​10.​1016/j.​jpdc.​2017.​05.​001
Pinedo ML (2008) Scheduling: theory, algorithms, and systems. https://​doi.​org/​10.​1007/​978-0-​387-​78935-4
Reyna YCF, Jiménez YM, Cabrera JMB, Hernández BMM (2015) A reinforcement learning approach for
schedulingproblems. Revista Investigacion Operacional 36(3):225–231
Runarsson TP, Schoenauer M, Sebag M (2012) Pilot, rollout and Monte Carlo tree search methods for job
shop scheduling. pp 160–174. https://​doi.​org/​10.​1007/​978-3-​642-​34413-8_​12
Sauvey C, Trabelsi W, Sauer N (2020) Mathematical model and evaluation function for conflict-free war-
ranted makespan minimization of mixed blocking constraint job-shop problems. Mathematics 8(1):121.
https://​doi.​org/​10.​3390/​math8​010121
Schaeffer J, Culberson J, Treloar N, Knight B, Lu P, Szafron D (1992) A world championship caliber check-
ers program. Artif Intell 53(2–3):273–289. https://​doi.​org/​10.​1016/​0004-​3702(92)​90074-8
Segler MHS, Preuss M, Waller MP (2018) Planning chemical syntheses with deep neural networks and sym-
bolic AI. Nature 555(7698):604–610. https://​doi.​org/​10.​1038/​natur​e25978
Shahrabi J, Adibi MA, Mahootchi M (2017) A reinforcement learning approach to parameter estimation in
dynamic job shop scheduling. Comput Ind Eng. https://​doi.​org/​10.​1016/j.​cie.​2017.​05.​026
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, Schrittwieser J, Antonoglou I, Pan-
neershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T,
Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural
networks and tree search. Nature 529(7585):484–489. https://​doi.​org/​10.​1038/​natur​e16961
Sriboonchandr P, Kriengkorakot N, Kriengkorakot P (2019) Improved differential evolution algorithm for
flexible job shop scheduling problems. Math Comput Appl 24(3):80. https://​doi.​org/​10.​3390/​mca24​
030080
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. The MIT Press
Vakhania N, Shchepin E (2002) Concurrent operations can be parallelized in scheduling multiprocessor job
shop. J Sched 5(3):227–245. https://​doi.​org/​10.​1002/​jos.​101
Vinod V, Sridharan R (2008) Scheduling a dynamic job shop production system with sequence-dependent
setups: an experimental study. Robot Comput-Integrated Manuf 24(3):435–449. https://​doi.​org/​10.​
1016/j.​rcim.​2007.​05.​001
Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement
learning. AAAI
Waschneck B, Reichstaller A, Belzner L, Altenmüller T, Bauernhansl T, Knapp A, Kyek A (2018) Optimiza-
tion of global production scheduling with deep reinforcement learning. Procedia CIRP 72:1264–1269.
https://​doi.​org/​10.​1016/j.​procir.​2018.​03.​212
Waschneck B, Altenmüller T, Bauernhansl T, Kyek A (2016). Production scheduling in complex job shops
from an industrie 4.0 perspective: a review and challenges in the semiconductor industry. SAMI
Wu T-Y, Wu I-C, Liang C-C (2013) Multi-objective flexible job shop scheduling problem based on Monte-
Carlo tree search. Conference on technologies and applications of artificial intelligence, pp 73–78.
https://​doi.​org/​10.​1109/​TAAI.​2013.​27
Zhang T, Xie S, Rose O (2017) Real-time job shop scheduling based on simulation and Markov decision
processes. WSC pp 3899–3907. https://​doi.​org/​10.​1109/​WSC.​2017.​82481​00
Zhang D, Dai D, He Y, Bao FS (2019) RLScheduler: learn to schedule HPC batch jobs using deep reinforce-
ment learning. arXiv:1910.08925v1
Zhang T, Rose O (2013) Intelligent dispatching in dynamic stochastic job shops. WSC. https://​doi.​org/​10.​
1109/​WSC.​2013.​67216​34

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

M. Saqlain received his B.S. degree in Software Engineering from Government College University
Faisalabad (GCUF), Pakistan in 2014, and M.S. degree in the same major from National University of
Science and Technology (NUST), Pakistan in 2016. He is now Ph.D. candidate at the Department of
Computer Science at Chungbuk National University, Republic of Korea. His research interests include

13
M. Saqlain et al.

data mining, artificial intelligence, machine learning, deep learning, reinforcement learning, and smart
manufacturing.

S. Ali received his B.E. degree in Computer Engineering from Mehran University of Engineering & Tech-
nology, Pakistan in 2015. He is now M.S. candidate at the Department of Computer Science at Chungbuk
National University, Republic of Korea. His research interests include bioinformatics, data mining, artifi-
cial intelligence, and cardiovascular disease.

J. Y. Lee received the B.E. and M.E. degrees in computer engineering and the Ph.D. degree in computer
science from Chungbuk National University, South Korea, in 1985, 1987, and 1999, respectively. He was
a Research/Project Leader with the Institute of Software Research and Development, Hyundai Electron-
ics Industrial Company Ltd., and Hyundai Information Technologies Company Ltd., South Korea, from
1990 to 1996. He was with BIT Computer Cooperation in 1989. He was an assistant professor with the
Department of Information and Communication Engineering, Kangwon National University at Samcheok
Campus, from 1999 to 2003. He is then a full professor with the Department of Software Engineering,
Chungbuk National University, South Korea. He had been a president of Korea Convergence Society
from January 2010 to December 2017 and is a chief-in-editor at the journal since January 2020. His
current research interests include medical databases, query processing and optimization techniques in
databases, fault detection in semiconductor manufacturing, and production scheduling in smart factories,
machine, and reinforcement learning.

13

You might also like