Applications of Machine Learning and Data Analytics Models in Maritime Transportation
Applications of Machine Learning and Data Analytics Models in Maritime Transportation
Transportation
Analytics Models in Maritime
Applications of Machine Learning and Data
decisions in the maritime industry. Applications of Machine Learning and Data Analytics
Models in Maritime Transportation explores the fundamental principles of analysing maritime-
transportation related practical problems using data-driven models, with a particular focus on
machine learning and operations research models.
and Data Analytics Models in
Data-enabled methodologies, technologies, and applications in maritime transportation are
clearly and concisely explained, and case studies of typical maritime challenges and solutions
Maritime Transportation
are also included. The authors begin with an introduction to maritime transportation, followed
by chapters providing an overview of ship inspection by port state control, and the principles
of data driven models. Further chapters cover linear regression models, Bayesian networks,
support vector machines, artificial neural networks, tree-based models, association rule
learning, cluster analysis, classic and emerging approaches to solving practical problems in
maritime transport, incorporating shipping domain knowledge into data-driven models,
explanation of black-box machine learning models in maritime transport, linear optimization,
advanced linear optimization, and integer optimization. A concluding chapter provides an
overview of coverage and explores future possibilities in the field.
Ran Yan and Shuaian Wang
The book will be especially useful to researchers and professionals with expertise in maritime
research who wish to learn how to apply data analytics and machine learning to their fields.
Shuaian Wang is a professor in the Department of Logistics and Maritime Studies at The
Hong Kong Polytechnic University, China.
Applications of Machine
Learning and Data Analytics
Models in Maritime
Transportation
Other related titles:
Volume 1 Clean Mobility and Intelligent Transport Systems M. Fiorini and J-C. Lin (Editors)
Volume 2 Energy Systems for Electric and Hybrid Vehicles K.T. Chau (Editor)
Volume 5 Sliding Mode Control of Vehicle Dynamics A. Ferrara (Editor)
Volume 6 Low Carbon Mobility for Future Cities: Principles and Applications H. Dia (Editor)
Volume 7 Evaluation of Intelligent Road Transportation Systems: Methods and Results M. Lu (Editor)
Volume 8 Road Pricing: Technologies, economics and acceptability J. Walker (Editor)
Volume 9 Autonomous Decentralized Systems and their Applications in Transport and Infrastructure
K. Mori (Editor)
Volume 11 Navigation and Control of Autonomous Marine Vehicles S. Sharma and B. Subudhi (Editors)
Volume 12 EMC and Functional Safety of Automotive Electronics K. Borgeest
Volume 15 Cybersecurity in Transport Systems M. Hawley
Volume 16 ICT for Electric Vehicle Integration with the Smart Grid N. Kishor and J. Fraile-Ardanuy (Editors)
Volume 17 Smart Sensing for Traffic Monitoring Nobuyuki Ozaki (Editor)
Volume 18 Collection and Delivery of Traffic and Travel Information P. Burton and A. Stevens (Editors)
Volume 20 Shared Mobility and Automated Vehicles: Responding to socio-technical changes and pandemics Ata
Khan and Susan Shaheen
Volume 23 Behavioural Modelling and Simulation of Bicycle Traffic L. Huang
Volume 24 Driving Simulators for the Evaluation of Human-Machine Interfaces in Assisted and Automated
Vehicles T. Ito and T. Hirose (Editors)
Volume 25 Cooperative Intelligent Transport Systems: Towards high-level automated driving M. Lu (Editor)
Volume 26 Traffic Information and Control Ruimin Li and Zhengbing He (Editors)
Volume 30 ICT Solutions and Digitalisation in Ports and Shipping M. Fiorini and N. Gupta
Volume 32 Cable Based and Wireless Charging Systems for Electric Vehicles: Technology and control,
management and grid integration R. Singh, S. Padmanaban, S.Dwivedi, M. Molinas and F. Blaabjerg (Editors)
Volume 34 ITS for Freight Logistics H. Kawashima (Editor)
Volume 36 Vehicular ad hoc Networks and Emerging Technologies for Road Vehicle Automation A. K. Tyagi and
S Malik
Volume 38 The Electric Car M.H. Westbrook
Volume 45 Propulsion Systems for Hybrid Vehicles J. Miller
Volume 79 Vehicle-to-Grid: Linking Electric Vehicles to the Smart Grid J. Lu and J. Hossain (Editors)
Applications of Machine
Learning and Data Analytics
Models in Maritime
Transportation
Ran Yan and Shuaian Wang
This publication is copyright under the Berne Convention and the Universal Copyright
Convention. All rights reserved. Apart from any fair dealing for the purposes of research or
private study, or criticism or review, as permitted under the Copyright, Designs and Patents
Act 1988, this publication may be reproduced, stored or transmitted, in any form or by
any means, only with the prior permission in writing of the publishers, or in the case of
reprographic reproduction in accordance with the terms of licences issued by the Copyright
Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to
the publisher at the undermentioned address:
The Institution of Engineering and Technology
Futures Place
Kings Way, Stevenage
Herts, SG1 2UA, United Kingdom
www.theiet.org
While the authors and publisher believe that the information and guidance given in this
work are correct, all parties must rely upon their own skill and judgement when making use
of them. Neither the author nor publisher assumes any liability to anyone for any loss or
damage caused by any error or omission in the work, whether such an error or omission is
the result of negligence or any other cause. Any and all such liability is disclaimed.
The moral rights of the author to be identified as author of this work have been asserted by
him in accordance with the Copyright, Designs and Patents Act 1988.
1 Introduction of maritime transportation 1
1.1 Overview of maritime transport 1
1.2 World fleet structure 1
1.2.1 Bulk carrier 1
1.2.2 Oil tanker 2
1.2.3 Container ship 2
1.3 Key roles in the shipping industry 3
1.3.1 Ship owner 3
1.3.2 Ship operator 3
1.3.3 Ship management company 3
1.3.4 Flag state 3
1.3.5 Classification society 4
1.3.6 Charterer 4
1.3.7 Freight forwarder 4
1.3.8 Ship broker 5
1.4 Container liner shipping 5
3 Introduction to data-driven models 21
3.1 Predictive problem and its application in maritime transport 21
3.1.1 Introduction of predictive problem 21
3.1.2 Examples of predictive problem in maritime transport 22
vi Machine learning and data analytics for maritime studies
5 Linear regression models 51
5.1 Simple linear regression and the least squares 51
5.2 Multiple linear regression 53
5.3 Extensions of multiple linear regression 55
5.3.1 Polynomial regression 55
5.3.2 Logistic regression 56
5.4 Shrinkage linear regression models 59
5.4.1 Ridge regression 60
5.4.2 LASSO regression 61
6 Bayesian networks 63
6.1 Naive Bayes classifier 63
6.2 Semi-naive Bayes classifiers 68
6.3 BN classifiers 73
7 Support vector machine 79
7.1 Hard margin SVM 79
7.2 Soft margin SVM 83
7.3 Kernel trick 86
7.4 Support vector regression 90
8 Artificial neural network 93
8.1 The structure and basic concepts of an ANN 93
8.1.1 Training of an ANN model 97
8.1.2 Hyperparameters in an ANN model 100
8.2 Brief introduction of deep learning models 103
Contents vii
18 Conclusion297
18.1 Summary of this book 297
18.2 Future research agenda 298
Index301
About the Authors
TO MY DAUGHTER AMY
—Shuaian
This page intentionally left blank
Chapter 1
Introduction of maritime transportation
Maritime transportation is the transport of passengers and cargoes by sea which can
be dated back to ancient Egypt times. It is the backbone of global trade, manufac-
turing supply chain, and world economy, as about 80% of world trade by volume
is carried out by ocean-going vessels [1]. Even during the ever-hard time brought
about by the COVID-19 pandemic, the maritime transport still plays a key role in
moving various goods and products, especially the necessary living and medical
supplies, around the world. The world economic development has a critical impact
on maritime transport. According to the review of maritime transport produced by
the United Nations Conference on Trade and Development (UNCTAD), the interna-
tional maritime trade volumes from 1970 to 2020 are shown in Table 1.1 [1].
The total number of ships of 100 gross tons and above reached 99 800 in 2020.
Different types of ships are responsible for carrying different types of commodities.
The top three ship types with the largest market shares in the world fleet by dead-
weight tonnage (DWT) are bulk carriers (913 032 000 tons, 42.77%), oil tankers
(619 148 000 tons, 29.00%), and container ships (281 784 000 tons, 13.20%) in
2021, according to the UNCTAD based on data from Clarksons Research [1]. A brief
introduction of the three types of ships is given in the following subsections.
1.2.1 Bulk carrier
Bulk carriers mainly transport dry cargoes in bulk quantities in a loose manner,
that is, the cargoes are free from any specific packaging. Major category of bulk
carriers as per size includes handysize carriers (usually between 25 000 and 40 000
DWT), handymax carriers (usually between 40 000 and 60 000 DWT), Panamax
carriers (usually between 60 000 and 100 000 DWT), post-Panamax carrier (usually
between 80 000 and 120 000 DWT), capesize carrier (usually between 100 000 and
200 000 DWT), and very large bulk carrier (usually over 200 000 DWT). Bulk car-
riers are mainly used to transport ores, coals, cement, etc.
2 Machine learning and data analytics for maritime studies
Table 1.1 International maritime trade from 1970 to 2020 (millions of tons
loaded)
1.2.2 Oil tanker
Oil tanker is a type of tanker ship which is designed for the particular purpose
of transporting liquefied goods, mainly including crude oil and its by-products.
According to their sizes, oil tankers can be divided into Panamax (usually
between 50 000 and 75 000 DWT), Aframax (coming from the Average Freight
Rate Assessment system, usually between 75 000 and 120 000 DWT), Suezmax
(usually between 120 000 and 180 000 DWT), very large crude carrier (usually
between 200 000 and 320 000 DWT), and ultra large crude carrier (usually more
than 320 000 DWT).
1.2.3 Container ship
Container ships have specially designed structures to hold a large quantity of
cargoes compacted in different types of containers. Container ship capacity is
measured in twenty-foot equivalent units (TEUs). They can be further divided
into feeder (1 001–2 000 TEU), feedermax (2 001–3 000 TEU), Panamax (3
001–5 100 TEU), post-Panamax (5 101–10 000 TEU), new Panamax (10 000–
14 500 TEU), and ultra large container vessel (over 14 500 TEU). Container
Introduction of maritime transportation 3
shipping is revolutionary in the form of cargoes that are ferried and transported
around the world, as it is the foundation of international multimodal transport
that connects ship with truck and rail. Globalization would not have been pos-
sible without containerization. Common cargoes transported by container ships
include manufactured goods, raw materials, food goods, agricultural products,
and seafood in refrigerated containers. More information on container liner
shipping is provided in section 1.4.
1.3.1 Ship owner
The owner of a merchant ship is called a ship owner, which can be companies, people, and
investment funds. The ownership is achieved by building new ships or purchasing second-
hand ships. Technical or commercial operations are provided by some ship owners, while
others prefer to outsource these services to specialized companies. The party that offers
commercial management service is called a ship operator, and the party that offers techni-
cal management service is called a ship management company.
1.3.2 Ship operator
Ship operator is the company in charge of daily matters on ships. It needs to tackle
with several tactical problems such as determining the cargo orders as well as the
freight rate, fleet deployment, ship routing and scheduling, and port service negotia-
tion. In addition, ship operator also needs to deal with operational problems such as
deciding ship loading, selecting sailing speeds, routes, and bunkering ports.
1.3.4 Flag state
The ship flag state is the nationality of a ship under whose laws the ship is ply-
ing in the open sea. A flag state is responsible for conducting regular inspections
of its ships to ensure the safety of its cargo and crew (which is called flag state
4 Machine learning and data analytics for maritime studies
control), collect taxes, and regulate the pollution levels by imposing maritime
policies and laws. In turn, ships receive protections and preferential treatments
such as tax, certification, trade rights, and security from their flags of registra-
tion. The flag state of a ship plays an important role in many aspects of ship
management and operation, such as vessel leasing, sale and purchase, newbuild-
ing deliveries, financing, and different priorities of owners and mortgagees. The
practice of registering a ship to a state different from that of the ship owner is
known as the flag of convenience.
1.3.6 Charterer
Ship charterer is a company that hires ships from ship owners to transport their
cargoes, and the contract between them is known as a charterparty. Time charter
and voyage charter are the most popular charter party agreements, where the
ship owner rents a ship to the charterer for a jointly determined period with
restrictions on trading and cargoes imposed in the former, while a ship is hired to
carry a particular cargo between specific places in the latter. Usually, under the
time charterparty, the ship owner takes care of their crew, maintains ship stores
and provisions, and conducts ship maintenance. Meanwhile, the time charterers
are responsible for fuel arrangement, port and canal dues, wharfage, and many
other types of operational costs. Whereas in a typical voyage charterparty, most
of the operational costs are covered by the ship owner, and the voyage charterer
is responsible for the costs in ports and terminals of the specific voyage. Another
popular charterparty is a bareboat charter, where a ship charterer hires a ship
without administration or technical maintenance provided, and it bears the full
operating expenses. In all cases, both ship owners and charterers are free to
negotiate their liabilities.
1.3.7 Freight forwarder
A freight forwarder is a person or an entity that manages shipments to transport the goods
from origin to destination, and it takes several duties which could include consultancy on
export and/or import costs and regulations, arrangement of goods transportation, cargo
insurance, document translation, and communication with clients.
Introduction of maritime transportation 5
1.3.8 Ship broker
Ships are chartered every day through ship brokers to carry cargoes worldwide. A
ship owner’s broker takes the responsibility to arrange the most profitable employ-
ment of the vessels under the owner’s control, and the ship charterer’s broker helps
the charter to find the most effective way to transport the cargoes with the right ship
at the right price. Therefore, it is necessary that the ship broker to be professional
and is aware of the current freight market and is also able to forecast its future status
as well as acts as the bridge between the ship owner and the charterer.
Note that on a route, different ports of call may be the same physical port. For
example, the Central China 1 (CC1) service of OOCL shown in Figure 1.2 has the
port rotation below:
Shanghai (1) ! Kwangyang (2) ! Pusan (3) ! Los Angeles (4) !
Oakland (5) ! Pusan (6) ! Kwangyang (7) ! Shanghai (1)
Both the second and the seventh ports of call are Kwangyang, and both the third and
the sixth ports of call are Pusan.
A leg is the voyage from one port of call to the next. Leg i is the voyage from the i
th port of call to port of call i + 1. The last leg is the voyage from the last port of call to
the first port of call. On CC1, the second leg is the voyage from Kwangyang (the second)
to Pusan (the third), and the seventh leg is the voyage from Kwangyang (the seventh) to
Shanghai (the first).
The rotation time of a route is the time required for a ship to start from the
first port of call, visit all ports of call on the route, and return to the first port
of call. As can be read from Figures 1.1 and 1.2, the rotation time of CC2 is 35
days*, and the rotation time of CC1 is 42 days. Each route provides a weekly fre-
quency, which means that each port of call is visited on the same day every week.
Therefore, a string of five ships are deployed on CC2, and the headway between
two adjacent ships is 7 days. These five ships usually have the same TEU capacity
13 days from the departure from Shanghai to the arrival at Los Angeles + 17 days from the departure
*
from Los Angeles to the arrival at Shanghai + 2 days spent at the port of Shanghai (Wednesday to
Friday) + 3 days spent at the port of Los Angeles (Thursday to Sunday).
Introduction of maritime transportation 7
References
is responsible for every aspect of maritime safety and security, and it is the highest
technical body of the IMO.
2.1.3 Seafarers’ management
Seafarers are regarded as key workers by the IMO, who are essential to shipping and
the world, in particular, during the COVID-19 pandemic. They are at the core of ship-
ping’s future. Basic requirements on the training, certification, and watchkeeping of
seafarers on an international level are established in the International Convention on
Standards of Training, Certification, and Watchkeeping for Seafarers by the IMO,
which entered into force in 1984 [4]. Besides, the milestone to ensure a decent liv-
ing and working environment for seafarers is the Maritime Labor Convention, 2006
(MLC2006), which entered into force in 2013 and is developed by the International
Labor Organization [5]. Minimum requirements for seafarers regarding conditions
of employment, work, accommodation, food and catering, health care and medi-
cal care onboard, as well as welfare and social security protection are specified in
MLC2006. A sub-committee on Human Element, Training, and Watchkeeping deals
with the human sides of shipping in the IMO.
A ship whose hull, machinery, equipment, or operational safety is substantially below the
standards required by the relevant convention or whose crew is not in conformity with the
safe manning document is deemed as a substandard ship [6]. The ship flag state is the first
line of defense against substandard shipping. It is obliged to ensure its ships are periodi-
cally surveyed and re-certified by carrying out inspections by its own surveyors or by an
authorized party called a recognized organization (RO). However, due to the international-
ism of shipping activities, many ships do not regularly call at ports in their flag states, and
thus restrict the role that flag state plays in identifying substandard ships.
Ship inspection by port state control 11
both rates will be lower (as only ships with relatively high risk are selected for
inspection). Therefore, accurately identifying substandard ships with a higher risk,
i.e., with a larger number of deficiencies or a higher probability of detention, is of
vital importance to enhance the effectiveness of PSC.
To achieve efficient PSC inspection, uniform ship selection methods are adopted
by individual MoUs, while the methods adopted by different MoUs might be dif-
ferent. One fundamental ship selection method is based on expert experience. For
example, in the Mediterranean MoU, ships can be exempted from further inspection
if they have been inspected within the last 6 months and found to comply with the
regulations [10]. A more advanced ship selection method is called the new inspec-
tion regime (NIR), which is adopted by the Abuja MoU, the Black Sea MoU, the
Paris MoU, and the Tokyo MoU. The NIR takes ship age and type, performance of
ship flag/RO/company*, and the number of deficiencies and detentions in the last 3
years to calculate ship risk profile (SRP). An example of the information sheet for
ship risk calculation used in the Tokyo MoU is presented in Table 2.1. Under the
NIR, ships are divided into three risk profiles in accordance with their risk levels
calculated, namely high-risk ships (HRS), standard-risk ships (SRS), and low-risk
ships (LRS). Inspection time window is then attached to each SRP as shown in
Table 2.2. It is also noted that the NIR for ship risk calculation can be more or less
different in different MoUs, including the information sheet used and the time win-
dows attached.
*
Here company refers to the ISM company
Table 2.1 Information sheet of NIR adopted by the Tokyo MoU
inspection(s). It can be carried out onsite a ship or by remote mode (which is also
called remote follow-up inspection). A general PSC inspection is not charged, while
port state charges can be expected in follow-up inspections.
2.2.4 Inspection results
After a PSC inspection, inspection results, i.e., deficiency(ies) identified and the
detention decision are recorded in the relevant forms and reported in the cen-
tral database of the corresponding MoU. To be more specific, deficiency codes are
derived from major maritime conventions and regulations and are specified and
constantly updated by individual MoUs and might be slightly different among the
MoUs. Deficiency codes adopted by the Paris MoU (effective from 1 July 2021)
are shown in Table 2.3, and those adopted by the Tokyo MoU (effective from 2
December 2019) are shown in Table 2.4.
A ship is detained if the PSCO decides it is unseaworthy due to the identification
of serious deficiency(ies), while the captain has the right to appeal against a deten-
tion order. Although the general purpose of PSC is to prevent a substandard ship
from proceeding to sea, the detention of a ship is a serious matter involving many
issues, and thus PSCOs are cautious of the detention decision to avoid undue delays.
Regarding the inspection information reported in central databases, some MoUs offer
rough inspection results, such as Caribbean MoU† and Riyadh MoU‡. More detailed ship
information together with the inspection results are provided by some other MoUs, such
as Paris MoU§, Tokyo MoU¶, Abuja MoU**, Black Sea MoU††, Indian Ocean MoU‡‡, and
Mediterranean MoU§§. Especially, statistics on inspections and their results are offered
by Paris MoU in graph user interface, which is more convenient and intuitive for the
users¶¶.
Data of the examples of algorithm illustration and code implementation are mainly
from PSC inspection records at the Hong Kong port. Unless otherwise stated, the
case data set used in this book is 1 991 initial inspection records from 2015 to 2017
at the Hong Kong port. The following fields are collected from the website operated
by the Tokyo MoU: inspection date, IMO number (IMO no.), ship type, keel laid
date, deadweight tonnage (DWT), gross tonnage (GT), SRP, classification society,
flag state, flag performance/RO/company performance given by the Tokyo MoU
(flag performance, RO performance, and company performance), last date of initial
inspection (last inspection date), number of deficiencies in last initial inspection (last
deficiency number), detention in last initial inspection (last detention), deficiency
†
http://www.caribbeanmou.org/content/inspection-detention-data
‡
https://www.riyadhmou.org/basicsearch.html?lang=en
§
https://www.parismou.org/inspection-search/inspection-search
¶
https://www.tokyo-mou.org/inspections_detentions/psc_database.php
**
http://www.abujamou.org/index.php?pid=125disclaimer
††
http://www.bsmou.org/database/inspections/
‡‡
https://www.iomou.org/HOMEPAGE/search_insp.php?l1=4&l2=26
§§
http://www.medmouic.org/Advanced
¶¶
https://www.parismou.org/inspection-search/inspections-results-kpis
Table 2.5 Five samples in the PSC case data set (part 1)
No. Inspection date IMO no. Ship type Keel laid date DWT GT SRP Classification society Flag state
1 December 29, 2017 9332626 Oil tanker February 2, 2008 74997 42893 SRP DNV GL AS Norway
2 December 27, 2017 9254484 Bulk carrier May 24, 2002 52364 30011 HRS Indian Register of Shipping India
3 December 27, 2017 9760500 Oil tanker October 2, 2015 106359 57164 SRS Lloyd’s Register Liberia
Table 2.6 Five samples in the PSC case data set (part 2)
Last Last
Flag RO Company inspection deficiency Last Deficiency Detainable
No. performance performance performance date number detention number Detention Deficiency codes deficiency codes
1 White High High September 0 No 4 No 01314/ 05118/ None
30, 2013 10106/ 07110*
2 White High Low November 9 No 14 Yes 03102/ 10118/ 11101/ 11101
19, 2015 03109/ 03107/
11101/ 11101/
11101/ 04102/
04116/ 14402/
03105/ 03108/
03105/ 07113 †
3 White High Medium None None None 0 No None None
*The nature of these codes is certificate and documentation - documents (SOPEP (The Shipboard Oil Pollution Emergency Plan), radio communications (operation of
GMDSS (Global Maritime Distress and Safety System) equipment), safety of navigation (compass correction log), and fire safety (fire fighting equipment and appli-
ances), respectively.
†
The nature of these codes is water/weathertight conditions (freeboard marks), safety of navigation (speed and distance indicator), water/weathertight conditions
(machinery space openings), water/weathertight conditions (doors), life-saving appliances (lifeboats), emergency systems (emergency fire pump and its pipes), emer-
gency systems (means of communication between safety center and other control stations), pollution prevention - MARPOL annex IV (sewage treatment plant), water/
weathertight conditions (Covers [hatchway-, portable-, tarpaulins, etc.]), water/weathertight conditions (ventilators, air pipes, and casings), and fire safety (fire pumps
and its pipes), respectively.
18 Machine learning and data analytics for maritime studies
number of the current inspection (deficiency number), and detention of the current
inspection (detention). In addition, the specific deficiency codes and detainable defi-
ciency codes are also collected. Among them, “deficiency number” and “detention”
are potential target variables, and the others are potential input features. Three typi-
cal samples in the initial data set directly retrieved from the database are given in
Table 2.5 and Table 2.6 Table 2.6.
The three records above have their own characteristics. Record no. 1 is a com-
mon PSC inspection in the data set, which is an un-detained inspection with last
inspection available. Record no. 2 is a detained inspection record with a large num-
ber of deficiencies detected (at 14) and detainable deficiencies detected. Record no.
3 is an un-inspected ship by any port in the Tokyo MoU, and thus it has no historical
inspection information in the database.
References
[9] Annual report [online]. Tokyo: Tokyo MOU. Available from http://www.
tokyo-mou.org/doc/ANN19-f.pdf [Accessed 25 Aug 2022].
[10] Selection of ships for inpsection [online]. London: Mediterranean MoU. 2014.
Available from http://www.medmou.org/Basic_Principlse.aspx#3 [Accessed
10 Dec 2020].
[11] Information sheet on the new inspection refime (NIR) [online]. Tokyo:
Tokyo MoU. 2014. Available from http://www.tokyo-mou.org/doc/NIR-
information%20sheet-r.pdf [Accessed 3 Mar 2022].
[12] List of paris mou deficiency codes [online]. Paris: Paris MOU. 2021. Available
from https://www.parismou.org/list-paris-mou-deficiency-codes [Accessed
15 Jan 2022].
[13] Tokyo MOU deficiency codes [online]. Tokyo: Tokyo MoU. 2019. Available
from http://www.tokyo-mou.org/publications/tokyo_mou_deficiency_codes.
php [Accessed 15 Jan 2022].
This page intentionally left blank
Chapter 3
Introduction to data-driven models
This chapter aims to clarify the basic issues of data-driven modeling and its applica-
tion in maritime transport. A predictive problem is first introduced, and the predic-
tive analysis in the field of maritime transport is then discussed with typical exam-
ples given. A comparison between the classic method and data-driven modeling to
deal with prediction tasks is also provided. Finally, typical data-driven modeling
approaches are briefly discussed.
1
The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China
22 Machine learning and data analytics for maritime studies
connections between the system state variables, i.e., the input, internal, and output
variables.
trajectory points from vessel automatic identification systems data. These time-
series data are then mined by classic ML models such as extreme learning
machine, support vector machine (SVM), and k-nearest neighbors (KNNs), as
well as more advanced deep-learning models like recurrent neural networks, so
as to figure out the sailing patterns that can be further used for vessel trajectory
prediction.
3.1.4.1 Statistical modeling
A statistical model is a mathematical model representing the data generation pro-
cess based on a set of statistical assumptions. In statistical modeling, regression
analysis is a set of statistical processes with the aim to estimate the relationship
between a dependent variable (i.e., output) and a set of independent variables
(i.e., input) using a training set that is constituted by historical data, so as to pre-
dict the output of an unseen this means an example that is not used for construct-
ing and tuning the model example. After data acquisition, the statistical mod-
eling process begins from data preprocessing and feature extraction. As statistical
modeling relies on statistical assumptions, reasonable assumptions are made on
the data generation process, and then suitable regression models are selected.
Next, the parameters of the regression models are estimated using the historical
data. Finally, the fit and generalization abilities of the models are tested. Popular
statistical models for prediction tasks are linear regression, logistic regression,
polynomial regression, stepwise regression, ridge regression, lasso regression,
and elastic net regression. Given a dataset with nsamples and mfeatures denoted
by D = f(xi , yi ), i = 1, ..., ng, xi 2 Rm , yi 2 R , the basic forms and common parameter esti-
mation methods of these popular statistical models are shown in Table 3.2.
3.1.4.2 ML modeling
ML modeling belongs to the area of artificial intelligence, which aims to enable
a computer to think and act like humans and to solve complex tasks without or
with only little human intervention. A commonly used definition is given by Tom
Mitchell:
Table 3.1 Comparison between theory-based modeling and data-driven modeling
Quantitative
prediction approach Advantages Disadvantages
Theory-based • System behavior and the prediction results are fully • The prediction accuracy can be low, which highly depends on
modeling tractable and interpretable. the assumptions made.
• No or few historical data of the system are needed, • Much a priori knowledge is needed.
and thus it can be applied in initial application stage • As the model structure and parameters are fixed, its
with low-data acquisition costs. performance cannot be improved as data accumulate.
• It strictly complies with domain knowledge and thus
can be more acceptable by practitioners.
Data-driven modeling • With proper data, feature, and model, its prediction • It is usually in a black-box nature, and thus the model-working
accuracy can be very high. mechanism and the predictions generated are highly likely to
• Little a priori knowledge is needed. be unexplainable.
• It can be constantly improved with the accumulation • Much historical data are needed for model construction, and
of data and refinement of model. the data acquisition costs can be very high.
• It can be very sensitive to data quality and quantity.
• The training data can be over-fitted, leading to poor
generalization ability.
• The predictions might be contradictory to domain knowledge.
Introduction to data-driven models 25
Common
parameter
estimation
Statistical model Basic form method
Linear regression y = ˇ1 x1 + ... + ˇm xm + . If m = 1, it is simple Least squares
linear regression. Otherwise, it is multiple linear
regression.
Logistic regression Given input xand model parameters , Weighted least
(for binary the probability of y = 1is calculated by squares and
classification maximum
problems) T 1 likelihood
P(y = 1|x; ) =g( x) = Tx
.
1+e estimation
Polynomial y = ˇ 0 + ˇ 1 X + ˇ 2 X 2
+ ˇ 3 X3
+ ... + ˇ h X h
+ ,
Least squares
regression where X is the feature matrix of the training
set.
Stepwise regression Stepwise regression is used to address the Least squares
multicollinearity of variables in linear regression combined
models by automatically choosing predictive with F-test
variables. There are three main approaches for and t-test
stepwise regression.
• Forward selection: Starting from no variable
in the model and adding the variables from the
largest to the least contribution evaluated by F-
test until no more variable(s) can be added.
• Backward elimination: Starting from all
candidate variables in the model and deleting
each variable whose loss gives the least
statistically significant deterioration of the model
fit evaluated by F-test until no more variable(s)
can be deleted.
• Bidirectional elimination: A combination of
forward selection and backward elimination
to ensure adding the variable with the largest
contribution in each step, while all the added
variables in all runs are statistically significant
(otherwise, statistically insignificant variable(s)
will be deleted).
Ridge regression The regression function format is similar to linear Least squares
(regularized by an regression models. An L2 regularization term
L2 term) is included in the cost function to address the
over-fitting problem as
Pn Pm 2
Pm 2
l = i=1 (yi j=1 ˇj xij ) + j=1 ||ˇj ||2,
> 0.
(Continues)
26 Machine learning and data analytics for maritime studies
References
[1] Yan R., Wang S., Zhen L., Laporte G. ‘Emerging approaches applied to mari-
time transport research: past and future’. Communications in Transportation
Research. 2021;1:100011.
[2] Yan R., Wang S., Psaraftis H.N. ‘Data analytics for fuel consumption man-
agement in maritime transportation: status and perspectives’. Transportation
Research Part E. 2021;155:102489.
Chapter 4
Key elements of data-driven models
This chapter aims to thoroughly introduce and discuss the major issues of data-driven
models, with a focus on machine learning (ML) models and their construction procedure.
We first compare three popular data-driven models, namely, statistical model, ML model,
and deep learning (DL) model. Then, the whole procedure of developing a data-driven
model will be provided. Key elements from various aspects of the whole process will be
covered in detail.
We first compare the differences among the three popular data-driven models,
namely, statistical model, ML model, and DL model from various perspectives as
presented in Table 4.1. It is noted that although DL is a type of ML as DL models
are based on deep neural networks (DNNs) consisting of multiple hidden layers, we
treat them separately as they have several major differences from other ML models.
In the remainder of this book, our main focus will be on using ML models to address
practical problems in maritime transport.
Note that the basic sequence of the four steps in the rounded rectangle is data collection → feature
*
engineering → model construction → model evaluation, selection and refinement. In practice, some of
them may be repeatedly operated to satisfy the requirements.
32 Machine learning and data analytics for maritime studies
of these four steps (from data collection to model evaluation, assessment, and refine-
ment) is generally like this, they might also be alternating and repetitive considering
the gap between the requirements as well as the expected outcomes and the devel-
oped ML models on hand. Finally, prediction results and model explanations are
obtained, and managerial insights are derived for improving the operation efficiency.
In the next subsections, we will discuss the key elements of the procedure in detail.
figures. We briefly introduce common data sources that are popular in maritime
transport research as follows.
*
https://ihsmarkit.com/products/maritime-ships-register.html
†
https://www.clarksons.net/portal/
‡
https://www.marinetraffic.com/en/ais/home/centerx:-12.0/centery:25.0/zoom:4
§
https://ww2.eagle.org/en.html
¶
http://www.shipwreckregistry.com/
34 Machine learning and data analytics for maritime studies
water depth sensor. Data are automatically collected and then stored in the vessel’s
central database and transmitted to port authorities.
**
https://gisis.imo.org/Public/MCI/Default.aspx
††
https://www.mardep.gov.hk/en/publication/publications/reports/ereport.html
‡‡
https://www.mardep.gov.hk/en/fact/portstat.html
Key elements of data-driven models 35
inspection record saying that ship type is “handymax carriers” and the dead weight
tonnage is 80,000, it is an outlier, as the dead weight tonnage of ship type “handy-
max carriers” is between 40,000 and 60,000. We only discuss univariate outliers
in this subsection. Common approaches of univariate outlier detection are listed in
Table 4.4.
After an outlier is detected, whether to drop the outlier needs careful consider-
ation. Some outliers are detected due to inherent variations in feature values, and
they are normal values instead of noise. For example, ship age 42 is detected as an
outlier in the PSC data set used in Reference 3. Nevertheless, the oldest ship still
afloat in the world is more than 200 years old, and thus the 42-year-old ship should
not be regarded as an outlier. Actually, given all other conditions equal, older ships
are usually associated with a larger number of deficiencies and a higher detention
probability and is thus the focus on PSC. Therefore, this outlier detected should
not be deleted or modified. Actually, the above example is just one case where the
detected outlier should be kept. When there are lots of outliers (e.g., more than 30%)
or the outliers can be rectified by capping, assigning new values, and data transform-
ing, they should not be dropped. In some other cases, the outliers should be dropped,
e.g., when it is sure that they are wrong, there are many data while outliers are few,
and they come from exceptional samples that should not be included.
Encoding Application
method Idea Example scenarios
One-hot Convert each Feature “ship type” has three Nominal feature.
encoding category value values: “container,” “bulk Note that although
into a new carrier,” and “passenger ship.” it can also be
column and We first convert this feature applied to ordinal
assign “1” or into three features: “is_ features, the order
“0” to the new container,” “is_bulk_carrier,” cannot be retained.
columns, where and “is_passenger_ship” and
“1” is assigned encode “container” as “(1, 0,
to the column of 0),” “bulk carrier” as “(0, 1,
the category the 0),” and “passenger ship” as
sample belongs “(0, 0, 1)” where each digit is
to, and “0,” for one new feature.
otherwise.
Target Convert each value Feature “ship age” has three Ordinal features
encoding in a column values: “young,” “middle-
to a number aged,” and “old.” We encode
considering their “young,” “middle-aged,” and
inherent orders. “old” to “1,” “2,” and “3.”
supervised and unsupervised methods can be used for feature selection. Supervised
feature selection techniques use the prediction target, and it can be further divided
into three categories: filter, wrapper, and embedded, while unsupervised feature
selection techniques ignore the target. We present common approaches for super-
vised feature selection as follows. Readers are referred to [4] for more information
on unsupervised feature selection techniques.
is first to calculate 2
test between each of the categorical features and the categori-
cal target. Then, a certain number of features with the highest 2
statistic or features
with 2 statistic above a given threshold are selected as the final feature subset.
MAPE = j j,
n yi
Pn i=1
1X
n
(yi yOi )2
R2 = 1 Pi=1
n , where y
N = yi .
i=1 (yi y
N )2 n i=1
For the metrics mentioned above, MSE is the most widely used metric for regression
problems as it is older and well-established. It is also differentiable, and thus there
are many good algorithms that can be applied to optimize loss functions based on
MSE. However, the unit of MSE is the square of the original data unit, and hence
this makes it hard to be interpreted. To remedy this issue, RMSE is proposed where
the unit is in compliance with the original data unit. Both metrics can be very sensi-
tive to the outliers in the test set, especially for the MSE where the square operation
is applied. This problem can be largely addressed by the MAE, whose unit also com-
plies with the original data and the metric is not that sensitive to outliers compared
to MSE. However, as it contains an absolute function that is not differentiable, the
optimization process is difficult. MAPE is a relative error rate that also considers the
original output, but the original output values cannot be zero. R2is also widely used
in statistics analytics, which represents the proportion of the variance in the output
that can be explained by the features. It can be seen that the meanings as well as the
pros and cons of the metrics for regression problems are different from each other.
44 Machine learning and data analytics for maritime studies
4.2.5.1.1 Accuracy
Accuracy is a classic and intuitive classification metric, which measures the ratio of
the samples that are correctly classified to all the samples in the test set calculated
as follows:
TP + TN
Accuracy = .
TP + FP + FN + TN
Accuracy is suitable to be applied in balanced classification problems where the
numbers of samples in the positive class and the negative class are similar. In case
that the problem is imbalanced, where the positive class is very sparse, always pre-
dicting the target to be negative will get a very high accuracy.
Example 1. When the ratio of ships that are detained (positive class) to ships that
are not detained (negative class) is 1:99, predicting all the ships to be not detained
could achieve a 99% accuracy rate, but the model is meaningless as the detained
ships are indeed our focus.
Therefore, metrics suitable to be applied to address imbalanced problems should
be proposed.
4.2.5.1.2 Precision
Precision aims to evaluate the proportion of predicted positive samples that are
indeed positive. It takes the following form:
TP
Precision = .
TP + FP
Key elements of data-driven models 45
4.2.5.1.3 Recall
Precision is from the perspective of samples predicted to be positive. From the
perspective of samples that are actually positive, the metric Recall aims to evaluate
the proportion of actual positive samples that are correctly predicted. It takes the
following form:
In Example 4.2.1, Recall = 0 as TP = 0 aims to capture as many positive samples
as possible. It is also noted that if we predict all samples to be positive, we have
Recall = 1.
Precision and Recall are somewhat contradictory as they view the same prob-
lem from opposite angles: Precisiononly looks at the samples predicted to be posi-
tive, while Recall takes all the actual positive samples into account.
4.2.5.1.4 F1 score
To reach a trade-off between Precisionand Recall , the metric of F1 score, which is
the harmonic mean of Precisionand Recall , takes the following form:
When T P + FP = 0where Precisionis undefined and T P + FN = 0where Recall
is undefined, F1 = 0. The range of F1 score is between 0 and 1, and a higher value
indicates better performance of a classification model. F1 score treats Precisionand
Recall equally. In case that domain knowledge of different weights for Precision
and Recall should be included, a more generic Fˇ score can be applied, which takes
the following form:
β is chosen such that Recall is considered β times as important as Precision.
When ˇ = 1, Fˇ is degenerated to F1. When ˇ > 1, Recall has a larger impact on Fˇ ;
when ˇ < 1, Precisionhas a larger impact.
Given a data set (with limited samples) and a threshold, a pair of (FPR, TPR) can
be obtained. By changing the threshold from the smallest to the largest predicted
values/probabilities for all the samples, different (FPR, TPR)pairs can be obtained,
and linking all the consecutive pairs together with (0, 0) and (1, 1) can produce the
46 Machine learning and data analytics for maritime studies
ROC curve, where an illustration is given by the solid line in Figure 4.2. As limited
samples are contained in the test set, the solid line is ragged. The dotted diagonal
line corresponds to the ROC curve of random guessing.
Figure 4.2 shows that as the threshold decreases, more samples are predicted to
be positive, and FPR and TPRincrease simultaneously from (0, 0), where the thresh-
old is 1 with no sample predicted to be positive, to (1, 1), where the threshold is 0
with no sample predicted to be negative. A perfect classifier has (FPR, TPR) = (0, 1).
Therefore, the closer an ROC curve to (0,1), the better a classifier is.
4.2.5.2 Model selection
An ideal ML model should have good generalization ability, or low generalization
error, that performs well on unseen data. Therefore, there is a trade-off between
over-fitting and under-fitting when an ML model is constructed: if the model is too
complex, training error can be very low or even zero. Meanwhile, the possibility
of over-fitting the training data can be high, leading to a poor generalization abil-
ity of the ML model constructed. In contrast, a too simple ML model might lead
to under-fitting as well as a poor generalization ability. An ideal ML model can be
selected by the metrics introduced in section 4.2.5.1 considering different scenarios
and requirements. Model selection aims to estimate the performance of different ML
models using proper metrics in order to choose the best one. To achieve effective
selection, a validation set/several validation sets that is/are mutually exclusive with
the training set and a test set should be formulated and used for model evaluation
and selection. The ML model with the lowest prediction error on the validation set(s)
(where we expect that it also has the lowest prediction error on the unseen test set)
should be selected. Common data set division methods used for model selection are
given below.
A common approach is to randomly divide the whole data set into a training set,
a validation set, and a test set, where the training set is used for model construction,
the validation set is used for model selection, and the test set should not be used until
the final ML model is constructed and the generalization error of the final chosen
model is assessed. There is no general rule to decide the number or ratio of samples
in these three sets, as it depends on the size and features of the whole data set.
Common ratios of these sets are 50:25:25, 60:20:20, and 70:15:15% To reduce the
uncertainties brought by the division of data set, multiple times of random partition
should be conducted, and the results generated should be averaged.
Another popular approach is called cross validation. After splitting a test set
from the whole data set D, the remaining samples in D, denoted by D0 , are divided
into k (k > 2, and the common values are 3, 5, and 10) mutually exclusive sub-
sets, i.e., D0 = D1 [ D2 [ ... [ Dk and Di \ Dj = ¿, i = 1, ..., k, j = 1, ..., k, i ¤ j .
Cross validation in this form is also called k -fold cross validation. For each of the k
times’ training, one subset is selected as the validation set and the remaining subsets
form the training set. Then, k model performance results can be obtained, and the
average performance can be calculated. An illustration of the 1 0-fold cross valida-
tion is given in Figure 4.3. When k = n0 , where n0 is the size of D0 , each sample is
given the opportunity to be used as the hold out validation set. This approach is also
called leave-one-out cross validation. In each time of training, all the samples except
the one in the validation set are used to train the ML model, and the ML models’
48 Machine learning and data analytics for maritime studies
the model, that is, how flexible the model is in fitting the data. Therefore, they can
be used to control over-fitting and to avoid under-fitting. Although default values of
the hyperparameters are given in ML models implemented by open source packages,
finding their best values given a specific data set is not a trivial task, especially in
ML models that have multiple hyperparameters that interact in nonlinear ways.
Hyperparameter tuning aims to search for the best values for a set of hyperpa-
rameters that result in the best model performance on a given data set. Denote the
number of hyperparameters that need to be optimized by p. The tuning procedure
starts by defining a search space for the p hyperparameters to be tuned, and then
different hyperparameter settings constructed from the search space are evalu-
ated by a tuning algorithm. Finally, the hyperparameter setting leading to the best
model performance on the validation sets is output, which is used to train the final
model.
The most commonly used hyperparameter tuning algorithm is grid search, which
picks out a grid of values of the hyperparameters concerned and evaluates them one
by one. It is also noted that many hyperparameters take values in the range of real
numbers. However, in grid search, a range of their values and the searching interval
are first given, and thus finite candidate values are contained in the search space.
For example, for an ML model with phyperparameters h1 to hp, the search space is
n o
v
h1 2 h11 , h21 , ..., h11 ,
..
.
n vp
o
1 2
hp 2 hp , hp , ..., hp .
Then, a grid of v1 ... vp hyperparameter settings can be formulated with each
constituted by one possible value of h1 to hp. All hyperparameter settings are then
tested, and the hyperparameter setting with the best performance on the validation
set or in cross validation is returned.
Grid search is quite simple and intuitive. However, it is very expensive in terms
of computation power. When the ML model is complex and the number of hyper-
parameters is large, the computation time might become unaffordable. Instead of
searching for the entire search space, random search is proposed which only eval-
uates a certain number of random hyperparameter settings (the number is much
smaller than v1 ... vp) in the grid. Theoretically, the performance of the best
hyperparameter setting found by grid search is highly likely to be worse than that
found by grid search. Nevertheless, in many cases of public datasets, Bergstra and
Bengio (2012) [5] found that random search can perform about as well as grid search
with much less computation power needed.
In recent years, smart hyperparameter tuning algorithms are developed. Unlike
the grid search and random search algorithms that form the candidate hyperparame-
ter settings in advance and then evaluate all or part of them one by one, smart hyper-
parameter tuning algorithms only pick a few hyperparameter settings and evaluate
their quality, and then decide where to sample next. The drawback is that the former
two tuning algorithms can be parallelizable, while the latter is sequential. Popular
50 Machine learning and data analytics for maritime studies
algorithms are Bayesian optimization [6], derivative-free optimization [7], and ran-
dom forest smart tuning [8].
References
[1] Tu E., Zhang G., Rachmawati L., Rajabally E., Huang G.B. ‘Exploiting AIS data
for intelligent maritime navigation: a comprehensive survey from data to meth-
odology’. IEEE Transactions on Intelligent Transportation Systems. 2017;19
(5):1559–82.
[2] Yang D., Wu L., Wang S., Jia H., Li K.X. ‘How big data enriches maritime
research—a critical review of automatic identification system (AIS) data ap-
plications’. Transport Reviews. 2019;39(6):755–73.
[3] Yan R., Wang S., Peng C. ‘An artificial intelligence model considering data
imbalance for SHIP selection in port state control based on detention prob-
abilities’. Journal of Computational Science. 2021;48:101257.
[4] Solorio-Fernández S., Carrasco-Ochoa J.A., Martínez-Trinidad J.F. ‘A review
of unsupervised feature selection methods’. Artificial Intelligence Review.
2020;53(2):907–48.
[5] Bergstra J., Bengio Y. ‘Random search for hyper-parameter optimization’.
Journal of Machine Learning Research. 2012;13(2):281–305.
[6] Snoek J., Larochelle H., Adams R.P. ‘Practical bayesian optimization of
machine learning algorithms’. Advances in Neural Information Processing
Systems. 2012;25.
[7] Conn A.R., Scheinberg K., Vicente L.N. Introduction to derivative-free opti-
mization [online]. SIAM; 2009 Jan. Available from http://epubs.siam.org/doi/
book/10.1137/1.9780898718768
[8] Hutter F., Hoos H.H., Leyton-Brown K. ‘Sequential model-based optimization
for general algorithm configuration’. International Conference on Learning
and Intelligent Optimization; 2011. pp. 507–23.
[9] Interpretable machine mearning: a guide for making black box models ex-
plainable [online]. 2022 Mar 29. Available from https://christophm.github.io/
interpretable-ml-book [Accessed 31 Mar 2022].
Chapter 5
Linear regression models
Linear regression aims to learn a linear model that can predict the target using the
features as accurately as possible. The assumption of linear regression models is
that the target is linearly correlated with the features, i.e., the regression function
E(y|x)is linear in x, where E()denotes expectation. If the assumption can (almost)
be satisfied, linear regression can be comparable or can even outperform fancier
non-linear models. The linear regression model is one of the most classic models for
prediction tasks, and it is still widely used in the computer and big data era, thanks
to its intuitiveness and interpretability in particular. In the following sections, we
first introduce simple linear regression models (with a single feature) and the least
squares method, which aims to find the optimal parameters of a linear regression
model by minimizing the sum of squares of the residuals. Then, we discuss multiple
linear regression (with more than one feature) and its extension. Finally, we intro-
duce shrinkage linear regression models.
Simple linear regression uses only one feature to predict the target. For example,
we use ship age to predict the number of deficiencies of a PSC inspection. Denote
the training set with n samples by D = f(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )g and the feature
vector by x . Simple linear regression aims to develop a model taking the following
form:
yO i = wxi + b ,
where yO i is the predicted target for sample i , w is the parameter weight and bis the
bias. w and bneed to be learned from D. Then, a natural question is: what are good
w and b? Or in other words, how to find the values of w and bsuch that the predicted
target is as accurate as possible? The key point of developing a simple linear regres-
sion model is to evaluate the difference between yO i and yi, i = 1, ..., nusing the loss
function and to adopt the values of w and b that minimize the loss function. In a
regression problem, the most P commonly used loss function is the mean squared error
(MSE), where MSE = 1n ni=1 (yi yOi )2. Therefore, the learning objective of simple
linear regression is to find the optimal (w , b ) such that the MSE is minimized.
The above idea can be presented by the following mathematical functions:
52 Machine learning and data analytics for maritime studies
P
n
(w , x ) = arg min (yi yOi )2
(w,b) i=1
Pn . (5.1)
= arg min (yi wxi b)2
(w,b) i=1
This idea is called the least squares method. The intuition behind it is to minimize
the sum of lengths of the vertical lines between all the samples and the regression
line determined by w and b. It can easily be shown that MSE is convex in w and b,
and thus (w , b )can be found by
@MSE P
n
= 2( xi wxi (yi b) ) = 0
@w Pi=1
n Pn . (5.2)
i=1 yi (xi n xi )
1
) w = Pn 2 1 Pn i=1 2
i=1 xi n ( i=1 xi )
The optimal wis first found by Equation (5.2), and then it can be used to calculate
the optimal value of b, denoted by b, as follows:
@MSE Pn
= 2( w xi + b yi ) = 0
@b i=1
1X
n . (5.3)
)b =
(yi w xi )
n i=1
Simple linear regression can easily be realized by scikit-learn API [1] in Python.
Here is an example of using ship age to predict ship deficiency number using simple
linear regression.
Example 5.1: In this problem, the only feature is ship age and the target is
ship deficiency number. (w , b ) is estimated on the training set using the
LinearRegression method in scikit-learn API. The core code is as follows:
Therefore, the relationship between ship age and deficiency number learned by
simple linear regression can be presented in:
deficiency number = 0.2648 age + 1. (5.4)
Model performance is then validated on the hold-out test set, which can also
easily be achieved by the scikit-learn API. Finally, we have MSE = 20.3471,
RMSE = 4.5108, MAE = 2.9650, and R2 = 0.0713. Recall that the definitions of
MSE, RMSE, and MAE are given in section 4.2.4 of chapter 4. Note that in
this example, R2 is very small, indicating that the model is highly likely to be
ineffective, and simple linear regression might not be suitable to address this
problem.
Linear regression models 53
In more general cases, the feature dimension denoted by m is more than one, i.e.,
two or more features are used to predict the target. For example, we use a linear
regression model with five features: ship age, type, GT, last inspection time, and last
deficiency number to predict the ship deficiency number of the current PSC inspec-
tion, and this model is called multiple linear regression. Mathematically, multiple
linear regression can be written as follows:
yO i = w1 xi1 + w2 xi2 + ... + wm xim + b
= xi w + b,
(5.5)
xi = (xi1 , xi2 , ..., xim ),
w = (w1 ; w2 ; ...; wm ).
To unify the parameters, Equation (5.5) can be written as
yO i = xQ i w,
Q
xQ i = (xi1 , xi2 , ..., xim , 1), (5.6)
wQ = (w1 ; w2 ; ...; wm ; b).
In a training set D with nsamples, the matrix form of Equation (5.6) is
0 1
0 1 0 1 w1
yO 1 x11 x12 x1m 1 B C
B C B CB w C
ByO 2 C Bx21 x22 x2m 1CB 2 C
B C B CB .. C
B . C=B . .. .. .. .. CB . C. (5.7)
B .. C B .. . . . .C B C
@ A @ AB C
@wm A
yO n xn1 xn2 xnm 1
b
or yO = xQ wQ . The dimension of yO is n 1, the dimension of xQ is n (m + 1), and
the dimension of wQ is (m + 1) 1. A closed-form solution of the optimal wQ can
be found by the least squares method but the case is more complex than that in
simple linear regression. The MSE is calculated by MSE = (y xQ w) Q , and
Q T (y xQ w)
its derivative to wQ is
@MSE
= 2QxT (Qxw
Q y). (5.8)
@wQ
If the matrix xQ T xQ is a full-rank matrix or a positive definite matrix, wQ can be calcu-
lated by setting Equation (5.8) to zero, and we can have
here (QxT xQ )1 is the inverse matrix of xQ T xQ . The optimal multiple linear regression
model solved by least squares is
If xQ T xQ is not a full-rank matrix, which indicates that the features outnumber the
samples, then there will be several wQ , and the optimal one can be selected by intro-
ducing regularization in section 5.4.
Multiple linear regression can also be realized by scikit-learn API [1] in Python.
Here is an example of using ship age, type, GT, last inspection time, and last defi-
ciency number, whose details are provided in section 2.3 of chapter 2, to predict ship
deficiency number using multiple linear regression.
Example 5.2: Ship type is a nominal categorical feature. We first use one-hot
encoding to encode this feature to six features: type_bulk_carrier, type_container_
ship, type_general_cargo/multipurpose, type_passenger_ship, type_tanker, and
type_other. For ships without the last inspection record, we set 1to their features
“last_inspection_time” and “last_deficiency_no.” Then, we have a total of ten fea-
tures, which are of varied ranges. For example, in the whole dataset, age ranges from
0 to 46, while GT ranges from 0 to 210,678. Therefore, min-max scaling (readers are
referred to section 4.2.4 of chapter 4) is used for feature scaling. It should be noted
the min-max scaler should be first fit on the features in the training set, and then
applied to the features in the training and test sets. Then, the training set and test set
after feature scaling using a min-max scaler are used to train and test the multiple
linear regression model. Similar to Example 5.1, wQ can be estimated on the training
set using the LinearRegression in scikit-learn API. In this example, the ship defi-
ciency number is predicted by Equation (5.11), where the features have been scaled
by the min-max scaler:
deficiency number = 8.1786 age + 0.1671 type_bulk_carrier
0.4938 type_container_ship + 3.9357 type_general_cargo/multipurpose
+ 0.9053 type_other 2.9534 type_passenger_ship
1.5609 type_tanker 2.9414 GT
2.8232 last_inspection_time + 13.6901 lastdeficiency_no + 2.3880.
(5.11)
The weights of the features show their contributions to the number of ship deficien-
cies detected. For example, ship deficiency number increases rapidly as ship age or
the last deficiency number increases. In contrast, given all other conditions equal,
a larger ship has less deficiencies. The longer the last inspection time, the fewer
deficiencies detected. This is because ships in worse conditions are more likely to be
selected for inspection in the current SRP ship selection scheme, and thus they have
shorter inspection intervals. For different types of ships, being types general_cargo/
multipurpose, other, and bulk carrier increases the deficiency number, while being
passenger ship, tanker, and container ship decreases the deficiency number. The
above analysis shows that linear regression is totally explainable and its compliance
with domain knowledge can easily be validated.
Linear regression models 55
Model performance is then validated on the hold-out test set, and we have
MSE = 15.9987, RMSE = 3.9998, MAE = 2.5948, and R2 = 0.2698. It is obvious
that the performance of the multiple linear regression model for ship deficiency
number prediction is much better than that of the simple linear regression model
in Example 5.1, as many more features that can influence the target are considered.
This section introduces two popular extensions of multiple linear regression: pol-
ynomial regression and logistic regression. As a relatively simple and basic ML
model considering a linear relationship between the features and the target, multiple
linear regression is highly likely to lead to the problem of underfitting. To address
this problem, one viable approach is to incorporate non-linearity into the model. To
retain the linear function form (and thus model interpretability), higher order terms
of the original features (e.g. quadratic, cubed, and product terms) are formed and
included in the multiple linear regression model. The model formulated is called
polynomial regression. Besides, multiple linear regression can only address regres-
sion problems. To extend multiple linear regression to address classification prob-
lems, the continuous output given by multiple linear regression is mapped to classes
using a surrogate function. For binary classification problem, logistic function is a
commonly used mapping function, and the classification model is called logistic
regression. Details of the two extensions of multiple linear regressions are given in
the following subsections.
5.3.1 Polynomial regression
As linear regression assumes a linear relationship between the original features and
the target by constructing a straight line, underfitting occurs if the features and the
target take a non-linear relationship, where the complexity of the model needs to
be improved to enhance model performance. One approach is first to add powers
to the original features to form new (and more complex) features, and then to form
a linear function between the new features and the target. This approach is called
polynomial regression, and the features constructed are called polynomial features.
For example, given two features (X1 , X2 )and power degree 2, each of them will be
transformed to 3 features with orders from 0 to 2 (note that both features have the
same value 1 when the order is 0), respectively, in addition to their product term.
Therefore, the new features will be (1, X1 , X21 , X2 , X22 , X1 X2 ). Applying polynomial
regression to Example 5.2 yields Example 5.3.
Example 5.3: The 6 binary features encoded from the nominal feature ship type
are not processed and are directly used in polynomial regression. Min-max scaling
is applied to numerical features age, GT, last inspection time, and last deficiency
number, and their polynomial features of degree 2 are constructed by scikit-learn
API using the following codes:
56 Machine learning and data analytics for maritime studies
Then, a total of 15 new features can be constructed from the 4 original features,
including 1 constant term, 4 linear terms, 4 quadratic terms, and 6 product terms.
Totally, there are 21 features (15 numerical features and 6 categorical features).
Then, multiple linear regression is applied to all the 21 features. We omit the specific
form as it is similar to that of Example 5.2. Model performance is then validated on
the hold-out test set, and we have MSE = 15.4055, RMSE = 3.9250, MAE = 2.5140,
and R2 = 0.2969. The performance of polynomial regression is better than that of
the multiple linear regression model in Example 5.2 and much better than that of the
simple linear regression model in Example 5.1, as a more complex feature form is
used.
5.3.2 Logistic regression
If the prediction target is whether a ship is detained in an inspection, which is a
binary variable with “1” indicating ship detention and “0,” otherwise. Neither
simple nor multiple linear regression model can be directly applied to this clas-
sification problem, as the output is continuous and unbounded, while we expect
the output to be categorical and bounded. An intuitive method is to set a threshold
to predict the probability of y = 1 given the input features x, i.e., P(y = 1|x). The
unit-step function is a popular method to map a continuous output (denoted by z )
to a probability (denoted by e y ),
which takes the following form as shown in
Figure 5.1.
However, Figure 5.1 shows that the final output given by the unit-step function
is discontinuous, making it hard to be optimized. Therefore, a continuous, mono-
tonic, and differentiable surrogate function of the unit-step function called logistic
function taking the following form is used:
1
e
y= , (5.12)
1 + ez
here z = xQ wQ is the continuous output given by a multiple linear regression model. An
illustration of the logistic function is shown in Figure 5.2.
Equation (5.12) can also be transformed as follows:
Linear regression models 57
1
e
y =
1 + ez (5.13)
e
y
)z = xQ w
Q = ln .
1 ey
e
y
z = xQ wQ = ln
1 ey
P(y = 1|x) (5.14)
) xQ wQ = ln .
P(y = 0|x)
and we can have
exQwQ
P(y = 1|x) = xQwQ ,
1+e (5.15)
1
P(y = 0|x) = xQwQ .
1 + e
Then, the maximum likelihood method [2] can be used to estimate wQ to maximize
the probability that the predicted target of each example is equal to its actual tar-
get. Algorithms in numerical optimization, such as the gradient descent method and
Newton method can be used to obtain the optimal wQ , i.e., wQ . We use the same fea-
tures of Example 5.2 to predict ship detention. The procedure and results are shown
in Example 5.4.
Linear regression models 59
Example 5.4: Min-max scaling is also first applied to numerical features age,
GT, last inspection time, and last deficiency number. However, one critical issue
in this example is that ship detention is a rare event. For example, among all the
1,592 samples in the training set, the target of only 85 of them is “1,” indicating that
the detention rate is about 5.34%. To overcome the problem of data imbalance, we
first use the synthetic minority oversampling technique (SMOTE) to oversample the
minority class, with the aim to construct a balanced dataset. Then, logistic regression
is applied to predict ship detention. We use a pipeline combining SMOTE and logis-
tics regression to first construct a balanced data set, and then predict ship detention
based on scikit-learn API with main code as follows:
Then, the constructed model is validated on the hold-out test set containing 399
examples (15 examples with detention), and the following confusion matrix given in
Table 5.1 is obtained. The average Precision, Recall , and F1scores are 0.55, 0.79,
and 0.54, respectively.
Table 5.1 Confusion matrix for ship detention prediction using logistic
regression
5.4.1 Ridge regression
Ridge regression imposes a penalty on the size of the regression coefficients using
L2 regularization, where the loss function takes the following form:
Pn Pm P m
l = (yi b xij wj )2 + wj2 , where > 0. (5.16)
i=1 j=1 j=1
is a complexity parameter to control the degree of shrinkage: a larger means a
greater amount of shrinkage. The objective of ridge regression is to find the optimal
wsuch that
( )
Pn Pm Pm
w = arg minw
(yi b xij wj ) + wj .
2 2
(5.17)
i=1 j=1 j=1
This is equivalent to solving the following optimization problem:
( )
Pn Pm
w = arg minw
(yi b xij wj ) ,
2
i=1 j=1
(5.18)
Pm
s.t. wj t,
2
j=1
here there is a one-to-one relationship between and t , and the size constraint
on the parameters (i.e. constraint on parameter values) is imposed explicitly in
Equation (5.18). Ridge regression is effective to alleviate the problem of high vari-
ance brought about by correlated variables in multiple linear regression by shrinking
coefficients close to (but not exactly) zero. It is also noted that bias b, which is not
directly related to the parameters, is excluded from the penalty terms, as they aim
to regularize the coefficients of parameters. Its value should also be determined in
Equation (5.18). An example of using ridge regression to predict ship deficiency
number using the features of Example 5.2 based on scikit-learn API is as follows.
Example 5.5: Min-max scaling is also first applied to numerical features age,
GT, last inspection time, and last deficiency number. Ridge regression with hyper-
parameter tuning for based on 5-fold cross-validation can easily be implemented
by the RidgeCV method provided by scikit-learn API. The main code is as follows:
Linear regression models 61
The constructed ridge regression model for ship deficiency number prediction
takes the following form:
deficiencynumber = 8.1786 age + 0.1671 type_bulk_carrier
0.4938 type_container_ship + 3.9357 type_generalc argo/multipurpose
+ 0.9053 type_other 2.9534 type_passenger_ship
1.5609 type_tanker 2.9414 GT
2.8232 last_inspection_time + 13.6901 lastdeficiency_no + 2.3880.
(5.19)
Compared to Equation(5.11) for ship deficiency number prediction using multiple
linear regression, it can be seen that the absolute weight values associated with the
parameters are smaller in Equation(5.19) in ridge regression. Meanwhile, the relative
importance of the parameters is similar in both models. Model performance is then
validated on the hold-out test set, and we have MSE = 15.9981, RMSE = 3.9998,
MAE = 2.5948, and R2 = 0.2698, which is slightly better than the performance of
the multiple linear regression model as shown in Example 5.2.
5.4.2 LASSO regression
LASSO regression imposes L1 regularization on the size of the regression coef-
ficients, whose loss function takes a similar form of ridge regression except for the
regularization term:
Pn Pm Pm
l= (yi b xij wj )2 + |wj |, where > 0. (5.20)
i=1 j=1 j=1
And it is equivalent to solving the following optimization model to find the optimal w:
( )
Pn Pm
w = arg minw
(yi b xij wj ) ,
2
i=1 j=1
(5.21)
Pm
s.t. |wj | t.
j=1
The optimal value of bcan also be determined by Equation (5.21). As the L1 regu-
larization term with absolute operation involved is used in LASSO regression, it can
lead to zero coefficient, which means that some features are totally ignored by the
regression model. The larger value of , the fewer features are retained. Therefore,
LASSO regression can also be used for feature selection in addition to overfitting
control. An example of using LASSO regression to predict ship deficiency number
using the features of Example 5.2 based on scikit-learn API is as follows.
Example 5.6: Min-max scaling is also first applied to numerical features age,
GT, last inspection time, and last deficiency number. Like ridge regression, LASSO
regression with hyperparameter tuning for based on 5-fold cross validation can
easily be implemented by the LassoCV method provided by scikit-learn API. The
main code is as follows:
62 Machine learning and data analytics for maritime studies
The constructed LASSO regression model for ship deficiency number predic-
tion takes the following form:
deficiency number = 8.0205 age + 0 type_bulk_carrier
0.6216 type_container_ship + 3.7846 type_general_cargo/multipurpose
+ 0.7083 type_other 2.7820 type_passenger_ship
1.6691 type_tanker 2.9266 GT
1.9954 last_inspection_time + 13.6331 last_deficiency_no + 2.5125. (5.22)
References
[1] Pedregosa F., Varoquaux G., Gramfort A, et al. ‘Scikit-learn: machine learning
in python’. Journal of Machine Learning Research. 2011;12:2825–30.
[2] Rossi R.J. Mathematical statistics [online]. Hoboken, NJ: John Wiley & Sons;
2018 Jul 18. Available from http://doi.wiley.com/10.1002/9781118771075
Chapter 6
Bayesian networks
This chapter introduces the basics of Bayesian network (BN) classifiers that are used
to address classification problems. Naive Bayes classifier is first presented, where a
simplified (but unrealistic) assumption that the features are conditionally independ-
ent and are of equal importance is made. To weaken the assumption so as to improve
the classification accuracy, semi-naive Bayes classifiers are then presented, where
part of the dependencies between the features is considered. Finally, BN in more
general form is introduced.
P(xi , ck )
P(ck |xi )
=
P(xi )
P(ck ) Y . (6.2)
m
= P(xij |ck )
P(xi ) j=1
The predicted target given the set of features xi should be the class leading to
the largest value of Equation (6.2). For a given xi , P(xi ) is fixed. The classifica-
tion problem is turned to estimate the product of P(ck )and the class conditional
probability P(xi |ck ), and find the class ck leading to the largest product as the
predicted target yO i . Mathematically, this process can be presented as
Q
m
yO i = arg maxck 2C P(ck )P(xij |ck ) . (6.3)
j=1
|Dck |
P(ck ) can be calculated by P(ck ) = |D| , where |D ck | is the number of samples
in class ck, |D|
is the total number of samples in the dataset, and P(xij |ck ) is the
conditional probability of each feature j , j = 1, ..., m of sample xi . Especially,
for categorical features, we denote the number of samples belonging to class ck
and taking value j0 for feature j by Dck ,j,j0 . Then, P(xij |ck )can be calculated by
Dck ,j,j0
P(xij |ck ) = . (6.4)
|Dck |
We use an example considering five features, namely ship age (age), type, last
inspection time (l_ins_time), last deficiency number (l_def_no), and last inspec-
tion state (whether a ship is detained in the last inspection; l_ins_state) to predict
whether a new coming ship has six or more deficiencies. Therefore, as the tar-
get variable, the deficiency number (denoted by def_no for short) is decoded to
two states: 0to5 and 6+. The training set contains 200 random samples from the
whole port state control (PSC) dataset.
Example 6.1 Age, l_ins_time, and l_def_no are numerical features. For the
sake of simplicity, we treat them as categorical features by discretizing their
values. The values for age after discretization are 0to5, 6to10, 11to15, and 16+.
The values for l_ins_time after discretization are no_inspection, 0to5, 6to10,
11to15, and 16+. The values for l_def_no after discretization are no_inspec-
tion, 0to5, and 6+. For the two categorical features type and l_ins_state, they
have five states (bulk_carrier, container_ship, general_cargo/multipurpose,
tanker, and other) and three states (no_inspection, no_detention, and detention),
respectively. For the three features related to last inspection, namely l_ins_time,
l_def_no, and l_ins_state, if a ship has no last inspection record in the database,
its state is set to “no inspection.” According to Equation (6.3), we first calculate
the prior probability P(ck )for each class ck 2 C . The frequency and distribution
of the target are shown in Table 6.1.
Then, the conditional probability table, or the CPT, for each feature given the
target is calculated and presented as follows in Tables 6.2–6.6.
Bayesian networks 65
Suppose here comes a 3-year-old (the value of age is 0to5) tanker ship (the
value of type is tanker), the last inspection time is four months ago (the value of
l_ins_time is 0to5) with eight deficiencies identified (the state of l_def_no is 6+)
and is not detained (the state of l_ins_state is no detention). Then, the product of
the probability that the ship has 6+ deficiencies and the conditional probabilities of
the feature values given the condition that deficiency_no is 6+ can be calculated by
Similarly, the product of the probability that the ship has 0 to 5 deficiencies and the
conditional probabilities of the feature values given the condition that the state of
deficiency_no is 0to5 can be calculated by
State of age
State of
def_no 0to5 6to10 11to15 16+
0to5 40 (25.949%)* 57 (36.709%) 32 (20.886%) 25 (16.456%)
6+ 3 (8%) 12 (26%) 9 (20%) 22 (46%)
*The format of this cell is “frequency (proportion).” This applies for the remaining cells of this table and
the following tables.
66 Machine learning and data analytics for maritime studies
Value of l_def_no
Value of l_ins_state
Value of l_ins_time
State of
def_no 0to5 6to10 11to15 16+ No inspection
0to5 67 (42.767%) 55 (35.220%) 17 (11.321%) 10 (6.918%) 5 (3.774%)
6+ 24 (49.020%) 12 (25.490%) 2 (5.882%) 6 (13.725%) 2 (5.882%)
As 5.4995 104is larger than 2.0827 104, it is predicted that the coming tanker
ship has 0 to 5 deficiencies.
The above classification process can be graphically shown in the form of
a BN, which is a directed acyclic graph consisting of a set of nodes and a set
of directed arcs. Nodes in a BN can be features, latent variables (i.e. variables
that are not directly observed), and the target to be predicted. The target is the
Value of type
State of general_cargo/
def_no Bulk carrier Container ship multipurpose Tanker Other
0to5 12 (8.176%) 110 (69.811%) 8 (5.660%) 18 (11.950%) 6 (4.403%)
6+ 12 (25.490%) 13 (27.451%) 11 (23.529%) 4 (9.804%) 6 (13.726%)
Bayesian networks 67
Figure 6.1 Construction of a naive Bayes classifier for ship deficiency condition
prediction
Figure 6.2 The constructed naive Bayes classifier for ship deficiency condition
prediction
class variable, and the features are the attribute variables. The directed arcs con-
necting two nodes represent conditional dependencies, where the node at the
tail of an arc is the parent node and the node at the head is the child node.
A parent node represents the condition, and its child node is the consequence of
that condition. The BN model to address the ship deficiency number prediction
problem in Example 6.1 can be constructed by the software Netica* and is shown
in Figure 6.1. A box represents a node, and the title in the first line of a box is the
name of a feature or the target. The following lines are their values (or states).
The marginal distributions of the features or the target are given by the belief
bars on the right of the states.
For the coming tanker ship, clicking each of its feature values in Figure 6.1 yields
the probability of the ship to have 0 to 5 deficiencies or 6 and more deficiencies as
*
https://www.norsys.com/netica.html
68 Machine learning and data analytics for maritime studies
Step 1 Specify the prediction target as the class variable and the features
as the attribute variables.
Step 2 The conditional mutual information between each two of the
features (denoted by Ai and Aj , and Ai ¤ Aj ) given the class variable
(denoted by C ) is calculated (denoted by I(Ai ; Aj |C)) to identify the
interdependence of the features.
Step 3 Build a complete undirected graph with attribute variables as the
nodes and the conditional mutual information I(Ai ; Aj |C) as the weight
of the arc between nodes Ai and Aj .
Table 6.7 Summary of typical semi-naive Bayes classifiers
P
Ni P P
Nj Nc
P(ai,s0 ,aj,s00 |Cs )
I(Ai ; Aj |C) = P(ai,s0 , aj,s00 , Cs ) log P(a , (6.7)
i,s0 |Cs )P(aj,s00 |Cs )
s0 =1 s00 =1 s=1
here “log ” means the logarithmic operation with base 2. Ni , Nj , and Nc are the
numbers of the possible values of Ai , Aj , and C , respectively, and ai,s0 , aj,s00 , and
s are specific values of Ai , Aj , and C
C , respectively. P(ai,s0 , aj,s00 , Cs ) is the joint
probability, and P(ai,s0 , aj,s00 |Cs ), P(ai,s0 |Cs ) and P(aj,s00 |Cs ) are the conditional
probabilities. P(ai,s0 , aj,s00 , Cs )should be understood as the proportion of samples
in the whole training set whose states of attribute variables Ai and Aj and the
class variable C are ai,s0 , aj,s00 , and C
s , respectively. Similarly, P(ai,s0 , aj,s00 |Cs )
means that among all the samples in the training set whose target is C s , the pro-
portion of samples whose states of attribute variables Ai and Aj are ai,s0 and aj,s00 ,
respectively. The range of I(Ai ; Aj |C)is [ 0, +1], where a large value of I(Ai ; Aj |C)
shows that attribute variables Ai and Aj are more strongly correlated.
After finishing the construction of the qualitative part of the TAN classifier, its
quantitative part is then addressed. The quantitative part has two components: the
marginal distribution and the CPT of each variable, and both can be learned from the
training set. The constructed TAN classifier can then be used to address classification
tasks. An example of developing a TAN classifier to predict whether a new coming
ship has six or more deficiencies is given in Example 6.2.
Example 6.2 The feature processing method is the same as that of Example
6.1. Netica software is used to construct the TAN classifier. The qualitative part
of the TAN classifier is shown in Figure 6.3 and the complete TAN classifier is
shown in Figure 6.4. Figure 6.3 shows that ship type is selected as the root vari-
able, which only has the class variable as its parent. In addition, the last inspec-
tion time is highly related to age and the last deficiency number, while the last
deficiency number is highly related to the last inspection state. It is noted that
the marginal distributions shown in Figure 6.4 are a little different from those
Bayesian networks 71
Figure 6.3 The qualitative part of the TAN classifier for ship deficiency condition
prediction
shown in Figure 6.2 except for nodes deficiency_no and ship_type, where node
deficiency_no has no parent node and ship_type has only deficiency_no as its
parent node in both figures. This is because the marginal distribution of a node is
calculated based on its CPT, i.e., based on its conditional distributions, in soft-
ware Netica, which would be influenced by its parent nodes. Therefore, when the
parent variable(s) of an attribute variable change(s), it is likely that its marginal
Figure 6.5 The TAN classifier for ship deficiency condition prediction
Naive Bayes classifier and semi-naive Bayes classifier are special cases of BN
classifiers, where no interdependence and partial interdependence of the features
are considered by them, respectively. In general BN classifiers, the interdepend-
ence of each pair of features are presented by directed arcs connecting the attrib-
ute variables. An example of the BN classifier to predict ship deficiency condi-
tions using the same features as Examples 6.1 and 6.2 are shown in Figure 6.6.
The local Markov property [3] is satisfied in BN classifiers, where a node is
conditionally independent of its non-descendants given its parents. Therefore,
the calculation of joint probability is simplified and is shown in Equation (6.8),
where only each node and the parent node(s) of the node need to be considered.
This property can largely reduce the computation power required to calculate
CPTs and to predict the target, especially in large BNs.
74 Machine learning and data analytics for maritime studies
Q
m
P(ck , xi1 , ..., xim ) =P(ck |parent(ck )) P(xij |parent(xij )), (6.8)
j=1
Where parent() is the set of the parent node(s) of the node in the parentheses. If
parent(xij ) =;, then P(xij |parent(xij )) = P(xij ).
There are three typical interdependence relationships of three nodes in the
BN classifier, which are also presented in Figure 6.6: common parent struc-
ture, where at least two child nodes have the same parent node as shown in
nodes ship_type, last_inspection_state, and last_deficiency_no; V -shape struc-
ture, where one child node has at least two parent nodes as shown in nodes
last_inspection_state, last_deficiency_no, and deficiency_no; and sequential
structure, where three nodes are connected sequentially as shown in nodes age,
ship_type, and last_deficiency_no. The calculation of the joint probability as
well as the insights generated in each of the three typical structures is shown in
Figures 6.7–6.9.
76 Machine learning and data analytics for maritime studies
which shows that nodes a and c are conditionally independent given the middle
node b.
The BN presented in Figure 6.6 can also be used to predict ship deficiency
condition. The BN classifier learned from the training set is shown in Figure 6.10.
The prediction result of the new coming tanker ship is given in Figure 6.11,
which also suggests that the new coming tanker ship is more likely to have 0 to
5 deficiencies.
References
[1] Zheng F., Webb G.I. ‘A comparative study of semi-naive bayes methods in
classification learning’. AUSDM05. 2005:141–55.
[2] Friedman N., Geiger D., Goldszmidt M. ‘Bayesian network classifiers’.
Machine Learning. 1997;29(2/3):131–63.
[3] Sebastiani P., Abad M.M., Ramoni M.F. ‘Bayesian networks’. Data Mining
and Knowledge Discovery Handbook. 2009:175–208.
This page intentionally left blank
Chapter 7
Support vector machine
This chapter first introduces one of the most popular machine learning models for
classification tasks called support vector machine (SVM). Then, kernel trick to
improve its prediction accuracy while reducing the computation burden is discussed.
The extension of SVM to address regression tasks, which is called support vector
regression, is then presented.
For a classification problem on data set D = f(xi , yi ), i = 1, ..., ng, where x 2 Rm
is the vector of features and y 2 C is categorical and to be predicted, the basic
idea of the SVM algorithm is to find a proper hyperplane that can accurately sepa-
rate examples in different classes. Consider a binary classification problem, i.e.,
y 2 fc1 , c2 g, and xi is two-dimensional, i.e., x = (x1 , x2 ). A hyperplane that can dis-
tinguish the two classes of examples, where squares represent examples in posi-
tive class and circles represent examples in negative class, is shown by the lines
in Figure 7.1.
Figure 7.1 shows that hyperplanes that can separate the two classes of exam-
ples are not unique; instead, the number of them can be infinite. Then, how to
choose the most proper one? Intuitively, a hyperplane that has the maximum dis-
tance from the examples in both classes is the best, as it is more tolerant to exam-
ples close to the hyperplane, which are more likely to be misclassified, and thus is
more robust to interference or noises in data. An SVM model aims to find such a
hyperplane, as the hyperplane with the maximum margin can be expected to have
the highest classification accuracy on unseen data, especially for the examples
close to it. Therefore, in Figure 7.1, hyperplane h0 should be the most suitable one.
The dimension of the hyperplane depends on the number of features: if the number
of features is m, then the hyperplane is of (m 1) dimensions. In what follows,
we consider the case of binary classification (where 1 indicates a positive example
and 1 indicates a negative example) with two features, where the hyperplane is
a line.
We first consider the case where examples in the two classes are linearly sepa-
rable, that is, they can be perfectly separated by a straight line. The hyperplane
that can perfectly separate the two classes is a hard margin hyperplane, which is
presented by
80 Machine learning and data analytics for maritime studies
wT x + b = 0. (7.1)
where w = (w1 , w2 )is the normal vector of the hyperplane determining its direction
and b is the offset determining the distance between the hyperplane and the origin.
A hyperplane is determined by the normal vector and the offset, and we denote it by
(w, b)for simplicity. For example x, its distance to (w, b)is
|wT x + b|
d(x) = . (7.2)
kwk
For the positive and negative examples in data set D, the following constraints are
satisfied to correctly classify them:
wT xi + b +1, yi = +1;
(7.3)
wT xi + b 1, yi = 1.
Visually, these hyperplanes are shown in Figure 7.2. For all the positive examples,
T
w xi + b +1 is satisfied, and the positive examples that are the closest to hyper-
plane h0 satisfy wT xi + b = +1. For negative examples, wT xi + b 1 is satisfied,
and the negative examples that are the closest to hyperplane h0 satisfy wT xi + b = 1.
Examples in both classes that are the closest to the hyperplane are marked in black (the
other examples are in gray) in Figure 7.2. These examples are called support vectors.
The margin is twice the distance between a support vector to the hyperplane, which
can be calculated by
2
= . (7.4)
kwk
According to the basic idea of SVM where the maximum margin should be found,
the problem is then turned to solve the following optimization model:
Support vector machine 81
Figure 7.2 Hyperplane with the maximum margin and the support vectors
2
max = . (7.5)
kwk
subject to
wT xi + b +1, yi = +1, i = 1, ..., n,
(7.6)
wT xi + b 1, yi = 1, i = 1, ..., n.
The optimization problem is equivalent to
1
[M0 ] min kwk2 (7.7)
2
subject to
yi (wT xi + b) 1, i = 1, ..., n. (7.8)
The optimal hyperplane is denoted by (w , b ).
Optimization model [ M0 ] is a con-
vex quadratic programming problem that can be solved by off-the-shelf packages.
Nevertheless, solving its dual problem by introducing the Lagrange multiplier can
largely improve the computational efficiency. By adding the Lagrange multiplier ˛i,
i = 1, ..., n, to each of the constraints in Equation (7.8), the Lagrangian function of
optimization model [ M0 ]can be written as
82 Machine learning and data analytics for maritime studies
1 X n
@L X n
= ˛i yi = 0,
@b i=1
Xn (7.11)
) ˛i yi = 0.
i=1
By incorporating Equation (7.11) into objective function (7.9), we can have
1 T X n
X n
Xn
L= w w+ ˛i ˛i yi wT xi ˛i y i b
2 i=1 i=1 i=1
Xn
X n (7.12)
1
= wT w + ˛i ˛i yi wT xi .
2 i=1 i=1
Then, the derivative of Equation (7.12) to wis
@L X n
=w ˛i xi yi . (7.13)
@w i=1
and by setting it to zero, we can have
X
n
w= ˛i xi yi . (7.14)
i=1
By introducing Equations (7.12) and (7.13) to optimization model [ M1 ]so as to can-
cel out wand b, the following optimization problem can be obtained:
1 XX
n n
P
n
[M2 ] max˛ ˛i ˛i ˛j yi yj xTi xj (7.15)
2 i=1 j=1
i=1
subject to
X
n
˛i yi = 0,
i=1
(7.16)
˛i 0, i = 1, ..., n.
The Karush–Kuhn–Tucker condition should be satisfied in [ M2 ], as all of the con-
straints are linear. Therefore, the following set of equations holds:
Support vector machine 83
8̂
ˆ ˛i 0,
<
yi (wT xi + b) 1 0,
ˆ
:̂
˛i (yi (wT x + b) 1) = 0.
This means that for any example (xi , yi ), we have ˛i = 0or yi (wT xi + b) = 1. If ˛i = 0,
it will not appear in Equation (7.12), and thus would not influence the hyperplane.
Otherwise, yi (wT xi + b) = 1should be satisfied, which means that the example is on
the maximum margin, i.e., it is a support vector. Therefore, it can be concluded that
the hyperplane obtained by an SVM is only determined by the support vectors while
having nothing to do with the other examples in the training set. After turning the
objective function of optimization model [ M2 ]to
1 XX X
n n n
min˛ ˛i ˛j yi yj xi xj
T
˛i . (7.17)
2 i=1 i=1 i=1
several efficient algorithms, such as sequential minimal optimization, can be used
to solve [ M2 ] so as to obtain the optimal values of ˛ (denoted by ˛ ). Then, the
optimal wis
X n
w = ˛i yi xi . (7.18)
i=1
To calculate the optimal value of b, a support vector (xs , ys )is first found and com-
bined with the optimal w, where we can have
ys (w xs + b ) = 1,
y2s (w xs + b ) =ys ,
(7.19)
y2s = 1,
) b = ys w xs .
To improve the robustness of the obtained value of b, the set of all support vectors,
denoted by S, can be used to calculate b*:
1 X
b = (ys w xs ). (7.20)
|S| sS
In Figure 7.1, examples in the two classes are linearly separable. However, in more
general cases in practice, the two classes of examples are linearly inseparable. That
is, they cannot be separated using a straight line, and thus the hard margin SVM can-
not be applied to the classification task as the constraints cannot be fully satisfied.
An example is shown in Figure 7.3, where the two classes cannot be separated using
a line as they are somewhat “mixed.” One viable approach is to allow an SVM to
make some mistakes in classification while keeping the margin as wide as possible
such that the other points can still be classified correctly. This is the basic idea of a
84 Machine learning and data analytics for maritime studies
soft margin SVM. Then, hyperplane h0 can still be used to separate the two classes,
as presented in Figure 7.4.
Mathematically, the above approach can be realized by allowing some points
(where the set of these points is denoted by D0 ) to violate the constraint,
i.e.,
y0i (wT xi0 + b) < 1, i0 2 D0 . (7.21)
We further denote the violation degree of example i0 of constraint (7.7) by a slack
variable , > 0, and thus Equation (7.21) can be written as
is used to train and test the final soft margin SVM
Then, the optimal value of C
model using the following code:
The Precision, Secall , and F1 score are 0.57, 0.53, and 0.51, respectively.
7.3 Kernel trick
Soft margin SVM is more suitable when the linearly inseparable issue is caused by only
a few examples in classification problems. When the examples in different classes are
interwoven with each other, soft margin SVM might be ineffective. Another approach
to address the problem of linearly inseparable classification problem in SVM is using
kernel trick. The basic idea of kernel trick is to use a transform function to map the
features from the original space to a higher space where they can be linearly separated.
The dot product used in SVM models (i.e., wT x) constructed so far on the original
features of each two examples can be viewed as a linear kernel. Moreover, it has been
proven that if the original feature space is of limited dimension, there must be a feature
space of a higher dimension where the examples are linearly separable.
Mathematically, a kernel function can be written as
The kernel function in Equation (7.27) can be understood as taking a dot product
(which measures the similarity of the two terms) of the transformed input vectors
by function (). The kernel function can easily be incorporated into the original
hard margin SVM model by replacing xi by (xi ) in optimization models [ M0 ],
[ M1 ], and [ M2 ].
An intuitive example to use the kernel trick in SVM to tackle the linearly
inseparable problem is as follows. Suppose we need to separate examples of two
classes presented in Figure 7.5. It is obvious that a straight line cannot separate the
two classes of examples perfectly or nearly perfectly. Instead, a circle can do the job
as shown in Figure 7.6.
Figure 7.6 Using a circle to separate examples in the linearly inseparable case
88 Machine learning and data analytics for maritime studies
Then, a kernel function measuring the similarity of two terms based on whether the
related points are within a circle can be considered. For a point (x10 , x20 )in Figure 7.6,
we define a transform function:
p
(x10 , x20 ) = (x210 , x220 , 2x10 x20 ). (7.28)
Support vector machine 89
It can be found that the optimal value of the penalty term C in this example is
much larger than that in Example 7.1 when linear SVM is used (C = 10in this case).
This is generally because using polynomial kernel maps the data to a higher dimen-
sion where they are more linearly separable. Then, the tolerance of an example that
is wrongly classified gets smaller, and thus the penalty gets larger. The optimal
value of C is then used to train and test the final soft margin SVM model using the
following code:
The Precision, Recall , and F1 score are 0.57, 0.59, and 0.58, respectively.
Recall that these three metrics for Example 7.1 are 0.57, 0.53, and 0.51, respectively.
Therefore, introducing a polynomial kernel into an SVM increases the prediction
accuracy.
The degree of a polynomial kernel can be any positive integer. The larger
value the degree is, the more complex model will be generated. Therefore, it can
be seen that selecting a proper kernel function is of vital importance to guarantee
the accuracy of an SVM model. This can be a tricky task, as how to accurately map
the original data to a higher dimensional space where they are linearly separable
is unknown. In addition to the linear kernel and the polynomial kernel, some other
common kernel functions are given in Table 7.1.
Kernel function
name Format parameters
Linear kernel K(xi , xj ) =xi xj
T
None
function
Polynomial kernel K(xi , xj ) = (xi xj )
T d
d 1, which is the degree of the
function polynomial. A larger d leads to a
more complex SVM model.
Gaussian kernel
kxi xj k2 > 0, which is the width of the
K(xi , xj ) =e 2 2
Gaussian kernel. A smaller leads
to an SVM model that is more
sensitive to the distance between
each pair of points, and thus is more
likely to overfit the data.
Laplacian Kernel kxi xj k The same as Gaussian kernel.
K(xi , xj ) =e
Sigmoid kernel K(xi , xj ) = tanh(ˇxi xj + ) tanh is the hyperbolic tangent function,
T
1 X n
Then, the optimal kernel and its degree and the value of the penalty term C
are
used to train and test the final SVR model using the following code:
92 Machine learning and data analytics for maritime studies
The MSE , MAE , and MAPE are 11.1886, 2.3153, and 0.7085, respectively.
Reference
[1] Pedregosa F., Varoquaux G., Gramfort A., et al. ‘Scikit-learn: machine learning
in python’. Journal of Machine Learning Research. 2011;12:2825–30.
Chapter 8
Artificial neural network
This chapter aims to introduce a widely-used machine learning model for both
regression and classification tasks called artificial neural network (ANN). Basic
concepts of ANNs are first covered, and then typical algorithms and tricks for ANN
training are introduced. Finally, deep learning models, as an important type of neural
network, are briefly discussed.
Suppose we have a data set D = f(xi , yi ), i = 1, ..., ng, where xi 2 Rm is the vector
of features and yi 2 R is the output. The structure of a typical ANN model to
predict yi (denoted by yO i , i = 1, ..., n ) is presented in Figure 8.1. It contains three
layers: an input layer, a hidden layer, and an output layer. Especially, the input
layer contains m nodes to receive the m feature values of the examples, which
can either be categorical or numerical, in addition to a bias node. The output
layer gives the final predicted target yO i . For a specific problem, the structure of
the input and output layers is fixed. In contrast, the hidden layer can be much
more flexible: an ANN can have one or more hidden layers, and the number of
nodes in one hidden layer is also flexible. In particular, the number of hidden
layers and the number of nodes contained by each of them depend on the spe-
cific problem and the data structure. In Figure 8.1 one hidden layer with K + 1
nodes is contained. Weight is attached to each arrow connecting two nodes in
consecutive layers in Figure 8.1. The direction of the arrows shows the data
flow direction in an ANN model, which is opposite to the direction of the error
flow which will be covered later.
One node in the neural network is also called a neuron. The details of a neu-
ron in Figure 8.1 are shown in Figure 8.2. A neuron is an elementary unit of an
ANN model, which can be viewed as a mathematical function receiving one or more
inputs from an example or from the neurons in the preceding layer. For the neuron
shown in Figure 8.2, it receives inputs from a1, a2, and a3 with weights w1, w2, and w3
attached. The weighted sum of the inputs, which is denoted by t , is first calculated
by t = w1 a1 + w2 a2 + w3 a3 + b, where bis the bias. Then, t is passed to an activation
function f to generate the output of this neuron, which is denoted by z , i.e., z = f(t).
In particular, activation functions deciding whether the neurons should be activated
94 Machine learning and data analytics for maritime studies
TableFigure
10.10 8.3
Association rules mined
An illustration with min_Conf
of Sigmoid f(x) = 1+e=1x1.5
0.7 and min_Lift
activation=function:
x x
Figure 8.4 An illustration of Tanh activation function: f(x) = eexe
+ex
In regression tasks where the output is numerical, the weighted sum of the outputs
given by the hidden layer is used directly as the final prediction in the output layer,
that is
yO = ty . (8.4)
In classification tasks where the output is categorical, the continuous weighted sum
input to the output layer is further discretized. In binary classification problems,
sigmoid function is often used, where the predicted value given by the output layer
is the probability of being a positive class:
1
yO = . (8.5)
1 + ety
In multiclass classification problems, the output is multi-dimensional, and a Softmax
function is often used to process the outputs, which turns absolute values into prob-
abilities. We do not discuss the details of this case here. Then, the predicted output
of the output layer can be uniformly expressed as
Figure 8.5 An illustration of ReLU (short for rectified linear units) activation
function: f(x) = 0if x < 0and f(x) =x , otherwise
yO = g(ty )
P
k+1
= g( vk zk )
k=1
P
k
(8.7)
= g( vk f(tk ) + vK+1 )
k=1
P k P
m+1
= g( vk f( wm0 k xm0 ) + vK+1 ).
k=1 m0 =1
Figure 8.6 An illustration of leaky ReLU activation function: f(x) = 0.2x if
x < 0and f(x) =x , otherwise
1
L(yi , yO i ) = (yi yO i )2. (8.8)
2
If yi is binary, the cost function of this example can be calculated by
L(yi , yO i ) =[yi (log(Oyi )) + (1 yi )(log(1 yO i ))]. (8.9)
The training process of an ANN model in one round is to adjust the weights connect-
ing the neurons, with the goal to minimize the cost function: the smaller the value of
the cost function, the closer the predicted and actual outputs. The most popular ANN
learning algorithm is called backpropagation, which is a way of computing gradients
of expressions through recursive application of chain rule, such that the current error
in the output layer can be reversely passed to each layer, and the weights can be
adjusted accordingly to minimize the error. We use the case of regression problem
as an example to show the working mechanism of the backpropagation algorithm.
We first calculate the gradients of the weights connecting the hidden layer and
the output layer. Assume sigmoid function is used as the activation function, i.e.,
1 0
f(x) = 1+ex . Then, its derivative can be calculated by f (x) =f(x)(1 f(x)). Denote
the current training example by (xi , yi ). For neuron k, k = 1, ..., K in the hidden layer
whose weight is denoted by vk, its connection with the output layer can be repre-
sented by
Artificial neural network 99
8̂
1
ˆ
ˆ L = (yi yO i )2
ˆ
ˆ 2
ˆ
<
yO i = ty (8.10)
ˆ
ˆ
ˆ
ˆ XK+1
ˆt = vk zk .
:̂ y
k=1
Then, the gradient of Lwith respect to vk, denoted by ı vk for short, can be calculated
by the following chain rule:
@L
ıvk =
@vk
@L @Oyi @ty (8.11)
=
@Oyi @ty @vk
= (Oyi yi )zk .
where all the terms involved are known. The gradient of weight wm0 k connecting
neuron m0 in the input layer and neuron k in the hidden layer can be calculated in
a similar way. The connection of neuron m0 and the prediction error can be repre-
sented by
8̂
1
ˆ
ˆ L = (yi yO i )2
ˆ
ˆ 2
ˆ
ˆ
ˆ yO = t
ˆ
ˆ
ˆ i y
ˆ
ˆ
ˆ
ˆ X
K+1
<
ty = vk zk
(8.12)
ˆ
ˆ k=1
ˆ
ˆ
ˆ
ˆ zk = f(tk )
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ X
m+1
ˆ
ˆ tk =
:̂ wm0 k xm0 .
m0 =1
Then, the gradient of Lwith respect to wm0 k , denoted by ıwm0 k for short, can be
calculated by the following chain rule
@L
ıwm0 k =
@wm0 k
@L @Oyi @ty @zk @tk (8.13)
=
@Oyi @ty @zk @tk @wm0 k
= (Oyi yi )vk f(tk )(1 f(tk ))xm0 .
where all the terms involved are known. ı vk and ıwm0 k are then used to update
the weights connecting the hidden layer and the output layer and those connect-
ing the input layer and the hidden layer, respectively. To reduce the problem
of over-fitting, learning rate, which is denoted by , is proposed to control the
speed of weight updating. Then, after this round of learning, the weight con-
necting neuron k , k = 1, ..., K and the output layer is updated to
100 Machine learning and data analytics for maritime studies
values of the weights might be surpassed in the updating process. If the value of
epoch is too large, the ANN model developed might learn the data too well, lead-
ing to the problem of over-fitting. In contrast, if epochis too small, the problem of
under-fitting might occur. To find a proper value for epoch, a validation set that is
independent of the training set should be used to test the performance of the tem-
porary ANN model constructed: if the cost function on the validation set decreases
moderately or even increases, the training should be stopped as the problem of over-
fitting is highly likely to occur. This trick is also called “early stopping.” Finally,
batch size is highly dependent on the size of the whole data set and the network
structure. Common batch size is 1, 2, 4, 16, 32, 64, 128, and 256. In addition, to
overcome the problem of over-fitting, regularization on the weights in the network
is introduced to the cost function shown in Equation (8.8):
1 X
n
X
m
X
k+1
X
k+1
in scikit-learn API), and epoch (max_iter in scikit-learn API). The core code is as
follows:
Then, the optimal learning rate init, batch size, and max iter are used to train
andtest the final ANN model using the following code:
The MSE , MAE , and MAPE are 14.3317, 2.5591, and 0.6437, respectively.
With the improvement of computation resources, training more complex ANN mod-
els where adequate storage and computational power are needed becomes possible.
With the accessibility of big data, the problem of over-fitting in these complex mod-
els can be largely reduced. The most common way to improve the complexity of an
ANN model is to increase the number of its hidden layers: the deeper a network, the
more complex a model. The reasons are intuitive: as the number of hidden layers
becomes larger, more weights between consecutive layers are added, and a larger
number of activation functions associated with the neurons can be involved. As a
result, their joint efforts can significantly increase the complexity and capacity of an
ANN model. An ANN model with a large number of hidden layers is called a DNN,
or a deep learning model. The term “deep learning” describes a family of learning
algorithms rather than a single algorithm to develop DNN models. Popular deep
learning models are summarized in Table 8.1.
104 Machine learning and data analytics for maritime studies
References
[1] Hornik K., Stinchcombe M., White H. ‘Multilayer feedforward networks are
universal approximators’. Neural Networks. 1989;2(5):359–66.
[2] Pedregosa F., Varoquaux G., Gramfort A., et al. ‘Scikit-learn: machine learning
in python’. Journal of Machine Learning Research. 2011;12:2825–30.
Chapter 9
Tree-based models
This chapter aims to introduce several machine learning models based on a tree
structure called the decision tree, which are widely believed to be among the most
popular methods for both classification and regression tasks. The basic structure and
concepts as well as the tree-growing algorithms of a single decision tree are first
introduced. As a single decision tree is prone to over-fitting, ensemble models con-
sisting of a certain number of decision trees are developed. Random forest-based on
bagging and gradient boosting decision trees based on boosting will be introduced
as the representatives of ensemble models on decision trees.
As the name suggests, a decision tree, which we call DT for short, uses a tree-like model
for decision making. A DT consists of nodes and directed edges, where a set of examples
is contained in a node, and a directed edge means splitting a node into consecutive nodes.
An example of a DT is shown in Figure 9.1. The orange node at the top is a root node,
which contains the whole data set used to construct the current DT. The green nodes are
leaf nodes that are not split any more and give the final prediction. The blue node is the
internal node, which is to be further split. The directed edges in the DT show the process
of node splitting: a parent node at the tail of a directed edge is split into two child nodes at
the head following some criteria, which is usually a selected feature or a (feature, value)
pair. The term “Splitting” here means that the examples contained by the parent node are
separated into two complementary and disjoint child nodes, and the splitting criterion is
the feature or the (feature, value) pair. The examples with feature values less than or equal
to the threshold value are split to the left child node, and others to the right child node. The
splitting aim is to make the two child nodes become “purer.” The term “purer” here means
that the examples contained by one node are more similar to each other: in classification
problems, the examples in one node are expected to be in the same class; in regression
problems, the targets of examples in the same node should be as close as possible. Finally,
the output of a leaf node is also determined by the problem. In general, in classification
problems, the class of the majority of the examples contained within it is used as the out-
put; in regression problems, the average target of all the examples contained is the output.
The depth of a tree is one more than the number of splits from the root node to the deepest
leaf node. The depth of the tree in Figure 9.1 is 3.
106 Machine learning and data analytics for maritime studies
P
K
C0k C0k
H(D0 ) = .
|D0 |
log2 |D0 | (9.6)
k=1
Then, the conditional entropy of feature x over data set D0 is
P
J |D0 | P
J |D0 | P
K
0|
|Djk 0|
|Djk
H(D0 |x) = |Dj0 | H(Dj0 ) = |Dj0 | k=1 |Dj0 |
log2 |Dj0 |
. (9.7)
j=1 j=1
Then, the IG can be calculated by
G(D0 , x) =H(D0 ) H(D0 |x). (9.8)
Feature x leading to the maximum value of among all candidate features is used
as the optimal feature to split the current node, where the examples of the same
value for xare split to one child node. This means that a total of J child nodes are
Tree-based models 109
generated. Then, the child nodes generated are split following the above steps. A
node will not be split any more if its entropy is below a preset threshold.
ID3 is an intuitive algorithm based on information theory, which is easy to
understand. However, it has some obvious drawbacks, which are listed as follows:
9.2.2 C4.5
The above drawbacks of ID3 can be addressed by C4.5 algorithm in the following
ways, which is developed by Quinlan [1]:
•• For a continuous feature x with J values (where J can be close to n espe-
cially for non-integer values), dichotomy is used to discretize the features. First,
rank the values of feature x in ascending order denoted by fx 1 j J
, ..., x , ..., x g.
Then, there will be a total of J 1 split points that can divide the values of
feature x into two categories. One candidate split point is denoted by xt , and
t xj +xj+1 t 0
x = 2 , j = 1, ..., J 0 1. Based on x , the current data set D can be dividedt
into two categories: Dxtwith examples with feature x less than or equal to x ,
and D0xt+ with feature x larger than xt . Then, the continuous feature x can be
treated as a binary feature.
•• Mainly aimed for a categorical feature x with J values, information gain ratio,
or IGR, is used as the criteria to find the best split feature in C4.5 algorithm. The
IGR of feature x to data set D0 can be calculated by
G(D0 , x)
GR (D0 , x) = , (9.9)
Hx (D0 )
110 Machine learning and data analytics for maritime studies
PJ |Dj0 | |Dj0 |
where Hx (D0 ) = j=1 |D0 |
log2 |D0 | is the intrinsic value of feature x . The
larger number of values of J , the larger value of Hx (D0 ). Hence, it can be used to
reduce the influence of the number of feature values on the IG value calculated.
•• Two issues regarding missing values need to be addressed: one is how to find
the best split feature when some features have missing values, and the other is
after selecting the best split feature, how to deal with the examples with miss-
ing feature values. The basic idea to address the above two issues is to attach
different weights to the examples. Readers are referred to Reference 1 for more
detailed information.
•• Two tree pruning strategies can be adopted: prepruning and postpruning. In
prepruning, some limitations called tree growing termination conditions are set
to restrict tree growing, which aim to prevent the tree from learning the training
data too well by limiting its size. Common tree growing termination conditions
include:
In postpruning, a complete tree is first constructed, and then two leaf nodes from
the same parent node are “drawn back” to their parent node, i.e., assume that the parent
node is not split. Then, the performance of the DT with leaf nodes drawn back is tested
on a hold-out validation set. If the prediction performance can be improved, the two
leaf nodes involved are to be pruned, and they are reduced to their parent node.
3
112 Machine learning and data analytics for maritime studies
Tree-based models 113
Then, the optimal max_depth and min_samples_leaf are used to train and test
the final classification DT model using the following code:
The Precision, Recall, and F1 score are 0.66, 0.64, and 0.65, respectively.
After a regression tree based on CART is constructed, tree pruning can be conducted
similar to the classification tree based on CART to reduce over-fitting. The overall
procedure of constructing a regression tree using CART is shown in Algorithm 2.
114 Machine learning and data analytics for maritime studies
Then, the optimal max_depth and min_samples_leaf are used to train and test
the final regression DT model using the following code:
Tree-based models 115
The MSE , MAE , and MAPE are 18.4986, 2.8218, and 0.6946, respectively.
DT is a popular machine learning method to address both classification and
regression problems. The advantages and disadvantages of DTs are summarized as
follows.
Advantages:
•• DTs are easy to understand and intuitive. They can also be visualized. Decision
rules can be summarized from a constructed DT.
•• Not many data as well as feature preprocessing procedures are needed. For
example, feature normalization and scaling are not needed, and some DT con-
struction algorithms can deal with missing values.
•• DTs can easily be extended to deal with multi-output problems.
Disadvantages:
•• DTs can easily be over-fitting, especially when the tree constructed is very
complex.
•• DTs can be unstable, which means that they can easily be influenced by small
variations in the data, especially some extreme feature values.
•• The prediction given by DTs for regression problems is not smooth or
continuous.
•• Learning an optimal DT is known to be NP-hard. Therefore, all the above popu-
lar DT construction methods are in a heuristic manner, where optimal decisions
in a local manner (i.e., for each node in the tree) are made*.
*
Constructing global optimal DTs is also studied by researchers, where readers are refered to [4–6]
Table 9.1 Common DT construction algorithms
Name of the Output data Input data Splitting Tree structure Processing of Pruning
algorithm criterion missing values
ID3 Categorical Categorical Information gain Polytree (i.e., a tree No None
data data node can have
more than two
child nodes)
C4.5 Categorical Categorical Information gain Polytree Yes, based on Prepruning and postpruning based on
data and ratio assigning nodes drawn back (i.e., pessimistic
numerical weights to error pruning)
data examples
CART Categorical Categorical Classification: Binary tree (i.e., a Yes, based on Prepruning and postpruning which is
and and Gini index; tree node only surrogate first generating all possible sub-trees
numerical numerical Regression: has two child splits and then finding the best one (i.e.,
data data MSE (mean nodes) cost-complexity pruning)
squared error)
or MAE (mean
absolute error)
Tree-based models 117
in the latter case, they are called component learners. In this section, we mainly
focus on how to develop an ensemble model consisting of a certain number of
DTs constructed by CART algorithm as base learners to address classification
or regression tasks. Actually, based on Hoeffding inequality, as the number of
base learners contained by the ensemble model, i.e., T , increases, the error rate
decreases dramatically and will be close to zero eventually [7]. Meanwhile, it
should also be noted that the above analysis is based on an important assump-
tion: the error of the base learners should be independent with each other.
However, “good” while “independent” base learners are hard to find, because
all the base learners are developed to address the same problem, and thus they
cannot be independent, especially under the condition that the data set size is
limited while certain accuracy should be achieved. Therefore, the core steps of
developing a strong ensemble model include how to develop “divergent and
good” base learners that can capture different aspects of the data accurately,
and how to produce a strong ensemble model based on these base learners. To
achieve this, two strategies are widely used, namely bagging and boosting.
9.3.1 Bagging
As mentioned in section 9.2.4, one disadvantage of DTs is that they are very sensi-
tive to the training set: a slightly different training set can result in a totally different
DT, especially for training sets with extreme points. Therefore, to produce diver-
gent base learners in an ensemble model, different training sets generated from the
original training set can be used. At the same time, as the dimension of the original
training set is fixed and limited, to ensure that the performance of a base learner is
not too bad, the number of examples contained by a training set for a base learner
should be the same as the original training set. One viable approach is to draw
samples of the same size as the original training set from the whole training set and
construct machine learning models on these samples. Bootstrapping aggregating
(bagging) is such a widely used approach to develop ensemble learning method
in a parallel manner. Bagging is based on bootstrap sampling: for a training set D
containing n examples, to formulate a sample training set D O t , a random example
is first selected and place to DO t , and then it is placed back to D . The above process
is repeated n times, and then the sample training set D O t with n examples can be
formulated. As the examples are selected with replacement, it is highly likely that
there are duplicate examples in D O t while some examples in D are not contained
by DO t . Actually, only about 63.2% of the examples in D are chosen in D O t when
the number of examples contained in training set D is sufficiently large [8]. After
repeating the above process T times, T sample training sets, D O 1to DO T , can be for-
mulated, and can be used to train T DTs. To ensemble the outputs of the T DTs in
the final ensemble model, voting strategy is adopted in classification problems, i.e.,
the target of the majority DTs is used as the final output. In case of tie, a random
output in tie is chosen. In regression problems, the average of the targets of all the
DTs is used as the final output.
118 Machine learning and data analytics for maritime studies
number in PSC. The features and target of Example 9.3 for classification are the
same as Example 9.1, and those of Example 9.4 for regression are the same as
Example 9.2.
Example 9.3. In addition to the hyperparameters in DT, RF has more hyperpa-
rameters to control the whole structure of the model. The number of DTs contained,
i.e., T , is called n_estimators in scikit-learn, the number of features to consider when
looking for the best split for a node is called max_features. Similar to the DT con-
structed to predict ship detention, we also use Gini index as the node splitting criteria
of each DT. We tune a total of four hyperparameters: two from the DT to control tree
complexity: max_depth and min_samples_leaf, and the other two from the ensemble
model n_estimators and max_features. As the training set is unbalanced, where only
381 ships have more than 5 deficiencies among all the 1500 ships contained, F1 score
is used as the metric for hyperparameter tuning. The core code is as follows:
The Precision, Recall , and F1 score are 0.68, 0.63, and 0.65, respectively.
Example 9.4. We use MSE as the node splitting criteria, and we tune the values of
four hyperparameters to control the structure of RF and tree complexity: n_estimators,
max_features, max_depth, and min_samples_leaf, similar to that in classification RF.
The core code is as follows:
120 Machine learning and data analytics for maritime studies
The MSE , MAE , and MAPE are 14.2134, 2.5567, and 0.6212, respectively.
9.3.2 Boosting
The base learners are independently constructed and are ideally independent of each
other in bagging; therefore, they can be constructed in parallel. In contrast, the base
learners, which are also called weak learners, are strongly dependent on each other
in boosting. The weak learners are constructed one by one, which aims to constantly
improve (boost) the performance of the current model developed by paying more
attention to the examples with larger prediction errors by attaching higher weights
to them or changing the prediction target. Then, after developing T weak learners for
the boosting model, they are combined linearly to produce the final output. In boost-
ing, the number of weak learners T is also called the maximum number of iterations.
In this section, we introduce two boosting methods: adaptive boosting or Adaboost
and boosting tree, where the former is based on changing the examples’ weights and
the latter is to learn the error of the base models constructed so far.
9.3.2.1 Adaboost
The weak learner of Adaboost is the decision stump, which is a DT with only
one split on the root node. In each round of weak learner construction, the
Tree-based models 121
122 Machine learning and data analytics for maritime studies
•• In step 1, all the examples are assigned identical weights to train the first (an
initial) weak learner. Then, the weights of the examples will be updated accord-
ing to the performance of the current weak learner.
•• Examples that are wrongly predicted will be assigned higher weights as shown
by how w t+1,i is calculated. Therefore, these examples need more attention in
the next round of learning. Therefore, it can be seen that in the learning process
of Adaboost, the examples are not changed while their weights, i.e., the distri-
bution of the examples, are changed.
•• From Equation (1.16), it can be seen that the error of round t is the sum of
weights of the examples that are wrongly classified, which indicates the
relationship between distribution D t of the data set and the weak learner
t (x) .
G
•• The weight of weak learner G t (x) , i.e., ˛ t is important of this weak
learner in the final classification model. It is shown in Equation (1.17)
that when e t 12 , ˛ t 0 , and as et decreases ˛ t increases. This shows that
the more accurate a weak learner is, the more important role it plays in
the final prediction model. In contrast, when e t > 12 , we have ˛ t < 0 . It is
not reasonable to have a weak learner whose weight is less than zero. To
address this issue, one approach is to change the predicted outputs to their
inverse numbers, and the error rate should become smaller than 12 . Another
approach is to discard the current weak learner as its performance is
too bad.
•• The sign of f Adaboost
combined
(x) determines the predicted class of example x , and the
absolute value of f Adaboost
combined
(x)shows the reliability of the prediction.
•• Regularization can be applied in Adaboost in Equation (1.18) of Step 7 by introduc-
ing learning rate denoted by v , v 2 (0, 1), which is used to control the model updat-
ing speed. Then, Equation (1.19) is changed to f Adaboost t
(x) =f Adaboost
t1
(x) + v˛ t G t (x).
The larger value v is, the faster the model learns.
The following Example 9.5 is an example to use Adaboost for ship risk prediction
using the same features and target as those in Example 9.3.
Example 9.5. As decision stump is used as the weak learner, there is no hyper-
parameter used to control the complexity of a DT contained by an Adaboost. Only
two hyperparameters are tuned to control the learning performance of the whole
Adaboost model: n_estimators, i.e., T , which is the number of weak learners con-
tained, and learning_rate, i.e., v to control model learning speed. As the training
set is unbalanced, where only 381 ships have more than five deficiencies among all
the 1500 ships contained, F1 scoreis used as the metric for hyperparameter tuning.
The core code is as follows:
Tree-based models 123
Then, the optimal n_estimators and learning_rate are used to train and test the
final classification DT model using the following code:
The Precision, Recall , and F1 score are 0.69, 0.66, and 0.67, respectively.
Adaboost is widely used for classification tasks, and it can also be used for
regression tasks. When used for regression, the main idea and procedure are identi-
cal to the situation when it is used for classification, except for the equations used
to calculate the performance of a weak learner in Equation (9.16) of Algorithm 3
(which is determined by the regression algorithm adopted), the weight of the exam-
ples in D, and the weight of each weak learner in Equation (9.17). In addition,
theoretically, any ML model can be used as weak learner under the framework of
Adaboost, and the most popular ones are DTs and ANNs. It should also be noted that
Adaboost can be very sensitive to outliers or extreme examples, as these examples
will be attached with large weights in model constructing as they are more likely to
be badly predicted. Consequently, the performance of Adaboost is badly influenced.
9.3.2.2 Boosting tree
Boosting tree is also called gradient boosting decision tree, or GBDT for short,
which can be used for classification and regression tasks. It is based on CARTs and
uses forward-stagewise additive modeling where the model is fit by minimizing
a loss function averaged over the training data, i.e., to use the prediction error in
the previous round of learning as the new target in the next round of weak learner
124 Machine learning and data analytics for maritime studies
learning. The above idea can be summarized as “learn from the prediction error so
far,” and it is achieved by changing the prediction target in the current round to the
negative gradient value of the loss function in the last round. The procedure of using
GBDT for a regression task is shown in Algorithm 4.
When using MSE as the loss function in this above task, the negative gradient calcu-
lated in Equation (1.21) is yi G t1 (xi ), i = 1, ..., n, which is the difference between the
real and current predictions for example i in the training set. This difference is also called
residual and is denoted by ri. Therefore, it can be seen that the learning objective of each
round in GBDT for regression is to minimize the residual of the current prediction, and
the final output is calculated by summing the outputs of all rounds of learning. The fol-
lowing Example 9.6 is used to show how to use the scikit-learn API [3] to develop a
GBDT model for regression to predict the number of deficiencies a ship has.
Example 9.6. In addition to the hyperparameters in DT (i.e., max_depth and
min_samples_leaf), GBDT for regression has more hyperparameters to control the
Tree-based models 125
The MSE , MAE , and MAPE are 16.4603, 2.6489, and 0.6671, respectively.
The basic idea of using GBDT for classification is similar to applying GBDT for
regression. Consider a GBDT model for binary classification. The main difference
is that the output of GBDT for classification is discrete, and thus the prediction error
cannot be directly used as the target in the next round of learning like GBDT used
for regression. This problem can be solved by using exponential loss, which is simi-
lar to the case of Adaboost, or using log loss, which is similar to the case of logistic
regression. In this section, we discuss the case of using log loss for classification
tasks by GBDT. Log loss aims to minimize the difference between the predicted
probability and the real probability of the examples. The log loss function for clas-
sification can be presented by
Name of Representative models Type of base Relationship of base Learning Combining weak Learning rate
algorithm learners learners objective of learners
each round of
learning
Bagging RF CART Independent, and can Real target Averaging in No
be paralleled regression
and voting in
classification
Boosting Adaboost and GBDT Adaboost: Dependent, and Adaboost: real Adaboost: weights; Yes
decision cannot be target; GBDT: GBDT: additive
stump; GBDT: paralleled negative manner
CART gradient
Tree-based models 127
where y 2 f1, 1g. The procedure of using GBDT for classification task is shown
in Algorithm 5.
learner, with the aim to reduce the over-fitting of each weak learner. For the whole
GBDT model, one possible way is to introduce learning rate v , v 2 (0, 1)like that in
Adaboost to control the model learning speed, and then Equation (9.23) is changed
PJ t
to G t (x) =G t1 (x) +Pv j=1 c tj I(x 2 R tj ) in Algorithm 4 and (9.29) is changed to
Jt
t (x) =G t1 (x) + v j=1 c tj I(x 2 R tj )in Algorithm 5.
G
We compare boosting and bagging models based on DTs in Table 9.2.
References
[1] Quinlan J.R. C4.5: programs for machine learning. San Mateo, CA: Morgen
Kaufmann Publishers; 2014.
[2] Hastie T., Tibshirani R., Friedman J.H. ‘The elements of statistical learning’ in
The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
Vol. 2. New York, NY: Springer; 2009.
[3] Pedregosa F., Varoquaux G., Gramfort A, et al. ‘Scikit-learn: machine learning
in python’. Journal of Machine Learning Research. 2011;12:28–2825.
[4] Bertsimas D., Dunn J. ‘Optimal classification trees’. Machine Learning.
2017;106(7):82–1039.
[5] Aghaei S., Gómez A., Vayanos P. ‘Strong optimal classification trees’. [arXiv
preprint arXiv:210315965] 2021.
[6] Verwer S., Zhang Y. ‘Learning optimal classification trees using a binary lin-
ear program formulation’. Proceedings of the AAAI Conference on Artificial
Intelligence. 2019;33:32–1625.
[7] Hoeffding W. ‘Springer’. [In: The Collected Works of Wassily Hoeffding]
Probability Inequalities for Sums of Bounded Random Variables. 1994::26–409.
[8] Breiman L. ‘Bagging predictors’. Machine Learning. 1996;24(2):40–123.
[9] Breiman L. ‘Random forests’. Machine Learning. 2001;45(1):5–32.
Chapter 10
Association rule learning
To introduce how association rules are generated from a given database, we first
introduce some basic concepts. Denote a set of N items by I = fi1 , ..., iN g. An
item set is a subset of I , and an item set containing n0 items (where n0 2 [1, ..., N])
is also called an n0 item set and is denoted by In0 . A database consisting of M records,
where each record is an item set, is denoted by D = ft1 , ..., tM g. Define the event of
observing the occurrence of a particular item set In0 by E(In0 ), which means that
all the items in In0 are observed in one record. We further define P(E(In0 )) as the
proportion of the M records that have all the items in In0 , which can also be inter-
preted as the probability of the occurrence of event E(In0 ). It should also be noted
that a record that includes all the items in item set In0 can also include items not in
In0 but in I . The probability of E(In0 )is also called the Support of item set In0 , that
130 Machine learning and data analytics for maritime studies
is, P(E(In0 )) = Sup(In0 ), and P(E(In0 )) 2 [0, 1]. A larger value of P(E(In0 )) indicates
that the more frequently item set In0 occurs in D. In order to define a large item set
denoted by In0 which frequently occurs in the database, we define the minimum
threshold of Support for an item set to be a large item set by min_Sup. That is, if and
only if In0 I and In0 ¤ ;is a large item set, we have Sup(In0 ) min_Sup.
A rule is generated by dividing a large n0 item set, i.e., In0 and n0 2, into
two mutually exclusive and non-empty item sets Ij and Ik , with Ij [ Ik = In0 . A rule
can be generated from Ij to Ik in form Ij ! Ik . To determine whether rule Ij ! Ik
is an association rule denoted by Ij ) Ik , two indicators are further introduced:
Confidence
and Lift of Ij ! Ik is calculated by
P(E(Ij ) \ E(Ik ))
Conf(Ij ! Ik ) = = P(E(Ik )|E(Ij )), (10.1)
P(E(Ij ))
and Conf(I
j ! Ik ) 2 [0, 1]. The larger value Confidence
is, the more likely that the
items in Ik appear given that the items in item set Ij appear. Lift of Ij ! Ik is calcu-
lated by
P(E(Ij ) \ E(Ik )) P(E(Ik )|E(Ij ))
Lift(Ij ! Ik ) = = , (10.2)
P(E(Ij )) P(E(Ik )) P(E(Ik ))
and Lift(Ij ! Ik ) 2 [0, +1) presents the influence of the occurrence of event E(Ij )
on event E(Ik ), which is the ratio of the probability of the occurrence of event E(Ik )
under the condition that event E(Ij )occurs and the probability that event E(Ik )occurs
unconditionally in the database. This can be interpreted as how the occurrence of
event E(Ij )can increase/decrease (i.e., lift) the occurrence of E(Ik ). To be more spe-
cific, if lift(Ij ! Ik ) 2 [0, 1), the occurrence of E(Ij )decreases the probability of the
occurrence of E(Ik ). If Lift(Ij ! Ik ) 2 (1, +1), the occurrence of E(Ij )increases the
probability of the occurrence of E(Ik ). If Lift(Ij ! Ik ) = 1, the occurrence of E(Ij )
has no influence on the occurrence of E(Ik ), that is, E(Ij )and E(Ik )are independent.
It is also interesting to find that as event E(Ik ) acts as the denominator to calculate
Lift of rule Ij ! Ik , if P(E(Ik )) is large, meaning that the occurrence probability of
event E(Ik ) is high, the value of Lift(Ij ! Ik ) would be reduced. This shows that a
frequently occurring event would have less contribution to generating association
rules compared to rare events.
The concept of association rule is defined based on Confidence and Lift of a
rule as follows:
Definition 10.1. Given a large item set In0with n0 2and two non-empty, mutu-
ally exclusive, and complementary item sets Ij and Ik , i.e., Ij ¤ ;, Ik ¤ ;, Ij \ Ik = ;,
and Ij [ Ik = In0, and given the minimum thresholds of Confidence
and Lift denoted
by min_Conf and min_Lift , respectively, the rule Ij ! Ik is an association rule if and
only if Conf(I
j ! Ik ) min_Conf and Lift(Ij ! Ik ) min_Lift .
In association rule Ij ) Ik , Ij is called antecedent or left-hand-side and Ik conse-
quent or right-hand-side, and it can be interpreted by “if Ij then Ik .” The implication
of this association rule is that if items contained by Ij occur, there is a high (which
is guaranteed by min_Conf ) and also higher (which is guaranteed by min_Lift )
Association rule learning 131
probability that the items in Ik can be detected. The thresholds of Confidence
and
Lift show the basic idea of association rule generation: the threshold of Confidence
guarantees that the ratio of Sup(Ij \ Ik )and Sup(Ij )is above a certain level, i.e., the
rule is meaningful, and the threshold of Lift guarantees that the association rule is
strong enough to be regarded as an effective “rule.” In the following sections, the
most popular algorithm to generate association rules, namely, Apriori, will be intro-
duced in detail first. Then, as an improvement of Apriori, FP-growth, which uses a
more efficient data structure to store the data and realize more efficient searching, is
then briefly covered.
10.2 Apriori algorithm
Before presenting the details of the Apriori algorithm, we first present the following
two properties of large item sets:
Property 1: Any non-empty and strict subset of a large item set is large.
Property 2: Any super-set of a non-large item set cannot be large.
The above properties are intuitive. For Property 1, suppose we have a large item
set I, and it has two mutually exclusive and non-empty item sets Ij and Ik where
Ij [ Ik = I . The number of times where I occurs in the database should be no less
than min_Supbecause it is a large item set. As the individual occurrence time of Ij
in the database should be no less than the occurrence time of Iin the database, Ij is
also a large item set. This also applies to item set Ik . For Property 2, for a non-large
item set I0 , its occurrence time in the database is less than min_Sup, and thus the
occurrence time of any super-set of I0 is also less than min_Sup. Therefore, it cannot
be a large item set.
The Apriori algorithm is proposed by Agrawal et al. in 1993 [1]. It is devel-
oped based on the above properties to improve the computational efficiency of
large item set generation. Denote a large item set containing n0 items as a large
n0 -item set. Denote Ln0 as the set of all large n0 -item sets. Denote N(In0 ) as the
occurrence times of item set In0 (In0 2 Ln0 ) in database D . The algorithm to
generate the set of all large n0 - item sets, i.e., Ln0 , is shown in Algorithm 1.
132 Machine learning and data analytics for maritime studies
In Algorithm 1, Step 1 aims to find all large 1-item sets by scanning the whole
database Dand examining each item it contains. When n0 2, by iteration, candi-
date large n0 -item sets are first found by combining the large item sets in Ln0 1where
the first (n0 2)items are the same while the last item is different, assuming that the
items are ordered in alphabet. Then, subsets containing (n0 1)items of each com-
bination are checked to ensure that all of these subsets are large item sets. Otherwise,
the item set containing n0 items should be removed from the set of candidate large
item sets. The detailed steps are shown in Algorithm 2. Then, Support of each of the
candidate large n0 -item sets generated by Algorithm 2 is checked, and the sets with
0
Support no less than min_Supare finally retained as the large n -item sets, i.e., Ln0 .
Association rule learning 133
The following toy example is used to show the procedure of using Apriori
algorithm to generate large item sets from a given database D. We require that
min_Sup = 0.5. The set of records is shown in Table 10.1.
The candidate large 1-item sets and final large 1-item sets can be generated
by Step 1 of Algorithm 1. To be more specific, the candidate large 1-item sets are
shown in Table 10.2, and the large 1-item sets are shown in Table 10.3.
Then, all the large n'-item sets, where n'>=2, are generated by Step 2 of
Algorithm 1, which combines the identified large (n'-1)-item sets one by one if they
have the same first (n'- 2) items. Then, subsets of each candidate large item set are
evaluated. The candidate large 2-item sets and the final large 2-item sets are shown
in Tables 10.4 and 10.5, respectively.
Based on the large 2-item sets generated, only one candidate large 3-item set
f2,
3, 5g can be formulated, which is to combine f2, 3g and f2,
5g as they have the
same first item 2. The Support of f2,
3, 5g is
2 /4 and is equal to the threshold, and
Records Items
t1
3, 4g
f1,
t2
3, 5g
f2,
t3
2, 3, 5g
f1,
t4
5g
f2,
134 Machine learning and data analytics for maritime studies
f1g
2/4
f2g
3/4
f3g
3/4
f4g
1 /4
f5g
3/4
thus it is a large 3-item set. The process of large item set generation terminates, as
no more large item sets can be formulated.
After generating all large n0 - item sets, association rules can be gener-
ated from them. Thresholds for Confidence
and Lift denoted by min_Conf and
min_Lift
,
respectively, should be specified before association rule generating.
In Apriori algorithm, Property 3 is used to guide the generation of association
rules:
Property 3: Partitioning a large item set Iinto two mutually exclusive and
complementary subsets Ij and Ik . The rule from Ij to Ik is denoted by Ij ! Ik .
Assume that Conf(I
j ! Ik )_Conf . Then, for any non-empty and strict subset of Ij ,
denoted by Ij0 , and the superset of Ik , denoted by I0k = Ik Ij0 , the rule from Ij0 to I0k
is called a sub-rule of Ij ! Ik , and Conf(I 0 0
j ! Ik )_Conf . The proof of Property 3
is as follows.
Proof: Denote the events of observing Ij and Ik by E(Ij ) and E(Ik ), respec-
tively, and the events of observing Ij0 and I0k by E(Ij0 ) and E(I0k ), respectively. The
Confidence
of rule Ij ! Ik can be calculated by
P(E(Ij ) \ E(Ik )) P(E(I ))
Conf(Ij ! Ik ) = = , (10.3)
P(E(Ij )) P(E(Ij ))
f1g
2/4
f2g
3/4
f3g
3/4
f5g
3/4
Association rule learning 135
2g
f1, 1 /4
3g
f1, 2/4
5g
f1, 1 /4
3g
f2, 2/4
5g
f2, 3/4
5g
f3, 2/4
3g
f1, 2/4
3g
f2, 2/4
5g
f2, 3/4
5g
f3, 2/4
136 Machine learning and data analytics for maritime studies
Figure 10.1 Generation of association rules from large 3-item set f2,
3, 5g
The procedure of generating association rules from the large 3-item set f2, 3, 5g
found in the toy example is shown in Figure 10.1. Suppose that we set min_Conf = 1
and min_Lift = 1.2.
As shown in Figure 10.1, in Step 1, three candidate rules can be generated:
2, 3 ! 5, 2, 5 ! 3, and 3, 5 ! 2. Their Confidence
values are first calculated, and
it is found that Confidence
of rule 2, 5 ! 3is less than min_Conf = 1. Hence, its Lift
value is not calculated, and its sub-rules are not considered, which are shown by the
gray nodes. Then, the Lift values of the other two rules are checked, which are larger
than the threshold of Lift , and thus these two rules are association rules. Then, in Step
2, only one candidate rule needs to be further investigated, namely, 3 ! 2, 5. This is
because the other two candidate association rules: 2 ! 3, 5and 5 ! 2, 3are sub-rules
of rule 2, 5 ! 3 whose Confidence
is smaller than min_Conf = 1. As 3 ! 2, 5 has
Confidence
= 2/3, which is less than min_Conf = 1, it cannot be an association rule.
To conclude, two association rules can be generated from the large 3-item set f2, 3, 5g
in the toy example: 2, 3 ! 5and 3, 5 ! 2.
Association rule learning 137
containing deficiency codes of each record, all non-zero feature values in the data
set are encoded to “T”, and zero feature values are encoded to “?”. Then, the initial
.csv file is converted to .arff file. The thresholds of Support , Confidence
, and Lift
are set to min_Sup = 0.1, min_Conf = 0.7, and min_Lift = 1.5. The large item sets
are shown in Tables 10.7–10.9, and the association rules mined from the large item
sets are shown in Table 10.10.
Table 10.7 shows that the most frequently detected deficiency items are D07:
fire safety, which is identified in more than half of the 200 inspections of concern,
followed by D10: safety of navigation, which is found in 43.5% of the inspections.
From Table 10.8, it can be seen that deficiency codes D07 and D10 are often detected
together, which can be found in nearly 25% of the inspection records. In addition,
D04 and D07 as well as D07 and D11 are often identified in one inspection, which
can be seen in more than 15% of the inspection records. Finally, Table 10.9 shows
that D07, D10, and D11 together with D04, D07, and D10 are detected together in
more than 10% of the inspection records. The association rules shown in Table 10.10
are all generated from the large 3-item sets and can be interpreted as follows. For
example, for the association rule D10, D11 ) D07, if D10 and D11 are detected on
one ship, the probability that the ship has deficiency code D07 is 81%. Compared to
the fact that the probability that a ship has deficiency D07 is 51% if no prior informa-
tion is known as shown in Table 10.7, the Lift value of this association rule is about
1.58, showing that identifying deficiency codes D10 and D11 can increase the prob-
ability of identifying deficiency code D07 by 1.58 times. Therefore, this association
rule can be used to guide onboard ship inspection: if deficiency codes D10 and D11
are detected on a ship, it is suggested that the PSC inspector should pay more atten-
tion to deficiency items under deficiency code D07 as there is a high probability that
the ship has a deficiency/deficiencies under this code.
10.3 FP-growth algorithm
The main difference between the FP-growth algorithm and the Apriori algorithm is
in the context of generating all large item sets, that is, finding all item sets with
Table 10.10 Association rules mined with min_Conf = 0.7and min_Lift = 1.5
Sup(In0 ) min_Sup, while the procedure of generating all association rules is the
same in both algorithms. FP-growth algorithm is proposed by Han et al. in 2000 [4]
and is believed to be an improvement of Apriori as it is much faster. In Apriori, data
are stored in set structure, and candidate large item sets should be generated, the num-
ber of which can be large. Moreover, the whole database needs to be traversed multi-
ple times to check the Support values of the candidate item sets in a brute-force man-
ner, and traversing the whole data set heavily reduces the computational efficiency. In
the FP-growth algorithm, data are stored in a tree structure called a frequent pattern
tree or FP tree. FP tree allows for faster scanning, while no candidate large item sets
are to be generated. In an FP tree, each node represents a single item in the original
database, and the links between the nodes represent their co-occurrence in the data-
base. We do not show the detailed steps to construct an FP tree and the large item sets
in this section, and we only present the rough steps as follows. Readers are referred
to Han et al. (2001) [4] and Borgelt [5] for more details.
1. Scan the database to calculate the frequency (i.e., Support ) of individual items,
and remove the items with Support values less than min_Sup.
2. Start to construct the FP tree by creating the root of the tree that represents null.
3. Scan the database again to exam each of the records and add them to the tree one
by one. Especially, the branch of the tree is constructed with items in descend-
ing order of count.
4. After constructing the FP tree, the tree is mined from the lowest nodes as well as
the links to the nodes. The lowest nodes represent the frequency pattern length
1. From each of the lowest nodes, traverse the path in the FP Tree, and the path
or paths are called a conditional pattern base.
5. Construct a conditional FP tree based on the conditional pattern bases of the
nodes.
6. Large item sets are generated from the conditional FP tree constructed.
References
[1] Agrawal R., Imieliński T., Swami A. ‘Mining association rules between sets of
items in large databases’. The 1993 ACM SIGMOD International Conference
Association rule learning 141
[online]; Washington, DC, New York, NY, 1993. pp. 207–16. Available from
http://portal.acm.org/citation.cfm?doid=170035
[2] Witten I.H., Frank E. ‘Data mining’. ACM SIGMOD Record. 2002;31(1):76–77.
[3] The Tokyo MOU; 2017 List of tokyo mou deficiency codes [online]]]. Available
from http://www.tokyo-mou.org/doc/Tokyo%20MOU%20deficiency%20codes
%20(December%202017).pdf [Accessed 9 Aug 2021].
[4] Han J., Pei J., Yin Y. ‘Mining frequent patterns without candidate generation’.
ACM SIGMOD Record. 2000;29(2):1–12.
[5] Borgelt C. An implementation of the FP-growth algorithm [online]. 2005:1–5.
Available from http://portal.acm.org/citation.cfm?doid=1133905
This page intentionally left blank
Chapter 11
Cluster analysis
Cluster analysis, or clustering, is a general task aiming to group a given set of examples
into several groups (i.e., clusters) following given criteria, such that the examples in
the same group are as close as to each other, and the examples in different groups are as
different as from each other. Generally, clustering works in the context of unsupervised
learning, where only the features of the examples are known while there is no target
defined or targets are unknown. It aims to divide the whole data set into several mutu-
ally exclusive and complementary clusters, so as to mine the properties of the clusters
formulated where such properties are represented by cluster labels. Mathematically,
suppose that we have an unlabeled data set D = fxi , i = 1, ..., ng, where xi 2 Rm is
the feature vector that can be represented by xi = (xi1 , ..., xim ). A clustering algorithm
aims to divide D into K groups fC 0 00
k |k = 1, ..., Kg such that C
k0 \ Ck00 = ;, k ¤ k ,
0 00 K
k 2 f1, ..., Kg and k 2 f1, ..., Kg, and D = [k=1 Ck . The cluster label of cluster C k
is denoted by k , which needs to be decided and summarized by model user. Based
on the cluster labels, the label of each example, which we call it by example cluster
label, can be obtained and is denoted by 0i , e.g., xi. One needs to distinguish between
clustering and classification, although both aim to separate examples in the data set. In
classification, labels of the examples are known, and the model training process aims
to separate the data according to the features while minimizing a given loss function
measured by the examples’ labels, so as to obtain a classifier to predict the target of the
new examples. In contrast, labels are unavailable in clustering, and the model training
process aims to put the examples similar to each other regarding their features in the
same cluster, and then to summarize the cluster labels of the clusters generated.
In the following sections, distance measures of two examples in the data set and
of two clusters are first introduced, as distance measure is the most important con-
cept in cluster analysis. Then, metrics to evaluate the performance of cluster analysis
are introduced. Finally, popular clustering algorithms are discussed with examples
of ship inspection by PSC provided.
of these two examples. Distance measure is an objective score used to measure the
relative difference/dissimilarity between two examples in the problem of concern.
The distance between two examples xi and xj is denoted by dist(xi , xj ), which satis-
fies the following properties:
Pm1 ˇ ˇ
distman (xi , xj ) =xi xj 1 = ˇxim0 xjm0 ˇ . (11.2)
m0 =1
Manhattan P
m1
distance
= xm0 A0 xm0 B0is. the sum of the absolute difference between each pair of com-
ponents mof0 =1 the coordinates of two examples whose calculation is related to the L1
vector norm. It was first proposed to calculate the distance of driving from one
crossroad to another crossroad in Manhattan. Therefore, Equation (11.2) is also
10.3
referred FP-growth
to as city blockalgorithmdistance.
When p = 2, Equation (11.1) is also called Euclidean distance, and can be writ-
The
ten asmain difference between the FP-growth algorithm and the Apriori algorithm is
in the context of generating all large item sets, that is, finding all item sets with
Sup(In0 ) min_Sup, while the procedure of generating all association rules is the
same in both algorithms. FP-growth algorithm is proposed by Han et al. in 2000 [4]
and is believed to be an improvement of Apriori as it is much faster. In Apriori, data
are stored in set structure, and candidate large item sets should be generated, the num-
ber of which can be large. Moreover, the whole database needs to be traversed multi-
ple times to check the Support values of the candidate item sets in a brute-force man-
Cluster analysis 145
s
Pm1 ˇ ˇ
disted (xi , xj ) = ||xi xj ||2 = ˇxim0 xjm0 ˇ2 . (11.3)
m0 =1
Euclidean distance is the most widely used distance measure of two examples whose
calculation is related to the L2 vector norm, and can be interpreted as the linear dis-
tance between two examples.
To measure the similarity of two examples regarding the angle between them,
one can use cosine similarity, which is calculated as follows:
Pm1
xi xj 0 xim0 xjm0
cos(xi , xj ) = = qP m =1 qP . (11.4)
|xi | |xj | m1
x 2 m1
x2
m0 =1 im0 m0 =1 jm0
The range of the cosine similarity of two examples is [ 1, 1]. Then, the cosine dis-
tance of two examples can be calculated from the cosine similarity of them by
distcos (xi , xj ) = 1 cos(xi , xj ). (11.5)
One should also note that, for two vectors that are normalized by L2 normalization,
the squared Euclidean norm of their difference is equivalent to twice of their cosine
distance, whose proof is as follows:
Proof: For two vectors A = (x1A , x2A , ..., xm1 A )and B = (x1B , x2B , ..., xm1 B ), they
are first normalized by L2 normalization, respectively, and we have
x1A x2A xm1 A
A0 = ( q ,q , ..., q ) = (x1A0 , x2A0 , ..., xm1 A0 )
x21A + x22A + ... + x2m1 A x21A + x22A + ... + x2m1 A x21A + x22A + ... + x2m1 A
and
x1B x2B xm1 B
B0 = ( q ,q , ..., q ) = (x1B0 , x2B0 , ..., xm1 B0 )
x21B + x22B + ... + x2m1 B x21B + x22B + ... + x2m1 B x21B + x22B + ... + x2m1 B
Then, the cosine distance between A and B can be calculated by
0 0
Pm1
0 xm0 A0 xm0 B0
cos(A0 , B0 ) = qP m =1 qP
m1 m1
x2
m0 =1 m0 A0
x2
m0 =1 m0 B0
(11.6)
Xm1
= xm0 A0 xm0 B0 .
m0 =1
The squared Euclidean norm of the difference of A0 and B0 has the following form
|A0 B0 |2 = (x1A0 x1B0 )2 + (x2A0 x2B0 )2 + ... + (xm1 A0 xm1 B0 )2
P
m1
= 22 xm0 A0 xm0 B0 . (11.7)
m0 =1
= 2 2 cos(A0 , B0 )
Then, the equivalence of the squared Euclidean norm and twice of the cosine dis-
tance of two vectors is proven.
Modified cosine similarity aims to reduce the influence of the absolute values
of the components of a feature vector. It is calculated by subtracting the mean of
146 Machine learning and data analytics for maritime studies
components of the vector for an example from each component of the vector. To be
more specific, the mean of xi (which is of m1 dimensions) is calculated by
Pm1
0 xim0
xi = m =1 , (11.8)
m1
and the mean of xj is calculated by
Pm1
0 xjm0
xj = m =1 . (11.9)
m1
Modified cosine similarity can be calculated by
(xi xi ) (xj xj )
cos0 (xi , xj ) =
|xi Pxi | |xj xj |
m1
m0 =1
(xim0 xi )(xjm0 xj ) . (11.10)
= pP qP
m1 m1
m0 =1
(xim0 xi )2 m0 =1
(xjm0 xj )2
Similarly, the revised cosine distance can be calculated by
dist0cos (xi , xj ) = 1 cos0 (xi , xj ). (11.11)
Measuring distance of examples containing nominal features can be much more
difficult, as their distance can hardly be measured because feature values are
noncomparable. To measure the distance of two examples containing nominal
features, cosine distance shown in Equation (11.5) and modified cosine distance
shown in Equation (11.11) can be used, as they only consider the angle of the
feature vector instead of the absolute values. If the nominal features are encoded
into binary features (e.g., the three values for feature ship type, namely, con-
tainer ship, bulk carrier, and passenger ship are encoded to three new features:
is_container_ship, is_bulk_carrier, and is_passenger_ship, and 100, 010, and
001 are used as the feature values of the three new features to indicate the three
types of ships), Hamming distance can be used. Hamming distance is a measure-
ment of the number of different values in two strings of equal length, and it can
be calculated by
Pm
dist0ham (xi , xj ) = m20 =1 1(xim0 ¤ xjm0 ), (11.12)
is the indicator function, which takes value 1 if x is true and 0, otherwise.
where 1(x)
1. Single-linkage clustering:
Distmin (Ck0 , Ck00 ) = minxi 2Ck0 ,xj 2Ck00 dist(xi , xj ), which is the minimum dis-
tance of two examples belonging to the two clusters, respectively.
2. Complete-linkage clustering:
Distmax (Ck0 , Ck00 ) = maxxi 2Ck0 ,xj 2Ck00 dist(xi , xj ), which is the maximum dis-
tance of two examples belonging to the two clusters, respectively.
3. Unweighted average-linkage clustering: P
1
Distuavg (Ck0 , Ck00 ) = |C 0 ||C | xi 2Ck0 ,xj 2Ck00 dist(xi , xj ), which is the
k k 00
average distance of all pairs of two examples belonging to the two clusters,
respectively.
4. Centroid-linkage clustering:
Distcen (Ck0 , Ck00 ) =dist(ck0 , ck00 ) where ck0 and ck00 are the centroids of clus-
1 P
ters C
k0 and C , respectively, and can be calculated by ck0 = |Ck0 | xi 2Ck0 xi
k00P
and ck00 = |C100 | xj 2C 00 xj . Therefore, centroid-linkage clustering measures the
k k
distance between centroids of the two clusters.
larger denominator shows larger inter-cluster distance). Then, the worst case of all
the clusters is identified by the max operator, and the DBI calculates the sum of
ratio between intra-cluster distance and inter-cluster distance in the worst case of all
the clusters generated. Therefore, the smaller the value of DBI, the better the per-
formance of a clustering model. Another popular metric is called Dunn index (DI),
which considers extreme intra-cluster and inter-cluster distances shown as follows:
min1k0 <k00 K Distmin (Ck0 , Ck00 )
DI = , (11.16)
max1kK max(Ck )
where max(Ck ) is calculated in Equation (11.14). A larger value of DI indicates
that a clustering algorithm performs better, as a larger minimum inter-cluster dis-
tance (i.e., Distmin (Ck0 , Ck00 )) or a smaller maximum intra-cluster distance (i.e.,
max1kK max(Ck )) increases the value of DI.
The above two metrics are called internal index for clustering performance
evaluating as they only consider the clusters generated by the clustering algorithm
itself. If there is a reference model that the clustering algorithm can be compared
with, external indexes can be used. For data set D containing n examples, denote
the clusters given by the reference model by fC
0
l |l = 1, ..., Sg with example cluster
i |i = 1, ..., ng. Recall that the clusters given by the current clustering model
label f
is fC 0
k |k = 1, ..., Kg with cluster label f
i |i = 1, ..., ng. Note that S is not necessarily
equal to K . Examples in Dare divided into clusters using the reference model and
the current clustering model to be evaluated. For each two examples xi 2 D and
xj 2 Dwith i , cases of how they belong to the clusters of the two clustering models
are as follows:
0 0
1. SS = f(xi , xj )|0i = j0 , i = j , ig, which means that xi and xj are divided into
the same cluster in both the clustering model to be evaluated and the reference
model. Denote the number of such pairs of examples by a = |SS|;
0 0
2. SD = f(xi , xj )|0i = j0 , i ¤ j , ig, which means that xi and xj are divided into
the same cluster in the clustering model to be evaluated but different clusters in
the reference model. Denote the number of such pairs of examples by b = |SD|;
0 0
3. DS = f(xi , xj )|0i ¤ j0 , i = j , ig, which means that xi and xj are divided into
the same cluster in the reference model but different clusters in the clustering
model to be evaluated. Denote the number of such pairs of examples by c = |DS|;
0 0
4. DD = f(xi , xj )|0i ¤ j0 , i ¤ j , ig, which means that xi and xj are divided
into different clusters in both the clustering model to be evaluated and the refer-
ence model. Denote the number of such pairs of examples by d = |DD|.
As we require that i < j, each pair of examples only appears once in SS , SD, DS ,
and DD, and we have a + b + c + d = n(n 1)/2. Based on the above cases, three
external indexes, namely, Jaccard coefficient (JC), Fowlkes and Mallows index
(FMI), and rand index (RI), are defined as follows:
a
JC = , (11.17)
a + b + c
Cluster analysis 149
q
a a
FMI = a+b a+c , (11.18)
and
2(a + d)
RI = . (11.19)
n(n 1)
The range of the above three indicators is [ 0, 1], and larger values indicate better
clustering algorithm performance.
11.2.1 Clustering algorithms
Classic clustering algorithms can roughly be divided into three types: partition-
based methods, density-based methods, and hierarchy-based methods. A summary
of these algorithms is presented in Table 11.1.
In the following subsections, a typical algorithm under each type of clustering
methods is introduced. In particular, we introduce K-means as the representative of
partition-based methods, density-based spatial clustering of applications with noise
(DBSCAN) as the representative of density-based methods, and agglomerative
algorithm as the representative of hierarchy-based methods.
Table 11.1 A summary of classic clustering algorithms
It is noted that the stopping criterion of K-means is that ck is not updated, which
means that the centroids of the clusters are not changed any more. This can take
quite a long time or lead to too many iterations. To reduce the computational burden,
the longest algorithm running time or the maximum number of updates can be set,
and the algorithm terminates when the threshold of the running time or the number
of updates is reached. The advantages of K-means are that:
1. The value of K should be preset in a manual manner, and can be prone to biased
subjective judgment.
2. As the initial centroids are generated randomly, the algorithm suffers from
uncertainties.
3. As the centroids are determined by their distance to examples in the data set,
clustering performance can be heavily influenced by outliers.
1. , which is a threshold of distance between two examples for one to be considered
as in the neighborhood of the other. For example, xi 2 D, its -neighborhood,
denoted by N (xi ), is constituted by the examples whose distance from xi is no
more than , i.e., N (xi ) =fxj 2 D|dist(xi , xj ) g. This guarantees that exam-
ples in the same cluster are close enough to each other.
2. Min_Pts, which is the minimum number of examples in the neighborhood of a cer-
tain example (including the example itself) for this examples to be considered as a
core example. This guarantees that a cluster contains a certain number of examples.
152 Machine learning and data analytics for maritime studies
1. Core example: for a given example xi 2 D, if the number of examples in its
-neighborhood is no less than Min_Pts, i.e., |N (xi )| Min_Pts, xi is a core
example.
2. Directly density reachable: if example xi is a core example and xj 2 N (xi ), then
example xj is directly density reachable from example xi . It should be men-
tioned that xi might not be directly density reachable from example xj , unless xj
is a core example.
3. Density reachable: for examples xi and xj , there is a series of examples xk1 , ..., xkl
such that xk1 = xi and xj is directly density reachable from xkl . If xkl0 and xkl0 +1,
where 1 l0 , are directly density reachable, meaning that xkl0 is within the
-neighborhood of xkl0 +1, and xkl0 +1 is also within the -neighborhood of xkl0 ,
then xj is density reachable from xi . As the series of examples xk1 (i.e., xi ), ..., xkl
are directly density reachable to each other, they must be core examples.
Meanwhile, xj is not necessary a core example.
4. Density connected: for examples xi and xj , if there is another example xk such
that xi is density reachable from xk , and xj is density reachable from xk , then xi
and xj are density connected. For these three examples xi , xj , and xk , xk must be
a core example, while xi and xj are not necessarily core examples.
Figure 11.1 is used to further illustrate the above concepts. Given the condition
that Min_Pts = 3, examples in red are core examples as no less than three examples
are in their -neighborhood, while those not in red are noncore examples. Examples
that are directly density reachable from a core example are in the blue circle with
the core example in red at the center. The green links connecting the core exam-
ples show that these core examples are density reachable with each other. While
if a core example is within the -neighborhood of another core example, they are
directly reachable to each other. As examples in blue are within the -neighborhood
of the core examples that are density reachable, all the examples in blue are density
connected.
Based on the above concepts and explanation, a cluster in DBSCAN is defined
as a maximum set of examples that are density connected from a core example.
In other words, for a core example xi , a cluster expanded from it is denoted by
k = fx 2 D|x and xi are density connectedg. The procedure of the DBSCAN algo-
C
rithm is shown in Algorithm 2.
From lines 2 to 7 of Algorithm 2, the set of all core examples is first gener-
ated by traversing the whole data set. Then, a random core example is selected,
and all the core examples that are density reachable from this core example are
found. Especially, to ensure that the cluster is expanded from the initial random core
example, queue data structure (i.e., cur ) is used to store the core examples density
reachable from this core example, so as to ensure that these core examples are vis-
ited one by one from the nearest to the farthest from the initial core example. Then,
examples that are in the -neighborhood of all the core examples are added to the
current cluster. This is shown from lines 14 to 18. Therefore, the initial (starting)
core example is density connected with the other examples in this cluster. The above
process is repeated until all the core examples are processed (i.e., added to clusters),
which is shown by the while loop from lines 10 to 22 of Algorithm 2.
154 Machine learning and data analytics for maritime studies
Unlike K-means where all the examples are divided into clusters, there might be
some examples that do not belong to any cluster in DBSCAN, which are referred to
as noises. The advantages of DBSCAN are that:
1. It is nonsensitive to noises and can also distinguish noise examples from normal
examples in the data set.
2. The number of clusters does not need to be preset.
3. It can generate clusters of arbitrary shapes, which means that the contour of the
examples contained by a cluster can be of any shape, as long as the examples in
one cluster are density connected from the initial core example.
1. It is ineffective when dealing with data sets with uneven density, which means
that the distribution of the examples in the data set is uneven in the feature
space.
2. When the size of the data set is large, DBSCAN would take a long time to
converge.
3. The values of parameters and Min_Ptscan have a very large influence on the
clusters generated.
4. As a random unvisited core example is selected as the seed from which a cluster
is to be generated, the stability of the algorithm is adversely influenced.
We use Example 1 to show how to use cluster analysis to cluster ships in PSC
inspection while generating explanations and implications from the results using
156 Machine learning and data analytics for maritime studies
K-means as an example. To show the clustering process visually, we use a data set
containing 50 ships that are randomly chosen from the whole data set with two fea-
tures: age and last_deficiency_no as a toy example.
Example 11.1: The details of the 50 selected ships are shown in Table 11.2. The
distribution of the features of the ships is shown in Figure 11.2.
In particular, we choose K = 2, K = 3, and K = 4 in the K-means algorithm.
K-means is implemented by scikit-learn API [1]. It can easily be called by using the
following lines of codes:
Reference
[1] Pedregosa F., Varoquaux G., Gramfort A., et al. ‘Scikit-learn: machine learning
in python’. Journal of Machine Learning Research. 2011;12:2825–30.
This page intentionally left blank
Chapter 12
Classic and emerging approaches to solving
practical problems in maritime transport
This chapter aims to discuss classic and emerging approaches adopted in existing aca-
demic literature to address practical problems in maritime transport. First, widely stud-
ied practical problems in the maritime industry are summarized according to review
papers. Then, classic methods that are widely adopted by the related studies are intro-
duced. After that, data-driven methods, as a typical type of emerging approaches, are
discussed. Especially, several examples of applying data-driven models to address
prediction tasks in maritime studies are given. Finally, several issues regarding the
application of data-driven models to address practical problems in maritime transport
are presented with various examples in port state control (PSC) provided.
Maritime transport is the transportation of cargo over the global water transporta-
tion network, where seaports serve as the nodes and waterways are the links of the
network. Therefore, research topics in maritime transport academic research can be
divided into “shipping part” and “port part” according to Talley [1]. There are sev-
eral review papers on discussing the research themes, topics, and trends of the aca-
demic literature on maritime transportation, which we summarize as follows from
the aspects of the number of papers involved, the research areas and topics covered,
and main conclusions.
1. A series of review studies have been conducted by Woo et al. in 2011, 2012,
and 2013 on how seaports have been studied from 1980s to 2000s from
three perspectives: research themes and trends [2], research methodology
[3], and research collaboration [4]. The authors found that in 1980s, about
half of the papers in academic journals are published by Maritime Policy &
Management. In 1980s, more than one-third of the related papers are still pub-
lished by Maritime Policy & Management, while more journals, and some of
them are newly established, started to publish papers on maritime transport,
such as Maritime Economics & Logistics, Journal of Transport Geography,
Transportation Research Part E: Logistics and Transportation Review, Marine
Policy, and Transport Policy. The authors divide port research into two main
162 Machine learning and data analytics for maritime studies
12.2 R
esearch methods and their specific applications to
maritime transport research
According to Shi and Li [5] and Yan et al. [6], seven methods are the most popular
in academic research on maritime transportation, whose explanations are shown in
Table 12.1.
From Table 12.1, it can be seen that the majority of the classic research meth-
ods are in a qualitative manner, including SIQO, case study, CCCQ, and literature
review. Those in a quantitative manner mainly aim to construct models using math-
ematical modeling, economic and econometric theories, statistical modeling, and
simulation methods to model or mimic the practical behavior of the system, with the
aim to gain operational and managerial insights to reduce costs or increase benefits
gained. The common point of the above models is that only when very specific
instructions are given to them can they function well. After a long time of develop-
ment and implementation, these classic methods as well as the results obtained from
them are widely recognized and used by both academic researchers and industrial
practitioners. However, it should also be mentioned that these methods are highly
dependent on long-term practical experience and expert knowledge, while some
of them might be biased, and they might not be adapted and updated to the ever-
changing environment.
A typical example of using classic methods to address a practical problem in PSC
is a ship selection scheme called ship risk profile (SRP) for high-risk ships selection
introduced in Chapter 2, which is widely used by port states around the world for
higher risk ship selection. It is heavily based on expert knowledge, which can be seen
from the rough processing of the risk factors considered (e.g., only two states are con-
sidered for feature ship age: ships with age larger than 12, and ships with age no more
than 12), the fixed and subjective weighting points assigned to the states of the features
(e.g., ships with age larger than 12 are assigned with 2 weighting points, and 0, other-
wise), and the simple weighted-sum manner for ship overall risk calculation. In addi-
tion, although ships can be quite different from each other, they are only divided into
three risk profiles by the SRP, namely, HRS (high-risk ship), SRS (standard-risk ship),
and LRS (low-risk ship) from the highest risk to the lowest risk, and a fixed (while dif-
ferent) time window is attached to each risk profile. The time window attached to HRS
is longer than SRS, and is much longer than LRS. Then, a ship’s inspection priority
is highly dependent on the relationship between its last inspection time in the related
MoU (Memorandum of Understanding) and the time window attached to the ship: if
the period between the last inspection time and the current inspection is longer than
the upper bound of the time window attached to the ship considering its SRP, the ship
has high inspection priority; if the period is between the lower and upper bounds of the
time window attached, the ship has medium inspection priority; if the period is shorter
than the lower bound of the time window attached, the ship has low inspection priority.
Besides, even if ship flag and RO (recognized organization) performances are updated
annually and ship company performance is updated on a daily basis, feature process-
ing methods, weighting points attached to the feature states, inspection time windows
164 Machine learning and data analytics for maritime studies
attached to the risk profiles, and ship inspection priority are not updated considering
the changing conditions and factors, such as the evolution of ship conditions and the
influences of the COVID-19 on ship behaviors and the PSC inspection mode.
To reduce the adverse influences of expert knowledge and subjective judgment
on making assumptions and constructing models and to take the ever-changing
environment into account, data-driven models as an emerging approach are rapidly
developed and used to address practical problems in maritime transport research. As
the name suggests, data-driven models are highly dependent on the data collected,
so as to dynamically explore useful information from these data. ML (machine
learning) models, which allow computer to learn from data facilitated by indirect
programming, are a typical type of data-driven models. Especially, an overview of
data-driven models is given in Chapter 3 and the key elements of data-driven models
are summarized in Chapter 1 of this book. There are some initial trials of applying
data-driven models to deal with practical issues in maritime transportation, which
we briefly list as follows [6]:
1. Ship trajectory prediction (in the category of shipping research): it aims to pre-
dict the trajectory of a moving vessel in the near future by using historical time
series trajectory points of the vessel, which is the foundation of vessel naviga-
tion risk analysis. Ship sailing trajectory is constructed from vessel automatic
identification systems (AIS), where one sailing record comprising of vessel
dynamic sailing information is proposed every tens of seconds to a few min-
utes. Popular data-driven models in existing literature for ship trajectory predic-
tion include SVM (support vector machine) [7], extreme learning machine [8],
k-nearest neighbors [9], autoencoder [10], and long short-term memory net-
works [11, 12].
2. Ship risk prediction and safety management (in the category of shipping
research): two research topics are widely studied under this research stream:
prediction of the occurrence probability of vessel accidents, and the analysis of
regional and global ship accident statistics and accident reports. In the former
research topic, AIS is the main data resource used to generate ship historical
166 Machine learning and data analytics for maritime studies
trajectory and predict ship trajectory in the near future for risk assessment and
analysis of ship collision [13–15], ship grounding accidents [16], fire accidents
[17], and seafarers’ nonfatal injuries [18]. In the latter research topic, mari-
time accident statistics and reports published by the IMO (Internation Maritime
Organization) and local marine departments/institutions are the main data
resource. BNs are the most popular model to analyze the relations among the
risk factors and between the risk factors and the prediction target. The related
studies can be found in References [19–22]. In addition, K-means is also applied
to analyze maritime accidents [23, 24].
3. Ship inspection planning (in the category of shipping research): the related
research topics are lying in four categories: ship risk prediction, onboard inspec-
tion sequence optimization, influencing factors on the inspection results of PSC,
and PSC inspections’ influence. Research data are mainly from the inspection
data published by individual MoUs and ship specification data provided by
online commercial databases. Widely used data-driven models in PSC-related
research include BNs [25, 26], tree-based models [27–29], SVMs [30, 31], and
association rule learning [32–34].
4. Ship energy efficiency prediction (in the category of shipping research): the
related research aims to predict ship fuel consumption rate (i.e., hourly or daily
fuel consumption amount) under various conditions using features of ship sail-
ing information and the surrounding environments. Such ship- and environment-
related features are collected from manually filled ship sailing records like noon
reports or automatically collected from onboard sensors like fuel consumption
sensors, global positioning system receivers, shaft power testers, wind speed
sensors, and water depth sonars. ANNs are the most widely used fuel con-
sumption rate prediction model, which can be found in References [35–38].
Additionally, tree-based models [39, 40], LASSO (least absolute shrinkage
and selection operator) regression [41, 42], ridge regression [43, 44], and SVR
(support vector regression) [44] are also popular for ship energy efficiency pre-
diction. Then, the predicted ship energy efficiency indicators, with fuel con-
sumption rate as the representative, are used to guide voyage management and
optimization by means of speed optimization, trim optimization, weather rout-
ing, and their combinations.
5. Ocean freight market condition prediction (in the category of shipping
research): two indicators of shipping market condition are widely used as the
target of ocean freight market condition prediction, namely, baltic dry index
(BDI), which is regarded as the leading indicator of economic activity and the
barometer of the dry bulk shipping market, and sea freight rate, which is the
requested price to transport cargo from the origin to the destination by sea,
mainly determined by the weight and volume of the cargo and the distance
between the origin and destination. For the prediction of BDI, ANNs (artifical
neural networks) and SVRs are the most popular methods, and the adoption of
ANNs can be seen in References [45–47], while the adoption of SVRs can be
seen in References [48, 49]. For the prediction of sea freight rate, ANNs are
also a popular method, which can be seen in References [50–52]. Other ML
Classic and emerging approaches to solving practical problems 167
models are also used and compared for sea freight rate prediction, as can be
seen in [53, 54].
6. Ship destination port and arrival time at port prediction (in the category of port
research): the related studies aim to predict the destination of a voyage and its
arrival time. As such prediction is highly dependent on ship historical voyage
information and the information of the current voyage, AIS is a widely used
data source. Ship arrival time to the destination is also influenced by the sea
and weather conditions along the way, as well as the port operating policies and
conditions (e.g., the just-in-time arrival policy for ships to save fuel consump-
tion and reduce greenhouse gas emissions). Therefore, marine weather forecast
and port statistics are also important data resources. It should be mentioned
that although the destination of the current voyage is required to be reported
via the AIS by vessel captains, it is widely believed that such report is highly
unreliable with complex causes as discussed by Yang et al. [55]. For ship des-
tination prediction, the methods can be either turning point based or trajectory
extraction based. As both tasks are nontrivial, especially when there are require-
ments on prediction accuracy, various ML-based models are developed, such as
a framework based on anomaly detection and route prediction [56], DBSCAN
(density-based spatial clustering of applications with noise) for turning region
clustering and identification combined with ANN for next turning point predic-
tion [57]. For literature in the second category, the prerequisite of arrival time
prediction is to know the destination. Therefore, there are several studies that
jointly predict vessel destination and the vessel arrival time, which can be found
in References [58–60].
7. Port condition prediction (in the category of port research): this stream of
research aims to predict port cargo volume and port traffic volume, where the
related data are from port statistics and bill of lading. Port cargo volume is
predicted by ANNs combined with time series methods [24] and by SVRs [61].
Port traffic volume at the Shanghai port is predicted by ANNs in Reference [62].
finally selected and handed over to practical users to deal with the current problem.
The users can be technicians who are familiar with the underlying of the model or
practitioners who are unfamiliar with the theories of the model. During the practical
usage of the model, ideally, its input, output, and settings should be constantly evalu-
ated and updated, so as to better handle the ever-changing environment.
In the above process, four layers are involved and interwoven with each other,
namely, data, model, user, and target. To develop proper and efficient data-driven
models, there are issues within these four layers that should be paid close attention
to which have been discussed by Yan et al. [6]. We discuss these issues here in detail
with several new points added in the following subsections.
12.3.1 Data
Regarding the data used for model construction, one critical issue is that the data
used, including the input features and output target(s), should be highly compat-
ible with the focus of the current problem. To be specific, the features used should
possess the properties of “relevance” and “comprehensiveness”: “relevance” means
that they are indeed related to/can contribute to the prediction of the target, and
“comprehensiveness” means that the features used cover most of the indicators that
affect the target.
Issues regarding data relevance can be assessed from qualitative perspective
and quantitative perspective. Qualitatively, features selected should be of interest to
model users, which is demonstrated in the following Example 12.1 of PSC.
Example 12.1: To predict the risk of a ship in a PSC inspection, it is common
that ship characteristics and indicators of its historical inspection performance, such
as ship dimensions (i.e., length, depth, beam, and draft), ship’s flag, RO, and com-
pany performance, the date and performance of a ship’s last PSC inspection, etc.,
are used as the input to develop prediction models, as can be seen in References
[27, 28, 30, 31]. These features are relevant to ship’s structural and operational
conditions, navigation safety, and safe management, which are also key points in
PSC. Therefore, it is justifiable to consider these factors when developing ship risk
prediction models. In contrast, sailing information of the current voyage, such as
the sailing route, average speed, freight rate, and charter mode, should deserve less
attention since obtaining such information is costly or even impossible, while the
port authorities cannot benefit much from the above information when evaluating
ship risk levels.
Quantitatively, feature selection can be dealt with by relevance calculation and
evaluation model development. The relevance calculation approach aims to evaluate
the interrelationship between the features and the target. For example, in the step of
feature selection in feature engineering, several approaches of feature selection can
be used to identify the most relevant features from all candidate features by calculat-
ing the relation between the features and the target(s) as introduced in section 1.2.4
of Chapter 4. Such approaches include Pearson’s correlation coefficient, mutual
information, and recursive feature elimination. In addition, features can be selected
based on regularization and feature importance, and both can be integrated in an ML
Classic and emerging approaches to solving practical problems 169
12.3.2 Model
When developing data-driven models to solve a practical problem in maritime trans-
portation, it is of utmost importance to understand and clarify the properties of the
problem itself as well as the proper metrics to evaluate its performance. One basic
step is to identify the nature of the current problem, i.e., to decide the current prob-
lem is a classification, regression, clustering, or association rule mining task. Then,
models falling within certain category should be tried and evaluated. In this process,
how to select an effective model as well as its evaluation metrics should be carefully
planned. We use the following Example 12.3 to show the related issues.
170 Machine learning and data analytics for maritime studies
Example 12.3: Suppose that we need to address the problem of ship detention
prediction in PSC. The nature of this problem is classification, where a detained ship
is predicted to be “1” and a non-detained ship is predicted to be “0.” The main char-
acteristic of this problem is that the data set is highly imbalanced regarding the pre-
diction target: the number of non-detained ships is much larger than that of detained
ships. For example, the annual detention rate at the Hong Kong port ranges from
2.96 to 3.67% between 2015 and 2019 as shown in the 2020 annual report of Tokyo
MoU [64]. This means that among all 100 foreign visiting ships to the Hong Kong
port that are inspected, no more than 3.67 of them are actually detained on average,
while up to 97.04 of them are actually not detained on average. If standard classi-
fication models are directly applied, no matter how well they perform on balanced
data sets, it can be anticipated that their performance for ship detention prediction
would be relatively bad, or the prediction results cannot make sense (e.g., predicting
all the inspected ships to be non-detained leading to a very high accuracy score, but
the prediction is meaningless).
To take the imbalanced property of the data set into account, several strategies
can be adopted. For example, oversampling (i.e., randomly selecting a certain num-
ber of detained ships from the original data set and then input the selected ships into
the original data set to form a balanced one) or subsampling (i.e., randomly select-
ing a subset of non-detained ships from the original data set and then form a new
data set containing all detained ships and the subset of the non-detained ships) can
be applied before model construction. Moreover, specialized ML models suitable to
address imbalanced classification tasks can also be used, such as balanced random
forest and anomaly detection models.
Another major issue regarding data-driven models is that most of the ML
models are of black-box nature, meaning that the model work mechanism and the
prediction results cannot be understood by the developers and users. There are of
course some exceptions, such as LR (linear regression), DT (decision tree), and
BN (Bayesian network) models. Nevertheless, as these models are not sophisti-
cated enough, they may have bad performance when addressing certain complex
problems, and thus other sophisticated ML models of black-box nature have to
be adopted. The black-box property of data-driven models would reduce their
acceptance and credibility of the practitioners, and thus restrict their applicabil-
ity. Furthermore, it is well known that when models are not conceived with self-
explanatory characteristics, they may engender pitfalls. In recent years, a concept
of “explainable artificial intelligence,” or XAI for short, is proposed and rapidly
developed, which aims to open up the black boxes of ML models and shed light
on the model working mechanisms as well as the justification of the results gener-
ated. Details of the concepts, techniques, and applications of XAI will be covered
in later sections of this part.
12.3.3 User
It is widely known that the maritime industry is relatively conservative and tradi-
tional, in which the decision support tools used by maritime practitioners are mainly
Classic and emerging approaches to solving practical problems 171
based on expert knowledge or classic theories for a long time, and the stakeholders
are reluctant to share the information and data with each other. Consequently, with-
out the support of sufficient data, especially high-quality data, the digitization pro-
cess of the maritime industry is relatively slow, where contemporary approaches,
especially data-driven models, are not that popular compared to other industries
such as financial technology and healthcare industry. Therefore, data-driven appli-
cations are still in their infancy, and it is also noted that applying data-driven mod-
els to maritime transport industry requires replacing the current (naive) decision
support tools, which the decision makers are quite familiar with more sophisticated
models that the decision makers may be unfamiliar with. Systems based on such
models are more like black boxes to the users, with brand new graphical user inter-
face (GUI) and obscure working mechanism. In this process, one guiding principle
is to avoid imposing any unnecessary burden or pressure on the users during their
usage.
However, this is not a trivial task as the developers and users of a data-driven-
based model or system in the maritime industry usually have quite different back-
grounds. On the one hand, model and system developers are usually engineering or
researchers specializing in data analytics and with weak shipping background. The
data-driven models developed usually have multifarious inputs and outputs as well
as complex structures, which means that the data processing and model operation
can be sophisticated. On the other hand, model and system users are usually prac-
titioners of the maritime industry, like government officials, managers, and techni-
cians in shipping companies and classification societies, and crew members, who
may lack expertise in data analytics and system operation. It is therefore crucial
to avoid requiring model users to input extra data or too much information to the
models as well as conducting too many extra operations when using the systems.
Ideally, a system with straightforward, concise, and friendly GUI as well as easy
human–computer interactions should be developed. The following is an example
of a high-risk ship selection decision support system in PSC developed for the
Marine Department of Hong Kong by researchers from the Hong Kong Polytechnic
University.
Example 12.4: This system is named by AI for PSC at Hong Kong*, which
automatically downloads the information of in port ships and due to arrival ships
at the Hong Kong port about every 20 minutes, and then applies AI (artificial
intelligence)-based models to predict ship risk, which is represented by the com-
bination of the number of deficiencies and the detention probability of each ship
to facilitate the Marine Department to select ships with the highest risks for
inspection. Especially, the displayed items as well as their formats on the website,
including not only the outputs of the AI models but also the ships’ identify infor-
mation, type, flag, last port of call, agent, current location, etc., are compliant with
the preference of the decision makers. This means that the system users do not
*
https://sites.google.com/site/wangshuaian/research/
172 Machine learning and data analytics for maritime studies
need to do anything to use the system, and they can acquire the information they
are interested in for high-risk ship selection directly from the system. Therefore,
the newly developed ship selection system based on data-driven models is in com-
pliance with the users’ past working habits and preference, and thus does not pose
extra operational burdens to them.
Another issue in this category is regarding data-driven models’ applicabil-
ity: as such models are developed using data collected from certain region and
entities, while different regions and entities might be quite different from each
other, they do not possess the property of universal applicability. For example,
the ship risk prediction models developed in this book all use PSC inspection
records at the Hong Kong port, and thus they may be hard to be applied to ports
outside of the Tokyo MoU, as different ship selection methods are used, and
even to the other ports within the Tokyo MoU, as the expertise and background
of the inspectors in different ports might vary a lot. Therefore, if one wishes to
apply the ship risk prediction models developed in this book to ports other than
the Hong Kong port, new prediction models using the corresponding data sets
should be developed, and more advanced methods, such as transfer learning, can
also be used.
12.3.4 Target
To promote the acceptance and application of data-driven models, the predicted out-
put, i.e., the target values given by the data-driven models, should comply with ship-
ping domain knowledge. Otherwise, it is highly likely that the users would doubt the
accuracy and fairness of the prediction models developed. The following Example
12.5 shows this issue in high-risk ship selection in PSC.
Example 12.5: In many existing studies on ship risk prediction, ship flag,
RO, and company performances are considered as input features to the ship risk
prediction models, which can be found in References [27–29]. Roughly speaking,
these three factors are calculated based on the performance of the ships under
their management in the previous three years, and can thus influence their reputa-
tion and the inspection priority of their ships in future PSC inspection. Therefore,
responsible ship flag, RO, and company put many efforts to maintain their ships
to be in satisfactory conditions, and they hope that their hard efforts can gain them
good reputation and thus a lower inspection priority of their vessels. Therefore,
it is justifiable to say that for two ships with all the other conditions being equal,
the ship with better flag, RO, or company performance than the other ship should
be predicted to be in a lower risk (i.e., less number of deficiencies or lower deten-
tion probability), and thus is less likely to be selected for inspection. If the pre-
dicted targets given by a data-driven model conflict with such shipping domain
knowledge, the decision makers, as well as the ships’ flags, ROs, and companies
might hardly trust the predictions. Therefore, developers of the data-driven mod-
els should take such shipping domain knowledge into account in model develop-
ment to guarantee model fairness. Details of this issue will be covered in the later
chapters of this part.
Classic and emerging approaches to solving practical problems 173
It should be noted that there are still a long way to go to develop and apply
systems based on data-driven models in the traditional and conservative mari-
time industry, and the issues covered are only a small part of the whole picture
that need more attention and efforts.
References
[11] Huang Y., Chen L., Chen P., Negenborn R.R., van Gelder P.H.A.J.M. ‘SHIP col-
lision avoidance methods: state-of-the-art’. Safety Science. 2020;121:451–73.
[12] Li W., Zhang C., Ma J., Jia C. ‘Long-term vessel motion predication by mod-
eling trajectory patterns with AIS data’. 2019 5th International Conference on
Transportation Information and Safety (ICTIS); Liverpool, IEEE, 2020. pp.
1389–94.
[13] Gang L., Ma J., Yao J. ‘Decision-making of vessel collision avoidance based
on support vector regression’. ICAIIS 2021; Chongqing, New York, NY,
2021. pp. 1–6. Available from https://dl.acm.org/doi/proceedings/10.1145/
3469213
[14] Gao M., Shi G.Y., Liu J. ‘Ship encounter azimuth MAP division based on
automatic identification system data and support vector classification’. Ocean
Engineering. 2020;213:107636.
[15] Abebe M., Noh Y., Seo C., Kim D., Lee I. ‘ Developing a SHIP collision
risk index estimation model based on dempster-shafer theory’. Applied Ocean
Research. 2021;113:102735.
[16] Wu B., Yan X., Yip T.L., Wang Y. ‘A flexible decision-support solution
for intervention measures of grounded ships in the Yangtze River’. Ocean
Engineering. 2017;141:237–48.
[17] Wu B., Tang Y., Yan X., Guedes Soares C. ‘Bayesian network modelling for
safety management of electric vehicles transported in ropax ships’. Reliability
Engineering & System Safety. 2017;209:107466.
[18] Zhang G., Thai V.V., Law A.W.K., Yuen K.F., Loh H.S., Zhou Q.
‘Quantitative risk assessment of seafarers’ nonfatal injuries due to occu-
pational accidents based on Bayesian network modeling’. Risk Analysis.
2020;40(1):8–23.
[19] Wang L., Yang Z. ‘Bayesian network modelling and analysis of accident
severity in waterborne transportation: a case study in China’. Reliability
Engineering & System Safety. 2018;180:277–89.
[20] Fan S., Blanco-Davis E., Yang Z., Zhang J., Yan X. ‘Incorporation of hu-
man factors into maritime accident analysis using a data-driven Bayesian net-
work’. Reliability Engineering & System Safety. 2020;203:107070.
[21] Fan S., Yang Z., Blanco-Davis E., Zhang J., Yan X. ‘Analysis of maritime
transport accidents using Bayesian networks’. Proceedings of the Institution
of Mechanical Engineers, Part O. 2020;234(3):439–54.
[22] Li B., Lu J., Lu H., Li J. ‘Predicting maritime accident consequence scenarios
for emergency response decisions using optimization-based decision tree ap-
proach’. Maritime Policy & Management. 2020:1–23.
[23] Lema E., Papaioannou D., Vlachos G.P. ‘Investigation of coinciding shipping
accident factors with the use of partitional clustering methods’. PETRA ’14;
Rhodes Greece, New York, NY, 2014. pp. 1–4. Available from https://dl.acm.
org/doi/proceedings/10.1145/2674396
[24] Zhang Y., Sun X., Chen J., Cheng C. ‘Spatial patterns and characteristics
of global maritime accidents’. Reliability Engineering & System Safety.
2021;206:107310.
Classic and emerging approaches to solving practical problems 175
[25] Yang Z., Yang Z., Yin J. ‘Realising advanced risk-based Port state control
inspection using data-driven Bayesian networks’. Transportation Research
Part A. 2018;110:38–56.
[26] Wang S., Yan R., Qu X. ‘Development of a non-parametric classifier: effec-
tive identification, algorithm, and applications in Port state control for mari-
time transportation’. Transportation Research Part B. 2019;128:129–57.
[27] Yan R., Wang S., Fagerholt K. ‘A semi- “ smart predict then optimize ” (semi-
SPO) method for efficient SHIP inspection’. Transportation Research Part B.
2020;142:100–25.
[28] Yan R., Wang S., Cao J., Sun D. ‘Shipping domain knowledge informed pre-
diction and optimization in Port state control’. Transportation Research Part
B. 2021;149:52–78.
[29] Yan R., Wang S., Peng C. ‘An artificial intelligence model considering data
imbalance for SHIP selection in Port state control based on detention prob-
abilities’. Journal of Computational Science. 2021;48:101257.
[30] Xu R.-F., Lu Q., Li W.-J., Li K.X., Zheng H.-S. ‘Presented at 2007
International Conference on Machine Learning and Cybernetics’. Hong
Kong.
[31] Wu S., Chen X., Shi C., Fu J., Yan Y., Wang S. ‘Ship detention prediction
via feature selection scheme and support vector machine (SVM)’. Maritime
Policy & Management. 2022;49(1):140–53.
[32] Tsou M.C. ‘Big data analysis of port state control ship detention database’.
Journal of Marine Engineering & Technology. 2019;18(3):113–21.
[33] Chung W.-H., Kao S.-L., Chang C.-M., Yuan C.-C. ‘Association rule learning
to improve deficiency inspection in Port state control’. Maritime Policy &
Management. 2020;47(3):332–51.
[34] Yan R., Zhuge D., Wang S. ‘Development of two highly-efficient and in-
novative inspection schemes for PSC inspection’. Asia-Pacific Journal of
Operational Research. 2021;38(3):2040013.
[35] Pedersen B.P., Larsen J. ‘Prediction of full-scale propulsion power using ar-
tificial neural networks’. Proceedings of the 8th International Conference
on Computer and IT Applications in the Maritime Industries (COMPIT’09);
Budapest: Hungary, 2009. pp. 10–12.
[36] Petersen J.P. Mining of SHIP operation data for energy conservation. DTU
Informatics; 2012.
[37] Rudzki K., Tarelko W. ‘A decision-making system supporting selection of
commanded outputs for A ship’s propulsion system with A controllable pitch
propeller’. Ocean Engineering. 2016;126:254–64.
[38] Du Y., Meng Q., Wang S., Kuang H. ‘Two-phase optimal solutions for
ship speed and trim optimization over a voyage using voyage report data’.
Transportation Research Part B. 2019;122:88–114.
[39] Peng Y., Liu H., Li X., Huang J., Wang W. ‘Machine learning method for
energy consumption prediction of ships in Port considering green ports’.
Journal of Cleaner Production. 2020;264:121564.
176 Machine learning and data analytics for maritime studies
[56] Pallotta G., Vespe M., Bryan K. ‘Vessel pattern knowledge discovery from
AIS data: a framework for anomaly detection and route prediction’. Entropy.
2018;15(12):2218–45.
[57] Daranda A. ‘Neural network approach to predict marine traffic’. Baltic
Journal of Modern Computing. 2016;4(3):483.
[58] Amariei C., Diac P., Onica E., Roşca V. ‘Cell grid architecture for maritime
route prediction on AIS data streams’. DEBS ’18; Hamilton, New Zealand,
New York, NY, 2018. pp. 202–04. Available from https://dl.acm.org/doi/
proceedings/10.1145/3210284
[59] Bodunov O., Schmidt F., Martin A., Brito A., Fetzer C. ‘R-time destination
and eta prediction for maritime traffic’. DEBS ’18; Hamilton, New York, NY,
2018. pp. 198. Available from https://dl.acm.org/doi/proceedings/10.1145/
3210284
[60] Nguyen D.D., Le Van C., Ali M.I. ‘Vessel destination and arrival time pre-
diction with sequence-to-sequence models over spatial grid’. DEBS ’18;
Hamilton, New York, NY, 2018. pp. 217–20. Available from https://dl.acm.
org/doi/proceedings/10.1145/3210284
[61] Ruiz-Aguilar J.J., Moscoso-López J.A., Urda D., González-Enrique J.,
Turias I. ‘A clustering-based hybrid support vector regression model to
predict container volume at seaport sanitary facilities’. Applied Sciences.
2020;10(23):8326.
[62] Wang S., Wang S., Gao S., Yang W. ‘Daily SHIP traffic volume statistics and
prediction based on automatic identification system data’. 9th International
Conference on Intelligent Human- Machine Systems and Cybernetics
(IHMSC); Hangzhou, IEEE, 2017. pp. 149–54.
[63] Cariou P., Wolff F.C. ‘Do Port state control inspections influence flag-and
class-hopping phenomena in shipping?’. Journal of Transport Economics and
Policy (JTEP). 2011;45(2):155–77.
[64] Annual report. Tokyo: Tokyo MOU. Available from http://www.tokyo-mou.
org/doc/ANN19-f.pdf
This page intentionally left blank
Chapter 13
Incorporating shipping domain knowledge into
data-driven models
As mentioned in section 12.3.4 of hapter 12, one of the issues regarding the target of
applying data-driven models to solve practical problems in maritime transportation
is that the predicted target may not comply with shipping domain knowledge. The
term “domain knowledge” refers to rules and common senses widely believed by
the practitioners. Domain knowledge is based on the practitioners’ understanding
of the disciplines and activities of the industry, and is gained from longtime experi-
ence of the practitioners in this industry as well as their own expertise, professions,
specializations, and judgment. For example, in the maritime industry, regarding the
activity of ship selection for inspection by PSC (port state control), it is generally
believed that given all other conditions being equal, an older ship would have a
larger number of deficiencies than a younger ship. Regarding ship fuel consumption
prediction, ship sailing speed is the most significant determinant, and it is widely
believed that a ship’s fuel consumption rate is proportional to its sailing speed to the
power of ˛ = 3, i.e., r / ˇ v˛, where r is the hourly or daily fuel consumption, ˇ
is a coefficient, and v is the average sailing speed. In practice, ˛can be higher than 3,
especially for large vessels like container ships where it can be 4, 5, or even higher.
In the former example, if there is a ship risk prediction model that gives opposite
prediction results, i.e., a younger ship has more deficiencies than an older ship under
the condition that the other features of these two ships are identical, the prediction
results as well as the prediction model are expected to be hardly accepted or used
by the port state officers, because they may conclude that the proposed model is
inaccurate and unfair based on such prediction. Similarly, in the latter example, if
the predicted ship fuel consumption rate does not take such convex and increasing
relationship with ship sailing speed, it may not convince the users. Given the condi-
tion that practitioners in the conservative while classic maritime industry might be
reluctant to replace the current rule-based decision support systems with data-driven
ones, it is thus of vital importance to make sure that the prediction results given by
the data-driven models to address practical problems comply with the corresponding
shipping domain knowledge. Otherwise, the practitioners are more likely to be very
skeptical of the models together with their results, and thus are not willing to use
them to assist their decision-making.
180 Machine learning and data analytics for maritime studies
One way to guarantee that the developed data-driven models constructed from
practical data are in compliance with shipping domain knowledge is to explicitly
impose the constraints when developing the models. The following two sections in
this chapter introduce some initial thoughts on how to incorporate shipping domain
knowledge into the development of data-driven models used to deal with a specific
problem in maritime transportation. To be specific, section 13.1 discusses how to
consider feature monotonicity into a tree-based model developed to predict ship
risk, and section 13.2 discusses how to jointly consider feature monotonicity and
convexity into ship fuel consumption rate prediction.
The content covered in this section is mainly from a paper published by Yan et al. in
2021 [1]. That paper aims to assist the Marine Department of Hong Kong to select
ships of higher risk among all the foreign visiting ships for inspection by using
state-of-the-art ML (machine learning) models with shipping domain knowledge
considered. An eXtreme Gradient Boosting (XGBoost) model is first developed to
predict the number of deficiencies each foreign visiting ship has. In particular, the
XGBoost model takes domain knowledge regarding ship flag performance, RO (rec-
ognized organization) performance, and company performance in PSC into account,
i.e., given two ships with all the other conditions equal (e.g. with the same age,
of the same type, with the same last inspection time, and inspection results), the
ship that has worse flag performance, or RO performance, or company performance,
should be predicted to have a larger number of deficiencies and thus, should have
a high priority/probability to be inspected. The ship risk prediction results are then
input to a PSC officer scheduling model to help the maritime authorities to allocate
scarce inspection resources such that as much ship noncompliance as possible can
be detected. As only the first part of this study is related to ML model development,
we only cover that part in this section. Our aim is to discuss how to incorporate
domain knowledge into the model construction process of an XGBoost model using
more plain words, so as to make readers with a weaker background in data analytics
easier to understand.
taking into account the performance of the ships under a certain flag state regarding
both inspection and detention conditions over the preceding three calendar years.
The flag performance is given by a black-gray-white list published by the Tokyo
MoU (Memorandum of Understanding) each year in its annual report. That is, ship
flag performance from the best to the worse is white, gray, and black. If the ships
under a flag state are inspected less than 30 times by all the member states of the
Tokyo MoU, the flag’s performance is set to “undefined.” An RO is an authorized
(and recognized) organization by a flag administration to carry out inspections and
surveys on its behalf on the ships under its registration. The performance of RO can
be high, medium, low, and very low from the best to the worst, which is determined
by its ships’ inspection and detention history over the preceding 3 calendar years,
and is published by the Tokyo MoU in its annual report. Similar to ship flag per-
formance, if the number of inspections of the ships in an RO is less than 60 times
over the last 3 years, the performance of this RO is “undefined.” The calculation of
ship company performance is quite different from the above two indicators. Ship
company here refers to a ship’s international safety management company. Its per-
formance considers the detention and deficiency conditions of the ships in its fleet,
which is updated on a daily basis for a running 36-months period. Unlike ship flag
and RO performance, there is no lower limit on the number of inspections to quan-
tify the performance of a company over the last three years. For a company with no
inspection in the previous three years, it will be given two weighting points. The
company’s performance is determined by its detention index and deficiency index,
and it also has four states, namely high, medium, low, and very low.
The above three risk factors are considered in the NIR in the following way.
The criteria for a ship to be an LRS (low-risk ship) are that its flag is white and is
of the IMO (International Maritime Organization) audit, its RO is recognized by at
least one member authority of the Tokyo MoU and it should have high performance,
and its company should have high performance. In addition, if a ship’s flag is on
the black list or a ship’s RO has low or very low performance, 1 weighting point is
attached to the ship, respectively. If a ship’s company performance is low or very
low, or there is no inspection on the ships belonging to the company in the previ-
ous 36 months, 2 weighting points are attached to the ship. It should be noted that
these three indicators considered in the NIR are subjective, as they are quantified
using criteria established by expert knowledge regarding the division of the states,
the ways to generate the states, and the weighting points attached to the states. The
common sense or domain knowledge, regarding these three risk factors is that for
two ships with the other features identical, the ship that has worse flag/RO/company
performance is expected to perform worse in the current PSC inspection, i.e., to have
a larger number of deficiencies and/or a higher probability of detention. This domain
knowledge should also be followed by data-driven models developed for ship risk
prediction for the following reasons. It is obvious that ship flag, RO, and company
play an important role in ship management, operation, and maintenance, and hence
their performance is taken into account by the widely adopted NIR for ship risk
calculation and selection for PSC inspection. In return, their ships’ performance in
PSC inspections would influence their performance evaluated by the corresponding
182 Machine learning and data analytics for maritime studies
MoUs and thus their reputation. Therefore, it is justifiable to conclude that given all
other conditions being equal, a ship should be estimated to have worse performance
if its flag, RO, or company gets worse. Only by following such a rule can a ship risk
prediction model be regarded to be in compliance with shipping domain knowl-
edge, and thus achieve fair prediction. Theoretically, such domain knowledge can be
learned from practical inspection records. Nevertheless, as actual data are featured
with noise and error, it cannot be guaranteed that such domain knowledge can be
preserved in the data-driven models constructed. In other words, if no constraint is
imposed in the model training process, such property may not be fully followed,
leading to unexpected prediction results.
In Yan et al. [1], the problem of ship deficiency number prediction in PSC to assist in
high-risk ship selection for PSC inspection is addressed. The dataset contains 1974 initial
inspection records at the Hong Kong port, and the input to the ship risk prediction model
includes several ship-related features as well as ship historical inspection features. To be
more specific, 14 features are considered, namely ship age, GT (gross tonnage), length,
depth, beam, type, flag performance, RO performance, company performance, the last
inspection date in the Tokyo MoU, the number of deficiencies in the last inspection in
the Tokyo MoU, the total number of detentions in all historical PSC inspections, ship flag
change times, and whether the ship has a casualty in the last 5 years. Inspection records
with ship flag and RO performance undefined are first deleted from the initial dataset, and
1926 records remain in the dataset. Then, the states of ship flag performance are encoded
by setting white to 1, gray to 2, and black to 3, and the states of ship RO and company
performances are encoded by setting high to 1, medium to 2, low to 3, and very low to 4.
Therefore, the domain knowledge regarding ship flag/RO/company performance and the
ship deficiency number can be viewed as a monotonic increasing relationship between
them: as the value of ship flag/RO/company performance gets large (i.e. from good to
bad), the ship deficiency number also gets large.
used) and model complexity (evaluated by the number of leaf nodes contained in a
decision tree). Recall that as introduced in Chapter 9, a classic DT (decision tree)
does not possess the monotonicity property. Therefore, we will introduce in detail
how to enforce a monotonic constraint to one feature (which can be ship flag perfor-
mance, RO performance, or company performance) regarding the number of defi-
ciencies predicted.
We first assume that feature monotonicity is imposed on one feature. As fea-
ture monotonicity works in the context that all the other features are equal except
for the feature imposed by the monotonicity property, without loss of general-
ity, we assume that the feature imposed by such property is ship flag performance.
Suppose that we consider a total of m features for ship deficiency number predic-
tion where the number of deficiencies of ship s is denoted by ys, and the feature vec-
0
tor is denoted by x = (x1 , ..., xm , ..., xm ). We further denote ship flag performance
by mN where the monotonic increasing constraint is imposed. This means that for
two ships s1 and s2, the monotonicity property of ship flag performance works when
m0 m0 0 N + 1, ..., m, and given xmsN1 < xmsN2, we should have the pre-
xs1 = xs2 , m = 1, ..., m
N 1, m
dicted numbers of ship deficiencies yO s1 yO s2. As the property of feature monotonicity
works in examples with all features being equal except for the feature enforced by the
monotonicity property, the tree can be simplified by only containing the splits using the
monotonic feature, as using all other features (which are identical to all the examples
considered) for node splitting will always lead these examples to the same tree nodes
and thus have the same output.
The structure of a simplified tree in an XGBoost model which only takes the
monotonic feature into account is shown in Figure 13.1. It should be emphasized
that only the values of the feature imposed by the monotonically increasing con-
straint will be used for node splitting, as this constraint works in the context that all
the other features are identical for the related ships. Similar to a traditional CART
regression tree, the tree starts to split from the root node at the top which contains the
whole training set, and the output of this node is denoted by O T . Suppose further that
a is used as the threshold of feature xmN to split the root node into the left child node
Land right child node R, i.e., for examples with feature value no larger than a , they
are split to the left child node whose output is O L, and the other examples are split to
the right child node whose output is O R. To guarantee the monotonically increasing
constraint put on feature xmN , constraint O L OR is imposed to the outputs of the
child nodes of the root node. If this constraint cannot be satisfied when using a as
the threshold for splitting, it will not be used to split the root node. Furthermore, if
no candidate value of feature xmN can satisfy the constraint, the node will not be split.
The above two rules apply for all nodes in a tree. Therefore, it can be guaranteed that
if the root node can be split, we have O L OR. Similarly, constraint O LL OLR is
imposed to node L and O RR ORLis imposed to node R, such that the monotonically
increasing constraint can be guaranteed for nodes L and R, and we can expect that
node Land node Rare either not split, or have O LL OLRand O RR ORLin the child
nodes, respectively. Furthermore, as the output of node LRshould be no more than
the output of node R, an upper bound is enforced to the output of node LR, i.e., O LR,
OL +OR
as O LR mean(O L , O R ) = 2 O R .
Similarly, a lower bound is enforced to the
OL +OR
output of node RL, O RL, as O
RL mean(O L , O R ) = 2 O L .
Therefore, the out-
puts of the nodes in this layer satisfy that O LL OLR mean(OL , OR ) ORL ORR.
As the tree is split in a recursive manner, the monotonicity increasing property of
the whole tree and thus on feature xmN can be guaranteed. As the gradient boosting
machine is in an additive manner, the monotonicity property of the whole XGBoost
model can be guaranteed, and so are the final predicted ship deficiency numbers.
An XGBoost model imposed by monotonic constraint is denoted by monotonic
XGBoost by Yan et al. [1]. Yan et al. further compare the monotonic XGBoost
model with other popular and state-of-the-art regression models, including standard
XGBoost, monotonic LightGBM, DT, RF (random forest), GBRT (gradient boost-
ing decision tree), LASSO (least absolute shrinkage and selection operator) regres-
sion, ridge regression, and SVR (support vector regression), and conclude that the
prediction performance of the monotonic XGBoost is the best among all the models
for comparison regarding both MSE and MAE. This complies with the comment
that model prediction performance can be improved if reasonable monotonic con-
straints on certain features are imposed, showing that such constrained models can
generalize better [4–6].
This section aims to address the ship fuel consumption rate prediction problem
while taking shipping domain knowledge regarding ship sailing speed and fuel
Incorporating shipping domain knowledge into data-driven models 185
consumption into account. As the shipping industry is mainly powered by heavy fuel
oil, a large environmental footprint has been created by shipping activities due to the
emissions of greenhouse gases (GHGs) and air pollutants. According to the report
of the fourth IMO GHG emission, shipping was responsible for 2.76% of global
anthropogenic GHG emissions in 2012, and the proportion increased to 2.89% in
2018 [7]. In recent years, the sustainability of shipping has become a public con-
cern, and various emission control measures and regulations have been imposed on
ships involving both national and international transportation activities to reduce the
adverse impacts on the environment. Meanwhile, the cyclical downturn of the world
economy and high bunker prices make it necessary and urgent for ship owners and
managers to operate their ships in a more cost-effective way while still accomplish-
ing the global trade amount. In ship’s daily operations, bunker fuel costs account
for a large proportion of vessel operating costs. For large vessels such as container
ships, when the bunker fuel price is about 500 USD per ton, the bunker cost can con-
tribute to about 75% of the operating cost [8, 9]. Under these conditions, shipping
companies are constantly making hard efforts to optimize ship energy efficiency, i.e.,
to use as little fuel as possible to accomplish the required transport tasks.
Various factors can influence ship energy efficiency due to the complexity of
vessel engine systems and the surrounding sea and weather conditions in a voyage.
According to Yan et al. [10], common features that can influence ship fuel con-
sumption rates can be divided into four categories: ship mechanical features, ship
operational features, ship maintenance features, and sea and weather conditions.
Especially, ship mechanical data include ship dimension features (e.g. length, beam,
and gross tonnage) and power system features (e.g. engine parameters and designed
speed). Ship operational features include ship voyage and sailing behavior informa-
tion like sailing speed over ground, sailing speed through water, type of fuel used,
fuel density, and temperature, etc., as well as ship mechanical conditions while oper-
ating, such as the conditions of propeller pitch, rudder angle, main engine load and
working hours, hull and propeller fouling conditions, and wetted surface area. Ship
maintenance features mainly include ship dry docking records. Sea and weather
condition data mainly refer to the sea states and weather information along the voy-
age, where the sea states include sea depth, sea water temperature and density, wave,
swell, and current conditions, etc., while weather information includes the direction
and value of wind, air density, and temperature, etc.
Among all the influencing factors on ship fuel consumption rates, ship sailing
speed is widely believed to be the most significant determinant of ship fuel con-
sumption. In many existing studies, a ship’s fuel consumption rate at sea is usually
treated as proportional to its sailing speed to a power of ˛, where ˛is empirically
shown to be from 1.452 to 4.8 as summarized in Reference [10]. This shows that
ship hourly fuel consumption takes a monotonically increasing and convex relation-
ship with ship sailing speed. In other words, the fuel consumption rate is higher
when a ship’s speed is higher when all other things being equal, and the ship fuel
consumption rate increases faster as the sailing speed increases, as the ship fuel con-
sumption rate is approximately proportional to speed to a power larger than 1. The
domain knowledge regarding the relationship between ship sailing speed and ship
186 Machine learning and data analytics for maritime studies
fuel consumption rate should be taken into account when developing data-driven
models for ship fuel consumption prediction. This section discusses the idea of
incorporating such domain knowledge into the development of ANN models.
Traditional ANNs are introduced in chapter 8 of this book. Suppose that we have
a traditional ANN model shown in Figure 13.2 on hand, where the input layer with
m + 1 neurons is used to receive the input of m features, which can be the average
sailing speed, sailing condition, draft, air density, sea water temperature, and wave
and wind conditions, and the (m + 1)th neuron is the bias term. The hidden layer has
K + 1neurons, where the ( K + 1)th neuron is the bias. The output layer gives the pre-
dicted ship fuel consumption rate and is denoted by yO . Without sacrificing generality,
we denote by node x1 in the input layer ship sailing speed, which is imposed by both
convex and monotonically increasing constraints, while the other nodes are free from
constraints imposed. To preserve such domain knowledge, we introduce a special
type of neuron in the hidden layer named convex and monotonic neuron, which has
convex and monotonically increasing activation functions and non-negative weights
connected to them. These weights include those from the input layer and those to the
output layer. In addition, it is also required that node x1 representing the feature of ship
sailing speed in the input layer only connects to convex and monotonic neurons in the
hidden layer. The ANN model imposed by the above features based on Figure 13.2 is
denoted by ANN-DK and is shown in Figure 13.3.
Figure 13.3 shows that node x1 representing ship sailing speed is connected
with nodes b1 to bK1 in the hidden layer marked in orange that has monotonically
Incorporating shipping domain knowledge into data-driven models 187
increasing and convex activation functions using non- negative weights. These
nodes are also connected with the output layer using non-negative weights. It is
also noted that node x1 is only connected to the aforementioned nodes with mono-
tonically increasing and convex activation functions while there is no connection
between node x1 and the other nodes without constraints imposed in the hidden
layer. Moreover, to guarantee the non-negativity of the weights, if the weight values
updated in each round of training is positive or zero, there is no further action; oth-
erwise, their values are changed to zero. For the other nodes in the input layer, i.e.,
nodes x2 to xm+1without constraints imposed, they are connected to nodes b1 to bK1
as well as the other normal nodes in the hidden layer marked in yellow. The follow-
ing Theorems 13.1 and 13.2 guarantee that the ANN-DK model is monotonically
increasing and convex in x1, and thus the predicted output is in compliance with
shipping domain knowledge.
Theorem 13.1 The following relation holds, which guarantees the monotonicity
property of x1 to the predicted fuel consumption rates:
@Oy
0. (13.1)
@x1
Proof:
@Oy
PK @Oy zk tk
= k=1 zk tk x1
P
@x1
K
= v f (tk )w1k
k=1 k
0
P K1 PK (13.2)
= v f (tk )w1k +
k=1 k
0
k=K1 +1 vk f0 (tk )w1k
P K1
= v f (tk )w1k 0,
k=1 k
0
where tk is the weighted sum of the inputs to node bk in the hidden layer. As the
activation functions are monotonically increasing in the input to nodes b1 to bK1
in the hidden layer, f0 (tk ) is positive. Furthermore, considering that weights vk and
PK1 0
w1k , k = 1, ..., K1 are non-negative, it can be guaranteed that k=1 vk f (tk )w1k 0.
Meanwhile, as there is no connection between node x1 and nodes bK1 +1 to bK , we
P @Oy
have Kk=K1 +1 vk f0 (tk )w1k = 0. Therefore, @x is non-negative, showing that x1 is
1
monotonically increasing in the predicted fuel consumption rate by the ANN-DK
model.
Theorem 13.2 The following relation holds, which guarantees the convexity of
the predicted fuel consumption rate in x1:
@2 yO
0. (13.3)
@x21
Proof:
PK
@2 yO @ k=1 vk f
0 (t
k )w1k
=
@x21 @x1
PK @tk
00 (t
= k=1 vk w1k f k)
@x1 (13.4)
PK1 00 @tk PK @tk
= k=1 vk w1k f (tk ) + k=K1 +1 vk w1k f00 (tk )
@x1 @x1
PK1 2 00 @tk
= k=1 vk w1k f (tk ) 0.
@x1
The above inequality holds because the activation functions of nodes b1 to bK1 are
convex, there are no connections between node x1 and nodes bK1 +1to bK in the hidden
layer, and weight w1k is non-negative.
Therefore, it can be safely concluded that as long as Theorems 13.1 and 13.2 can
be satisfied in an ANN model developed for ship fuel consumption prediction, the
convex and monotonically increasing relationship between ship sailing speed and
hourly fuel consumption rate can be preserved. Furthermore, such constraints can
be imposed on more features. For example, when the direction of wind/swell is the
same as the direction of a ship’s heading, a larger wind force/swell would generally
decrease the hourly fuel consumption rate. To consider such a monotonic relation-
ship, another new type of neuron with a monotonic activation function (which is not
necessarily convex) can be introduced in the hidden layer, which is also connected
by non-negative weights with the corresponding neurons imposed by the mono-
tonic constraint in the input layer and the neurons in the output layer. ANNs with
the above properties can also be applied to address other problems where similar
domain knowledge should be preserved both within and outside of the area of mari-
time transportation.
Incorporating shipping domain knowledge into data-driven models 189
References
[1] Yan R., Wang S., Cao J., Sun D. ‘Shipping domain knowledge informed pre-
diction and optimization in port state control’. Transportation Research Part
B. 2021;149:52–78.
[2] Information sheet on the new inspection regime (NIR) [online]. Tokyo:
Tokyo MoU. 2022. Available from http://www.tokyo-mou.org/doc/NIR-
information%20sheet-r.pdf
[3] Chen T., Guestrin C. ‘XGBoost: a scalable tree boosting system’. Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining; 2016. pp. 785–94.
[4] Sill J. ‘Monotonic networks’ in Advances in Neural Information Processing
Systems; 1997.
[5] Daniels H., Velikova M. ‘Monotone and partially monotone neural networks’.
IEEE Transactions on Neural Networks. 2010;21(6):906–17.
[6] Pei S., Hu Q., Chen C. ‘Multivariate decision trees with monotonicity con-
straints’. Knowledge-Based Systems. 2016;112:14–25.
[7] Fourth IMO GHG study 2020 – final report [online]. London: IMO
Publications. 2020 Aug 25. Available from https://www.imo.org/en/OurWork/
Environment/Pages/Fourth-IMO-Greenhouse-Gas-Study-2020.aspx
[8] Ronen D. ‘The effect of oil price on containership speed and fleet size’.
Journal of the Operational Research Society. 2011;62(1):211–16.
[9] Wang S., Meng Q., Liu Z. ‘Bunker consumption optimization methods in
shipping: a critical review and extensions’. Transportation Research Part E.
2013;53:49–62.
[10] Yan R., Wang S., Psaraftis H.N. ‘Data analytics for fuel consumption man-
agement in maritime transportation: status and perspectives’. Transportation
Research Part E. 2021;155:102489.
This page intentionally left blank
Chapter 14
Explanation of black-box ML models in
maritime transport
In the domain of ML, the authors further define interpretability as the ability to
explain or present in understandable terms to a human. Then, according to Arrieta et
al. (2020) [2], for ML models, interpretability is “a passive characteristic of a model
referring to the level at which a given model makes sense for a human observer.”
The authors further explain that this characteristic can be interpreted as model trans-
parency. This shows that model interpretability is an inherent property of an ML
model, indicating that the ML model itself processes the ability to make the develop-
ers and users understand its reasoning process and the predictions generated. Typical
ML models falling in this category include rule-based learning and reasoning, linear
regression (LR), decision tree (DT), and kNN (k-nearest neighbor). Details of these
ML models will be covered in the next sections. In contrast, explainability is “an
active characteristic of a model, denoting any action or procedure taken by a model
with the intent of clarifying or detailing its internal functions.” Therefore, model
explainability can be understood as explaining the ML model of black-box nature
developed using other external techniques, where such techniques can be local (i.e.
for the prediction of a single example) or global (i.e. for the overall performance of
the whole model), and model-specific (which are limited to specific model classes)
or model-agnostic (which can be used to any ML models). A detailed introduction
of the related methods will be given in the next sections.
We say that most of the ML models are of black-box nature because we do not
know how the features fed into the model are processed and used to give the final
output. According to Doshi-Velez and Kim [1], the black-box nature of ML models
is not a big deal when (1) there are no significant consequences for unacceptable
results given by an ML model; or (2) the problem is sufficiently well studied and val-
idated in practice, and thus, the users trust the decisions recommended by the system
even if it is imperfect, i.e., the prediction results are not always of 100% accuracy.
Otherwise, an ML model explanation should be given to address the incomplete-
ness in the problem formalization, which means that getting to know what if the
prediction is not enough for some problems, and it is also expected that why such
prediction is made should be explained [3]. Unfortunately, neither condition is held
in the maritime industry. The main reason is that this industry involves several het-
erogeneous and conservative stakeholders and decision-makers, who make strate-
gic, tactical, and operational decisions heavily dependent on long-term experience
instead of emerging data-driven models, especially ML models. Consequently, rec-
ommendations generated by ML models in black-box nature without convincing
explainability provided can be seldom acceptable to them, even if in many cases the
recommendations given by these models can be much more efficient and reasonable
than transparent but naive rules or expert systems.
One example in maritime transportation is ship selection decisions in PSC made
on each workday at port authorities: although there are various high-performing data-
driven models developed for ship risk prediction or high-risk ship recommendation
in existing studies, which have also been empirically shown to be more efficient than
the rule-based SRP (ship risk profile) ship selection method adopted by most MoUs
(Memorandum of Understandings), these data-driven models are seldom adopted by
port authorities at the moment. Meanwhile, for ship risk prediction problem in PSC,
Explanation of black-box ML models in maritime transport 193
the first condition mentioned above is not satisfied, because ship selection decision is
vital for both port authorities and ship operators. For port authorities, the number of
foreign visiting ships is large each day, while the available inspection resources can
be scarce. Moreover, it can be seen from the annual reports of the MoUs that only
some of the inspected ships are with deficiency/deficiencies identified, while only a
small proportion of them are actually detained, i.e., with serious deficiency/deficien-
cies found. Therefore, only a small proportion of all the foreign visiting ships can be
and should be inspected. It is thus crucial to allocate the limited inspection resources
to the ships with the highest risks (i.e. with the worst performance) in PSC inspec-
tion to improve the efficiency of inspection. From the perspective of ship operators,
too frequent and unnecessary inspections would delay the shipping schedule, and
thus, reduce the efficiency of the fast turnover of the shipping logistics system and
cause financial loss. Besides, if too few substandard ships are inspected by the port
states, ship operators may lack the motivation to intensively maintain their ships to
be in good condition. On the contrary, if too many qualified ships are inspected by
a certain port, ship operators may turn to other destinations with looser inspection
strategies, and such behavior is referred to as “port shop,” which may affect overall
vessel quality and ports’ reputation. Besides, the second condition mentioned above
is also not satisfied in ship selection for inspection by PSC, because although this
topic is widely studied in existing literature where several data-driven models have
been developed to assist port authorities in high-risk ship selection, the proposed
models are not applied by the ports, and thus, the results have not yet been fully
validated. Consequently, there is still a long way to go to make the users trust the
prediction results given by the data-driven models in PSC.
In addition to fulfilling the needs of model explanation according to Doshi-Velez
and Kim [1], providing explanations to black-box models developed to address
practical problems in maritime transport can bring the following advantages:
shipping service providers, model fairness is a key to guarantee that the predic-
tions are reasonable and thus, verify the recommendations given by the black-
box models to be compliant with ethical standards and common beliefs.
4. Extensibility: explanations given for black-box data-driven models can help the
developers to further adjust the settings and hyperparameters of the ML models
more effectively, so as to improve their prediction accuracy. They can also help
the model users generate general rules and extract new knowledge from the
massive data for future decision-making.
prediction models are meaningful. This requires that the prediction models to
be explained should learn the underlying relationships in the data well and thus,
generalize well to unknown examples. In general, model predictive accuracy
is evaluated by the accuracy of the test set using proper metrics. Under the
context of predictive accuracy evaluation for a model explanation, one should
pay attention to the data used to check the predictive accuracy: the test set data
must resemble the interest of the model users. For example, to evaluate a high-
risk ship selection model developed for the Hong Kong port using inspection
records there, it is not reasonable to evaluate the prediction model’s perfor-
mance using the inspection records at a port other than the Hong Kong port, say
the Port of Singapore. Moreover, sometimes using average prediction accuracy
might not be enough, as it is also expected that model performance should be
stable when there are data and model perturbations within a reasonable range.
This is because if the model has dramatic changes when there are slight changes
in the data and model, the explanations generated from the model might not be
trustworthy.
196 Machine learning and data analytics for maritime studies
One point that needs to be added is that there is a trade-off between predictive
accuracy and descriptive accuracy: usually, a more complex black-box prediction
model has higher predictive accuracy, but this means that its descriptive accuracy
might be low, as such a model might be hard to analyze. In contrast, a simple white-
box prediction model can have a high descriptive accuracy as its behavior is easier
to be captured. However, its predictive accuracy might not be high enough.
2. Feature importance: this method aims to show how important a feature is to the
prediction of the target, i.e., how much does the feature contribute to the predic-
tion of the target. It can be expressed as the importance value of a single feature,
or in a more complex form such as the pairwise feature interaction strengths.
3. Partial dependence: this method aims to show the marginal contribution of
the values of a feature on the predicted output, which is often called partial
dependence plots. As marginal contribution needs to be calculated, all the other
features other than the feature to be explored need to remain unchanged, and
different values of the feature to be explored should be traversed.
4. Data point: this method aims to use existent or newly created examples to show
the decision process of an ML model. One typical method on a single example
level is called counterfactual explanation, which aims to find similar examples
of the example to be explained by changing some of its features and observing
the changes in the output, so as to understand how decisions are made by the
prediction model.
5. Surrogate model: this method aims to find a surrogate model which is inher-
ently interpretable, and can mimic the behavior of the black-box ML model to
be explained.
There are several criteria that can be followed to classify explanations given to
black-box ML prediction models from different perspectives. One criterion is based
on model property as discussed above: an ML model is intrinsically explainable or
external explanation methods should be applied to explain the ML model. The first
class of explanation is called intrinsic explanation and the second class is called post
hoc explanation. Intrinsically explainable models are interpretable ML models with
simple structures and are self-explainable. A typical example is LR which takes an
additive linear form, and the coefficients of the terms (i.e. features) can be viewed
as their weighting points (or the approximate importance scores if feature values
are normalized) while the final predicted target of a particular example is given by
the weighted sum of its feature values. Another typical example is DT, especially
shallow DTs, in which the reasoning process is shown by the (feature and value)
pairs selected for node splitting, and the predicted output of a particular example is
determined by the average/majority outputs of the examples in the node where this
example falls. In addition, feature importance can be generated from a DT: a feature
is regarded to be more important than the other if it can reduce node impurity to a
larger extent. It should be noted that the explanations given by an intrinsic model
can be from relatively limited aspects, and such an explanation sacrifices the accu-
racy of the prediction model. In contrast, post hoc explanation refers to the process
of developing an ML model of black-box nature first and then using external expla-
nation methods to explain the model. If necessary, it can also be applied to explain
intrinsic models from wider perspectives.
Another two criteria consider the application scope of the explanation method.
One is that if the explanation is given to a single example, it is local. Alternatively, if
the explanation is given to the entire model, it is global. The other is that if the expla-
nation method can only be used to a specific model, e.g., artificial neural networks or
198 Machine learning and data analytics for maritime studies
The visualized DT is presented in Figure 14.1. The node at the top of the tree
shown in Figure 14.1 is the root node, which contains all the 1,500 ships, i.e., the
entire training set, used for tree training. The first line in the node shows the feature
and its value selected to split the node, i.e., feature gross tonnage (GT, whose unit
is 100 cubic feet) is selected to split the root node with a threshold value of 6,451.5.
This means that ships whose GT is no more than the threshold are split to their left
child node, where the process is shown by the “True” arrow under the root node,
meaning that splitting criteria “GT <= 6,451.5” is satisfied in these ships. Other
ships, i.e., ships whose GT is more than 6,451.5 the unit of GT has been added to the
place where feature GT is mentioned, are split to the right child node, as shown by
the “False” arrow under the root node. The second line in the root node shows the
current MSE (mean squared error), i.e., impurity, of the node. The MSE of the root
node is 27.732. The third line shows the number of ships contained by the node. The
fourth line is the predicted ship deficiency number of this node, which is the aver-
age deficiency number of all the ships contained in the node. This value is 4.309 in
the root node, indicating that the average number of deficiencies of the ships in the
entire training set is 4.309.
Explanation of black-box ML models in maritime transport 199
The tree is split in a recursive manner until the stopping criteria of max_depth
as 3 is met. Therefore, we can see that the depth of the tree is 3. The total number of
ships contained in all leaf nodes is equal to 1,500. It is also interesting to find that
although splittings in DTs aim to reduce impurity in subsequent nodes, which is
represented by MSE in this example, the MSE may not necessarily be reduced after
splitting a node. An example is the split of the root node to its left child node, where
the MSE increases from 27.732 to 81.853. This indicates that the left child node
might contain examples with extreme output values (i.e. very large ship deficiency
numbers) that have a big influence on the MSE of the node. This phenomenon is also
shown by the outputs of leaf nodes: for leaf nodes with the predicted ship deficiency
number close to the output of the root nodes, i.e., nodes 3, 5, and 6, their MSE values
are less than or even much less than the MSE of the root node. The outputs of the
other nodes are large, which is mainly because they contain ships with very large
deficiency number. Consequently, their MSE values are higher than the MSE of the
root node. This shows that extreme values are indeed contained in the training set,
and they can have a very large impact on a DT’s performance.
The above analysis shows the intrinsic explainable property of DT models: the
inner working mechanism of a DT, i.e., how the DT is split from the root node to
leaf nodes is shown by the (feature and value) pair; the examples contained, output
and error can be obtained for each node; and if there comes a new ship, how to clas-
sify the ship to a leaf node and decide its predicted deficiency number are all clear.
Model fairness can be verified based on these explanations. For example, from the
root node, it can be seen that a smaller ship tends to have a larger deficiency number
in PSC inspection, as it may not be managed that well compared to larger ships.
Moreover, from the second node on the right in tree depth 2, it can be seen that if
a ship has a larger deficiency number in the last initial inspection, it is predicted to
have a larger deficiency number by the DT for the current PSC inspection, which
complies with expert knowledge. Based on the splitting criteria and the MSE of
200 Machine learning and data analytics for maritime studies
each node, feature importance can be generated. As only three features are used
for node splitting in the tree in Figure 14.1, i.e., GT, company_performance, and
last_deficiency_no, only these features have importance scores, which are 6.8689,
1.7838, and 1.4505, respectively, showing that ship GT is the most important feature
determining ship deficiency number, followed by ship company performance and
the number of deficiencies in the last PSC initial inspection. For the detailed meth-
ods to calculate feature importance score in regression DT, readers are referred to
Reference 8 for more information. In addition, decision rules can be extracted from
the DT in Figure 14.1. For example, it can be concluded that if a ship has GT no
more than 6,451.5 and has unknown company performance, its deficiency number
is predicted to be 19.393, which is relatively high and deserves much attention from
the port authority. Another rule is that if a ship has a GT larger than 6,451.5 while
less than 24,294.5 and its last deficiency number is no larger than 3.5, its predicted
deficiency number is 4.122.
14.2.3 SHAP method
SHAP is proposed by Lundberg and Lee in 2017 [9], which is theoretically based
on Shapley value in game theory. SHAP is used to explain the output of the predic-
tion of an individual example (i.e. local) of an arbitrary machine learning model
(i.e. model-agnostic) after it is constructed (i.e. post hoc). A base value, which is
the average output of the examples in the training set, is first calculated, and the
contribution of each feature value of the example to the predicted target is calcu-
lated. Then, the final output of the example is the sum of the base value and the
contributions of its feature values where such contribution is represented by mar-
ginal contribution, which can be viewed to be similar to a linear model taking a sum
format. SHAP is motivated by coalition game theory, where the “game” refers to the
prediction task of an example, the “players” are the features included in the predic-
tion model, and the “gain” is the difference between the final predicted target and the
base value, i.e., the sum of the contributions of the feature values. Given an example
0 0
xi with mfeatures, the contribution of feature m , m = 1, ..., m, e.g., xi is denoted by
i
m0 . Furthermore, given that the base value is y
, which is the average output in the
training set, the predicted output of the example xi denoted by f(xi ) explained by
SHAP can be represented by the following additive linear function form:
X
m
Table 14.2 Feature values and the corresponding SHAP values of the example
ship
Then, we explain the prediction of a random ship in the test set. Feature values
and their corresponding SHAP values of the selected example ship are shown in
Table 14.2.
The sum of SHAP values of all features is 2.082140. Given the base value as
4.320877, the final output of the deficiency number of this ship is 2.238737. The real
deficiency number of this ship is 3, and thus, the absolute prediction error of this
ship is 0.761263. Main contributors to the final predicted deficiency number can also
be shown in a visual manner in Figure 14.2.
202 Machine learning and data analytics for maritime studies
As shown in Figure 14.2, the base value is 4.32 and the predicted value is
2.24, where both are shown directly on the axis. The main features of this ship that
decrease the predicted deficiency number are shown in blue, which are having GT
at 131,332 (being a relatively large ship), of ship age 5 (being a young ship), and
not being ship type general_cargo/multipurpose (not of a ship type associated with a
larger number of deficiencies). Meanwhile, the major feature that increases the pre-
dicted ship deficiency number is having 5 deficiencies in the last initial inspection
within the Tokyo MoU, which is larger than the average deficiency number over the
training set at 4.32. The figure further shows that the overall effects of the values
of all the features lead to the ship having 2.08 less deficiencies than the base value,
and thus, the final predicted deficiency number of the ship is 4.32 2.08 = 2.24 in
the current PSC inspection. Then, the decision process of this ship, i.e., why it is
predicted to have 2.24 deficiencies and what are the main contributors to the final
target, is clearly presented to the users.
We then use SHAP to generate more insights from the training set. We first use
a beeswarm plot with the y-axis representing each feature and the x-axis represent-
ing the features’ SHAP values with each dot representing a single ship in the training
set as shown in Figure 14.3. Feature values from low to high are shown by gradient
colors as shown on the right side of the figure. When multiple dots land at the same
x position, they pile up to show the density. The core code to generate this figure is
given as follows:
We then explore more on the feature values on the predicted deficiency number.
•• GT: for feature values shown by blue dots, i.e., for relatively small feature val-
ues, many of their SHAP values are larger than 0, indicating that being a smaller
ship is more likely to have the worse condition and thus, increases the number
of deficiencies. It is also noted that when the feature values are shown in blue,
they can also have negative SHAP values which would reduce a ship’s pre-
dicted deficiency number. This shows that the conditions of small ships might
Explanation of black-box ML models in maritime transport 203
be varied. In contrast, when the feature values are shown in red which refers to
large ships, their SHAP values are always less than 1, showing that large ships
are generally in good condition.
•• Last_deficiency_number: having a larger number of deficiencies in the last ini-
tial PSC inspection in the Tokyo MoU would no doubt increase the predicted
number of deficiencies in the current PSC inspection, and these ships are shown
by the red dots with positive SHAP values, as this indicates that the ship is
in a relatively bad condition. Bad performance in the last PSC inspection can
increase the final prediction by up to 6. Meanwhile, it is also interesting to find
that even if a ship performs relatively well in the last PSC inspection, it does
not necessarily mean that the ship could perform well in the current inspection,
and these ships are shown by the blue dots with positive SHAP values. This is
mainly because ships’ conditions can vary from time to time, especially when
the last inspection is long ago. In addition, different port states may have differ-
ent inspection strategies and different degrees of strictness. In general, it can be
seen that having less deficiency number in the last PSC inspection can reduce
the number of deficiencies identified in the current PSC inspection.
•• Age: the general condition of this feature is that older ships tend to have posi-
tive and slightly negative SHAP values while younger ships tend to have nega-
tive SHAP values. This is intuitive and easy to understand: the manufacturing
technology of older ships is not as exquisite as that of younger ships, and older
ships may also have more wear and tear as well as damage after a long period
of sailing.
204 Machine learning and data analytics for maritime studies
fairness can be verified, so as to convince the users regarding the rationale behind
the black-box model to improve model acceptance and practical applicability.
References
[9] Lundberg S.M., Lee S.I. ‘A unified approach to interpreting model predic-
tions’. Advances in Neural Information Processing Systems. 2017;30.
[10] Lundberg S.M., Erion G.G., Lee S.I. ‘Consistent individualized feature
Attribution for tree ensembles’. ArXiv:180203888. 2018.
[11] Shap API reference [online] [Scott Lundberg Revision]. 2022 Aug 19.
Available from https://shap-lrjball.readthedocs.io/en/latest/api.html
This page intentionally left blank
Chapter 15
Linear optimization
15.1 Basics
200yMS + 300yMB
We seek to find values for yMS and yMB that could maximize the objective
function.
Third, we need to define the constraints because the decision variables cannot take all
real values. For example, yMBcannot be 1 billion because of limited demand and limited
ship capacity; yMBcannot be 4, either.
210 Machine learning and data analytics for maritime studies
Solution. Let yMS and yMBbe the decision variables representing the volumes of con-
tainers transported from Melbourne to Sydney and Melbourne to Brisbane, respectively.
The model is:
max 200yMS + 300yMB (maximize the total profit)
subject to
yMS + yMB 1000
(ship capacity constraint on the leg from M to S)
yMB 1000
(cannot carry more containers than the ship capacity on the leg from S to B)
yMs 800
(cannot carry more containers for the OD pair (M S) than the demand)
yMB 700
(demand constraint for the OD pair (M, B))
yMS 0
(cannot carry a negative number of containers for the OD pair (M, S))
yMB 0
(nonnegativity constraint on yMB ).
We can see from the above solution that, in an optimization model we need to
(1) define the decision variables; (2) define the objective function and explain it;
and (3) define the constraints and explain each constraint. We can use any symbol
to represent a decision variable; however, we often use symbols that are easy to
remember. If the meaning of the objective function or a constraint is straightfor-
ward, it is acceptable not to explain it. In the above solution, “max” means we
want to maximize the objective function; we use “min” if we want to minimize the
objective function (e.g., when the objective function represents cost). The model
for Example 1 has two variables and six constraints.
Example 15.2. Reconsider the service in Example 15.1. Ships with a capacity of 1500
(TEUs) are deployed to provide a weekly frequency. The container shipment demand
is: Melbourne to Sydney qMS = 800, Melbourne to Brisbane qMB = 700, and Sydney
to Brisbane qSB = 900. The profit of transporting 1 TEU from Melbourne to Sydney
is $200, from Melbourne to Brisbane is $300, and from Sydney to Brisbane is $100.
Develop an optimization model to evaluate the maximum profit ($/week) the company
can make.
Solution. Let yMS, yMB, and ySB be the decision variables representing the vol-
umes of containers transported from M to S, M to B, and S to B, respectively.
The model is:
subject to
Linear optimization 211
Example 15.5. Consider the CC1 service in Figure 15.1 with the port rotation
Shanghai (S, 1) ! Kwangyang (K, 2) ! Pusan (P, 3) ! Los Angeles (L, 4)
! Oakland (O, 5) ! Pusan (P, 6) ! Kwangyang (K, 7) ! Shanghai (S, 1)
Ships with a capacity of 8000 (TEUs) are deployed to provide a weekly frequency. The
container shipment demand is: Shanghai to Los Angeles qSL = 4500, Kwangyang
to Los Angeles qKL = 1000, Pusan to Los Angeles qPL = 1500, Oakland to Shanghai
OS SP
q = 3700, and Shanghai to Pusan q = 1900. The profit of transporting 1 TEU from
Shanghai to Los Angeles is $1800, Kwangyang to Los Angeles is $1900, Pusan to Los
Angeles is $1600, Oakland to Shanghai is $900, and Shanghai to Pusan is $500. Develop
an optimization model to evaluate the maximum profit ($/week) the company can make.
Solution. Let ySL, yKL, yPL, yOS, and ySP be the decision variables representing the volumes
of containers transported from S to L, K to L, P to L, O to S, and S to P, respectively. The
model is:
max1800ySL + 1900yKL + 1600yPL + 900yOS + 500ySP
subject to
ySL + ySP 8000
(ship capacity constraint on the leg from S to K)
ySL + yKL + ySP 8000
(ship capacity constraint on the leg from K to P)
ySL + yKL + yPL 8000
(ship capacity constraint on the leg from P to L)
yOS 8000
(ship capacity constraint on the leg from O to P)
yOS 8000
(ship capacity constraint on the leg from P to K)
yOS 8000
(ship capacity constraint on the leg from K to S)
0 ySL 4500
0 yKL 1000
0 yPL 1500
0 yOS 3700
0 ySP 1900.
Linear optimization 213
15.2 C
lassification of linear optimization models according to
solutions
The examples in section 15.1 are all linear optimization models. A linear optimization
model has a linear objective function to be maximized or minimized, and a set of linear
constraints. The following functions are linear:
5x + 6y 3.4z
100x1 0.001x2 + 1002.
The following functions are nonlinear:
5x + 6y2 3.4z
100x1 0.001 sin x2 + 1002
3xy.
The following constraints are linear:
5x + 6y 3.4z 23
100x1 0.001x2 + 1002 100
100x1 0.001x2 + 1002 = 100.
It should be noted that we do not usually consider “<” or “>” constraints in linear
optimization, e.g., x1 + x2 < 5.
A feasible solution to a linear optimization model is a vector of values of the deci-
sion variables that satisfies all the constraints. For example, in the model for Example 1,
MS MB
(y , y ) = (0, 1) is a feasible solution. The set of all feasible solutions is called the
feasible set. We can calculate the objective function value for each feasible solution. For
example, the objective function value of the solution (yMS , yMB ) = (0, 1)is 300. An opti-
mal solution is the “best” feasible solution, which has the largest objective function value
for a maximization model, and the smallest objective function value for a minimization
model. The objective function value of an optimal solution is called the optimal objective
function value.
Example 15.6. Write down three distinct feasible solutions to the model in the solution to
Example 1, and calculate their objective function values.
Solution. There are many feasible solutions. For example, the objective function value
of the solution (yMS , yMB ) = (0, 1) is 300; the objective function value of the solution
MS MB MS MB
(y , y ) = (0, 0)is 0; the objective function value of the solution (y , y ) = (10, 10)
MS MB
is 5000; the objective function value of the solution (y , y ) = (100, 100)is 50,000.
A linear optimization model can be classified into three categories according to solu-
tions: infeasible, unbounded, and having an optimal solution. A linear optimization model
may have no feasible solution, i.e., the model is infeasible, e.g.:
max x + y
214 Machine learning and data analytics for maritime studies
subject to
x+y1
x0
y 0.
A linear optimization model is infeasible if and only if its feasible set is empty, i.e., no
solution satisfies all the constraints simultaneously. Whether a linear optimization model
is infeasible or not has nothing to do with its objective function. A linear optimization
model either has 0, or 1, or an infinite number of feasible solutions. A linear optimization
model is feasible if it has at least one feasible solution.
A linear optimization model may be unbounded, i.e., for a maximization model, the
objective function value can be infinitely large and for a minimization model, the objective
function value can be infinitely small, e.g.:
max x + y
subject to
x+y1
x0
y 0.
Note that if a linear optimization model is unbounded, its feasible set must be
unbounded. If the feasible set of a linear optimization model with decision vari-
ablesxx11, ,xx22, , , ,xxnn is bounded, i.e., there exists a positive number M such that any
feasible solution satisfies M x1 M, M x2 M, , M xn M, then
the model will not be unbounded. If the feasible set is unbounded, the model may
be bounded (e.g., minimizing x subject to x 0) or unbounded (e.g., minimizing
x subject to x 0). For most practical problems, we do not worry about whether
they unbounded, because in reality the absolute values of the decision variables
cannot be infinitely large.
If a linear optimization model is feasible and not unbounded, then it has an
optimal solution, e.g.:
max x + y
subject to
x+y1
x1
y 1.
We often add the superscript “” to a decision variable to represent its value in an
optimal solution. For example, we often use (x , y )to represent the optimal values
of the decision variables (x, y).
If a linear optimization model has an optimal solution, the optimal solution may
not be unique, e.g.:
max x + y
Linear optimization 215
subject to
x+y1
x1
y 1.
If a linear optimization model has more than one optimal solution, then it has an
infinite number of optimal solutions. In most cases, we are only interested in obtain-
ing one of them.
If an optimization model has an optimal solution, then there is no loss of gen-
erality to say “suppose that it is a minimization model with a positive objective
function.” For example, if we aim to maximize 3x 4y , then we can minimize
(3x
4y) + z subject to z = 1010 (i.e., we let the new decision variable z equal a
very large positive number).
Example 15.7. Suppose that all linear optimization models in this question have
optimal solutions. (1) Consider a minimization linear optimization model with
several constraints, and one of them is 3x 4y 5. If the constraint is changed
to 3x 4y 6, how will the feasible set change? How will the optimal objective
function value change? (2) Consider a linear optimization model with several con-
straints and minimizing 3x 4y . If the objective function is changed to 6x 8y , how
will the feasible set change? How will the optimal solution change? How will the
optimal objective function value change?
Solution. (1) The feasible set will be larger or not change. The optimal objective
function value will be smaller or not change. (2) The feasible set will not change.
The optimal solution will not change. The optimal objective function value will be
twice as large as before.
Example 15.8. Reconsider the model in the solution to Example 1. (1) If the ship capac-
ity increases, how will the feasible set change? How will the optimal objective function
value change? (2) If the demand from Melbourne to Brisbane decreases, how will the
feasible set change? How will the optimal objective function value change? (iii) If the
profit of transporting 1 TEU from Melbourne to Sydney is $400 instead of $200, and from
Melbourne to Brisbane is $600 instead of $300, how will the feasible set change? How
will the optimal solution change? How will the optimal objective function value change?
Solution. (1) The feasible set will be larger or not change. The optimal objective
function value will be larger or not change. (2) The feasible set will be smaller or
not change. The optimal objective function value will be smaller or not change. (3)
The feasible set will not change. The optimal solution will not change. The optimal
objective function value will be twice as large as before.
It is helpful to understand that some forms of a linear optimization model are equiva-
lent. When we say transform Model A to Model B, we mean that the Model B is
equivalent to Model A in the sense that gives an optimal solution to Model B, we can
easily derive an optimal solution to Model A.
216 Machine learning and data analytics for maritime studies
Fifth, we can transform a model to one in which all decision variables are non-
negative. For example, if we have a constraint x 0, we can let u = x and replace
x by uin the model. If we have a constraint x 3, we can let u = x 3and replace
x by u + 3 in the model. If we have a constraint x 3, we can let u = 3 x and
replace x by 3 u in the model. If the model does not specify whether x is non-
negative or nonpositive, we can define u1 0, u2 0 and replace x by u1 u2 in
the model.
Example 15.11. Transform the model in Example 5.9 to one with only “=” con-
straints and nonnegative decision variables.
Solution. As x 1 and y 0, we let s = x 1 and t = y . Hence, x = s + 1, s 0
and y = t, t 0. The model is
max 5s 6t (note that the constant 5 does not affect the model)
subject to
3s + 4t r = 31
45s + 98t = 36
r0
s0
t 0.
Sixth, we can transform the objective function to one that is equal to a decision
variable. For example, maximizing 5x1 6x2is equivalent to maximizing u, subject
to u = 5x1 6x2. It is also equivalent to maximizing u, subject to u 5x1 6x2.
In plain words, this property means that the objective function can somehow be
considered as a constraint; it also means that if we can solve an optimization model
with complex constraints, we can also solve an optimization model with complex
constraints and a complex objective function.
Finally, in some problems there is no objective function, for instance, when
we are only interested in finding a feasible solution. In this case, we can aim to
“min 0.”
Two general forms of linear optimization models are frequently used in theoret-
ical analysis: the canonical form that maximizes cT x subject to Ax band x 0,
218 Machine learning and data analytics for maritime studies
and the standard form that maximizes cT x subject to Ax = b and x 0, where x
is a column vector representing n decision variables, Ais an m n matrix, c is a
column vector with nrows, bis a column vector with mrows, and A, b, and c are
all parameters.
Example 15.12. Transform the model in Example 15.9 to the canonical form.
Solution. Letting x = u v, u 0, v 0, and y = w, w 0, the model is trans-
formed to
Linear optimization models with two variables can be solved intuitively using
graphs. In practical applications, hardly any problem has only two variables.
However, learning the graphical method is helpful for appreciating the properties of
linear optimization models.
In the graphical method, we first draw constraints that define an upper or lower
bound for a decision variable (e.g., x 0, y 100), then draw the other constraints,
and finally draw a line or a series of parallel lines that represent the objective
function.
Example 15.13. Consider the model below:
max x1 + x2
Linear optimization 219
subject to
x1 x2
x1 1
x2 1
x1 0
x2 0.
Use the graphical method to find the optimal solution.
Solution. See Figure 15.2. The optimal solution corresponds to the intersection of
the lines x1 = x2 and x1 = 1. Therefore, the optimal solution is x1 = 1, x2 = 1. The
optimal objective function value is 2.
Example 15.14. Consider the model below:
min x2 x1
subject to
x1 x2
x1 1
x2 1
x1 0
x2 0.
Use the graphical method to find the optimal solution.
220 Machine learning and data analytics for maritime studies
subject to
x1 + x2 1
x1 3
x2 1
x1 0
x2 0.
Use the graphical method to find the optimal solution.
Solution. See Figure 15.5. The optimal solution corresponds to the intersection of
the lines x1 = 3 and x2 = 1. Therefore, the optimal solution is x1 = 3, x2 = 1. The
optimal objective function value is 7.
Example 15.17. Consider the model below:
max 2x1 + x2
subject to
x1 1
x2 1
x1 + x2 = 1
x1 0
x2 0.
Use the graphical method to find the optimal solution.
222 Machine learning and data analytics for maritime studies
Solution. See Figure 15.6. Note that the feasible set is a line segment. The optimal
solution corresponds to the intersection of the lines x1 + x2 = 1and x1 = 1. Therefore,
the optimal solution is x1 = 1, x2 = 0. The optimal objective function value is 2.
Example 15.18. Consider the model below:
max x1
subject to
2x1 + x2 1000
3x1 + 4x2 2400
x1 + x2 700
x1 x2 350
x1 0
x2 0.
2x1 + x2 1000
3x1 + 4x2 2400
x1 + x2 700
x1 x2 = 350
x1 0
x2 0.
We can see from the above examples that if a linear optimization model has an
optimal solution, then there exists an optimal solution that is at a “corner point” of
the feasible set*.
The graphical method can also be used to identify whether a linear opti-
mization model is infeasible, unbounded, or has an infinite number of optimal
solutions.
*
There are some linear optimization models with no corner points, e.g., maximizing 0x + 2y subject to0
≤ y ≤ 1. We generally do not need to worry about them in practical applications.
†
The exact time depends on the parameters of the model, the solver, and the computer.
226 Machine learning and data analytics for maritime studies
subject to
x + y 1000
y 1000
x 800
y 700
x0
y 0.
Solution. The optimal solution is x = 300, y = 700.
Example 15.21. Use a solver (e.g., Excel) to solve the following linear optimization
model:
max 200x + 300y
subject to
x + y 1000
y 1000
x 800
y 700
x0
y 0.
Solution.If Excel is used, then the results are “The Objective Cell values do not con-
verge. Solver can make the Objective Cell as large (or as small when minimizing) as
it wants,” which indicates that the model is unbounded.
Example 15.22. Use a solver (e.g., Excel) to solve the following linear optimization
model:
max 200x + 300y
subject to
x + y 1000
y 1000
x 800
y 700
x0
y 0.
Solution. If Excel is used, the results are “Solver could not find a feasible solution.
Solver cannot find a point for which all Constraints are satisfied,” which indicates
that the model is infeasible.
Example 15.23. Use a solver (e.g., Excel) to solve the following linear optimization
model:
max x + y
Linear optimization 227
subject to
x+y1
x0
y 0.
Solution. Although there are an infinite number of optimal solutions, Excel only
provides one optimal solution, e.g., x = 1, y = 0.
The following questions are on linear optimization models. For each question, you
should either answer that such a linear optimization model does not exist (and you
do not need to provide the reason), or give an example of such a linear optimization
model.
Example 15.24. Give a model that is both infeasible and unbounded.
Solution. Such a model does not exist.
Example 15.25. Give a model whose optimal objective function value is 1, and after
removing one constraint, the optimal objective function value is 0.
Solution. Minimizing xsubject to x 1, x 0. After removing the constraint x 1,
the optimal objective function value is 0.
Example 15.26.
1. A model is infeasible, and after removing one constraint, it is feasible.
2. A model is infeasible, and after removing one constraint, it is unbounded.
3. A model is infeasible, and after removing one constraint, it has an optimal
solution.
4. A model is infeasible, and after removing one constraint, it has an infinite num-
ber of optimal solutions.
5. A model has an optimal solution, and after adding one constraint, it is
unbounded.
6. A model has an optimal solution, and after adding one constraint, it is
infeasible.
7. A model has exactly one optimal solution, and after adding one constraint, it
has an infinite number of optimal solutions.
8. A model has an infinite number of optimal solutions, and after adding one con-
straint, it has exactly one optimal solution.
9. A model is infeasible, and after changing its objective function, it has an opti-
mal solution.
10. A model has an optimal solution, and after changing its objective function, it is
infeasible.
11. A model has an optimal solution, and after changing its objective function, it is
unbounded.
12. A model is unbounded, and after changing its objective function, it has exactly
one optimal solution.
228 Machine learning and data analytics for maritime studies
13. A model is unbounded, and after changing its objective function, it has an infi-
nite number of optimal solutions.
Solution.
Although a linear optimization solver only provides one optimal solution if there
are an infinite number of optimal solutions, we can take advantage of the solver in a
number of smart ways to address more problems.
Example 15.27. Given a linear optimization model with a non-empty and bounded
feasible set defined by Ax band x 0, where the vector of decision variables x
has n elements and A, b, and c are parameters of appropriate dimensions, how to
use a linear optimization solver to check whether it has only one feasible solution or
an infinite number of feasible solutions.
Solution. The number of feasible solutions is independent of the objective function.
We solve the following two models: maximizing x1 subject to Ax b and x 0,
and minimizing x1 subject to Ax b and x 0. Here both models have an opti-
mal solution. If their optimal objective function values are different, then there are
an infinite number of feasible solutions to the original model. Otherwise we check
models that maximize and minimize xx22,,xx33,, ,,xxnn. If all of the optimal objective
function values of the n maximization models are the same as the corresponding
values of the minimization models, there is only one feasible solution.
Example 15.28. Given a linear optimization model that is feasible and not
unbounded: maximizing cT x subject to Ax band x 0, where the vector of deci-
sion variables x has n elements and A, b and c are parameters of appropriate
dimensions, how to use a linear optimization solver to check whether it has only one
optimal solution or an infinite number of optimal solutions.
Linear optimization 229
Solution. Obtain an optimal solution, denoted by x. Now the problem becomes how
to check whether there is one or an infinite number of feasible solutions to the set of
constraints Ax b, x 0, and cT x = cT x, which is Example 15.27.
Example 15.29. Given a set of constraints Ax b and x 0, where the vector of
decision variables x = [ x y ]T and A, b, and c are parameters of appropriate
dimensions. We want to find the optimal values of the decision variables satisfy-
ing two conditions. Condition 1: the expression 3x 4y is maximized. Condition 2:
among all the feasible solutions satisfying Condition 1, we want to find the one that
minimizes 2x + 3y . How to use a linear optimization solver to address this problem‡
Solution. The first approach is to use the big-M method. Our first priority is to maxi-
mize 3x 4y , and our second priority is to maximize 2x 3y . Therefore, we can
solve the model maximizing 10 8 (3x 4y) + (2x 3y) subject to Ax b and x 0.
Of course, the solution may not be accurate.
The second approach involves sequential optimization. We first solve the model
maximizing 3x 4y subject to Ax b and x 0 to obtain the optimal solution
(Ox, yO ). We then solve the model minimizing 2x + 3y subject to Ax b, x 0, and
3x 4y = 3Ox 4Oy . Its optimal solution (x , y )is what we want.
Example 15.30. A linear optimization model has decision variablesxx1 ,1 ,xx2 ,2 , , ,xxnnand
the following constraints:
a11 x1 + a12 x2 + ... + a1n xn b1
a21 x1 + a22 x2 + ... + a2n xn b2
..
.
am1 x1 + am2 x2 + ... + amn xn bm
0 x1 1
0 x2 1
..
.
0 xn 1
where aij , bi are all constants, i = 1, 2, , m
, m, j = 1, 2, , n
, n. We do not know its
objective function yet. How to use a linear optimization solver to check whether
the first constraint a11 x1 + a12 x2 + + a1n xn b1 is redundant (i.e., removing the
constraint does not affect the model)?
Solution. We actually need to check whether the feasible set defined by all the con-
straints is the same as the feasible set defined by the constraints excluding the first
one. To this end, we solve the model minimizing a11 x1 + a12 x2 + + a1n xnsubject
to the second to the last constraints. If the optimal objective function value is not
smaller than b1, which means, any point that satisfies the second to the last con-
straints automatically satisfies the first one, then the first constraint is redundant;
otherwise it is not redundant.
For example, in traffic control, Condition 1 might mean maximizing the survival rates in road acci-
‡
dents, and Condition 2 might mean minimizing the travel delay due to road accidents. The objective in
Condition 1 is much more important than that in Condition 2.
12 Superconducting magnetic energy storage in power grids
230 Machine learning and data analytics for maritime studies
different cases. These cases contemplate unbalance in voltage, harmonic distortion
in voltage,15.31.
Example and power fluctuations
Suppose that Ω1 iscaused the high(xxvariation
byvectors
a set of 11, ,xx22, , , ,xxnn)weather
of defined resources.
by the fol-
lowing m + 2n constraints:
a11 x1 + a12 x2 + ... + a1n xn b1
3.3 Conclusion
a21 x1 + a22 x2 + ... + a2n xn b2
..
This chapter deals with some basics . of SMES and its control methodology. SMES is
one ofathe most developing and efficient energy storage devices. The integration of
m1 x1 + am2 x2 + ... + amn xn bm
SMES systems in the AC power microgrids under connected operation mode allows
0 x1 1
compensating active and reactive power dynamically, which clearly improves the
grid performance in terms 0 ofx2power 1 factor reducing at the same time the power
oscillations produced by renewable .. generators. Different topologies of the VSC and
.
CSC systems are explored. The different control methodologies for VSC and CSC
are 0 xn in
used to mitigate the variation 1 voltage
and power for grid-connected systems
2is aSMES.
using
set of vectors(x (x11,,xx22,, ,,xxnn))defined by the following m + 2nconstraints:
c11 x1 + c12 x2 + + c1n xn d1
c21 x1 + c22 x2 + + c2n xn d2
..
.
cm1 x1 + cm2 x2 + + cmn xn dm
0 x1 1
0 x2 1
..
.
0 xn 1
where aij , bi , cij , di are all constants, i = 1, 2, , m
, m, j = 1, 2, , n
, n. How to use a lin-
ear optimization solver to check whether the two sets 1 and 2are the same?
Solution. We first need to check whether the two sets are empty (minimizing 0 sub-
ject to the constraints that define each set). If 1 = ¿and 2 = ¿, they are the same.
If exactly one of them is empty, they are different.
References
If 1 ¤ ¿ and 2 ¤ ¿, we refer to Example 15.30: if each constraint that
defines 1 is redundant for
2(i.e., any point in automatically satisfies all con-
2for
[1] Farahani M. ‘A new control strategy of SMES mitigating subsynchronous
straints in 1 ) , and each constraint that defines is
234–39.
redundant for 1, then the two
oscillations’. Physica C. 2012, vol. 483, pp.
sets are the same; otherwise they are different.
[2] Gil-González W., Montoya O.D., Garces A. ‘Control of a SMES for miti-
gating subsynchronous oscillations in power systems: a PBC-PI approach’.
Journal of Energy Storage. 2018, vol. 20, pp. 163–72.
[3] Ortega A., Milano F. ‘Generalized model of VSC-based energy storage sys-
tems for transient stability analysis’. IEEE Transactions on Power Systems.
2018, vol. 31(5), pp. 3369–80.
[4] Ali M.H., Murata T., Tamura J. ‘Transient stability enhancement by fuzzy
logic-controlled SMES considering coordination with optimal reclosing of
circuit breakers’. IEEE Transactions on Power Systems. 2018, vol. 23(2), pp.
631–40.
Chapter 16
Advanced linear optimization
Basic linear optimization models have been introduced in Chapter 15. In this chap-
ter, more advanced linear optimization models will be covered.
Many practical problems are associated with the optimization of flow in a network,
such as transportation, telecommunication, and power transmission networks. A net-
work has nodes and arcs (arcs can also be called links). For example, a city logistics
network has intersections (nodes) and roads (arcs). Arcs are directional, i.e., cargoes/
passengers can only be transported from the tail node to the head node of an arc.
Therefore, a two-way street is usually considered as two arcs.
Example 16.1: Walmart has three warehouses (W1, W2, and W3) that store
the same type of product and five supermarkets (S1–S5) that need the products
in a city. The number of products available at each warehouse, the number of
products needed at each supermarket, and the transportation cost per unit prod-
uct from each warehouse to each supermarket are shown below. Develop a linear
optimization model to help Walmart make the decision of how to transport the
products.
Solution. Let fij be the decision variables representing the number of products
transported from warehouse i = 1, 2, 3 to supermarket j = 1, 2, 3, 4, 5. The model is
as follows:
min f11 + 2f12 + 4f13 + 3f14 + 6f15 + 5f21 + 2f22 +
4f23 + 4f24 + 4f25 + 5f31 + f32 + f33 + 3f34 + 2f35
subject to
f11 + f12 + f13 + f14 + f15 100
f21 + f22 + f23 + f24 + f25 200
f31 + f32 + f33 + f34 + f35 50
f11 + f21 + f31 = 80
f12 + f22 + f32 = 90
f13 + f23 + f33 = 70
f14 + f24 + f34 = 60
f15 + f25 + f35 = 50
fij 0, i = 1, 2, 3, j = 1, 2, 3, 4, 5.
Sometimes, we may use symbols to simplify the notation. We can define sets
I = f1, 2, 3g, J = f1, 2, 3, 4, 5g, represent by pi the number of products available in
warehouse i 2 I , denote by qj the number of products needed by supermarket j 2 J ,
and let cij be the cost of transporting one product from warehouse i 2 I to supermar-
ket j 2 J . Note that pi, qj, and cij are all known parameters. Let fij be the decision
variables representing the number of products transported from warehouse i 2 I to
supermarket j 2 J . The model is as follows:
PP
min cij fij
i2I j2J
subject to
P
fij pi , i 2 I
j2J
P
fij = qj , j 2 J
i2I
fij 0, i 2 I, j 2 J.
It can be seen that the above model is very compact. Moreover, it is very general: we
might save the two sets I and J and parameters pi, qj, and cij in files, and program
the model using a linear optimization solver. Next time when Walmart plans the
transportation, it only needs to change the input files. The computer codes for linear
Advanced linear optimization 233
Solution. Let fij be the decision variables representing the number of desks
transported from warehouse i = 1, 2to store j = 1, 2, 3. The model is as follows:
min f11 + 2f12 + 4f13 + 5f21 + 2f22 + 4f23
subject to
f11 + f12 + f13 20
f21 + f22 + f23 25
f11 + f21 = 12
f12 + f22 = 13
f13 + f23 = 14
fij 0, i = 1, 2, j = 1, 2, 3.
Example 16.3: This is a maximum flow problem. We have a crude oil pipeline
network shown in Figure 16.1. The arrows are the pipelines: crude oil can only
flow in the direction of the arrows, and the numbers on the arrows are the capacities
(maximum flow rates of crude oil) of the pipelines (1 000 tons/h). Node A is an oil
field and node E is a refinery factory. Develop a linear optimization model to find the
maximum flow of crude oil per hour from A to E.
In network flow problems, we can use the amount of cargo flow on each arc as
the decision variables. Such a formulation often needs the flow conservation equa-
tions: if a node does not export or import cargoes, then the total cargo inflow must
be equal to the total outflow; if the node exports cargoes, then the difference between
the total outflow and the total inflow equals the number of exported cargoes; and if
the node imports cargoes, then the difference between the total inflow and the total
outflow equals the number of imported cargoes. We first present a link flow formula-
tion for the maximum flow problem:
Solution. Let N := fA, B, C, D, Egbe the set of nodes and A be the set of arcs:
A := f(A, B), (A, D), (B, C), (B, D), (C, E), (D, C), (D, E)g.
Let fij be the decision variables representing the flow on arcs (i, j) 2 A . The model
is as follows:
max fAB + fAD (maximize the total net outflow from node A)
subject to
fAB = fBC + fBD
(flow conservation at node B)
fBC + fDC = fCE
(flow conservation at node C)
fAD + fBD = fDC + fDE
(flow conservation at node D)
fAB 12, fAD 4, fBC 6, fBD 2
fCE 5, fDC 23, fDE 7
fij 0, 8(i, j) 2 A.
Note that in the above link flow formulation, the flow conservation equations at
nodes B, C, and D ensure that the total inflow to node E, fCE + fDE , equals the total
outflow from node A, fAB + fAD .
Example 16.4: In Example 16.3, develop a link flow linear optimization model
to find the maximum flow of crude oil per hour from node B to node E.
Solution. Let N := fA, B, C, D, Egbe the set of nodes and A be the set of arcs:
A := f(A, B), (A, D), (B, C), (B, D), (C, E), (D, C), (D, E)g.
Let fij be the decision variables representing the flow on arcs (i, j) 2 A . The model
is as follows:
max fBC + fBD fAB (maximize the total net outflow from node B)
subject to
fAB + fAD = 0 (flow conservation at node A)
fBC + fDC = fCE (flow conservation at node C)
fAD + fBD = fDC + fDE (flow conservation at node D)
fAB 12, fAD 4, fBC 6, fBD 2
fCE 5, fDC 23, fDE 7
fij 0, 8(i, j) 2 A.
Advanced linear optimization 235
Note that as we seek the maximum flow from B to E, it does not make sense to have
flow to B. Therefore, it is correct to say that fAB = 0and hence fABcan be removed
from the model.
We can also use path flows as the decision variables in the maximum flow prob-
lem. To this end, we first need to enumerate all paths from the origin node to the
destination node.
Example 16.5: In Example 16.3, develop a path flow linear optimization model
to find the maximum flow of crude oil per hour from node A to node E.
Solution. From node A to node E, we have the following paths:
Path 1: A ! B ! C ! E
Path 2: A ! B ! D ! E
Path 3: A ! B ! D ! C ! E
Path 4: A ! D ! E
Path 5: A ! D ! C ! E.
Let fi be the decision variables representing the flow on path i ∈ {1, 2, 3, 4, 5}. The
path flow formulation is as follows:
X
5
Let fi be the decision variable representing the flow on path. The model is as follows:
X
3
max fi
i=1
subject to
f1 6 (capacity on arc (B, C))
f2 + f3 2 (capacity on arc (B, D))
f1 + f3 5 (capacity on arc (C, E))
f3 23 (capacity on arc (D, C))
f2 7 (capacity on arc (D, E) )
fi 0, i = 1, 2, 3.
Example 16.7: Similar to Example 16.3, develop a link flow linear optimization
model to find the maximum flow of crude oil per hour from node A to node E in the
network shown in Figure 16.2.
Solution. Let N := fA, B, C, D, Egbe the set of nodes and A be the set of arcs:
A := f(A, B), (A, D), (B, C), (B, D), (C, A), (C, E), (D, C), (D, E), (E, C)g.
Let fij be the decision variables representing the flow on arcs (i, j) 2 A . The model
is as follows:
max fAB + fAD fCA
subject to
fAB = fBC + fBD (flow conservation at node B)
fBC + fDC + fEC = fCA + fCE (flow conservation at node C)
fAD + fBD = fDC + fDE (flow conservation at node D)
fAB 12, fAD 4, fBC 6, fBD 2
fCA 1, fCE 5, fDC 23, fDE 7, fEC 3
fij 0, 8(i, j) 2 A.
A := f(A, B), (A, D), (B, C), (B, D), (C, E), (D, C), (D, E)g.
Let fij be the decision variables representing the flow on arcs (i, j) 2 A . The model
is as follows:
min 2 (6fAB + 5fCE + 8fDC ) + 3 (4fAD + 6fBC + 2fBD + 7fDE )
subject to
fAB + fAD = 0 (flow conservation at node A)
fBC + fBD fAB = 12(flow conservation at node B)
fBC + fDC = fCE (flow conservation at node C)
fAD + fBD = fDC + fDE (flow conservation at node D)
fCE + fDE = 12(flow conservation at node E)
fAB 10, fCE 10, fDC 10,
fAD 5, fBC 5, fBD 5, fDE 5
fij 0, 8(i, j) 2 A.
Example 16.10: In Example 16.8, the price of a unit of cargo in Los Angeles is
100, and 115 in Chicago (the price has the same unit as the transportation cost). A
company buys cargoes in Los Angeles, transports them to Chicago, and sells them
to make a profit. Develop a linear optimization model to find the optimal number of
cargoes to purchase in Los Angeles that maximizes the profit.
Solution. Let N := fA, B, C, D, Egbe the set of nodes and A be the set of arcs:
A := f(A, B), (A, D), (B, C), (B, D), (C, E), (D, C), (D, E)g.
Let fij be the decision variables representing the flow on arcs (i, j) 2 A . Denote by x
the units of cargoes to purchase in Los Angeles. The model is as follows:
max(115 100)x 2 (6fAB + 5fCE + 8fDC ) + 3 (4fAD + 6fBC + 2fBD + 7fDE )
subject to
A := f(A, B), (A, D), (B, C), (B, D), (C, E), (D, C), (D, E)g.
Figure 16.5 A road transportation network with more than one shortest path
from A to E
Advanced linear optimization 241
Let fAE
ij be the decision variables representing the flow of commodities with origin
A and destination E (i.e., clothes) on arcs (i, j) 2 A . Let fDE
ij be the decision vari-
ables representing the flow of commodities with origin D and destination E (i.e.,
machines) on arcs (i, j) 2 A . The model is as follows:
AE + fDE ) + 5(fAE + fDE ) + 8(fAE + fDE )]+
min 2 [6(fAB AB CE CE DC DC
AE + fDE ) + 6(fAE + fDE ) + 2(fAE + fDE ) + 7(fAE + fDE )]
3 [4(fAD
AD BC BC BD BD DE DE
subject to
Flow conservation equations:
AE + fAE = 12
fAB AD
AE = fAE + fAE
fAB BC BD
AE + fAE = fAE
fBC DC CE
AE + fAE = fAE + fAE
fAD BD DC DE
fAE AE
CE + fDE = 12
DE + fDE = 0
fAB AD
DE = fDE + fDE
fAB BC BD
DE + fDE = fDE
fBC DC CE
DE + fDE ) (fDE + fDE ) = 2
(fDC DE AD BD
fDE DE
CE + fDE = 2
Capacity constraints:
AE + fDE 10, fAE + fDE 10, fAE + fDE 10
fAB AB CE CE DC DC
AE + fDE 5, fAE + fDE 5, fAE + fDE 5, fAE + fDE 5
fAD AD BC BC BD BD DE DE
Nonnegativity constraints:
fAE
ij 0, fDE
ij 0, 8(i, j) 2 A.
Example 16.13: Figure 16.6 shows a liner container shipping network. The circles
represent ports. There are three routes r1, r2, and r3. The capacity of a ship deployed
on route r1 is E1 (TEUs). E2 and E3 have similar meanings. The volumes of contain-
ers between different OD pairs are shown in the figure. For example, is the demand
(TEUs/week) from port 1 to port 2. The profit for transporting one TEU from port
1 to port 2 is ($/TEU) and have similar meanings. Assume that container handling
costs are 0. Develop a linear optimization model for finding the maximum profit that
can be gained from transporting containers.
It should be noted that container liner shipping networks are usually very sparse.
For example, if we assume that there is at most one arc from one node to another,
then a network with 100 nodes has at most 100 ×99 = 9 900links. If the number of
242 Machine learning and data analytics for maritime studies
Figure 16.6 A container liner shipping network with direct delivery and
transshipment
links is much smaller than this maximum number, then we say that the network is
sparse. A sparse network does not have many paths from one node to another, and
hence we can enumerate all paths*.
Solution. The set of OD pairs is W = f(1, 2), (1, 3), (3, 2)g. Let Hod be the set of
itineraries (paths) for OD pair (o, d) 2 W . To simplify the notation, we use <r, i >to
represent leg i of route r . Then H12consists of the following:
h1 :< 1, 1 >
h2 :< 3, 1 >!< 2, 2 >
H13consists of:
h3 :< 3, 1 >
h4 :< 1, 1 >!< 2, 1 >
H32consists of:
h5 :< 2, 2 >
h6 :< 3, 2 >!< 1, 1 > .
Define H := [(o,d)2W Hod . Let yh be the decision variables representing the flow on
itinerary h 2 H(TEUs/week). The model is as follows:
P od P
max g yh (maximize the total profit)
(o,d)2W h2Hod
subject to
P
yh qod , (o, d) 2 W (limited demand)
h2Hod
Define H := [(o,d)2W Hod . Let yh be the decision variable representing the flow on
itinerary h 2 H. The model is as follows:
P P
max (o,d)2W god h2Hod yh
subject to
P
yh qod , (o, d) 2 W
h2Hod
y1 + y2 E1 (capacity on leg < 1, 1 >)
y3 + y4 E2 (capacity on leg < 2, 1 >)
y1 + y2 + y3 + y4 + y5 + y6 + y7 E3 (capacity on leg < 3, 1 >)
y3 + y6 E4 (capacity on leg < 4, 1 >)
y2 + y4 + y7 E5 (capacity on leg < 5, 1 >)
yh 0, h 2 H.
We encapsulate this sub-section by stating the problems discussed in mathemati-
cal languages. We are given a network denoted by g = (N , A), where N is the
set of nodes and A N N is the set of arcs. (i) In the maximum flow prob-
lem, each arc (i, j) 2 A has a capacity denoted by si,j 0, and the objective is to
find the maximum flow from node nsource 2 N to node nsink 2 N / fnsource g. (ii) In
the shortest path problem, each arc (i, j) 2 A has a unit cost denoted by cij 0,
and the objective is to find the minimum cost path from nodensource 2 N to node
n 2 N / fn
sink source
g. (iii) In the minimum cost flow problem, each arc (i, j) 2 A has
a unit cost denoted by cij 0 and a capacity denoted by sij 0, and the objec-
tive is to find the minimum total cost for transporting units of cargoes from node
nsource 2 N to node nsink 2 N / fnsource g. (iv) In the multi-commodity flow problem†,
each arc (i, j) 2 A has a unit cost denoted by cij 0 and a capacity denoted by
sij 0. There are a set of OD pairs denoted byW N N . Cargoes between
different OD pairs are different. Between OD pair (o, d) 2 W, qod units of cargoes
must be transported. The objective is to find the minimum total cost for transport-
ing all cargoes between all the OD pairs in W ‡.
†
More linear optimization formulations for the multi-commodity flow problem can be found in Refer-
ence 2.
‡
In fact, the maximum flow problem and the shortest path problem are special cases of the minimum
cost flow problem; the minimum cost flow problem is a special case of the multi-commodity flow
problem.
246 Machine learning and data analytics for maritime studies
Example 16.16: Reconsider Example 16.1. How do you add dummy§ nodes and
links to transform the problem to a minimum cost flow problem?
Solution. See Figure 16.9. The nodes W1–W3 represent the three warehouses,
and S1–S5 represent the five supermarkets. The links from a warehouse to a super-
market represent the transportation of products. The unit cost of such a link is equal
to the transportation cost, and the capacity is infinite. We further add a dummy
source node that is connected to each warehouse. A dummy arc from the dummy
source node to a warehouse has no cost and a capacity that is equal to the number of
products available in the warehouse. A dummy sink node is added and is connected
to each supermarket. A dummy arc from a supermarket to the dummy sink node has
no cost and a capacity that is equal to the number of products required by the super-
market. The problem now becomes: how to transport 8 0 + 90 + 70 + 60 + 50 = 350
products from the dummy source node to the dummy sink node at the lowest total
cost.
Example 16.17: Reconsider Example 16.12. How do you add dummy nodes
and links to transform the problem into a minimum cost flow problem?
Solution. See Figure 16.10. The transportation network is not changed: the unit
cost and capacity of rail and truck links are the same as their physical meanings. A
dummy source node is added and is connected to nodes A and D by two dummy arcs.
The dummy arc from the dummy source node to node A has no cost and a capacity of
12; the dummy arc from the dummy source node to node D has no cost and a capac-
ity of 2. A dummy sink node is added and is connected by a dummy arc from node E.
The dummy arc has no cost and infinite capacity. The problem now becomes: how to
The word “dummy” means something that does not exist or is imaginary. For example, we can almost
§
always add a dummy arc to a network while imposing that its capacity is 0.
Advanced linear optimization 247
transport 12 + 2 = 14units of cargoes from the dummy source node to the dummy sink
node at minimum cost.
It should be noted that this example is a special multi-commodity flow problem
in that all commodities have the same destination. Therefore, we can transform it
into a minimum cost flow problem. If all commodities have the same origin, we can
also transform the problem into a minimum cost flow problem. However, a general
multi-commodity flow problem cannot be transformed into a minimum cost flow
problem.
Because of trade imbalance, for example, China exports more cargoes to the
United States than the imported cargoes from the United States, some ports (e.g.,
ports in the United States) have surplus empty containers, and some ports (e.g., ports
in China) are in shortage of empty containers (these ports are called deficit ports).
Hence, empty containers have to be repositioned from surplus ports to the deficit
ports. Unlike laden containers, all empty containers can be considered identical.
Example 16.18: This is an empty container repositioning problem. Reconsider
Example 16.13. Suppose that the remaining capacity of leg <r, i > after carrying
laden containers is EQ ri . The remaining capacity is used for repositioning empty con-
tainers. The number of surplus empty containers at port p 2 f1, 2, 3g is denoted by
qQp. Here qQ 1 < 0, qQ 2 > 0, and qQ 3 > 0,P
which means port 1 is a deficit port and ports 2
3
and 3 are surplus ports. Moreover, p=1 qQp = 0. We need to check whether the net-
work has sufficient capacity to reposition all the empty containers from ports 2 and 3
to port 1. How to transform the problem into a maximum flow problem?
Solution. Since all empty containers are identical, we can add a dummy source
node, connected to each surplus port with an arc whose capacity equals the volume
of the surplus empty containers at the port, and a dummy sink node, connected from
each deficit port with an arc whose capacity equals the volume of the deficit empty
containers at the port, as shown in Figure 16.11. Now the problem is transformed
into a maximum flow problem: all empty containers can be repositioned if and only
if the maximum flow from the source to the sink equals qQ 2 + qQ 3.
Example P316.19: Reconsider Example 16.18. Suppose that qQ 1 < 0, qQ 2 > 0,
Q 3 < 0, and p=1
q qQp = 0. How do you add dummy nodes and links to transform the
empty container repositioning problem into a maximum flow problem?
Solution. See Figure 16.12. Now the problem is transformed into a maximum
flow problem: all empty containers can be repositioned if and only if the maximum
flow from the source to the sink equals qQ 2.
248 Machine learning and data analytics for maritime studies
Figure 16.11 A liner shipping network for empty container repositioning with
dummy nodes and links
Figure 16.12 A liner shipping network for empty container repositioning with
dummy nodes and links
Some seemingly nonlinear problems can be transformed into linear problems. For
example, the constraint |x + 2y| 3 is equivalent to the combination of x + 2y 3
and x + 2y 3; the constraint maxfx 2y, 2x + z, x + y zg z + 1is equivalent to
Advanced linear optimization 249
The notation “max” followed by a set means the element in the set with the largest value. Some people
¶
function is no longer convex. However, we can easily check that the Hessian of
x ln x + y ln y (x + y) ln(x + y)is positive semi-definite. Hence, it is convex in x and
y and can be linearized using tangent planes.
Finally, some nonlinear constraints cannot be approximated by linear constraints
with high accuracy, e.g., x1 sin x2and x1 x22, because they are non-convex.
16.4 Practice
Example 16.21: Consider the following service with the port rotation:
Shanghai(S, 1) ! HongKong(H, 2) ! Singapore(P, 3)
! Rotterdam(R, 4) ! Shanghai(S, 1)
Ships with a capacity of 6 000 (TEUs) are deployed to provide a weekly frequency.
The container shipment demand (TEUs/week) is as follows: Shanghai to Rotterdam
SR HR
q = 3500, Hong Kong to Rotterdam q = 2500, Singapore to Rotterdam
PR RS
q = 1500, and Rotterdam to Shanghai q = 2500. The profit of transporting one
Advanced linear optimization 251
TEU ($/TEU) from Shanghai to Rotterdam gSR = 2500, Hong Kong to Rotterdam
HR PR
g = 2300, Singapore to Rotterdam g = 2000, and Rotterdam to Shanghai
gRS = 1500. Develop an optimization model to evaluate the maximum profit ($/
week) the company can make.
Solution. Let ySR, yHR, yPR, and yRS be the decision variables representing the
volumes of containers transported from S to R, H to R, P to R, and R to S, respec-
tively. The model is as follows:
max 2500y SR + 2300yHR + 2000yPR + 1500yRS
subject to
ySR 6000 (ship capacity constraint on the leg from S to H)
ySR + yHR 6000 (ship capacity constraint on the leg from H to P)
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS 6000 (ship capacity constraint on the leg from R to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500.
ySR 6000 (ship capacity constraint on the leg from S to H)
252 Machine learning and data analytics for maritime studies
ySR + yHR 6000 (ship capacity constraint on the leg from H to P)
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS 6000 (ship capacity constraint on the leg from R to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500.
Note: It is also correct if you only have the following constraints because the other
three constraints are implied by them:
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500.
Example 16.22: Consider the following service with the port rotation:
Ships with a capacity of 6000 (TEUs) are deployed to provide a weekly fre-
quency. The container shipment demand (TEUs/week) is as follows: Shanghai
to Rotterdam qSR = 3500, Hong Kong to Rotterdam qHR = 2500, Singapore to
Rotterdam qPR = 1500, Rotterdam to Shanghai qRS = 2500, Rotterdam to Hong
Kong qRH = 1500, Rotterdam to Singapore qRP = 500, and Shanghai to Singapore
SP
q = 1400. The profit of transporting one TEU ($/TEU) from Shanghai to
Rotterdam gSR = 2500, Hong Kong to Rotterdam gHR = 2300, Singapore to
Rotterdam gPR = 2000, Rotterdam to Shanghai gRS = 1500, Rotterdam to Hong
Kong gRH = 1500, Rotterdam to Singapore gRP = 1500, and Shanghai to Singapore
SP
g = 1000. Develop an optimization model to evaluate the maximum profit ($/
week) the company can make.
Solution. Let ySR, yHR, yPR, yRS , yRH , yRP , and ySP be the decision variables
representing the volumes of containers transported from S to R, H to R, P to R, R to
S, R to H, R to P, and S to P, respectively. The model is as follows:
SR HR PR RS RH RP SP
max 2500y + 2300y + 2000y + 1500y + 1500y + 1500y + 1000y
subject to
ySR + yRH + yRP + ySP 6000 (ship capacity constraint on the leg from S to H)
ySR + yHR + yRP + ySP 6000 (ship capacity constraint on the leg from H to P)
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS + yRH + yRP 6000 (ship capacity constraint on the leg from R to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500
0 yRH 1500
ySR + yRH + yRP + ySP 6000 (ship capacity constraint on the leg from S to H)
ySR + yHR + yRP + ySP 6000 (ship capacity constraint on the leg from H to P)
Advanced linear optimization 253
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS + yRH + yRP 6000 (ship capacity constraint on the leg from R to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500
0 yRH 1500
0 yRP 500
0 ySP 1400.
Example 16.23: Consider the following service with the port rotation:
Ships with a capacity of 6000 (TEUs) are deployed to provide a weekly frequency.
The container shipment demand (TEUs/week) is as follows: Shanghai to Rotterdam
qSR = 3500, Hong Kong to Rotterdam qHR = 2500, Singapore to Rotterdam qPR =
1500, Rotterdam to Shanghai qRS = 2500, Rotterdam to Hong Kong qRH = 1500,
Rotterdam to Singapore qRP = 500, and Shanghai to Singapore qSP = 1400. The profit
of transporting one TEU ($/TEU) from Shanghai to Rotterdam gSR = 2500, Hong
Kong to Rotterdam gHR = 2300, Singapore to Rotterdam gPR = 2000, Rotterdam to
Shanghai gRS = 1500, Rotterdam to Hong Kong gRH = 1500, Rotterdam to Singapore
gRP = 1500, and Shanghai to Singapore gSP = 1000.
(i) Develop an optimization model to evaluate the maximum profit ($/week) the
company can make.
(ii) Write down two distinct feasible solutions, and calculate their objective
function values.
(iii) Is this model infeasible? Why? Is this model unbounded? Why? Does this
model have an optimal solution? Why?
Solution. (i) Let ySR, yHR, yPR, yRS , yRH , yRP , and ySP be the decision variables
representing the volumes of containers transported from S to R, H to R, P to R, R to
S, R to H, R to P, and S to P, respectively. The model is as follows:
SR HR PR RS RH RP SP
max 2500y + 2300y + 2000y + 1500y + 1500y + 1500y + 1000y
subject to
ySR + yRP + ySP 6000 (ship capacity constraint on the leg from S to H)
ySR + yHR + yRP + ySP 6000 (ship capacity constraint on the leg from H to P)
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS + yRH + yRP 6000 (ship capacity constraint on the leg from R to H)
yRS + yRP 6000 (ship capacity constraint on the leg from H to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500
0 yRH 1500
RP
ySR + yRP + ySP 6000 (ship capacity constraint on the leg from S to H)
ySR + yHR + yRP + ySP 6000 (ship capacity constraint on the leg from H to P)
ySR + yHR + yPR 6000 (ship capacity constraint on the leg from P to R)
yRS + yRH + yRP 6000 (ship capacity constraint on the leg from R to H)
254 Machine learning and data analytics for maritime studies
yRS + yRP 6000 (ship capacity constraint on the leg from H to S)
0 ySR 3500
0 yHR 2500
0 yPR 1500
0 yRS 2500
0 yRH 1500
0 yRP 500
0 ySP 1400.
(ii) One feasible solution is, and its objective function value is 0. Another feasible
solution is, and its objective function value is 2 500. (iii) The model is feasible,
because we have found two feasible solutions in (ii). The feasible set of the model
is bounded, because letting, any feasible solution satisfies the following. Hence, the
model is not unbounded. Since this is a linear optimization model that is feasible and
not unbounded, it has an optimal solution.
Example 16.24: Reconsider the CC1 service with the port rotation:
Shanghai (S, 1) ! Kwangyang (K, 2) ! Pusan (P, 3) ! Los Angeles (L, 4)
! Oakland(O, 5) ! Pusan (P, 6) ! Kwangyang(K, 7) ! Shanghai (S, 1)
Ships with a capacity of (TEUs) are deployed to provide a weekly frequency. The
container shipment demand (TEUs/week) is as follows: Shanghai to Los Angeles
qSL, Pusan to Los Angeles qPL, Oakland to Shanghai qOS, and Shanghai to Oakland
qSO. The profit of transporting one TEU ($/TEU) from Shanghai to Los Angeles is
gSL, Pusan to Los Angeles gPL, Oakland to Shanghai gOS, and Shanghai to Oakland
gSO. The values of E, q’s, and g’s are all given.
(i) Develop an optimization model to evaluate the maximum profit ($/week) the
company can make.
(ii) If qSL is increased, how will the feasible set of the model in (i) change, and
how will the optimal objective function value change?
(iii) If E is increased, how will the feasible set of the model in (i) change, and
how will the optimal objective function value change?
(iv) If the values of gSL, gPL, gOS, gSO and are all doubled simultaneously, how
will the feasible set of the model in (i) change, and how will the optimal objective
function value change?
Solution. (i) Define the set of OD pairs W = f(S, L), (P, L), (O, S), (S, O)g. Let
od
y be the decision variables representing the volumes of containers transported
between OD pairs (o, d) 2 W . The model is as follows:
P od od
max g y
(o,d)2W
subject to
ySL + ySO E (ship capacity constraint on the leg from S to K)
ySL + ySO E (ship capacity constraint on the leg from K to P)
ySL + yPL + ySO E (ship capacity constraint on the leg from P to L)
ySO E (ship capacity constraint on the leg from L to O)
yOS E (ship capacity constraint on the leg from O to P)
yOS E (ship capacity constraint on the leg from P to K)
yOS E (ship capacity constraint on the leg from K to S)
0 yod qod , (o, d) 2 W.
ySL + ySO E (ship capacity constraint on the leg from S to K)
Advanced linear optimization 255
ySL + ySO E (ship capacity constraint on the leg from K to P)
ySL + yPL + ySO E (ship capacity constraint on the leg from P to L)
ySO E (ship capacity constraint on the leg from L to O)
yOS E (ship capacity constraint on the leg from O to P)
yOS E (ship capacity constraint on the leg from P to K)
yOS E (ship capacity constraint on the leg from K to S)
0 yod qod , (o, d) 2 W.
Note: It is also correct if you only have the following constraints because the other
five constraints are implied by them:
(ii) The feasible set will be larger or unchanged; the optimal objective function value
will be larger or unchanged.
(iii) The feasible set will be larger or unchanged; the optimal objective function
value will be larger or unchanged.
(iv) The feasible set will be unchanged; the optimal objective function value
will be doubled.
Example 16.25: (i) Write down an example of a linear optimization model with
two variables and three constraints that is infeasible.
(ii) Write down an example of a linear optimization model with two variables
and three constraints that is unbounded.
(iii) Write down an example of a linear optimization model with two variables
and three constraints that has an infinite number of optimal solutions.
(iv) Write down an example of a linear optimization model with two variables
and three constraints whose optimal objective function value is 123.
Solution. Note: The answers are not unique.
(i) Minimize subject to x + y subject to x ≤ 0, x ≥2, y ≥ o.
(ii) Minimize subject to x + y subject to x ≤ 0, x≤ −2, y ≥ 0.
(iii) Minimize subject to x + y subject to x + y ≥ 1, x ≥ 0, y ≥ 0.
(iv) Minimize subject to x + y subject to x ≥ − 123, x ≥ −200, y ≥ 0.
Example16.26: We have the following linear optimization model:
min x y
subject to
x+y3
y2
x 2y = 1.
(i) Transform the model to the canonical form that maximizes cT x subject to Ax b
and x 0.
256 Machine learning and data analytics for maritime studies
(ii) Transform the model to the standard form that maximizes cT x subject to
Ax = band x 0.
Note: The answers are not unique.
Solution. Letting x = u v, u 0, v 0 and w = 2 y, w 0, the model is
transformed to
min(u v) (2 w)
subject to
(u v) + (2 w) 3
(u v) 2(2 w) = 1
u0
v0
w 0.
(i) The model is equivalent to
max (u v) + (2 w)
subject to
(u v) (2 w) 3
(u v) 2(2 w) 1
(u v) + 2(2 w) 1
u0
v0
w 0
which is equivalent to
max u + v w
subject to
u + v + w 1
u v + 2w 5
u + v 2w 5
u0
v0
w 0.
2 3
1 1 1
6 7
6 27
h i A =4 1 1 5
x= u h i
v w T 1
1 2 , b= 1
Hence, hin the canonical
i form, , 5 5 T
,
and c= 1 1 1 , with x = u v and y = 2 w .
T
Advanced linear optimization 257
(iv) Write down a linear optimization model satisfying: first, it has two vari-
ables; second, its objective function is not a constant; third, its feasible set is a pen-
tagon; and fourth, it has an infinite number of optimal solutions. Use the graphical
method to find an optimal solution.
Solution. (i) See Figure 16.15. The optimal solution corresponds to the intersec-
tion of the lines x = 0and x + y = 2. Therefore, the optimal solution is x = 0, y = 2.
The optimal objective function value is 4.
(ii) Any of the following solutions are correct:
min x , min y , min x + y , or max x + y .
(iii) An example is
min x + y
subject to
x2
y2
x+y3
x0
y 0.
The graphical method is not provided here. The optimal solution is x = 0, y = 0,
and the optimal objective function value is 0.
(iv) We can change the objective function in the model in (iii) to min x . The
graphical method is not provided here. An optimal solution is x = 0, y = 0,and the
optimal objective function value is 0.
Example 16.28: Use the graphical method to find the optimal solution to the
model below:
max x + 2y
subject to
x+y= 1
x+y2
x0
y 0.
4x + 5y + 6z 10000
7x + 8y + 9z 100
10x + 11y + 12z 200
13x + 14y + 15z 300
x0
y0
z 0.
max fAB + fAC + fAF (maximize the total outflow from node A)
subject to
fAB + fDB = fBC + fBE
(flow conservation at node B)
fAC + fBC = fCD + fCE
(flow conservation at node C)
fCD + fED = fDB + fDF
(flow conservation at node D)
fBE + fCE = fED + fEF
(flow conservation at node E)
fAB sAB , fAC sAC , fAF sAF , fBC sBC , fBE sBE
fCD sCD , fCE sCE , fDB sDB , fDF sDF , fED sED , fEF sEF
fij 0, 8(i, j) 2 A.
It is correct to say that there is no flow on (E,D), (A,F), (D,F), and (E,F). Therefore,
the following model is also correct:
min cAB fAB + cAC fAC + cBC fBC + cBE fBE +
cCD fCD + cCE fCE + cDB fDB
subject to
fAB + fAC = 10
fAB + fDB = fBC + fBE
fAC + fBC = fCD + fCE
fCD = fDB
fBE + fCE = 10
fAB sAB , fAC sAC , fBC sBC , fBE sBE
fCD sCD , fCE sCE , fDB sDB
fij 0, 8(i, j) 2 A \ f(E, D), (A, F), (D, F), (E, F)g.
Example 16.32: Figure 16.18 shows a liner container shipping network. The circles rep-
resent ports. There are four routes r1, r2, r3, and r4 that provide weekly shipping services.
262 Machine learning and data analytics for maritime studies
The capacity of a ship deployed on route r1 is E1 (TEUs). E2, E3, and E4 have similar
meanings. The volumes of containers between different OD pairs are shown in the fig-
ure. For example, q15is the demand (TEUs/week) from port 1 to port 5. The profit for
transporting one TEU from port 1 to port 5 is g15($/TEU). g17, g25, g36, and g46have
similar meanings. Assume that container handling costs are 0. Develop a path-flow lin-
ear optimization model to find the maximum profit that can be gained from transporting
containers. YouPmust write down the details of the objective function and constraints
without using “ ” or “8” except in the nonnegativity constraints.
Solution. The set of OD pairs is W = f(1, 5), (1, 7), (2, 5), (3, 6), (4, 6)g. Let Hod
be the set of itineraries (paths) for OD pair (o, d) 2 W . To simplify the notation, we
use < r, i >to represent leg i of route r . Then H15consists of the following:
h1 :< 1, 2 >!< 1, 3 >!< 3, 1 >
H17consists of the following:
max g15 y1 + g17 (y2 ) + g25 (y3 ) + g36 (y4 ) + g46 (y5 ) (maximize the total profit)
Advanced linear optimization 263
subject to
y1 q15
y2 q17
y3 q25
y4 q36
y5 q46
y1 + y2 E1 (capacity on leg < 1, 2 >)
y1 + y2 + y3 E1 (capacity on leg < 1, 3 >)
y4 E2 (capacity on leg < 2, 1 >)
y1 + y2 + y3 + y4 + y5 E3 (capacity on leg < 3, 1 >)
y2 + y4 + y5 E3 (capacity on leg < 3, 2 >)
y2 E4 (capacity on leg < 4, 1 >)
yh 0, h 2 H.
Example 16.33: Propose a linear optimization model that satisfies all of the fol-
lowing requirements or answer “such a model does not exist”: (i) it is infeasible,
(ii) after removing the first constraint, it has exactly one optimal solution; (iii) after
removing the first two constraints, it has an infinite number of optimal solutions; and
(iv) after removing the first three constraints, it is unbounded.
Solution. An example can be
max x
subject to
x 1
x+y1
x1
x0
y 0.
The feasible set is non-empty. Transform the model to a linear optimization model.
Solution. Define two auxiliary decision variables uand v . The model is equiva-
lent to
min u v
subject to
u x, u y, u z
v x, v y, v z
a11 x1 + a12 x2 b1
a21 x1 + a22 x2 b2
a31 x1 + a32 x2 b3
..
.
am1 x1 + am2 x2 bm
0 x1 1
0 x2 1.
Suppose that we know all points in the feasible set form a square. Given a linear
optimization solver, how to calculate the area of the square?
Example 16.38: We have a linear optimization model with an infinite number
of optimal solutions. How to find two optimal solutions with the largest L1 distance?
Example 16.39: Given a linear optimization model, how to use a linear opti-
mization solver to check whether all points in the feasible set form a line segment?
Example 16.40: Walmart has three warehouses (W1, W2, and W3) that store
the same type of product and five supermarkets (S1–S5) that need the products in a
city. The number of products available at each warehouse, the number of products
needed at each supermarket, and the transportation cost per unit product from each
warehouse to each supermarket are shown below. Develop a linear optimization
model to help Walmart make the decision of how to transport the products. You
must
P write down the details of the objective function and constraints without using
“ ” or “8” except in the nonnegativity constraints.
Solution. Let fij be the decision variables representing the number of products
transported from warehouse i = 1, 2, 3 to supermarket j = 1, 2, 3, 4, 5. The model is
as follows:
min f11 + 2f12 + 4f13 + 3f14 + 6f15 + 5f21 + 2f22 +
subject to
f11 + f12 + f13 + f14 + f15 100
f21 + f22 + f23 + f24 + f25 200
f31 + f32 + f33 + f34 + f35 50
f11 + f21 + f31 = 80
f12 + f22 + f32 = 90
f13 + f23 + f33 = 70
f14 + f24 + f34 = 60
f15 + f25 + f35 = 50
fij 0, i = 1, 2, 3, j = 1, 2, 3, 4, 5.
Note: The following constraints are incorrect because evidently all fij will be 0 in the
optimal solution:
f11 + f12 + f13 + f14 + f15 100
f21 + f22 + f23 + f24 + f25 200
f31 + f32 + f33 + f34 + f35 50
f11 + f21 + f31 80
f12 + f22 + f32 90
f13 + f23 + f33 70
f14 + f24 + f34 60
f15 + f25 + f35 50
fij 0, i = 1, 2, 3, j = 1, 2, 3, 4, 5.
subject to
(u v) + (2 w) t = 3
(u v) 2(2 w) = 1
u0
v0
w0
t 0
which is equivalent to
max u + v w
subject to
uvwt= 1
u v + 2w = 5
u0
v0
w0
t 0.
" #
h i 1 1 1 1
A=
x= u v w t T 1 1 2 0 ,
Hence,
h in
i the standard
h form, i ,
y = 2 w
b= 1 5 T , and c= 1 1 1 0 T , with x = u v and .
Note 1: The solution to this question is not unique.
Note 2: The following interesting solutions are also correct: Let
u
= x + y 3 0 and v = 2 y 0. Solving this equation system, we have
x = u + v + 1, y = 2 v . Therefore, the model is equivalent to min u + 2v 1subject
to u + 3v 3 = 1, u 0, v 0, etc.
Note 3: It is incorrect to say y 2is equivalent to w = y, w 2, w 0.
Example 16.42: Use the graphical method to find the optimal solution to the
model below:
max 2x + y
subject to
x+y1
x2
y1
x0
y 0.
268 Machine learning and data analytics for maritime studies
Solution. See Figure 16.20. The optimal solution corresponds to the intersection
of the lines x = 2 and y = 1. Therefore, the optimal solution is x = 2, y = 1. The
optimal objective function value is 5.
Note: The slope of the line 2x + y = k is 2rather than 2.
Example 16.44: Figure 16.22 shows a liner container shipping network. The circles
represent ports. There are five routes r1, r2, r3, r4, and r5 that provide weekly shipping
services. The capacity of a ship deployed on route r1 is E1 (TEUs). E2, E3, E4, and E5
have similar meanings. The volumes of containers between different OD pairs are
shown in the figure. For example, q12 is the demand (TEUs/week) from port 1 to
port 2. The profit for transporting one TEU from port 1 to port 2 is g12($/TEU). g14
and g15have similar meanings. Assume that container handling costs are 0. Develop
a path-flow linear optimization model to find the maximum profit that can be gained
from transporting containers. You must
P write down the details of the objective func-
tion and constraints without using “ ” or “8” except in the nonnegativity constraints.
Solution. The set of OD pairs is W = f(1, 2), (1, 4), (1, 5)g. Let Hod be the set of
itineraries (paths) for OD pair (o, d) 2 W . To simplify the notation, we use <r, i >to
represent leg i of route r . Then H12consists of the following:
h1 :< 1, 1 >
y1 q12
y2 + y3 q14
y4 + y5 q15
y1 + y2 + y3 + y4 + y5 E1 (capacity on leg < 1, 1 >)
y3 + y4 E2 (capacity on leg < 2, 1 >)
y2 + y5 E3 (capacity on leg < 3, 1 >)
y3 + y4 E4 (capacity on leg < 4, 1 >)
y5 E5 (capacity on leg < 5, 1 >)
y3 E5 (capacity on leg < 5, 2 >)
yh 0, h 2 H.
Note: It is wrong to say y4 q15 , y5 q15.
Example 16.45: Propose a linear optimization model that satisfies all of
the following requirements or answer “such a model does not exist”: (i) it is
infeasible; (ii) after removing the first constraint, it has an infinite number of
optimal solutions; (iii) after removing the first two constraints, it has exactly
one optimal solution; and (iv) after removing the first three constraints, it is
unbounded.
Solution. Note: The answer is not unique. An example can be
max x
subject to
Advanced linear optimization 271
x 1
x1
x+y2
x0
y 0.
234, 554
1, 000, 000
Example 16.47: A linear optimization model has decision variables x1 , x2 , , xn
and constraints:
min z subject to all of the constraints. Then we solve the model min 0 subject to
u u , u x, u y, u z, v v , v x, v y, v z and all of the other con-
straints, whose optimal solution is also optimal to the original model.
It is wrong because there may not exist a feasible solution (Nx, yN , zN) such that
max(Nx, yN , zN) =u , min(Nx, yN , zN) =v . To appreciate this point, consider a model
with only two decision variables x and y , the feasible set is the triangle in
Figure 16.24, and we want to maximize max(x, y) min(x, y). In plain words, we
want to find a point that is as far away from the line y = x as possible. We can
see that the optimal solution is the dot in the figure, and the optimal objective
function value is very small (somewhere between 0.1 and 0.2). If we use the
aforementioned wrong approach, we have u = 1, v = 0 and the optimal objec-
tive function value is 1.
A detailed explanation of the solution to the above example:
There are six possible cases: x y z , x z y , y x z , y z x ,
z x y , and z y x .
We solve Model 1: maximize x z subject to all the given constraints and
(1) (1) (1)
x y, y z , and its optimal solution is (x , y , z ).
274 Machine learning and data analytics for maritime studies
References
[1] Wang S., Meng Q., Sun Z. ‘Container routing in liner shipping’. Transportation
Research Part E. 2013;49(1):1–7.
[2] Wang S. ‘A novel hybrid-link-based container routing model’. Transportation
Research Part E. 2014;61:165–75.
[3] Wang S., Meng Q. ‘Sailing speed optimization for container ships in a liner
shipping network’. Transportation Research Part E. 2012;48(3):701–14.
Chapter 17
Integer optimization
In linear optimization, we assume that all of the decision variables are continu-
ous. However, in reality, some are not. For example, the number of crude oil
tankers used to transport crude oil from the Middle East to the US is an inte-
ger, and rounding down 6.5 ships to 6 ships can lead to considerable errors.
Therefore, a natural extension to linear optimization models is integer linear
optimization models, which are the same as linear optimization models except
that the decision variables can only take integer values. We also have mixed-
integer linear optimization models, in which some decision variables can only
take integer values, and the others are continuous. We often use “integer opti-
mization models” to refer to integer linear optimization models or both inte-
ger linear optimization models and mixed-integer linear optimization models.
However, one should keep in mind that integer linear optimization models are
not linear; in other words, integer linear optimization models belong to the cat-
egory of nonlinear optimization models.
We use Z +to represent the set of non-negative integers*. Hence, x 2 Z+means
that x is non-negative and can only take integer values.
Example 17.1: Suppose that there is 1 million tons of coal to transport from
Indonesia to Japan. The following bulk carriers are available:
*
One often uses Z++ to represent the set of positive integers.
276 Machine learning and data analytics for maritime studies
Each bulk carrier can only complete one trip from Indonesia to Japan due to the
deadline for delivering the coal. The company needs to determine how many ships
in each type to use. Formulate an integer optimization model that minimizes the total
cost of transporting the coal.
Solution: Let x , y , z , and w be the numbers of Capesize, Panamax, Handymax,
and Handysize ships to use, respectively. The model is:
subject to
Example 17.2: Suppose that there is 2 million tons of crude oil to transport from
Saudi Arabia to the US. The following crude oil tankers are available (capacity in
terms of 1000 tons and cost per trip in terms of million $):
Each ship can only complete one trip from Saudi Arabia to the US due to the
deadline for delivering the crude oil. The company needs to determine how many
ships in each type to use. Formulate an integer optimization model that minimizes
the total cost of transporting the crude oil.
Solution: Let x , y , z , and w be the numbers of VLCC, Suezmax, Aframax, and
Panamax ships to use, respectively. The model is
subject to
Integer optimization 277
Location r1 r2 r3
1 11 12 13
2 25 28 37
3 0 50 40
4 20 20 0
5 31 14 2
xij 2 Z+ , i = 1, 2, 3, 4, 5, j = 1, 2, 3.
Some integer optimization models have only binary decision variables, i.e., decision
variables which can only take the value 0 or 1. Such models are called 0–1 integer
optimization models or binary integer optimization models. 0–1 integer optimiza-
tion models have wide applications for expressing logical constraints.
Example 17.5: Consider a teaching venue allocation problem. Suppose that
there are six courses (C1–C6) to be taught in three classrooms (R1–R3). Each class-
room can only be used to teach two courses due to reservations, e.g., meetings. Not
all classrooms are suitable for all courses because of available teaching equipment,
as shown in the table below where “Y” means the classroom is suitable for the
course and “N” means not suitable. Formulate an integer optimization model to find
a feasible classroom allocation plan for all the courses.
R1 R2 R3
C1 Y Y N
C2 Y N N
C3 N N Y
C4 Y Y Y
C5 N N Y
C6 Y Y N
min 0
subject to
x11 + x12 + x13 = 1
x21 + x22 + x23 = 1
x31 + x32 + x33 = 1
x41 + x42 + x43 = 1
x51 + x52 + x53 = 1
x61 + x62 + x63 = 1
Integer optimization 279
The container shipment demand is: S to P qSP = 800 and S to L qSL = 6000. The
revenue of transporting one TEU (twenty-foot equivalent unit) from S to P is $200
and from S to L is $1 300. Suppose that the company needs to determine which of
the following three types of container ship to use:
Exactly one type of ship will be used. The company aims to maximize its profit.
Formulate a mixed-integer linear optimization model to help the company make the
decision of which type of ship to use.
Solution: Let x1 , x2 , x3 be binary decision variables that equal 1 if and only
post-Panamax, Panamax type 2, and Panamax type 1 ships are used, respectively,
and 0 otherwise. Let ySP and ySLbe the decision variables representing volumes of
containers transported from S to P and S to L, respectively. The model is
max 200ySP + 1300ySL (500, 000x1 + 350, 000x2 + 330, 000x3 )
subject to
ySP + ySL 8000x1 + 5100x2 + 4800x3
ySL 8000x1 + 5100x2 + 4800x3
ySP 800
y 6000
SL
x1 + x2 + x3 = 1
x1 2 f0, 1g, x2 2 f0, 1g, x3 2 f0, 1g
ySP 0, ySL 0.
h1 :< 1, 1 >
h2 :< 3, 1 >!< 2, 2 >
Figure 17.1 A container liner shipping network with direct delivery and
transshipment
Integer optimization 281
H32 consists of :
h5 :< 2, 2 >
h6 :< 3, 2 >!< 1, 1 > .
Define H := [(o,d)2W Hod . Let xr be a binary decision variable that equals 1 if and
only if route r is operated, r = 1, 2, 3. Let yh be the decision variable representing the
P hod2 H
flow on itinerary P(TEUs/week). The model is
max g yh (c1 x1 + c2 x2 + c3 x3 )
(o,d)2W h2Hod
subject to
P
yh qod , (o, d) 2 W
h2Hod
Because of M , we often say we use “the big-M method” to formulate the constraints.
It might be attempting to set M at a very large value to be on the safe side.
However, a larger M will make the model more time-consuming to solve. In reality,
282 Machine learning and data analytics for maritime studies
we should try to find the smallest possible M . For example, if the above model
further has the following constraints: 0 x 2and 1 y 3, then we know that
maxfx
+ y 2, 2x y 3, 3x + 4y 4g = 8,
and we can hence set M = 8.
Example 17.8: We have a model. One of its decision variables x can either be
greater than or equal to 2, or less than or equal to 1. How to use the big-M method to
formulate this requirement as linear constraints with binary variables?
Solution: The requirement is at least one of the following two constraints must
hold: 2 x 0and x 1 0. Hence, it can be formulated as
2 x Mz1
x 1 Mz2
z1 + z2 1
z1 2 f0, 1g, z2 2 f0, 1g.
Example 17.9: We have a model. One of its decision variables x can either be
greater than or equal to 2, or between 0 and 1 (inclusive of 0 and 1). How to use
the big-M method to formulate this requirement as linear constraints with binary
variables?
Solution: The requirement is: x 0 and at least one of the following two con-
straints must hold: 2 x 0and x 1 0. Hence, it can be formulated as
2 x Mz1
x 1 Mz2
z1 + z2 1
z1 2 f0, 1g, z2 2 f0, 1g
x 0.
In contrast to the fact that linear optimization models are generally easy to formu-
late, (i) it may be difficult to formulate an integer optimization model, (ii) one may
often formulate an incorrect integer optimization model, and (iii) it is possible to
formulate a bad integer optimization model when there are good formulations that
can be solved efficiently.
Example 17.11: There is a model with the following constraints
x1 = x2
x1 = x3
x1 = x4
x1 2 f0, 1g, x2 2 f0, 1g, x3 2 f0, 1g, x4 2 f0, 1g.
Formulate one constraint to replace the first three constraints.
284 Machine learning and data analytics for maritime studies
subject to
P
xij = 1, i 2 N
j2N\fig
P
xij = 1, j 2 N
i2N\fjg
Example 17.14: In the traveling salesmanP problem, write down the model for the
case of n = 4. You are not allowed to use “ ” or “8” except in the non-negativity/
binary constraints.
Solution: Let xij be a binary variable which equals 1 if and only if arc (i, j) is
traversed, and let ui represent the number of customers that have been served after
visiting i . The model is
min d12 x12 + d13 x13 + d14 x14 + d21 x21 + d23 x23 + d24 x24 +
d31 x31 + d32 x32 + d34 x34 + d41 x41 + d42 x42 + d43 x43
subject to
x12 + x13 + x14 = 1
x21 + x23 + x24 = 1
x31 + x32 + x34 = 1
x41 + x42 + x43 = 1
x21 + x31 + x41 = 1
x12 + x32 + x42 = 1
x13 + x23 + x43 = 1
x14 + x24 + x34 = 1
u2 1
u3 1
u4 1
u1 = 0.
Example 17.15: Revisit the shortest path problem. A transportation network has
a set of nodes N := f1, 2, , ng and a set of arcs A := f(i, j), i 2 N, j 2 N, i ¤ jg.
The distance of arc (i, j) is dij . Develop an integer optimization model to deter-
mine the shortest path from node 1 to node n. Note the differences between the
shortest path problem and the TSP ( travelling salesman problem); in the short-
est path problem, not all nodes have to be visited, and the person does not return
to node 1.
Solution: Let xij be a binary variable which equals 1 if and only if arc (i, j) is
traversed. The model is
P
min dij xij
(i,j)2A
subject to
P
x1j = 1
j2N\f1g
P
xin = 1
i2N\fng
P P
xij = xji , i 2 N \ f1, ng
j2N\fig j2N\fig
min d12 x12 + d13 x13 + d14 x14 + d23 x23 + d24 x24 + d32 x32 + d34 x34
subject to
x12 + x13 + x14 = 1
x14 + x24 + x34 = 1
x23 + x24 = x12 + x32
x32 + x34 = x13 + x23
xij 2 f0, 1g, (i, j) 2 A.
Integer optimization 287
Example 17.17: Ships need to be deployed on three services r1, r2, and r3 to main-
tain a weekly frequency. The operating cost of a service given the number of ships
deployed is shown below.
Suppose that the company has ten ships and can charter in additional ships at
the cost of 0.1 million $/week per ship. Develop a model to determine how to deploy
ships on the three services at minimum cost.
Solution: Let xr and yr be the numbers of owned ships and chartered ships to
deploy on route r = 1, 2, 3, respectively. Let zrsbe a binary variable that equals 1 if
and only if s = 5, 6, 7ships are deployed on route r = 1, 2, 3. The model is
min
00.1(y1 + y2 + y3 )+
1.5z15 + 1.3z16 + 1.1z17 + 2z25 + 1.9z26 + 1.6z27 + 2.3z35 + 2.2z36 + 2z37
subject to
P
7
xr + yr = szrs , r = 1, 2, 3
s=5
P
7
zrs = 1, r = 1, 2, 3
s=5
xr 2 Z+ , r = 1, 2, 3
P
3
xr 10
r=1
yr 2 Z+ , r = 1, 2, 3
zrs 2 f0, 1g, r = 1, 2, 3, s = 5, 6, 7.
The problem with the above formulation is that it has a large number of symmetrical
optimal solutions. For example, if (x1 , y1 , x2 , y2 , x3 , y3 ) = (4, 1, 4, 2, 2, 5)is an opti-
mal solution, then (x1 , y1 , x2 , y2 , x3 , y3 ) = (3, 2, 4, 2, 3, 4) is also an optimal solu-
tion. In fact, there will be 35 optimal solutions. Generally speaking, having so many
symmetrical feasible and optimal solutions will make the model very difficult to
solve (the reason is beyond the scope of the subject) Readers can refer to Reference
[1] to see a comparison of efficiency of the two models. A better formulation is as
follows.
Solution: Let mr be the number of ships (including owned and chartered ships)
to deploy on route r = 1, 2, 3. Let x be the total number of owned ships deployed on
the three routes and y be the total number of chartered ships deployed on the three
288 Machine learning and data analytics for maritime studies
routes. Let zrs be a binary variable that equals 1 if and only if s = 5, 6, 7 ships are
deployed on route r = 1, 2, 3. The model is
min
00.1y+
1.5z15 + 1.3z16 + 1.1z17 + 2z25 + 1.9z26 + 1.6z27 + 2.3z35 + 2.2z36 + 2z37
subject to
P
7
mr = szrs , r = 1, 2, 3
s=5
P7
zrs = 1, r = 1, 2, 3
s=5
P3
mr = x + y
r=1
x 10
mr 2 Z+ , r = 1, 2, 3
x 2 Z+
y 2 Z+
zrs 2 f0, 1g, r = 1, 2, 3, s = 5, 6, 7.
We can further prove that the integrality constraints on x and y can be removed.
Moreover, for some problem, natural integer optimization formulations have
an exponential number of variables, while more clever formulations have only a
polynomial number of variables.
Given z1 2 f0, 1g, z2 2 f0, 1g, the constraint z z1 z2 is equivalent to
z z1 + z2 1, z 0; the constraint z z1 z2 is equivalent to z z1 , z z2; and
the constraint z = z1 z2is equivalent to z z1 , z z2 , z z1 + z2 1, z 0.
Given z 2 f0, 1g, 0 x M , the constraint y zx is equivalent to
y 0, y x M(1 z); the constraint y zx is equivalent to y x, y Mz ; and
the constraint y = zx is equivalent to y 0, y x M(1 z), y x, y Mz .
Example 17.18: A model has decision variables z1 2 f0, 1g, z2 2 f0, 1g, 0 x M,
an objective function that minimizes 3z1 z2 4z1 x, and some linear constraints. How do
you linearize the objective function?
Solution: We introduce two new variables u1 and u2. The model is
min 3u1 4u2
Integer optimization 289
subject to
u 1 = z1 z2 ,
u2 = z1 x,
and other relevant constraints. This model is equivalent to:
min 3u1 4u2
subject to
u1 z1 z2 ,
u2 z1 x,
and other relevant constraints. This model is equivalent to:
min 3u1 4u2
subject to
u1 z1 + z2 1,
u1 0,
u2 x,
u2 Mz1 ,
and other relevant constraints.
17.7 Practice
Example 17.19: A student has at most 20 days for preparing the final exams of four
subjects. Each subject is worth six credit points. The student needs to determine how
many days to spend on each subject. It is required that the number of days allocated
to each subject must be an integer.
Suppose that the final score of a subject is proportional to the number of
days spent on it. Let ci be the scores the student can get if he spends 1 day on
subject i = 1, 2, 3, 4. For instance, if c1 = 11, then the student can get 99 marks if
he spends 9 days on subject 1; as the student cannot get more than 100 marks,
the student will not spend more than 9 days on the subject. We therefore let
T i be the maximum days the student spends on subject i = 1, 2, 3, 4 is the larg-
est integer that does not exceed 100 . The student knows the values of ci and T
i ,
ci
i = 1, 2, 3, 4 .
The student wants to maximize his average mark. How should he allocate time
to the four subjects?
Solution: Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i . The
model is
c1 x1 + c2 x2 + c3 x3 + c4 x4
max
4
290 Machine learning and data analytics for maritime studies
subject to
x1 + x2 + x3 + x4 20
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4.
Example 17.20: In the context of Example 17.19, the student must make sure that
he passes all the subjects (at least 50 marks). How should he allocate time to the four
subjects in order to maximize his average mark?
Solution: Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i . The
model is
c1 x1 + c2 x2 + c3 x3 + c4 x4
max
4
subject to
x1 + x2 + x3 + x4 20
ci xi 50, i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4.
Example 17.21: In the context of Example 17.19, the student must make sure
that he passes all the subjects. How should he allocate time to the four sub-
jects in order to get the most number of “HD” (at least 85 marks)? (Hint: Let
xi , i = 1, 2, 3, 4, be the number of days spent on subject i . Let zi , i = 1, 2, 3, 4, be
a binary variable that equals 1 if and only if he gets HD for subject i . Then we
maximize z1 + z2 + z3 + z4 . We may have the constraint ci xi 85zi , so that zi can
be 1 only if ci xi 85.)
Solution: Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i . Let
zi , i = 1, 2, 3, 4, be a binary variable that equals 1 if and only if he gets HD for sub-
ject i . The model is
max z1 + z2 + z3 + z4
subject to
x1 + x2 + x3 + x4 20
ci xi 50, i = 1, 2, 3, 4
ci xi 85zi , i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4
zi 2 f0, 1g, i = 1, 2, 3, 4.
Example 17.22: In the context of Example 17.19, develop a model to help the stu-
dent check whether he can get “D” (at least 75 marks) for all the subjects.
Integer optimization 291
Solution: Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i . The
model is
min 0
subject to
x1 + x2 + x3 + x4 20
ci xi 75, i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4.
If the above model has a feasible solution, he can get “D” for all the subjects, oth-
erwise he cannot.
Example 17.23: In the context of Example 17.19, develop a model to help the
student check whether he can get at least two “D” and two “HD” (i.e., two “D” and
two “HD,” one “D” and three “HD,” or four “HD”).
Solution: Similar to Example 17.23, this question requires that the score for
each subject should at least be 75 and checks whether the maximum number of
“HD” exceeds two. Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i .
Let zi , i = 1, 2, 3, 4, be a binary variable that equals 1 if and only if he gets HD for
subject i . The model is
max z1 + z2 + z3 + z4
subject to
x1 + x2 + x3 + x4 20
ci xi 75, i = 1, 2, 3, 4
ci xi 85zi , i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4
zi 2 f0, 1g, i = 1, 2, 3, 4.
If the optimal objective function value is greater than or equal to 2, the student can
get at least two “D” and two “HD,” otherwise he cannot.
Example 17.24: In the context of Example 17.19, it may not be reasonable to
assume that the score of a subject is proportional to the number of days spent on it.
Therefore, we let cij be the total score of subject i = 1, 2, 3, 4 the student can get if
j = 0, 1, , 20days is spent on it. Of course, 0 ci0 ci1 ci2 ci,20 100.
The student knows the values of cij . How should he allocate time to the four subjects in
order to maximize his average mark? (Hint: Let yij , i = 1, 2, 3, 4and j = 0, 1, , 20,
be a binary variable that equals 1 if and only if j days is spent on subject i .)
Solution: Let yij , i = 1, 2, 3, 4 and j = 0, 1, , 20, be a binary variable that
equals 1 if and only if j days is spent on subject i . The model is:
P4 P20
i=1 j=0 cij yij
max 4
292 Machine learning and data analytics for maritime studies
subject to
P
4 P
20
jyij 20
i=1 j=0
P
20
yij = 1, i = 1, 2, 3, 4
j=0
ui 0, i = 0, 1 M
ui x M(1 zi ), i = 0, 1 M
ui x, i = 0, 1 M
ui Mzi , i = 0, 1 M
PM
zi = 1
i=0
zi 2 f0, 1g, i = 0, 1 M
0 x M.
Example 17.29: A student has at most 20 days for preparing the final exams of four
subjects. Each subject is worth six credit points. The student needs to determine how
many days to spend on each subject. It is required that the number of days allocated
to each subject must be an integer.
Suppose that the final mark of a subject is proportional to the number of days spent
on it. Let ci be the mark the student can get if he spends 1 day on subject i = 1, 2, 3, 4.
For instance, if c1 = 11, then the student can get 99 marks if he spends 9 days on sub-
ject 1; as the student cannot get more than 100 marks, the student will not spend more
than 9 days on the subject. We therefore let T i be the maximum number of days the
student spends on subject i = 1, 2, 3, 4 is the largest integer that does not exceed 100 .
ci
The student knows the values of ci and T i, i = 1, 2, 3, 4. All ci, i = 1, 2, 3, 4, are inte-
gers. Therefore, the final mark of a subject must also be an integer.
The student wants to maximize his average mark while (i) not failing any subject
(i.e., the final mark of each subject is at least 50) and (ii) not getting a “credit” in any
subject (i.e., the final mark of each subject is either less than or equal to 64, or greater
than or equal to 75). In sum, the final mark of each subject is either between 50 (inclu-
sive) and 64 (inclusive), or at least 75 (inclusive). Formulate an integer linear optimi-
zation model to help the student allocate time to the four subjects to achieve his goal.
Solution: Let xi , i = 1, 2, 3, 4, be the number of days spent on subject i . The
model is
c1 x1 + c2 x2 + c3 x3 + c4 x4
max
4
subject to
x1 + x2 + x3 + x4 20
ci xi 50, i = 1, 2, 3, 4
ci xi 64 or ci xi 75, i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4.
To linearize the “or” constraints, we introduce new binary variables yi and zi,
i = 1, 2, 3, 4. The integer linear optimization model is
c1 x1 + c2 x2 + c3 x3 + c4 x4
max
4
subject to
x1 + x2 + x3 + x4 20
ci xi 50, i = 1, 2, 3, 4
ci xi 64 100yi , i = 1, 2, 3, 4
75 ci xi 100zi , i = 1, 2, 3, 4
yi + zi 1, i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
xi 2 Z+ , i = 1, 2, 3, 4
yi 2 f0, 1g, i = 1, 2, 3, 4
zi 2 f0, 1g, i = 1, 2, 3, 4.
ci xi 50, i = 1, 2, 3, 4
ci xi 64 100yi , i = 1, 2, 3, 4
75 ci xi 100zi , i = 1, 2, 3, 4
yi + zi 1, i = 1, 2, 3, 4
xi Ti , i = 1, 2, 3, 4
Integer optimization 295
xi 2 Z+ , i = 1, 2, 3, 4
yi 2 f0, 1g, i = 1, 2, 3, 4
zi 2 f0, 1g, i = 1, 2, 3, 4.
subject to
P
x1j = 1
j2N\f1g
P
xi1 = 1
i2N\f1g
P
xij = zi , i 2 N \ f1g
j2N\fig
P
xij = zj , j 2 N \ f1g
i2N\fjg
Reference
[1] Wang S., Wang T., Meng Q. ‘A note on liner ship fleet deployment’. Flexible
Services and Manufacturing Journal. 2011;23(4):422–30.
Chapter 18
Conclusion
Several future research directions are discussed in this section. It should be noted
that these topics are just the tip of the iceberg. There is still a long way to go to
realize digitization transformation in the traditional maritime industry.
1. Data play the key role and act as the foundation of constructing data-driven
models to solve practical problems in maritime transport. However, there are
digital inefficiencies in the maritime transport chains, where there are rare pub-
lic data sources while data provided by commercial companies can be very
expensive. Efforts should be put to promote digital collaborations between ship
to port, port to port, and ship to hinterland for data sharing. Data-driven models
developed to address practical problems in maritime transport should also com-
ply with shipping domain knowledge and are expected to take such knowledge
explicitly into the procedure of data construction.
2. Data-driven models are usually of black box nature, which means that they are
opaque to model users or even model developers regarding how the predic-
tion model works and why a certain prediction is made. This issue should also
be taken into account when developing and explaining data-driven prediction
models to improve model transparency and acceptance.
3. In many practical problems, accurate prediction is far from enough. Instead, it
is expected that predictions given by data-driven models should be input to the
following optimization problems to prescribe better decisions. Unfortunately,
there is still a big gap between making a good prediction and making a good
decision, where future research efforts should be put into.
4. Up to now, as covered in Chapter 1, there are only a few maritime transporta-
tion problems that have been dealt with by ML approaches, and most of these
Conclusion 299
problems are prediction-based. Even if there are such prediction models, real-
izing accurate and real-time prediction is still rare and difficult. There are sev-
eral important issues from both shipping and port sides that have not yet been
addressed, such as vessel routing and scheduling, fleet planning and develop-
ment, shipping network design, port congestion prediction and management,
port resource allocation management, and port safety and security management.
In future research, these issues are expected to be addressed.
This page intentionally left blank
Index
Transportation
Analytics Models in Maritime
Applications of Machine Learning and Data
decisions in the maritime industry. Applications of Machine Learning and Data Analytics
Models in Maritime Transportation explores the fundamental principles of analysing maritime-
transportation related practical problems using data-driven models, with a particular focus on
machine learning and operations research models.
and Data Analytics Models in
Data-enabled methodologies, technologies, and applications in maritime transportation are
clearly and concisely explained, and case studies of typical maritime challenges and solutions
Maritime Transportation
are also included. The authors begin with an introduction to maritime transportation, followed
by chapters providing an overview of ship inspection by port state control, and the principles
of data driven models. Further chapters cover linear regression models, Bayesian networks,
support vector machines, artificial neural networks, tree-based models, association rule
learning, cluster analysis, classic and emerging approaches to solving practical problems in
maritime transport, incorporating shipping domain knowledge into data-driven models,
explanation of black-box machine learning models in maritime transport, linear optimization,
advanced linear optimization, and integer optimization. A concluding chapter provides an
overview of coverage and explores future possibilities in the field.
Ran Yan and Shuaian Wang
The book will be especially useful to researchers and professionals with expertise in maritime
research who wish to learn how to apply data analytics and machine learning to their fields.
Shuaian Wang is a professor in the Department of Logistics and Maritime Studies at The
Hong Kong Polytechnic University, China.