0% found this document useful (0 votes)
15 views18 pages

Smart Logistics Warehouse Moving-Object Tracking Based on

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Smart Logistics Warehouse Moving-Object Tracking Based on

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

applied

sciences
Article
Smart Logistics Warehouse Moving-Object Tracking Based on
YOLOv5 and DeepSORT
Tingbo Xie and Xifan Yao *

School of Mechanical and Automotive Engineering, South China University of Technology,


Guangzhou 510640, China; [email protected]
* Correspondence: [email protected]; Tel.: +86-20-8711-2381

Featured Application: An approach for object tracking in logistics warehouses based on YOLOv5
and DeepSORT is proposed, which distinguishes humans from goods, and an evaluation system
is established for object tracking in logistics warehouse scenarios.

Abstract: The future development of Industry 4.0 places paramount importance on human-centered/-centric
factors in the production, design, and management of logistic systems, which has led to the emergence
of Industry 5.0. However, effectively integrating human-centered/-centric factors in logistics scenarios
has become a challenge. A pivotal technological solution for dealing with such a challenge is to
distinguish and track moving objects such as humans and goods. Therefore, an algorithm model
combining YOLOv5 and DeepSORT for logistics warehouse object tracking is designed, where
YOLOv5 is selected as the object-detection algorithm and DeepSORT distinguishes humans from
goods and environments. The evaluation metrics from the MOT Challenge affirm the algorithm’s
robustness and efficacy. Through rigorous experimental tests, the combined algorithm demonstrates
rapid convergence (within 30 ms), which holds promising potential for applications in real-world
logistics warehouses.

Keywords: deep learning; YOLOv5; DeepSORT; logistics warehouse

Citation: Xie, T.; Yao, X. Smart 1. Introduction


Logistics Warehouse Moving-Object
With the emergence of smart manufacturing in conjunction with Industry 4.0, the
Tracking Based on YOLOv5 and
DeepSORT. Appl. Sci. 2023, 13, 9895.
demand for intelligent warehouses and intelligent logistics systems has been increasing. A
https://doi.org/10.3390/
well-functioning logistics warehouse system often enhances both the work efficiency of
app13179895 workers and the operational efficiency of goods [1]. The development of Industry 4.0 has
prompted various applications of artificial intelligence and automation technology in intelli-
Academic Editor: Jose Machado
gent logistics warehouses. However, it comes with the oversight of human-centered aspects
Received: 29 July 2023 in intelligent warehouse systems [2]. It has become a major challenge to comprehensively
Revised: 18 August 2023 incorporate human factors into intelligent logistics warehouses.
Accepted: 29 August 2023 Furthermore, the development of mobile networks has led to a boom in online shop-
Published: 1 September 2023 ping transactions, and subsequently, the logistics industry [3]. Therefore, logistics ware-
houses have to manage goods in a more expedient and efficient way. Within the realm of
logistics warehouse management, managers mostly use surveillance cameras to manage
the arrangement of goods and personnel movements. Nevertheless, with the increase in the
Copyright: © 2023 by the authors. number of surveillance cameras and the associated manual time expenses, the traditional
Licensee MDPI, Basel, Switzerland.
manual handling of video surveillance has gradually been phased out. A more intelligent
This article is an open access article
and automated video processing system is needed to oversee the movement of individuals
distributed under the terms and
and goods [4,5].
conditions of the Creative Commons
Object-tracking technology has found widespread utility within logistics warehouses.
Attribution (CC BY) license (https://
In the practical application of monitoring logistics warehouses, there is an increasing need
creativecommons.org/licenses/by/
for object-tracking algorithms that deliver real-time performance and robustness due to
4.0/).

Appl. Sci. 2023, 13, 9895. https://doi.org/10.3390/app13179895 https://www.mdpi.com/journal/applsci


Appl. Sci. 2023, 13, 9895 2 of 18

changes in the environment and the tracked entities themselves. Given the substantial vol-
ume of goods and individuals requiring tracking within logistics warehouses, some of the
targets may move randomly, making it difficult for traditional machine vision algorithms to
identify targets accurately. The newly developed object-recognition algorithms, grounded
in deep learning, are able to automatically learn and extract features. Moreover, they can
be adapted to complex scenarios, rendering them particularly well-suited for intelligent
monitoring applications in logistics warehouses [6,7].
This study aims to provide an effective object-tracking approach within smart logistics
warehouses. Additionally, it strives to formulate an object-tracking system tailored to logis-
tics scenarios. The former provides auxiliary means for monitoring the operation status of
logistics warehouses, ensuring smooth functioning and safety management. This approach
aligns with the notion of the human-centric logistics production management concept. The
latter serves as an illustrative model for the management method of sustainable and flexible
industrial logistics systems. It also offers a pertinent method of object tracking to facilitate
human–machine interaction in virtual reality. In sum, this study resonates with the theme
of “Sustainable, Human-Centered/Centric and Resilient Production and Logistic Systems
Design and Management”.
The specific work is as follows.
1. Three prominent deep learning detection algorithms—Faster R-CNN, SSD, and
YOLO—are briefly introduced and compared. The YOLO algorithm is chosen to
analyze the logistics warehouse moving-object-tracking problem.
2. A dedicated dataset was meticulously crafted based on the logistics warehouse sce-
narios. The quality and quantity of the dataset were optimized, yielding enhanced
outcomes in object detection specific to logistics warehouses.
3. The principle and process of DeepSORT algorithms are introduced. Finally, an evalua-
tion metric for the effectiveness of logistics warehouse object tracking is established
based on the MOT Challenge multi-object-tracking competition. This metric facilitates
quantitative assessment, ultimately offering supplementary tools for diminishing the
necessity of manual video surveillance within logistics settings.

2. Related Work
The key to object tracking and providing auxiliary means to eliminate manual monitor-
ing in logistics warehouse scenarios hinges on the selection of object-tracking algorithms. In
pursuit of this goal, state-of-the-art object-detection algorithms are compared for selecting
the best performer among them.

2.1. Faster R-CNN (Faster Region-Based Convolutional Neural Network)


Based on Fast R-CNN, Shaoqing Ren et al. [8] proposed a new Faster R-CNN algorithm
in 2017, which takes advantage of CNNs to enhance the procedure of the candidate region
generation for detection targets. Notably, the architecture strategically integrates the feature
extraction network with the candidate frame-generation network, employing the same
convolutional layer network. This astute design choice greatly reduces the time of the
whole-image object-detection process. Such an idea of using CNNs for image feature
extraction is also reflected in other algorithms. The operational process of the Faster
R-CNN algorithm is illustrated in Figure 1.
The process of Faster R-CNN is as follows:
• The input image undergoes processing via R-CNN’s convolutional layers to extract
image features, resulting in a feature map comprised of feature vectors that capture
essential information from the image.
• The feature vectors are subsequently channeled into the Region Proposal Network
(RPN) to generate candidate boxes.
• The dimensions of candidate boxes are determined by the pooling layer (ROI pooling),
which enables the classification of candidate regions, using a support vector machine
(SVM), to identify the target category.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 3 of 18

Appl. Sci. 2023, 13, 9895


• The dimensions of candidate boxes are determined by the pooling layer (ROI
3 of 18
pool-
ing), which enables the classification of candidate regions, using a support vector
machine (SVM), to identify the target category.
• • AAprecise
preciseregression
regression is performedononthe
is performed the classified
classified target
target boxes.
boxes.

Figure
Figure 1. Faster
1. Faster R-CNNalgorithm
R-CNN algorithm process.
process.
2.2. SSD (Single Shot MultiBox Detector)
2.2. SSD (Single Shot MultiBox Detector)
Wei Liu [9] proposed a new object-detection algorithm known as SSD in 2015. Op-
Wei Liu
erating [9] proposed
within the realm ofa new object-detection
single-stage detection algorithm
algorithms,known as SSD in 2015.
SSDs accomplish targetOper-
localization
ating as well
within the realmas of
classification
single-stagesimultaneously.
detection algorithms, SSDs accomplish target local-
ization SSD can be
as well as divided into SSD
classification 300 and SSD 512, according to the size of the input image.
simultaneously.
The following is the basic operation
SSD can be divided into SSD 300 process
and of
SSDthe512,
SSD according
algorithm, with SSD
to the 300ofserving
size as im-
the input
the exemplar:
age. The following is the basic operation process of the SSD algorithm, with SSD 300 serv-
ing•as the input image (size of 300 pixels × 300 pixels) is fed into a pre-training network
Theexemplar:
(VGG-16) to obtain different feature mappings.
• • The Theinput image
feature (size
vectors of of 300 \pixels
Conv4 × 300and
_3, Conv7, pixels)
Conv8 is\_2
fedare
into a pre-training
employed network
to establish
(VGG-16) to obtain different feature mappings.
object-detection boxes at different scales, concurrently performing their classification.
• • The Thefeature vectors of Conv4\_3,
NMS (non-maximum Conv7,
suppression) and is
algorithm Conv8\_2
employed are employed
to eliminate to establish
redundant
and inaccurate detections.
object-detection boxes at The final detection
different outcome is thenperforming
scales, concurrently generated, consisting of
their classifica-
the
tion. most pertinent and trustworthy bounding boxes for the targeted objects.
• 2.3.The NMS (non-maximum suppression) algorithm is employed to eliminate redun-
YOLO (You Only Look Once)
dant and inaccurate detections. The final detection outcome is then generated, con-
The YOLO algorithm proposed in 2016 was designed as a new network for object
sisting of the most pertinent and trustworthy bounding boxes for the targeted objects.
detection from end to end. Its unique design approach increases the speed of the entire net-
work operation by casting the object detection entirely into a regression problem, spanning
2.3.from
YOLO (You Only
input-image Look Once) to the classification of targets in each grid [10].
segmentation
AsYOLO
The shown in Figure 2, proposed
algorithm the process in
of YOLO is asdesigned
2016 was follows: as a new network for object

detection
Thefrom
inputend to is
image end. Its unique
divided design
into several approach
grids, and eachincreases
grid takesthe speed of the
responsibility forentire
detecting potential targets within its confines.
network operation by casting the object detection entirely into a regression problem, span-
• from
ning The input-image
image of each segmentation
grid is convolvedto to obtain
the the positions
classification of the bounding
of targets in each boxes and
grid [10].
the corresponding confidences for multiple targets.
As shown in Figure 2, the process of YOLO is as follows:
• The final prediction box for each grid’s multiple targets forecast is obtained using
• The input imagesuppression
non-maximum is divided (NMS),
into several
whichgrids, and each
is the result griddetection.
of object takes responsibility for
detecting potential targets within its confines.
• The image of each grid is convolved to obtain the positions of the bounding boxes
and the corresponding confidences for multiple targets.
• The final prediction box for each grid’s multiple targets forecast is obtained using
non-maximum suppression (NMS), which is the result of object detection.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 4 of 18
Appl. Sci. 2023, 13, 9895 4 of 18

YOLOalgorithm
Figure2.2.YOLO
Figure algorithmprocess.
process.

2.4. Comparison of the Three Detection Algorithms


2.4. Comparison of the Three Detection Algorithms
Both YOLO and SSD algorithms are single-stage algorithms that are significantly faster
thanBoth YOLOR-CNN,
the Faster and SSD soalgorithms
they shouldare be single-stage
preferred in algorithms
warehouse that are significantly
scenarios that require
faster than the Faster
real-time detection. R-CNN, so they should be preferred in warehouse scenarios that
requireThereal-time
YOLO detection.
algorithm has progressed to its fifth version as of the time of writing.
The YOLO
YOLOv5 has made algorithm has progressedontothe
further improvements itsprevious
fifth version as with
version, of theincreased
time of detection
writing.
YOLOv5
accuracy has
andmade
speed.further
Among improvements on the previous
its variants, YOLOv5s version,
maintains with
excellent increased
detection detec-
capability
tion accuracy and speed. Among its variants, YOLOv5s maintains excellent
within a compact size (7.3 MB), laying the foundation for prospective explorations into detection ca-
pability within a compact size (7.3 MB), laying the foundation for prospective
hardware platforms for cost-effective and capacity-efficient logistics warehouse object explorations
into hardware
tracking [11]. platforms for cost-effective and capacity-efficient logistics warehouse ob-
ject tracking [11].
The DSSD513 algorithm, an improved iteration derived from the SSD algorithm,
The aDSSD513
exhibits algorithm, an in
marked improvement improved
detectioniteration
accuracyderived from thedetection
but diminished SSD algorithm,
speed. ex-
The
hibits
three aalgorithms—YOLOv3-320,
marked improvement in detection
YOLOv5, accuracy but diminished
and DSSD513—are detection
compared usingspeed. The
the COCO
three algorithms—YOLOv3-320,
dataset YOLOv5,one
to determine the most suitable andforDSSD513—are
the logistics compared
warehouse using the COCO
object-tracking
dataset to determine
scenario. The outcome theof
most
the suitable
comparison one is
forshown
the logistics warehouse
in Table 1. object-tracking sce-
nario. The outcome of the comparison is shown in Table 1.
Table 1. Comparison of the three detection algorithms on the COCO.
Table 1. Comparison of the three detection
Algorithms algorithms on the COCO.
[email protected]/percent Delay Time/ms
YOLOv3-320
Algorithms 51.5
[email protected]/percent 22
Delay Time/ms
YOLOv5s
YOLOv3-320 55.4
51.5 222
DSSD513 53.3 156
YOLOv5s 55.4 2
DSSD513
In Table 53.3
1, mAP [12] refers to the mean 156 measures the
Average Precision (mAP), which
accuracy of the object detection. The delay time refers to the time taken by the algorithm to
In Table
execute 1, mAP [12]
the detection refers
process, to the mean
signifying Average Precision
the processing speed of(mAP), which measures
the algorithm. The delay
the
timeaccuracy of the
is a critical object detection.
performance metric The delay time
in real-time refers to the time taken by the algo-
applications.
rithmAs to seen
execute the detection
in Table 1, YOLOv5s process, signifying
combines the processing
both detection speed
accuracy andofspeed
the algorithm.
compared
The delay time is a critical performance metric in real-time applications.
to YOLOv3-320 and DSSD513. It has a compact volume size of only 7.3 MB, which ensures
As seen in Table
its applicability 1, YOLOv5s
to real-world combines
scenarios. both detection
Therefore, YOLOv5 accuracy
is chosenand
as speed compared
the algorithm for
to YOLOv3-320 and DSSD513. It has a compact
object detection in the logistics warehouse scenario. volume size of only 7.3 MB, which ensures
its applicability to real-world scenarios. Therefore, YOLOv5 is chosen as the algorithm for
2.5. Object-Tracking
object detection in the Algorithms
logistics warehouse scenario.
Object-tracking algorithms play an important role in warehouses. Object-tracking
2.5. Object-Tracking
technology offers aAlgorithms
valuable way for warehouse managers to ascertain the location and
status of items in the
Object-tracking warehouses
algorithms playmore conveniently,
an important roleimproving operational
in warehouses. efficiency.
Object-tracking
Object-tracking
technology offerstechnology
a valuablecould
way beforimplemented in variousto
warehouse managers ways. For instance,
ascertain Zhanand
the location [13]
combined
status the in
of items Industrial Internet more
the warehouses of Things (IIoT) and
conveniently, Digital Twin
improving (DT) forefficiency.
operational the unsu-
pervised management
Object-tracking technology of cold-chain logistics warehouses,
could be implemented in various ensuring occupational
ways. For safety
instance, Zhan
[13] combined the Industrial Internet of Things (IIoT) and Digital Twin (DT) for the unsu-
pervised management of cold-chain logistics warehouses, ensuring occupational safety
and health (OSH). These technologies provide higher efficiency and a foundation for the
development of unmanned warehouses. However, a convenient and cost-effective means 5 of 18
Appl. Sci. 2023, 13, 9895
of object tracking is needed that also strikes a balance with practical applications. There-
fore, object-tracking technology based on image object detection has gained attention [14].
DeepSORT is widely used(OSH).
and health for tracking people [15]provide
These technologies and ordinary items [16],
higher efficiency andas a classic for the
a foundation
technology for target tracking. Previous studies have effectively harnessed the DeepSORT
development of unmanned warehouses. However, a convenient and cost-effective means
algorithm combined withtracking
of object other technologies
is needed thattoalso
solve indoor
strikes object-tracking
a balance problems.
with practical For There-
applications.
instance, Jang [17] designed
fore, a lightweight
object-tracking technologyindoor
basedobject-tracking
on image object method
detectionbyhasintroducing
gained attention [14].
DeepSORT
DeepSORT in scenarios is widely
with used
partially for trackingfields
overlapping people
of[15] and
view ordinary
(FOVs). items [16],this
Therefore, as a classic
technology for target tracking. Previous studies have effectively
article harnesses the potential of DeepSORT to design an object-tracking method for use harnessed the DeepSORT
algorithm combined with other technologies to solve indoor object-tracking problems. For
within logistics warehouses.
instance, Jang [17] designed a lightweight indoor object-tracking method by introducing
DeepSORT in scenarios with partially overlapping fields of view (FOVs). Therefore, this
3. Logistics Warehouse Object Detection
article harnesses the potential of DeepSORT to design an object-tracking method for use
As stated above,
within YOLOv5 has been chosen as the detection algorithm for logistics
logistics warehouses.
warehouses. Its network structure is illustrated first.
3. Logistics Warehouse Object Detection
3.1. YOLO Algorithm As stated Structure
Network above, YOLOv5 has been chosen as the detection algorithm for logistics
warehouses. Its network structure is illustrated first.
The YOLO algorithm, known as You Only Look Once, employs a single-stage detec-
tion approach inspired by Algorithm
3.1. YOLO the human visualStructure
Network system. The YOLO algorithm swiftly encom-
passes the whole-image data for
The YOLO prediction
algorithm, knownto as
avoid confusing
You Only the target
Look Once, employswith the back-detection
a single-stage
ground. In someapproach
cases, the YOLO
inspired bydetection
the humanmodel struggles
visual system. Thewith
YOLO targets thatswiftly
algorithm occupy a
encompasses
small image size. Compared to offline detection algorithms, the YOLO algorithm might
the whole-image data for prediction to avoid confusing the target with the background. In
some cases,
not exhibit the highest the YOLO
detection detection
accuracy, model
but the struggles
detection with targets
accuracy and that occupyspeed
detection a small image
size. Compared to offline detection algorithms,
should be taken into account in practical applications [18]. the YOLO algorithm might not exhibit the
highest detection accuracy, but the detection accuracy and detection speed should be taken
As shown in Figure 3 [19], YOLOv5s consists of four main components: Input, Back-
into account in practical applications [18].
bone, Neck, and Prediction, where the Backbone convolves and pools the input image to
As shown in Figure 3 [19], YOLOv5s consists of four main components: Input, Back-
obtain the imagebone,
feature
Neck,vector; the Neckwhere
and Prediction, transfers the extracted
the Backbone imageand
convolves features
pools theto input
the image
Prediction part through
to obtainsampling;
the image and thevector;
feature Prediction parttransfers
the Neck makes predictions
the extractedbased
imageon the to the
features
image features toPrediction
obtain the target bounding boxes and prediction classification. There
part through sampling; and the Prediction part makes predictions based are on the
some basic composite network
image features to structures in the
obtain the target network
bounding structure,
boxes as stated
and prediction separatelyThere are
classification.
below [20]. some basic composite network structures in the network structure, as stated separately
below [20].

Figure
Figure 3. YOLOv5s 3. YOLOv5s
network network structure.
structure.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, 9895 6 of 18
3.1.1. CBL (Convolutional Operation, Batch Normalization Algorithm, and Leaky ReLU
3.1.1. CBL (Convolutional
Activation Function) Operation, Batch Normalization Algorithm, and Leaky ReLU
3.1.1. CBL
Activation (Convolutional Operation, Batch Normalization Algorithm, and Leaky ReLU
Function) operation,
3.1.1.The
CBLconvolutional
(Convolutional the Batch
Operation, batch Normalization
normalization algorithm,
Algorithm,andandthe leaky
Leaky ReLU
ReLU
Activation Function)
Activation
activation Function)
function compose
The convolutional the CBL
operation, module,
the batch which mainly
normalization plays the
algorithm, role
and theofleaky
dataReLU
sam-
pling. The
The
activation
The convolutional
structure
function compose
convolutional operation,
is shown the
operation, thebatch
in Figure
CBL
the batch normalization
4. normalization
module, algorithm,
which mainly plays the
algorithm, and
and theof
role
the leaky ReLU
dataReLU
leaky sam-
activation
activation function
pling. Thefunction
structure compose
is shown
compose the
the CBL
inCBL
Figure module,
4. which
module, which mainly
mainly plays
plays the role
the role of data
of data sam-
sampling.
pling.
The The structure
structure is shownis shown in 4.
in Figure Figure 4.

Figure 4. CBL structure.


Figure 4. CBL structure.
FigureRes
Figure
3.1.2. 4. CBL
4. CBL structure.
structure.
Unit
3.1.2.The
3.1.2. ResRes
Unitunit learns from the residual structure in the ResNet network, which incor-
3.1.2. Res
Res Unit
Unit
poratesThethe original
Res inputfrom
unit learns
learns withthe
theresidual
output after two CBL
structure theoperations
in the to form which
ResNet network,
network, the residual
incor-
The
The Res
Res unit
unit learns from
from the
the residual
residual structure in ResNet which incor-
network
porates structure.
the original This approach
input with makes
the thestructure
output structure
after two in
CBLtheoperations
deeper ResNet network,
and avoids
to which
gradient
form the incor-
vanish-
residual
porates
porates the original input with the output after two CBL operations to form the residual
ing [21]. the
network The original
structure
structure. input
This with the
isapproach
shown output
theafter
schematically
makes intwo CBL
Figure
structure 5. operations
deeper to form
and avoids
avoids the residual
gradient vanish-
network structure. This approach makes the structure deeper and gradient vanish-
network
ing [21]. structure.
The This
structure isapproach
shown makes the
schematically structure
in Figuredeeper
5. and avoids gradient vanish-
ing [21]. The structure is shown schematically in Figure 5.
ing [21]. The structure is shown schematically in Figure 5.

Figure 5. Res unit structure.


Figure5.
Figure 5. Res
Res unit
unit structure.
structure.
Figure 5. Res
The unit structure.
residual structure means that the input x of layer i − 1 is superimposed directly
on the The residual
output I(x) structure
of layer i.means
The outputthat the inputixisofthen
of layer layer i−
f(x) 1 is superimposed directly
The residual structure means that the input x of layer i =− x1 +isI(x). The residual
superimposed struc-
directly
on
ture the output
The I(x)
residual
canoutput
be interpreted of layer
structure i.
meansThe
as ai.distribution output
that the of layer
input x iofis then
layer if (x)
− 1 =
is x + I(x).
superimposed The residual
directly
on the I(x) of layer The outputof ofinputs
layer iover
is then different
f(x) = xlayers+ I(x).of theresidual
The network. The
struc-
structure
on canof beI(x)
interpreted as a distribution of structure
inputs over different layers of the network.
turethe output
distribution
can of sources
input
be interpreted layeras ai.distribution
The
makesoutput of inputs
the whole
of layer i over
is then f(x)
more
different =rational
xlayers
+ I(x). The
of theresidual
without invalid
network. struc-
in-
The
The
ture
puts. distribution
can
Thebeemergence of input
interpreted ofas sources
the residualmakes
a distribution the
of whole
inputs
structure structure
over
allows more
different
the neural rational
layers
network of without
the network.
structure invalid
to The
be-
distribution of input sources makes the whole structure more rational without invalid in-
inputs.
come The emergence
distribution
deeper, of input
and the of themakes
sources
multi-layer residualthe
input structure
whole
structure allows
structure
allows the
more
the neural
rational
gradient network
without
to be structure
invalid
updated to
in-
with
puts. The emergence of the residual structure allows the neural network structure to be-
become
puts. deeper,
The and
emergence theof multi-layer
the residual input structure
structure allows allows the
the the
neuralgradient
networkto be updated with
the
come stable gradient
deeper, and the [21].multi-layer input structure allows gradient to bestructure
updatedtowith be-
the
comestable gradient
deeper, and [21].multi-layer input structure allows the gradient to be updated with
the
the stable gradient [21].
the stable
3.1.3. CSP1_Xgradient [21].
3.1.3. CSP1_X
3.1.3.The
CSP1_X
CSP1_Xstructure
structureshown shownininFigure Figure6 6learns
learns from
3.1.3.The CSP1_X
CSP1_X from thethe network
network structure
structure of CSP-
of CSPNet.
Net. The
The CSPNet
CSP1_X structure
structure unites
shown the
in features
Figure 6 of all
learns input
from
The CSPNet structure unites the features of all input layers to maximize the number layers
the to
networkmaximize the
structure number
of CSP- of
of
Net. The
branch
The CSP1_X
inputs
CSPNet structure
used to
structure shown
enhance
unites the inlearning
the Figure
features 6capability
learns
of all from
input of thenetwork.
the
layers network
to structure
The
maximize CSPNet
the ofsplits
CSP-
number
branch inputs used to enhance the learning capability of the network. The CSPNet splits
Net.
the The CSPNet
inputs, whichusedstructure
also to drives unites thelearning
features of all input of layers to maximize the number
of branch
the inputs, inputs
which also enhance
drives aa reduction
reduction
the in computation.
in computation.
capability the network. The CSPNet splits
of branch inputs used to enhance the learning
the inputs, which also drives a reduction in computation. capability of the network. The CSPNet splits
the inputs, which also drives a reduction in computation.

Figure 6. CSP1_X structure.


Figure 6. CSP1_X structure.
CSP1_X
Figure divides
6. CSP1_X the input into two parts. One part undergoes CBL, multiple residual
structure.
CSP1_X divides the input into two parts. One part undergoes CBL, multiple residual
operation structures, and convolutional operations; the other part directly undergoes con-
operation
CSP1_X structures,
divides and
the convolutional
input into two operations;
parts. One the other
part part directly
undergoes CBL, undergoes
multiple con-
residual
volutional
CSP1_X operations. Finally,
divides the inputthe two
into parts
two are One
parts. concatenated,
part which CBL,
undergoes refersmultiple
to concatenation
residual
volutional
operation operations.
structures, Finally,
and the two
convolutional parts are concatenated,
operations; the other which
part refers
directly to concatena-
undergoes con-
in neural networks.
operation This
structures, and approach generally
convolutional fuses features
operations; the extracted
other part from multiple
directly undergoesframes.
con-
tion in neural
volutional networks. This
operations. Finally, approach
the generally
two ReLU
parts are fuses features
concatenated, extracted
which from multiple
After the concatenation,
volutional operations. BN, the
Finally, leaky
the two parts activation
are function,
concatenated, and refers
which CBL are
refers
to connected
to
concatena-
concatena-
frames. After the
tionsampling
in neural concatenation,
networks. BN, the leaky
This approach ReLUfuses
generally activation function,
features andfrom
extracted CBL multiple
are con-
for
tion in for
neural[22].
networks.
nected
frames. sampling
After the [22]. This approach
concatenation, generally
BN, the leaky ReLU fuses features
activation extracted
function, andfrom
CBLmultiple
are con-
frames.
nected After
for
3.1.4. CSP2_X the concatenation,
sampling [22]. BN, the leaky ReLU activation function, and CBL are con-
nected for sampling [22].
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 7 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 7 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 7 of 18

3.1.4. CSP2_X
3.1.4. CSP2_X
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
Appl. Sci. 2023, 13, 9895 3.1.4. CSP2_X 7 of 18
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.

Figure 7. CSP2_X structure.


Figure 7. CSP2_X structure.
Figure
Figure
3.1.5. SPP7. CSP2_X
CSP2_X structure.
7.(Spatial structure. Pooling)
Pyramidal
3.1.5.
SPP
3.1.5. SPP (Spatial
is known
SPP (Spatial Pyramidal
asPyramidal Pooling)Pooling, as shown in Figure 8. Its function is resiz-
Spatial Pyramidal
Pooling)
3.1.5. SPP (Spatial Pyramidal Pooling)
SPP
ing theSPPinputis feature
is known
known as asSpatial
maps to aPyramidal
Spatial fixed size. Pooling,
Pyramidal CNN,as
InPooling, shown
SPP
as in Figure
is commonly
shown 8. Its8.function
used
in Figure toIts
solve is resiz-
prob-
function is
ing
lems SPP
the
that is known
input
include as Spatial
feature maps
inappropriatelyPyramidal
to a fixed
sized Pooling,
size. In
images. as shown
CNN, SPP is in Figure 8. Its
commonly usedfunction
to is resiz-
solve prob-
resizing the input feature maps to a fixed size. In CNN, SPP is commonly used to solve
ing
lemsthe input
that feature
include maps to a fixed
inappropriately size.
sized In CNN, SPP is commonly used to solve prob-
images.
problems that include inappropriately sized images.
lems that include inappropriately sized images.

Figure 8. SPP structure.


Figure8.8.SPP
Figure SPPstructure.
structure.
Figure
SPP8.obtains
SPP structure.
features through Maxpool, and then merges the features with concate-
nation SPPSPP
to obtains
obtains
obtain features
a fixed sizethrough
features feature. Maxpool,
through Maxpool,
In addition,and
and then
then merges
YOLOv5s addsthe
merges the features
a CBLfeatures
modulewith
withforconcate-
concate-
data
nation
nation to
SPPto obtain
obtains
obtain aa fixed
features
fixed size
size feature.
through
feature. In
In addition,
Maxpool,
sampling enhancement before and after the pooling layer [23]. and
addition, YOLOv5s
then
YOLOv5smerges adds
addstheaa CBL
features
CBL module
with
module for data
concate-
for data
sampling
nation to enhancement
obtain a fixed before
size and after
feature. In the pooling
addition,
sampling enhancement before and after the pooling layer [23]. layer
YOLOv5s [23].
adds a CBL module for data
sampling
3.2. Dataset andenhancement
Training before and after the pooling layer [23].
3.2. Dataset and Training
3.2.Dataset
3.2.1. Dataset and Training
3.2.1. Dataset
3.2. Dataset and Training
3.2.1.
The Dataset
quality and quantity of the dataset is critical for deep learning detection models.
3.2.1. Dataset and quantity of the dataset is critical for deep learning detection models. In
The quality
Inthe The quality
thecontext
context of
of the and
the quantity
logistics
logistics of the dataset
warehouse
warehouse scene
scene is under
critical
under for deep learning
investigation,
investigation, the the detection
firstfirst dataset
dataset models.
in-
included
In
cluded theThe
273 quality
context of
randomly and
the quantity
logistics
selected of the on
dataset
warehouse
videos is critical
scene
websites, under forinvestigation,
depicting deep learning
workers detection
the
and first
parcels. models.
dataset
How- in-
273 randomly selected videos on websites, depicting workers and parcels. However,
In
ever, the
cluded
the context
273
object of the logistics
randomly
detection selected
performancewarehouse
videos of on
the scene
websites,
trainedunder investigation,
depicting
model was workers
poor. the
and first dataset
parcels.
Therefore, a in-
How-
logis-
the object detection performance of the trained model was poor. Therefore, a logistics
cluded
ever, the273 randomly
object detection selected videoswas
performance on used
websites,thedepicting workers and parcels. a How-
tics warehouse
warehouse operated
operated bybya acompany
company wasof the as
used trained
asthe model was
research
research poor.
scene,
scene, Therefore,
as shown
as shown logis-
in Figure
in Figure 9.
ever,
tics
9.There, the
There, 790 object
warehouse
790 photos detection
operated
photos were performance
by a
were captured, company of
wasthe trained
used as themodel was
research poor.
scene, Therefore,
as shown in a logis-
Figure
captured, which which included
includedgoods goodsandandworkers,
workers,encompassing
encompassing
tics
diverse warehouse
9. There,
angles. operated
790 Light
photos werebycaptured,
blocking aconditions
company was
which
were used as thegoods
included
employed research
for andscene,
obtainingworkers,as shown
complete in Figure
encompassing
data.
diverse angles. Light blocking conditions were employed for obtaining complete data.
9. There, 790 photos were captured, which included goods
diverse angles. Light blocking conditions were employed for obtaining complete data.and workers, encompassing
diverse angles. Light blocking conditions were employed for obtaining complete data.

(a) Parcels (b) Worker


(a) Parcels (b) Worker
Figure 9. 9.
Figure Photos in in
Photos the logistics
the
(a) warehouse
logistics
Parcels scene.
warehouse scene. (b) Worker
Figure 9. Photos in the logistics warehouse scene.
The
Figure original
9. Photos in image set was
the logistics annotated
warehouse using the software called Labelimg in txt for-
scene.
mat [24]. YOLOv5 changed the annotation of the dataset from xml format to txt format,
with five parameters per line in the txt file, which is shown in Table 2.
Appl. Sci. 2023, 13, 9895 8 of 18

Table 2. YOLOv5 dataset annotation format.

Object_Class x_Center y_Center Width Height


0 0.85 0.29 0.27 0.12

• To ensure the robustness of the model, 100 photos from the first dataset are intention-
ally incorporated as a supplement to training.
• object_class indicates the type of object detected, which is usually numbered by a
positive integer starting from zero.
• x_center and y_center indicate the coordinates of the center of the target bounding box
(normalized, i.e., divided by the width and height of the whole image).
• width and height indicate the width and height of the target bounding box (normalized,
i.e., divided by the width and height of the whole image).

3.2.2. Training Process


In order to improve the object-detection model’s performance for goods and workers
across various scales, the training process incorporates the multi-scale transformation
strategy that comes with YOLOv5. The training configuration environment for the YOLOv5-
based object-detection model is shown in Table 3.
Table 3. Training configuration environment.

Graphic Operating Software Training


CPU
Card System Platform Framework
NVIDIA GTX CUDA11.4
Intel i5-8300H/8 GB Windows Pytorch
1050Ti/4 GB CUDNN8.2

Based on the training configuration environment in Table 3, the parameters of the


YOLOv5 training function were set as shown in Table 4. The workers parameter refers to the
number of threads that CPU is working on during YOLOv5 model training. The workers
parameter was set to 0 after several tries, as we had insufficient resources for pre-training.
Table 4. Training function parameter settings.

Batchsize Epochs Imgsz Label-Smoothing Workers


4 300 640 0.1 0

The training framework was Pytorch, a GPU-accelerated neural network framework


that is easy to understand and debug code.
The YOLOv5s.pt model was selected for training, and the training process lasted
around 24 h. The training process comprised 300 batches, with each batch consisting of
4 randomly selected images from the training set.
The variation of the YOLOv5 loss function during training is shown in Table 5. The
variation of train loss, validation loss and [email protected] is shown in Figure 10. From Table 5
and Figure 10, it can be seen that the loss function loss, box_loss, obj_loss and cls_loss
gradually became smaller as training progressed. The loss function loss tended to converge
at the end of training.
Table 5. Loss function during training.

Batch 100 200 300


loss 0.031857 0.020654 0.016067
box_loss 0.016709 0.009924 0.008079
obj_loss 0.014739 0.010567 0.008079
cls_loss 0.000409 0.000163 0.000063
cls_loss 0.000409
Batch 100 0.000163 200 0.000063 300
loss 0.031857 0.020654 0.016067
box_loss 0.016709 0.009924 0.008079
obj_loss 0.014739 0.010567 0.008079
Appl. Sci. 2023, 13, 9895 9 of 18
cls_loss 0.000409 0.000163 0.000063

Train/Validation loss–Epoch curve (b) [email protected]–Epoch curve


Figure 10. Train/Validation loss–Epoch curve and [email protected] curve.
(a) Train/Validation loss–Epoch curve (b) [email protected]–Epoch curve
3.3. Object-Detection Test
Figure 10. Train/Validation loss–Epoch curve and [email protected] curve.
Figure 10. Train/Validation loss–Epoch curve and [email protected] curve.
3.3.1. First Training Detection Model
3.3.Object-Detection
3.3.
The first training Object-Detection
dataset consists Test
Test of 273 photographs of actual logistics warehouse
scenarios, including 50 images of goods, 50 Model
3.3.1. First
3.3.1. Training
First Training Detection
Detection Model of staff and 173 images of staff and
images
goods. The
Thefirst training
first dataset
training consists
dataset of 273
consists photographs
of 273 photographs of actual logistics
of actual warehouse
logistics sce-
warehouse
The 273 images narios, including
were fedincluding
scenarios, 50 images
into the 50 of
network goods,
imagesfor 50 images
training
of goods, 50to of staff and
obtainofthe
images 173 images
detection
staff of staff
and 173 model. and goods.
images of staff and
The 273 images were fed into the network for training to obtain the detection model.
goods.
Three test videos were randomly searched for in a logistics scene. To validate the dataset’s
Three test
Thevideos were were
273 images randomly searched
fedofinto for in afor
thegoods
network logistics scene.
training To validate
to obtain the the dataset’s
dependability, each video contained frames staff, and the background. Thedetection
re- model.
dependability,
Three test each
videos video
were contained
randomly frames
searched of staff,
for in agoods and
logistics the
scene.background.
To validate The
the results
dataset’s
sults of the first logistics warehouse
of the first object-detection
logistics warehouse test are test
object-detection shown in Figure
are shown 11. 11.
in Figure
dependability, each video contained frames of staff, goods and the background. The re-
sults of the first logistics warehouse object-detection test are shown in Figure 11.

(a) Video 1 test results (b) Video 2 test results (c) Video 3 test results
(a) Video 1 test results (b) Video 2 test results (c) Video 3 test results
Figure 11. Test results for the three videos.
Figure 11. Test results for the three videos.
Figure 11. Test results for the three videos.
The trained model encountered challenges when detecting a target that was a relatively
The trained model encountered
The trained modelchallenges when
encountered detectingwhen
challenges a target that was
detecting a rela-
a target that stream,
was a rela-
small part of the whole-image bounding box. Along with a continuous video it
tively small part of the
tively whole-image
small part of bounding
the box.
whole-image Along
bounding with
box. a continuous
Along with a video
continuous video
tended to lose the tracking frame when the target was obscured. There were also many
stream, it tended to loseofthe
stream,
cases it tracking
tended
false toframe
lose thewhen
positives—for the frame
tracking
example, target was obscured.
when
the detectionthefor
target There
was
non-goods were also
obscured.
objects There were also
as targets. In
many cases of false positives—for
many
cases casesgoods
where example,
of false
and the
positives—for detection
workers were example, for
obscured non-goods
thefrom
detection objects
each for as
non-goods
other, targets.
objects
the detection as targets.
bounding
In cases where goodsIn cases
boxesand where
workers
often goods
were
appeared and
for workers
obscured
only were
from
one of obscured
each
the two from
other,
elements. theeach other, the
detection detection bounding
bounding
boxes often appearedboxes often
onlyappeared
forlabelling,
In onethe fortwo
onlyelements.
one of the
ofauthentic-detection
the two elements.
bounding box of the workers was defined as the
smallest rectangular box that could enclose worker’s clothing, hair, etc. However, it was
difficult for the detection model to enclose the entire area of the worker when parts of the
worker were concealed. When it came to detecting goods, the model often struggled to
detect the individual items based on their features, especially in scenarios where multiple
goods were stacked each other.
To address the issues above, another dataset was established. The second training
incorporated 790 photos from a fixed scene (i.e., the logistics warehouses of Company S)
difficult for the detection model to enclose the entire area of the worker when parts of the
worker were concealed. When it came to detecting goods, the model often struggled to
detect the individual items based on their features, especially in scenarios where multiple
goods were stacked each other.
Appl. Sci. 2023, 13, 9895
To address the issues above, another dataset was established. The second training 10 of 18
incorporated 790 photos from a fixed scene (i.e., the logistics warehouses of Company S)
as the dataset for the detection of parcels and workers. It changed the situation of the
relatively smallassize
the of datasets
dataset anddetection
for the a large number ofand
of parcels diverse scenes.
workers. It changed the situation of the
relatively small size of datasets and a large number of diverse scenes.
3.3.2. Second Training Detection Model
3.3.2. Second Training Detection Model
The second dataset used for training is a set of 790 photos taken from the same logis-
The second dataset used for training is a set of 790 photos taken from the same
tics warehouses (i.e., Company S’s logistics warehouses). The 790 images consisted of 100
logistics warehouses (i.e., Company S’s logistics warehouses). The 790 images consisted
images of specific goods (parcels with adhesive tape), 100 images of workers and 590 im-
of 100 images of specific goods (parcels with adhesive tape), 100 images of workers and
ages of spatial 590
interference
images of between workers and
spatial interference the goods.
between workersThe
andtest resultsThe
the goods. of this
test object
results of this
detection are shown in Figure 12.
object detection are shown in Figure 12.

Figure 12. Test result


Figure 12. Test result.

As can be seenAsfrom
can be seen from
Figure Figure
12, the 12, theof
detection detection
obscured of obscured
targets wastargets was significantly
significantly
improved, and [email protected] reached 0.976. At the same time, the model was still able to detect
improved, and [email protected] reached 0.976. At the same time, the model was still able to detect
targets with a relatively small size. Compared with the first training, the second training
targets with a relatively small size. Compared with the first training, the second training
was improved dramatically, and the trained model detected the targets individually even
was improved when
dramatically,
the goodsand the trained
blocked model detected the targets individually even
each other.
when the goods blocked each other.
As for the proposed tracking task, a continuous-detection bounding box for the
As for thetarget,
proposed tracking
as stated above,task,
was a continuous-detection
not bounding
enough; the tracking model neededboxto for the tar-
maintain the stable
invariance
get, as stated above, wasofnottheenough;
tracking the
IDs.tracking
In the tracking
model task,
neededeachtotarget was given
maintain a unique ID,
the stable
and
invariance of the keepingIDs.
tracking the IDIn constant when
the tracking the target
task, reappearing
each target was aa vital
was given uniquemetric
ID,for
andtracking
keeping the ID constant when the target reappearing was a vital metric for tracking per- model
performance reference [25]. To achieve tracking task, we combined the detection
with DeepSORT to preserve the continuity of target IDs.
formance reference [25]. To achieve tracking task, we combined the detection model with
DeepSORT to preserve theWarehouse
4. Logistics continuityTarget
of target IDs.
Tracking
YOLOv5 and DeepSORT were combined to assign IDs to the tracked targets in the
4. Logistics Warehouse Target
video stream Trackingthe stability of the IDs. Originated from the SORT algorithm,
and maintain
YOLOv5 andDeepSORT now became
DeepSORT a tracking algorithm
were combined to assigncombining
IDs to theKalman
trackedfiltering
targetsand
in Hungarian
the
matching, as described below.
video stream and maintain the stability of the IDs. Originated from the SORT algorithm,
DeepSORT now became a tracking algorithm combining Kalman filtering and Hungarian
4.1. DeepSORT
matching, as described below.
DeepSORT differs from SORT, mainly defining a new state of target tracks and match-
ing the original detection result with the prediction by judging the state of the tracks [26]. At
the same time, tracks with distinct states can be judged by the quantities of matching, which
either increase or decrease. The DeepSORT algorithm process is shown in Figure 13 [27,28].
4.1. DeepSORT
DeepSORT differs from SORT, mainly defining a new state of target tracks and
matching the original detection result with the prediction by judging the state of the tracks
[26]. At the same time, tracks with distinct states can be judged by the quantities of match-
Appl. Sci. 2023, 13, 9895
ing, which either increase or decrease. The DeepSORT algorithm process is shown 11 in of
Fig-
18
ure 13 [27,28].

Figure 13. DeepSORT process.


Figure 13. DeepSORT process.
The DeepSORT algorithm process is introduced as follows.
The DeepSORT algorithm process is introduced as follows.
• The existing track is divided into confirmed and unconfirmed states.
•• Thethe
For existing
trackstrack
withisa divided
confirmed into confirmed
state, they are and unconfirmed
cascaded states.
to match with the detection
• For the tracks with a confirmed state, they are cascaded to
result. The priority of tracks depends on the time that the tracks exist match with thethe
with detection
video
stream and the number of successful matching attempts. A track that has just video
result. The priority of tracks depends on the time that the tracks exist with the been
stream and the
successfully number
matched hasof successful
a higher matching
priority attempts.
for cascade A track
matching that has just been
[29].
• successfully
The process ofmatched
cascadehas a higher priority for cascade matching [29].
matching:
• The process of cascade matching:
1. The cosine distance and cost matrix between the track feature and the depth
1. feature
The cosine distance
of the detectionandarecost matrix between the track feature and the depth
computed.
2. feature
The of the detection
martingale distanceare computed.
between the prediction bounding boxes and the detec-
2. tion
The martingale distance
bounding boxes between the
is computed prediction
in the bounding boxes and the detec-
cost matrix.
3. tion previously
The bounding boxes is computed
obtained input costinmatrix
the costis matrix.
matched in Hungary to obtain the
3. final
The previously obtained input cost matrix is matched in Hungary to obtain the
pairing result.
• final pairing state
The unconfirmed result.tracks are formed into a new set with the unconfirmed state
• The unconfirmed
tracks that were not state tracks
paired are second
in the formedstep.
into aThe
new
setset
is with the unconfirmed
matched state
with the detected
tracks thatboxes
bounding were that
not paired
have notin the second
been pairedstep.
in aThe set is matched
Hungarian withwith
matching the adetected
1-IOU
cost matrix.boxes that have not been paired in a Hungarian matching with a 1-IOU
bounding
• The
costpaired
matrix.tracks need to update the Kalman filter information based on the detection
• bounding
The pairedboxes’
tracksinformation.
need to update the Kalman filter information based on the detec-
1.tion bounding
If the tracksboxes’ information.
are paired more than three times, they will change from unconfirmed
1. If the tracks are paired more pairings
to confirmed. Tracks lacking will
than three be removed
times, they willfrom the set
change of uncon-
from
firmed tracks
to confirmed. Tracks lacking pairings will be removed from thesetting
if the number of consecutive pairing failures exceeds the set of
value.
unconfirmed tracks if the number of consecutive pairing failures exceeds the
2. Unpaired tracks will also be removed if they have not a successful pairing
setting value.
2. previously.
Unpaired tracks will also be removed if they have not a successful pairing pre-
3. If there is an unpaired detection bounding box, a new target track will be created
viously.
for
3. If thereit. is an unpaired detection bounding box, a new target track will be created
for it.
4.2. Object-Tracking Test
DeepSORT requires the outcomes of other object-detection algorithms as its input, so
YOLOv5 is employed as the input object-detection algorithm. The test video from the first
dataset training for YOLOv5 was selected randomly. Figure 14 shows the comparison of
DeepSORT and YOLOv5.
4.2. Object-Tracking Test
DeepSORT requires the outcomes of other object-detection algorithms as its input, so
4.2. Object-Tracking Test
YOLOv5 is employed as the input object-detection algorithm. The test video from the first
datasetDeepSORT
training forrequires
YOLOv5 thewas
outcomes ofrandomly.
selected other object-detection algorithms
Figure 14 shows as its input,ofso
the comparison
YOLOv5 isand
DeepSORT employed
YOLOv5. as the input object-detection algorithm. The test video from the first
Appl. Sci. 2023, 13, 9895 12 of 18
dataset training for YOLOv5 was selected randomly. Figure 14 shows the comparison of
DeepSORT and YOLOv5.

(a) DeepSORT test results (b) YOLOv5 test results


Figure 14. Comparison of DeepSORT with YOLOv5.
(a) DeepSORT test results (b) YOLOv5 test results
Figure
Figure 14.Comparison
As14.
canComparison
be seen fromofDeepSORT
of DeepSORT
Figure withDeepSORT
14,with
the YOLOv5. adds the IDs to the detection results of
YOLOv5.
YOLOv5. The observation of DeepSORT throughout the test video showed that the track-
As
Ascan
ing target ID be
can beseen
seenfrom
remains from Figure
Figure
relatively 14,
14,the
theDeepSORT
constant the casesadds
DeepSORT
in adds
wherethe
theIDs
IDs
the totothe
thedetection
tracking detection results
resultsof
target reappears. of
YOLOv5.
YOLOv5. The
The observation
observation of
ofDeepSORT
DeepSORT throughout
throughout the test
the video
test video showed
Notably, there is an absence of ID switching among multiple targets in the video stream,showed that the
that tracking
the track-
target ID remains
ing target
maintaining ID
theremainsrelatively
stability of theconstant
relatively inIDs
constant
tracking the
in thecases
(as where
cases
stated where the
by Lin ettracking
the al. [30]).target
tracking target reappears.
reappears.
However, it is
Notably,
Notably, there
there isisan
anabsence
absence ofofID
ID switching
switching among
among multiple
multiple targets
targets in
in the
the video
video stream,
stream,
challenging to evaluate the superiority of two algorithms’ video tracking effects solely by
maintaining
maintaining the stability
stabilityofofthe tracking IDs
IDs(as(asstated by
byLin
Linetet al.
al.[30]).
[30]).However, ititisis
the human eye.theTherefore, the the trackingmetrics
evaluation stated
of the logistics warehouse However,
multi-object
challenging
challenging to evaluate the
thesuperiority of
oftwo
twoalgorithms’ video tracking effects solely by
tracking were to evaluate
derived superiority
to provide a quantitative algorithms’
evaluation. video tracking effects solely by
the
the human eye. Therefore, the evaluation metrics of the logisticswarehouse
human eye. Therefore, the evaluation metrics of the logistics warehousemulti-object
multi-object
tracking
tracking were
were derived
derived to
to provide
provide aaquantitative
quantitative evaluation.
evaluation.
4.3. Object-Tracking Evaluation Metrics
4.3. The object-tracking
Object-Tracking evaluation
Evaluation metrics utilized in our study are derived from the
Metrics
4.3. Object-Tracking Evaluation Metrics
MOT TheChallenge competition [31]. The
object-tracking evaluation metrics main object-tracking
utilized in ourevaluation
study are indicators
derived fromof the
the
MOT The object-tracking evaluation metrics utilized Tracking
in our study are derived from the
MOT Challenge competition [31]. The main object-tracking evaluation indicators(Mul-
Challenge competition are MOTA (Multi-Object Accuracy), MOTP of the
MOT
tiple Challenge competition [31]. The main object-tracking evaluation indicators of the
MOTObject Tracking
Challenge Precision),
competition Rcll (Recall),
are MOTA MT (Mostly
(Multi-Object TrackingTracked), MLMOTP
Accuracy), (Mostly Lost),
(Multiple
MOT
IDF1 Challenge competition
(Identification F-Score),Rcll are MOTA
IDP (Multi-Object
(Identification Tracking
Precision), Accuracy),
IDR ML MOTP
(Identification (Mul-
Recall),
Object Tracking Precision), (Recall), MT (Mostly Tracked), (Mostly Lost), IDF1
tiple
IDSW Object Tracking
(ID Switch), Precision),
etc., asIDP
shown Rcll (Recall),
in Figure 15. MT (Mostly
The MOTIDR Tracked),
Challenge ML (Mostly
competition Lost),
employs
(Identification F-Score), (Identification Precision), (Identification Recall), IDSW
IDF1
a(ID (Identification
set Switch),
of comprehensive F-Score),
evaluation IDP (Identification
metrics to evaluatePrecision), IDR
object-tracking (Identification
performance. Recall),
etc., as shown in Figure 15. The MOT Challenge competition employs aThese
set of
IDSW are
metrics (IDas
Switch),
shown etc.,
in as shown
Figure 15. in Figure 15. The MOT Challenge competition employs
comprehensive evaluation metrics to evaluate object-tracking performance. These metrics
a set of comprehensive evaluation metrics to evaluate object-tracking performance. These
are as shown in Figure 15.
metrics are as shown in Figure 15.

Figure15.
Figure Evaluationindicators
15.Evaluation indicatorsfor
forthe
theMOT
MOTChallenge
Challengecompetition.
competition.

(1) MOTA
Figure 15. Evaluation indicators for the MOT Challenge competition.
The formula for calculating MOTA is as follows:

FN + FP + IDSW
MOTA = 1 − (1)
GT
GT (ground truth) signifies the accurate target bounding box information of the
tracking target, which is often obtained by manual annotation or automatic annotation
Appl. Sci. 2023, 13, 9895 13 of 18

software in practical problems. MOTA quantifies the proportion of tracking information


except false negative (FN), false positive (FP) and ID switching (IDSW) among all true
target information.
FN represents the true samples that are identified by the tracking algorithm as false
samples; FP represents the false samples that are identified by the tracking algorithm as
true samples [32]. The samples here can be interpreted as the tracking bounding boxes of
the target for each frame in the video stream. IDSW represents the number of times that
the same tracking target ID undergoes changes during the tracking process.
(2) MOTP
MOTP is a metric employed to evaluate the effectiveness of detection during tracking.
The calculation formula is as follows.
∑ Dt,j
t,j
MOTP = (2)
∑ ct
t

where t signifies the current number of frames in the tracked video stream; the variable ct
corresponds to the count of successful pairings between the tracked tracks and the Ground
Truth (GT) tracks at frame t. Dt,j denotes the distance between the successfully paired
two tracks at frame t (generally calculated by Euclidean distance or IOU, Euclidean distance
is employed in this paper). The better the detection effect, the shorter the Euclidean distance
between the paired tracks. An increased number of successful pairings results in a higher
MOTP value.
(3) Rcll
Rcll gauges the proportion of successful target tracking, calculated by dividing the
number of successfully paired targets by the number of real targets.
(4) MT and ML
MT indicates the number of tracks that have been successfully paired for over 80%
of the total frames during the whole tracking process; ML indicates the number of tracks
that have been successfully paired for less than 20% of the total frames during the whole
tracking process.
(5) IDP
IDP is similar to the calculation of precision. Accuracy is defined as the proportion of
true positive samples among all positive samples, while IDP refers to the proportion of all
samples assigned correctly to the target ID.
(6) IDR
IDR is akin to the calculation of recall. Recall refers to the proportion of all true
positive samples for which the predictor offers a positive prediction, while IDR refers to
the proportion of all samples for which the tracking algorithm has correctly assigned the
target ID.
(7) IDF1
IDF1 is then the harmonic mean of IDP and IDR. The calculation formula is as follows.

2
IDF1 = 1 − 1 1
(3)
IDP + IDR

(8) FM
FM refers to the number of tracked target tracks that are mistakenly terminated by the
tracker in the video stream.
(8) FM
FM refers to the number of tracked target tracks that are mistakenly terminated by
Appl. Sci. 2023, 13, 9895 14 of 18
the tracker in the video stream.
(9) IDSW
IDSW serves as a distinctive form of FM, so the value of FM is generally larger than
(9) IDSW
IDSW.
IDSW serves as a distinctive form of FM, so the value of FM is generally larger
than IDSW.
4.4. Object-Tracking Model Testing
4.4.1.
4.4. Data Format Model Testing
Object-Tracking
4.4.1. A gt.txt
Data file is commonly used in MOT datasets to record information about each
Format
tracking
A gt.txt fileinisthe
frame video stream
commonly used [33]. The datasets
in MOT Darklabeltosoftware is used to annotate
record information the
about each
video from Company S’s logistics warehouses to obtain a gt.txt file, as shown
tracking frame in the video stream [33]. The Darklabel software is used to annotate thein Figure
16. from Company S’s logistics warehouses to obtain a gt.txt file, as shown in Figure 16.
video

Figure16.
Figure 16. gt.txt
gt.txt file
file screenshot.
screenshot.

•• Thefirst
The firstbit
bitof
ofeach
eachline
linerepresents
representsthe thecurrent
currentvideovideoframe.
frame.
•• Thesecond
The secondbit bitindicates
indicatesthe
theIDIDnumber
numberof ofthe
thetarget
targettrack.
track.
•• The third to sixth
The third sixth bits
bitsare
arefour
fourparameters,
parameters, and
and thethe
third andand
third fourth bits bits
fourth of the of data
the
represent
data the horizontal
represent and and
the horizontal vertical coordinates
vertical coordinates of the
of upper-left
the upper-left corner of the
corner ob-
of the
ject-tracking frame. The fifth and sixth bits indicate the width and
object-tracking frame. The fifth and sixth bits indicate the width and height of the height of the track-
ing frame.
tracking frame.
•• The The seventh
seventh bit bit indicates
indicates the
the current
current status
status of of the
the target
target track,
track, with
with 00 being
being inactive
inactive
and
and11being
beingactive.
active.
•• The The eighth
eighthbit bitindicates
indicatesthe
thetarget
targettype
typeof ofthe
thetrack.
track.
•• The Theninth
ninthbitbitindicates
indicatesthethevisibility
visibilityratio
ratioofof the
the object-tracking
object-trackingframe, frame,which
whichis is related
related
to
tothe
theinterference
interferencebetween
betweenthisthisframe
frameandandother
otherframes.
frames.
AA det.txt
det.txt file
fileisisaacounterpart
counterpartto toaagt.txt
gt.txtfile,
file,which
whichrecords
recordsthe theobject-tracking
object-trackingframe frame
informationobtained
information obtainedusingusingthe
thetracking
trackingalgorithm.
algorithm. The The det.txt
det.txt filefile
hashas
thethe format
format shownshown in
Figure 17. 17.
in Figure

Figure17.
Figure 17. det.txt
det.txt file
file screenshot.
screenshot.

The
The first
first to
to sixth
sixth parameters
parameters in
in each
each line
line of
of det.txt
det.txt files
files have
have essentially
essentially the
the same
same
meaning as those in gt.txt files. The seventh parameter indicates the confidence level of the
meaning as those in gt.txt files. The seventh parameter indicates the confidence level of
tracked target, and the eighth to tenth parameters are related to 3D target tracking.
the tracked target, and the eighth to tenth parameters are related to 3D target tracking.
4.4.2. Tracking
4.4.2. TrackingResult
Result
The test video was annotated using Darklabel to obtain a gt.txt file. The det.txt file was
generated from the two datasets. The outcomes of the det.txt file were derived through the
YOLOv5 detection model combined with DeepSORT. Table 6 shows the tracking outcomes
of the first and second tests measured by MOT evaluation metrics, respectively.
The test video was annotated using Darklabel to obtain a gt.txt file. The det.txt file
was generated from the two datasets. The outcomes of the det.txt file were derived
through the YOLOv5 detection model combined with DeepSORT. Table 6 shows the
Appl. Sci. 2023, 13, 9895
tracking outcomes of the first and second tests measured by MOT evaluation metrics,15 re-of 18
spectively.
The change in the values of the aforementioned metrics demonstrates a notable im-
provement in the tracking results. Additionally, upon visual inspection of the tracking
Table 6. The first and second tracking results.
videos frame by frame, it becomes evident that the tracking target ID stability from the
Times MOTA MOTP
second
Rcll
dataset
MT
has alsoML
significantly
IDF1
improved.
IDP IDR IDSW FP FN FM
First 63.7% 0.131 Table
86.6% 23 and second
6. The first 0 tracking
75.5%results.
72.4% 78.8% 56 1858 1120 129
Second 75.8% 0.118 87.8% 25 0 83.8% 84.0% 83.5% 26 973 1022 96
Times MOTA MOTP Rcll MT ML IDF1 IDP IDR IDSW FP FN FM
First 63.7% 0.131 86.6%
The change23in the0 75.5%
values of the 72.4% 78.8%
aforementioned 56
metrics 1858 1120
demonstrates 129 im-
a notable
Second 75.8% 0.118provement
87.8% in the
25 tracking
0 results.
83.8% Additionally,
84.0% upon visual
83.5% 26 inspection
973 of the tracking
1022 96
videos frame by frame, it becomes evident that the tracking target ID stability from the
4.5.second dataset
Transfer hasfor
Learning also significantly improved.
DeepSORT
Trained object-tracking models often need to be implemented in different scenarios
4.5. Transfer Learning for DeepSORT
and environments in practical applications, which requires the high transferability and
Trainedofobject-tracking
generalization models
the trained models. Tooften needtransfer
this end, to be implemented
learning couldin different scenarios
be employed to
and environments in practical applications, which requires the high transferability and
retrain the model through a new dataset, enhancing the object-tracking performance.
generalization of the trained models. To this end, transfer learning could be employed
Transfer learning [34] is a proposed machine learning technique that facilitates the trans-
to retrain the model through a new dataset, enhancing the object-tracking performance.
fer of trained model knowledge from one task to another related task model. The steps to
Transfer learning [34] is a proposed machine learning technique that facilitates the transfer
perform transfer learning on the trained DeepSORT model are as follows:
of trained model knowledge from one task to another related task model. The steps to
• perform
Preparetransfer
the dataset: Theon
learning newthedataset
trainedneeds to be organized,
DeepSORT model areensuring it is representa-
as follows:
tive and diverse.
• Prepare the dataset: The new dataset needs to be organized, ensuring it is representa-
• Transfer learning strategy:
tive and diverse.
• • Transfer
Featurelearning
extraction: The majority of the pre-trained DeepSORT model’s weights
strategy:
need to be retained while solely adjusting the output layer to suit the new task.
• Feature extraction: The majority of the pre-trained DeepSORT model’s weights
It is necessary to freeze the weights of the initial layers to maintain low-level
need to be retained while solely adjusting the output layer to suit the new task. It
feature extraction
is necessary capabilities.
to freeze the weights of the initial layers to maintain low-level feature
• Fine-tuning:
extraction In addition to modifying the output layer, fine-tuning some of the
capabilities.
• lower-level weights
Fine-tuning: to adapt
In addition to the newthe
to modifying task is considered,
output allowingsome
layer, fine-tuning certain
of the
lower-level weights to undergo slight adjustments.
lower-level weights to adapt to the new task is considered, allowing certain
• Model training: Training
lower-level weightsthetoadjusted
undergomodel
slight using the new dataset.
adjustments.
• • Model
Model training: Training the adjusted model using compute
evaluation: Evaluate the retrained model and evaluation metrics to
the new dataset.
• evaluate the model’s performance on the new task.
Model evaluation: Evaluate the retrained model and compute evaluation metrics to
• Refinement andmodel’s
evaluate the iteration: Further refining
performance on thethe
newmodel
task.based on the evaluation results
• andRefinement
adjusting hyperparameters or fine-tuning
and iteration: Further themodel
refining the modelbased
network.
on the evaluation results
Theand adjusting
transfer hyperparameters
learning or fine-tuning
process for DeepSORT the model
is shown network.
in Figure 18.
The transfer learning process for DeepSORT is shown in Figure 18.

Figure 18. Transfer learning for DeepSORT.


Figure 18. Transfer learning for DeepSORT.
5. Discussion
In the absence of relevant research and open-source datasets for logistics packages,
this study has constructed a logistics warehouse image dataset, which was applied to the
combined algorithm of YOLOv5 and DeepSORT to achieve object tracking within logistics
warehouse scenarios. The combined algorithm was numerically evaluated by the MOT
Challenge evaluation metrics. The study has achieved good tracking performance and pro-
vided a validation approach for the effectiveness of eliminating manual video surveillance
Appl. Sci. 2023, 13, 9895 16 of 18

in logistics scenarios. This accomplishment signifies an effective smart monitoring ap-


proach for “Sustainable, Human-Centered/Centric and Resilient Logistic Systems Design
and Management”.
The combination of YOLOv5 and DeepSORT is specifically designed to cater to human-
centric logistics scenarios, where real-time object detection and multi-object tracking are
essential. The following are some applications:
• One primary application is the determination of warehouse zone occupancy. By accu-
rately tracking the movement of objects in different zones, real-time occupancy data
could be easily garnered, which optimizes inventory management and streamlines
logistics. Additionally, it facilitates the prediction of space utilization trends, enabling
the proactive allocation of resources.
• The method contributes to safety protocols within warehouses. By continually tracking
the trajectories of objects and employees, potential hazards and unsafe practices can
be identified promptly, minimizing the risk of accidents.
• The method’s versatility extends to inventory management, offering insights into
warehouse stock movement and depletion rates, and preventing overstock situations.
In conclusion, the presented object-tracking method holds immense potential for
revolutionizing warehouse operations. Its applications include occupancy determination,
safety monitoring, efficiency enhancement, and inventory management, contributing to
the optimization and development of future advancements in human-centric logistics
warehouse environments. Such a combination also provides value-added support by
enabling data-driven decision-making to optimize resource allocation for enterprises.
However, it is pertinent to acknowledge that the computational demands of the
proposed method could increase significantly with a larger number of objects and types
of employees. This might necessitate more distributed computing resources, which could
impact the feasibility of real-time tracking. The proposed algorithm may need to be adapted
to handle a larger volume of data while maintaining reasonable processing times. A
careful balance between algorithmic sophistication, computational efficiency, and practical
considerations may be required to address these inherent limitations and trade-offs.

6. Conclusions
This research focuses on the design of a moving-object-tracking scheme for logistics
warehouses. It primarily encompasses the design of a deep learning-based detection
algorithm for logistics warehouses and the design of experiments for tracking moving
targets in logistics warehouses. The main contributions of the dataset, algorithm, and
evaluation method are described below:
• In order to solve the problem of moving-object tracking in logistics warehouses, a
specific company was selected as the sampling object to make the dataset. The dataset
encapsulated tracking outcomes for both goods and workers in video streams. The
tracking model was further strengthened and applied to most logistics warehouse
scenarios through the optimization of the dataset.
• YOLOv5s was selected as the base model on account of both detection accuracy
and detection speed. Despite incorporating the DeepSORT network for multi-object
tracking, the tracking speed remained sufficiently fast (within 30 ms). In addition,
YOLOv5s boasted a compact size of 7.3 MB, thus holding potential for application in
the actual logistics scene.
• The multi-objective evaluation metrics from the MOT Challenge were employed as
the performance evaluation metrics of the logistics warehouse object-tracking system,
assessing the degree of improvement of the tracking effect.
• In the detection test, the [email protected] achieved an impressive score of 0.976, demonstrat-
ing the high accuracy of the object-detection model. Moreover, the tracking test yielded
a commendable MOTA of 75.8% and a Rcll of 86.6%, validating the effectiveness of the
tracking algorithm. Moreover, the method of transfer learning for trained DeepSORT
models is introduced.
Appl. Sci. 2023, 13, 9895 17 of 18

Author Contributions: Conceptualization, T.X. and X.Y.; methodology, T.X.; software, T.X.; validation,
T.X.; formal analysis, T.X.; investigation, T.X.; resources, T.X.; data curation, T.X.; writing—original
draft preparation, T.X.; writing—review and editing, X.Y.; visualization, T.X.; supervision, X.Y.; project
administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the published
version of the manuscript.
Funding: This work was supported by Guangdong Basic and Applied Basic Research Foundation
(2022A1515010095, 2021A1515010506), the National Natural Science Foundation of China, and the
Royal Society of Edinburgh (51911530245).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Tubis, A.A.; Rohman, J. Intelligent Warehouse in Industry 4.0-Systematic Literature Review. Sensors 2023, 23, 4105.
2. Neumann, W.P.; Winkelhaus, S.; Grosse, E.H.; Glock, C.H. Industry 4.0 and the human factor—A systems framework and analysis
methodology for successful development. Int. J. Prod. Econ. 2021, 233, 107992. [CrossRef]
3. Chen, H.; Zhang, Y. Regional Logistics Industry High-Quality Development Level Measurement, Dynamic Evolution, and Its
Impact Path on Industrial Structure Optimization: Finding from China. Sustainability 2022, 14, 14038. [CrossRef]
4. Yan, B.-R.; Dong, Q.-L.; Li, Q.; Amin, F.U.; Wu, J.-N. A Study on the Coupling and Coordination between Logistics Industry and
Economy in the Background of High-Quality Development. Sustainability 2021, 13, 10360. [CrossRef]
5. Chen, Y.; Yang, B. Analysis on the evolution of shipping logistics service supply chain market structure under the application of
blockchain technology. Adv. Eng. Inform. 2022, 53, 13. [CrossRef]
6. Hong, T.; Liang, H.; Yang, Q.; Fang, L.; Kadoch, M.; Cheriet, M. A Real-Time Tracking Algorithm for Multi-Target UAV Based on
Deep Learning. Remote Sens. 2023, 15, 2. [CrossRef]
7. Li, J.; Zhi, J.; Hu, W.; Wang, L.; Yang, A. Research on the improvement of vision target tracking algorithm for Internet of things
technology and Simple extended application in pellet ore phase. Future Gener. Comput. Syst. 2020, 110, 233–242. [CrossRef]
8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
9. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of
the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37.
10. Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022,
113, 104914. [CrossRef]
11. Simeth, A.; Kumar, A.A.; Plapper, P. Using Artificial Intelligence to Facilitate Assembly Automation in High-Mix Low-Volume
Production Scenario. Procedia CIRP 2022, 107, 1029–1034. [CrossRef]
12. Qu, Z.; Gao, L.-y.; Wang, S.-y.; Yin, H.-n.; Yi, T.-m. An improved YOLOv5 method for large objects detection with multi-scale
feature cross-layer fusion network. Image Vis. Comput. 2022, 125, 104518. [CrossRef]
13. Zhan, X.; Wu, W.; Shen, L.; Liao, W.; Zhao, Z.; Xia, J. Industrial internet of things and unsupervised deep learning enabled
real-time occupational safety monitoring in cold storage warehouse. Saf. Sci. 2022, 152, 105766. [CrossRef]
14. Soleimanitaleb, Z.; Keyvanrad, M.A.; Jafari, A. Object Tracking Methods: A Review. In Proceedings of the 2019 9th International
Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 24–25 October 2019; pp. 282–288.
15. Abdulrahman, B.; Zhu, Z. Real-time pedestrian pose estimation, tracking and localization for social distancing. Mach. Vis. Appl.
2022, 34, 8. [CrossRef]
16. Zhou, P.; Liu, Y.; Meng, Z. PointSLOT: Real-Time Simultaneous Localization and Object Tracking for Dynamic Environment. IEEE
Robot. Autom. Lett. 2023, 8, 2645–2652. [CrossRef]
17. Jang, J.; Seon, M.; Choi, J. Lightweight Indoor Multi-Object Tracking in Overlapping FOV Multi-Camera Environments. Sensors
2022, 22, 5267. [CrossRef] [PubMed]
18. Kumar, A.; Kalia, A.; Verma, K.; Sharma, A.; Kaushal, M. Scaling up face masks detection with YOLO on a novel dataset. Optik
2021, 239, 166744. [CrossRef]
19. A Complete Explanation of the Core Basic Knowledge of Yolov5 in the Yolo Series. Available online: https://zhuanlan.zhihu.
com/p/143747206 (accessed on 19 June 2023).
20. Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images
and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [CrossRef]
21. Li, S.; Li, C.; Yang, Y.; Zhang, Q.; Wang, Y.; Guo, Z. Underwater scallop recognition algorithm using improved YOLOv5. Aquac.
Eng. 2022, 98, 102273. [CrossRef]
Appl. Sci. 2023, 13, 9895 18 of 18

22. Wang, H.; Zhang, S.; Zhao, S.; Wang, Q.; Li, D.; Zhao, R. Real-time detection and tracking of fish abnormal behavior based on
improved YOLOV5 and SiamRPN++. Comput. Electron. Agric. 2022, 192, 10. [CrossRef]
23. Li, J.; Qiao, Y.; Liu, S.; Zhang, J.; Yang, Z.; Wang, M. An improved YOLOv5-based vegetable disease detection method. Comput.
Electron. Agric. 2022, 202, 107345. [CrossRef]
24. Kaufmane, E.; Sudars, K.; Namatēvs, I.; Kalnin, a, I.; Judvaitis, J.; Balašs, R.; Strautin, a, S. QuinceSet: Dataset of annotated Japanese
quince images for object detection. Data Brief 2022, 42, 108332. [CrossRef]
25. Chen, F.; Wang, X.; Zhao, Y.; Lv, S.; Niu, X. Visual object tracking: A survey. Comput. Vis. Image Underst. 2022, 222, 42. [CrossRef]
26. Lin, Y.; Chen, T.; Liu, S.; Cai, Y.; Shi, H.; Zheng, D.; Lan, Y.; Yue, X.; Zhang, L. Quick and accurate monitoring peanut seedlings
emergence rate through UAV video and deep learning. Comput. Electron. Agric. 2022, 197, 106938. [CrossRef]
27. Introduction to Multi Object Tracking (MOT). Available online: https://zhuanlan.zhihu.com/p/97449724 (accessed on 19
June 2023).
28. Wang, Z.; Zheng, L.; Liu, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. arXiv 2019, arXiv:1909.12605.
29. Ma, L.; Liu, X.; Zhang, Y.; Jia, S. Visual target detection for energy consumption optimization of unmanned surface vehicle. Energy
Rep. 2022, 8, 363–369. [CrossRef]
30. Lin, X.; Li, C.-T.; Sanchez, V.; Maple, C. On the detection-to-track association for online multi-object tracking. Pattern Recognit.
Lett. 2021, 146, 200–207. [CrossRef]
31. Yang, F.; Wang, Z.; Wu, Y.; Sakti, S.; Nakamura, S. Tackling multiple object tracking with complicated motions—Re-designing the
integration of motion and appearance. Image Vis. Comput. 2022, 124, 104514. [CrossRef]
32. Wong, P.K.-Y.; Luo, H.; Wang, M.; Leung, P.H.; Cheng, J.C.P. Recognition of pedestrian trajectories and attributes with computer
vision and deep learning techniques. Adv. Eng. Inform. 2021, 49, 101356. [CrossRef]
33. Tan, C.; Li, C.; He, D.; Song, H. Towards real-time tracking and counting of seedlings with a one-stage detector and optical flow.
Comput. Electron. Agric. 2022, 193, 106683. [CrossRef]
34. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like