Smart Logistics Warehouse Moving-Object Tracking Based on
Smart Logistics Warehouse Moving-Object Tracking Based on
sciences
Article
Smart Logistics Warehouse Moving-Object Tracking Based on
YOLOv5 and DeepSORT
Tingbo Xie and Xifan Yao *
Featured Application: An approach for object tracking in logistics warehouses based on YOLOv5
and DeepSORT is proposed, which distinguishes humans from goods, and an evaluation system
is established for object tracking in logistics warehouse scenarios.
Abstract: The future development of Industry 4.0 places paramount importance on human-centered/-centric
factors in the production, design, and management of logistic systems, which has led to the emergence
of Industry 5.0. However, effectively integrating human-centered/-centric factors in logistics scenarios
has become a challenge. A pivotal technological solution for dealing with such a challenge is to
distinguish and track moving objects such as humans and goods. Therefore, an algorithm model
combining YOLOv5 and DeepSORT for logistics warehouse object tracking is designed, where
YOLOv5 is selected as the object-detection algorithm and DeepSORT distinguishes humans from
goods and environments. The evaluation metrics from the MOT Challenge affirm the algorithm’s
robustness and efficacy. Through rigorous experimental tests, the combined algorithm demonstrates
rapid convergence (within 30 ms), which holds promising potential for applications in real-world
logistics warehouses.
changes in the environment and the tracked entities themselves. Given the substantial vol-
ume of goods and individuals requiring tracking within logistics warehouses, some of the
targets may move randomly, making it difficult for traditional machine vision algorithms to
identify targets accurately. The newly developed object-recognition algorithms, grounded
in deep learning, are able to automatically learn and extract features. Moreover, they can
be adapted to complex scenarios, rendering them particularly well-suited for intelligent
monitoring applications in logistics warehouses [6,7].
This study aims to provide an effective object-tracking approach within smart logistics
warehouses. Additionally, it strives to formulate an object-tracking system tailored to logis-
tics scenarios. The former provides auxiliary means for monitoring the operation status of
logistics warehouses, ensuring smooth functioning and safety management. This approach
aligns with the notion of the human-centric logistics production management concept. The
latter serves as an illustrative model for the management method of sustainable and flexible
industrial logistics systems. It also offers a pertinent method of object tracking to facilitate
human–machine interaction in virtual reality. In sum, this study resonates with the theme
of “Sustainable, Human-Centered/Centric and Resilient Production and Logistic Systems
Design and Management”.
The specific work is as follows.
1. Three prominent deep learning detection algorithms—Faster R-CNN, SSD, and
YOLO—are briefly introduced and compared. The YOLO algorithm is chosen to
analyze the logistics warehouse moving-object-tracking problem.
2. A dedicated dataset was meticulously crafted based on the logistics warehouse sce-
narios. The quality and quantity of the dataset were optimized, yielding enhanced
outcomes in object detection specific to logistics warehouses.
3. The principle and process of DeepSORT algorithms are introduced. Finally, an evalua-
tion metric for the effectiveness of logistics warehouse object tracking is established
based on the MOT Challenge multi-object-tracking competition. This metric facilitates
quantitative assessment, ultimately offering supplementary tools for diminishing the
necessity of manual video surveillance within logistics settings.
2. Related Work
The key to object tracking and providing auxiliary means to eliminate manual monitor-
ing in logistics warehouse scenarios hinges on the selection of object-tracking algorithms. In
pursuit of this goal, state-of-the-art object-detection algorithms are compared for selecting
the best performer among them.
Figure
Figure 1. Faster
1. Faster R-CNNalgorithm
R-CNN algorithm process.
process.
2.2. SSD (Single Shot MultiBox Detector)
2.2. SSD (Single Shot MultiBox Detector)
Wei Liu [9] proposed a new object-detection algorithm known as SSD in 2015. Op-
Wei Liu
erating [9] proposed
within the realm ofa new object-detection
single-stage detection algorithm
algorithms,known as SSD in 2015.
SSDs accomplish targetOper-
localization
ating as well
within the realmas of
classification
single-stagesimultaneously.
detection algorithms, SSDs accomplish target local-
ization SSD can be
as well as divided into SSD
classification 300 and SSD 512, according to the size of the input image.
simultaneously.
The following is the basic operation
SSD can be divided into SSD 300 process
and of
SSDthe512,
SSD according
algorithm, with SSD
to the 300ofserving
size as im-
the input
the exemplar:
age. The following is the basic operation process of the SSD algorithm, with SSD 300 serv-
ing•as the input image (size of 300 pixels × 300 pixels) is fed into a pre-training network
Theexemplar:
(VGG-16) to obtain different feature mappings.
• • The Theinput image
feature (size
vectors of of 300 \pixels
Conv4 × 300and
_3, Conv7, pixels)
Conv8 is\_2
fedare
into a pre-training
employed network
to establish
(VGG-16) to obtain different feature mappings.
object-detection boxes at different scales, concurrently performing their classification.
• • The Thefeature vectors of Conv4\_3,
NMS (non-maximum Conv7,
suppression) and is
algorithm Conv8\_2
employed are employed
to eliminate to establish
redundant
and inaccurate detections.
object-detection boxes at The final detection
different outcome is thenperforming
scales, concurrently generated, consisting of
their classifica-
the
tion. most pertinent and trustworthy bounding boxes for the targeted objects.
• 2.3.The NMS (non-maximum suppression) algorithm is employed to eliminate redun-
YOLO (You Only Look Once)
dant and inaccurate detections. The final detection outcome is then generated, con-
The YOLO algorithm proposed in 2016 was designed as a new network for object
sisting of the most pertinent and trustworthy bounding boxes for the targeted objects.
detection from end to end. Its unique design approach increases the speed of the entire net-
work operation by casting the object detection entirely into a regression problem, spanning
2.3.from
YOLO (You Only
input-image Look Once) to the classification of targets in each grid [10].
segmentation
AsYOLO
The shown in Figure 2, proposed
algorithm the process in
of YOLO is asdesigned
2016 was follows: as a new network for object
•
detection
Thefrom
inputend to is
image end. Its unique
divided design
into several approach
grids, and eachincreases
grid takesthe speed of the
responsibility forentire
detecting potential targets within its confines.
network operation by casting the object detection entirely into a regression problem, span-
• from
ning The input-image
image of each segmentation
grid is convolvedto to obtain
the the positions
classification of the bounding
of targets in each boxes and
grid [10].
the corresponding confidences for multiple targets.
As shown in Figure 2, the process of YOLO is as follows:
• The final prediction box for each grid’s multiple targets forecast is obtained using
• The input imagesuppression
non-maximum is divided (NMS),
into several
whichgrids, and each
is the result griddetection.
of object takes responsibility for
detecting potential targets within its confines.
• The image of each grid is convolved to obtain the positions of the bounding boxes
and the corresponding confidences for multiple targets.
• The final prediction box for each grid’s multiple targets forecast is obtained using
non-maximum suppression (NMS), which is the result of object detection.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 4 of 18
Appl. Sci. 2023, 13, 9895 4 of 18
YOLOalgorithm
Figure2.2.YOLO
Figure algorithmprocess.
process.
Figure
Figure 3. YOLOv5s 3. YOLOv5s
network network structure.
structure.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 18
Appl. Sci. 2023, 13, 9895 6 of 18
3.1.1. CBL (Convolutional Operation, Batch Normalization Algorithm, and Leaky ReLU
3.1.1. CBL (Convolutional
Activation Function) Operation, Batch Normalization Algorithm, and Leaky ReLU
3.1.1. CBL
Activation (Convolutional Operation, Batch Normalization Algorithm, and Leaky ReLU
Function) operation,
3.1.1.The
CBLconvolutional
(Convolutional the Batch
Operation, batch Normalization
normalization algorithm,
Algorithm,andandthe leaky
Leaky ReLU
ReLU
Activation Function)
Activation
activation Function)
function compose
The convolutional the CBL
operation, module,
the batch which mainly
normalization plays the
algorithm, role
and theofleaky
dataReLU
sam-
pling. The
The
activation
The convolutional
structure
function compose
convolutional operation,
is shown the
operation, thebatch
in Figure
CBL
the batch normalization
4. normalization
module, algorithm,
which mainly plays the
algorithm, and
and theof
role
the leaky ReLU
dataReLU
leaky sam-
activation
activation function
pling. Thefunction
structure compose
is shown
compose the
the CBL
inCBL
Figure module,
4. which
module, which mainly
mainly plays
plays the role
the role of data
of data sam-
sampling.
pling.
The The structure
structure is shownis shown in 4.
in Figure Figure 4.
3.1.4. CSP2_X
3.1.4. CSP2_X
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
Appl. Sci. 2023, 13, 9895 3.1.4. CSP2_X 7 of 18
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
CSP2_X as shown in Figure 7, is similar to CSP1_X, using CBL to replace the Res unit.
• To ensure the robustness of the model, 100 photos from the first dataset are intention-
ally incorporated as a supplement to training.
• object_class indicates the type of object detected, which is usually numbered by a
positive integer starting from zero.
• x_center and y_center indicate the coordinates of the center of the target bounding box
(normalized, i.e., divided by the width and height of the whole image).
• width and height indicate the width and height of the target bounding box (normalized,
i.e., divided by the width and height of the whole image).
(a) Video 1 test results (b) Video 2 test results (c) Video 3 test results
(a) Video 1 test results (b) Video 2 test results (c) Video 3 test results
Figure 11. Test results for the three videos.
Figure 11. Test results for the three videos.
Figure 11. Test results for the three videos.
The trained model encountered challenges when detecting a target that was a relatively
The trained model encountered
The trained modelchallenges when
encountered detectingwhen
challenges a target that was
detecting a rela-
a target that stream,
was a rela-
small part of the whole-image bounding box. Along with a continuous video it
tively small part of the
tively whole-image
small part of bounding
the box.
whole-image Along
bounding with
box. a continuous
Along with a video
continuous video
tended to lose the tracking frame when the target was obscured. There were also many
stream, it tended to loseofthe
stream,
cases it tracking
tended
false toframe
lose thewhen
positives—for the frame
tracking
example, target was obscured.
when
the detectionthefor
target There
was
non-goods were also
obscured.
objects There were also
as targets. In
many cases of false positives—for
many
cases casesgoods
where example,
of false
and the
positives—for detection
workers were example, for
obscured non-goods
thefrom
detection objects
each for as
non-goods
other, targets.
objects
the detection as targets.
bounding
In cases where goodsIn cases
boxesand where
workers
often goods
were
appeared and
for workers
obscured
only were
from
one of obscured
each
the two from
other,
elements. theeach other, the
detection detection bounding
bounding
boxes often appearedboxes often
onlyappeared
forlabelling,
In onethe fortwo
onlyelements.
one of the
ofauthentic-detection
the two elements.
bounding box of the workers was defined as the
smallest rectangular box that could enclose worker’s clothing, hair, etc. However, it was
difficult for the detection model to enclose the entire area of the worker when parts of the
worker were concealed. When it came to detecting goods, the model often struggled to
detect the individual items based on their features, especially in scenarios where multiple
goods were stacked each other.
To address the issues above, another dataset was established. The second training
incorporated 790 photos from a fixed scene (i.e., the logistics warehouses of Company S)
difficult for the detection model to enclose the entire area of the worker when parts of the
worker were concealed. When it came to detecting goods, the model often struggled to
detect the individual items based on their features, especially in scenarios where multiple
goods were stacked each other.
Appl. Sci. 2023, 13, 9895
To address the issues above, another dataset was established. The second training 10 of 18
incorporated 790 photos from a fixed scene (i.e., the logistics warehouses of Company S)
as the dataset for the detection of parcels and workers. It changed the situation of the
relatively smallassize
the of datasets
dataset anddetection
for the a large number ofand
of parcels diverse scenes.
workers. It changed the situation of the
relatively small size of datasets and a large number of diverse scenes.
3.3.2. Second Training Detection Model
3.3.2. Second Training Detection Model
The second dataset used for training is a set of 790 photos taken from the same logis-
The second dataset used for training is a set of 790 photos taken from the same
tics warehouses (i.e., Company S’s logistics warehouses). The 790 images consisted of 100
logistics warehouses (i.e., Company S’s logistics warehouses). The 790 images consisted
images of specific goods (parcels with adhesive tape), 100 images of workers and 590 im-
of 100 images of specific goods (parcels with adhesive tape), 100 images of workers and
ages of spatial 590
interference
images of between workers and
spatial interference the goods.
between workersThe
andtest resultsThe
the goods. of this
test object
results of this
detection are shown in Figure 12.
object detection are shown in Figure 12.
As can be seenAsfrom
can be seen from
Figure Figure
12, the 12, theof
detection detection
obscured of obscured
targets wastargets was significantly
significantly
improved, and [email protected] reached 0.976. At the same time, the model was still able to detect
improved, and [email protected] reached 0.976. At the same time, the model was still able to detect
targets with a relatively small size. Compared with the first training, the second training
targets with a relatively small size. Compared with the first training, the second training
was improved dramatically, and the trained model detected the targets individually even
was improved when
dramatically,
the goodsand the trained
blocked model detected the targets individually even
each other.
when the goods blocked each other.
As for the proposed tracking task, a continuous-detection bounding box for the
As for thetarget,
proposed tracking
as stated above,task,
was a continuous-detection
not bounding
enough; the tracking model neededboxto for the tar-
maintain the stable
invariance
get, as stated above, wasofnottheenough;
tracking the
IDs.tracking
In the tracking
model task,
neededeachtotarget was given
maintain a unique ID,
the stable
and
invariance of the keepingIDs.
tracking the IDIn constant when
the tracking the target
task, reappearing
each target was aa vital
was given uniquemetric
ID,for
andtracking
keeping the ID constant when the target reappearing was a vital metric for tracking per- model
performance reference [25]. To achieve tracking task, we combined the detection
with DeepSORT to preserve the continuity of target IDs.
formance reference [25]. To achieve tracking task, we combined the detection model with
DeepSORT to preserve theWarehouse
4. Logistics continuityTarget
of target IDs.
Tracking
YOLOv5 and DeepSORT were combined to assign IDs to the tracked targets in the
4. Logistics Warehouse Target
video stream Trackingthe stability of the IDs. Originated from the SORT algorithm,
and maintain
YOLOv5 andDeepSORT now became
DeepSORT a tracking algorithm
were combined to assigncombining
IDs to theKalman
trackedfiltering
targetsand
in Hungarian
the
matching, as described below.
video stream and maintain the stability of the IDs. Originated from the SORT algorithm,
DeepSORT now became a tracking algorithm combining Kalman filtering and Hungarian
4.1. DeepSORT
matching, as described below.
DeepSORT differs from SORT, mainly defining a new state of target tracks and match-
ing the original detection result with the prediction by judging the state of the tracks [26]. At
the same time, tracks with distinct states can be judged by the quantities of matching, which
either increase or decrease. The DeepSORT algorithm process is shown in Figure 13 [27,28].
4.1. DeepSORT
DeepSORT differs from SORT, mainly defining a new state of target tracks and
matching the original detection result with the prediction by judging the state of the tracks
[26]. At the same time, tracks with distinct states can be judged by the quantities of match-
Appl. Sci. 2023, 13, 9895
ing, which either increase or decrease. The DeepSORT algorithm process is shown 11 in of
Fig-
18
ure 13 [27,28].
Figure15.
Figure Evaluationindicators
15.Evaluation indicatorsfor
forthe
theMOT
MOTChallenge
Challengecompetition.
competition.
(1) MOTA
Figure 15. Evaluation indicators for the MOT Challenge competition.
The formula for calculating MOTA is as follows:
FN + FP + IDSW
MOTA = 1 − (1)
GT
GT (ground truth) signifies the accurate target bounding box information of the
tracking target, which is often obtained by manual annotation or automatic annotation
Appl. Sci. 2023, 13, 9895 13 of 18
where t signifies the current number of frames in the tracked video stream; the variable ct
corresponds to the count of successful pairings between the tracked tracks and the Ground
Truth (GT) tracks at frame t. Dt,j denotes the distance between the successfully paired
two tracks at frame t (generally calculated by Euclidean distance or IOU, Euclidean distance
is employed in this paper). The better the detection effect, the shorter the Euclidean distance
between the paired tracks. An increased number of successful pairings results in a higher
MOTP value.
(3) Rcll
Rcll gauges the proportion of successful target tracking, calculated by dividing the
number of successfully paired targets by the number of real targets.
(4) MT and ML
MT indicates the number of tracks that have been successfully paired for over 80%
of the total frames during the whole tracking process; ML indicates the number of tracks
that have been successfully paired for less than 20% of the total frames during the whole
tracking process.
(5) IDP
IDP is similar to the calculation of precision. Accuracy is defined as the proportion of
true positive samples among all positive samples, while IDP refers to the proportion of all
samples assigned correctly to the target ID.
(6) IDR
IDR is akin to the calculation of recall. Recall refers to the proportion of all true
positive samples for which the predictor offers a positive prediction, while IDR refers to
the proportion of all samples for which the tracking algorithm has correctly assigned the
target ID.
(7) IDF1
IDF1 is then the harmonic mean of IDP and IDR. The calculation formula is as follows.
2
IDF1 = 1 − 1 1
(3)
IDP + IDR
(8) FM
FM refers to the number of tracked target tracks that are mistakenly terminated by the
tracker in the video stream.
(8) FM
FM refers to the number of tracked target tracks that are mistakenly terminated by
Appl. Sci. 2023, 13, 9895 14 of 18
the tracker in the video stream.
(9) IDSW
IDSW serves as a distinctive form of FM, so the value of FM is generally larger than
(9) IDSW
IDSW.
IDSW serves as a distinctive form of FM, so the value of FM is generally larger
than IDSW.
4.4. Object-Tracking Model Testing
4.4.1.
4.4. Data Format Model Testing
Object-Tracking
4.4.1. A gt.txt
Data file is commonly used in MOT datasets to record information about each
Format
tracking
A gt.txt fileinisthe
frame video stream
commonly used [33]. The datasets
in MOT Darklabeltosoftware is used to annotate
record information the
about each
video from Company S’s logistics warehouses to obtain a gt.txt file, as shown
tracking frame in the video stream [33]. The Darklabel software is used to annotate thein Figure
16. from Company S’s logistics warehouses to obtain a gt.txt file, as shown in Figure 16.
video
Figure16.
Figure 16. gt.txt
gt.txt file
file screenshot.
screenshot.
•• Thefirst
The firstbit
bitof
ofeach
eachline
linerepresents
representsthe thecurrent
currentvideovideoframe.
frame.
•• Thesecond
The secondbit bitindicates
indicatesthe
theIDIDnumber
numberof ofthe
thetarget
targettrack.
track.
•• The third to sixth
The third sixth bits
bitsare
arefour
fourparameters,
parameters, and
and thethe
third andand
third fourth bits bits
fourth of the of data
the
represent
data the horizontal
represent and and
the horizontal vertical coordinates
vertical coordinates of the
of upper-left
the upper-left corner of the
corner ob-
of the
ject-tracking frame. The fifth and sixth bits indicate the width and
object-tracking frame. The fifth and sixth bits indicate the width and height of the height of the track-
ing frame.
tracking frame.
•• The The seventh
seventh bit bit indicates
indicates the
the current
current status
status of of the
the target
target track,
track, with
with 00 being
being inactive
inactive
and
and11being
beingactive.
active.
•• The The eighth
eighthbit bitindicates
indicatesthe
thetarget
targettype
typeof ofthe
thetrack.
track.
•• The Theninth
ninthbitbitindicates
indicatesthethevisibility
visibilityratio
ratioofof the
the object-tracking
object-trackingframe, frame,which
whichis is related
related
to
tothe
theinterference
interferencebetween
betweenthisthisframe
frameandandother
otherframes.
frames.
AA det.txt
det.txt file
fileisisaacounterpart
counterpartto toaagt.txt
gt.txtfile,
file,which
whichrecords
recordsthe theobject-tracking
object-trackingframe frame
informationobtained
information obtainedusingusingthe
thetracking
trackingalgorithm.
algorithm. The The det.txt
det.txt filefile
hashas
thethe format
format shownshown in
Figure 17. 17.
in Figure
Figure17.
Figure 17. det.txt
det.txt file
file screenshot.
screenshot.
The
The first
first to
to sixth
sixth parameters
parameters in
in each
each line
line of
of det.txt
det.txt files
files have
have essentially
essentially the
the same
same
meaning as those in gt.txt files. The seventh parameter indicates the confidence level of the
meaning as those in gt.txt files. The seventh parameter indicates the confidence level of
tracked target, and the eighth to tenth parameters are related to 3D target tracking.
the tracked target, and the eighth to tenth parameters are related to 3D target tracking.
4.4.2. Tracking
4.4.2. TrackingResult
Result
The test video was annotated using Darklabel to obtain a gt.txt file. The det.txt file was
generated from the two datasets. The outcomes of the det.txt file were derived through the
YOLOv5 detection model combined with DeepSORT. Table 6 shows the tracking outcomes
of the first and second tests measured by MOT evaluation metrics, respectively.
The test video was annotated using Darklabel to obtain a gt.txt file. The det.txt file
was generated from the two datasets. The outcomes of the det.txt file were derived
through the YOLOv5 detection model combined with DeepSORT. Table 6 shows the
Appl. Sci. 2023, 13, 9895
tracking outcomes of the first and second tests measured by MOT evaluation metrics,15 re-of 18
spectively.
The change in the values of the aforementioned metrics demonstrates a notable im-
provement in the tracking results. Additionally, upon visual inspection of the tracking
Table 6. The first and second tracking results.
videos frame by frame, it becomes evident that the tracking target ID stability from the
Times MOTA MOTP
second
Rcll
dataset
MT
has alsoML
significantly
IDF1
improved.
IDP IDR IDSW FP FN FM
First 63.7% 0.131 Table
86.6% 23 and second
6. The first 0 tracking
75.5%results.
72.4% 78.8% 56 1858 1120 129
Second 75.8% 0.118 87.8% 25 0 83.8% 84.0% 83.5% 26 973 1022 96
Times MOTA MOTP Rcll MT ML IDF1 IDP IDR IDSW FP FN FM
First 63.7% 0.131 86.6%
The change23in the0 75.5%
values of the 72.4% 78.8%
aforementioned 56
metrics 1858 1120
demonstrates 129 im-
a notable
Second 75.8% 0.118provement
87.8% in the
25 tracking
0 results.
83.8% Additionally,
84.0% upon visual
83.5% 26 inspection
973 of the tracking
1022 96
videos frame by frame, it becomes evident that the tracking target ID stability from the
4.5.second dataset
Transfer hasfor
Learning also significantly improved.
DeepSORT
Trained object-tracking models often need to be implemented in different scenarios
4.5. Transfer Learning for DeepSORT
and environments in practical applications, which requires the high transferability and
Trainedofobject-tracking
generalization models
the trained models. Tooften needtransfer
this end, to be implemented
learning couldin different scenarios
be employed to
and environments in practical applications, which requires the high transferability and
retrain the model through a new dataset, enhancing the object-tracking performance.
generalization of the trained models. To this end, transfer learning could be employed
Transfer learning [34] is a proposed machine learning technique that facilitates the trans-
to retrain the model through a new dataset, enhancing the object-tracking performance.
fer of trained model knowledge from one task to another related task model. The steps to
Transfer learning [34] is a proposed machine learning technique that facilitates the transfer
perform transfer learning on the trained DeepSORT model are as follows:
of trained model knowledge from one task to another related task model. The steps to
• perform
Preparetransfer
the dataset: Theon
learning newthedataset
trainedneeds to be organized,
DeepSORT model areensuring it is representa-
as follows:
tive and diverse.
• Prepare the dataset: The new dataset needs to be organized, ensuring it is representa-
• Transfer learning strategy:
tive and diverse.
• • Transfer
Featurelearning
extraction: The majority of the pre-trained DeepSORT model’s weights
strategy:
need to be retained while solely adjusting the output layer to suit the new task.
• Feature extraction: The majority of the pre-trained DeepSORT model’s weights
It is necessary to freeze the weights of the initial layers to maintain low-level
need to be retained while solely adjusting the output layer to suit the new task. It
feature extraction
is necessary capabilities.
to freeze the weights of the initial layers to maintain low-level feature
• Fine-tuning:
extraction In addition to modifying the output layer, fine-tuning some of the
capabilities.
• lower-level weights
Fine-tuning: to adapt
In addition to the newthe
to modifying task is considered,
output allowingsome
layer, fine-tuning certain
of the
lower-level weights to undergo slight adjustments.
lower-level weights to adapt to the new task is considered, allowing certain
• Model training: Training
lower-level weightsthetoadjusted
undergomodel
slight using the new dataset.
adjustments.
• • Model
Model training: Training the adjusted model using compute
evaluation: Evaluate the retrained model and evaluation metrics to
the new dataset.
• evaluate the model’s performance on the new task.
Model evaluation: Evaluate the retrained model and compute evaluation metrics to
• Refinement andmodel’s
evaluate the iteration: Further refining
performance on thethe
newmodel
task.based on the evaluation results
• andRefinement
adjusting hyperparameters or fine-tuning
and iteration: Further themodel
refining the modelbased
network.
on the evaluation results
Theand adjusting
transfer hyperparameters
learning or fine-tuning
process for DeepSORT the model
is shown network.
in Figure 18.
The transfer learning process for DeepSORT is shown in Figure 18.
6. Conclusions
This research focuses on the design of a moving-object-tracking scheme for logistics
warehouses. It primarily encompasses the design of a deep learning-based detection
algorithm for logistics warehouses and the design of experiments for tracking moving
targets in logistics warehouses. The main contributions of the dataset, algorithm, and
evaluation method are described below:
• In order to solve the problem of moving-object tracking in logistics warehouses, a
specific company was selected as the sampling object to make the dataset. The dataset
encapsulated tracking outcomes for both goods and workers in video streams. The
tracking model was further strengthened and applied to most logistics warehouse
scenarios through the optimization of the dataset.
• YOLOv5s was selected as the base model on account of both detection accuracy
and detection speed. Despite incorporating the DeepSORT network for multi-object
tracking, the tracking speed remained sufficiently fast (within 30 ms). In addition,
YOLOv5s boasted a compact size of 7.3 MB, thus holding potential for application in
the actual logistics scene.
• The multi-objective evaluation metrics from the MOT Challenge were employed as
the performance evaluation metrics of the logistics warehouse object-tracking system,
assessing the degree of improvement of the tracking effect.
• In the detection test, the [email protected] achieved an impressive score of 0.976, demonstrat-
ing the high accuracy of the object-detection model. Moreover, the tracking test yielded
a commendable MOTA of 75.8% and a Rcll of 86.6%, validating the effectiveness of the
tracking algorithm. Moreover, the method of transfer learning for trained DeepSORT
models is introduced.
Appl. Sci. 2023, 13, 9895 17 of 18
Author Contributions: Conceptualization, T.X. and X.Y.; methodology, T.X.; software, T.X.; validation,
T.X.; formal analysis, T.X.; investigation, T.X.; resources, T.X.; data curation, T.X.; writing—original
draft preparation, T.X.; writing—review and editing, X.Y.; visualization, T.X.; supervision, X.Y.; project
administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the published
version of the manuscript.
Funding: This work was supported by Guangdong Basic and Applied Basic Research Foundation
(2022A1515010095, 2021A1515010506), the National Natural Science Foundation of China, and the
Royal Society of Edinburgh (51911530245).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Tubis, A.A.; Rohman, J. Intelligent Warehouse in Industry 4.0-Systematic Literature Review. Sensors 2023, 23, 4105.
2. Neumann, W.P.; Winkelhaus, S.; Grosse, E.H.; Glock, C.H. Industry 4.0 and the human factor—A systems framework and analysis
methodology for successful development. Int. J. Prod. Econ. 2021, 233, 107992. [CrossRef]
3. Chen, H.; Zhang, Y. Regional Logistics Industry High-Quality Development Level Measurement, Dynamic Evolution, and Its
Impact Path on Industrial Structure Optimization: Finding from China. Sustainability 2022, 14, 14038. [CrossRef]
4. Yan, B.-R.; Dong, Q.-L.; Li, Q.; Amin, F.U.; Wu, J.-N. A Study on the Coupling and Coordination between Logistics Industry and
Economy in the Background of High-Quality Development. Sustainability 2021, 13, 10360. [CrossRef]
5. Chen, Y.; Yang, B. Analysis on the evolution of shipping logistics service supply chain market structure under the application of
blockchain technology. Adv. Eng. Inform. 2022, 53, 13. [CrossRef]
6. Hong, T.; Liang, H.; Yang, Q.; Fang, L.; Kadoch, M.; Cheriet, M. A Real-Time Tracking Algorithm for Multi-Target UAV Based on
Deep Learning. Remote Sens. 2023, 15, 2. [CrossRef]
7. Li, J.; Zhi, J.; Hu, W.; Wang, L.; Yang, A. Research on the improvement of vision target tracking algorithm for Internet of things
technology and Simple extended application in pellet ore phase. Future Gener. Comput. Syst. 2020, 110, 233–242. [CrossRef]
8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
9. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of
the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37.
10. Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022,
113, 104914. [CrossRef]
11. Simeth, A.; Kumar, A.A.; Plapper, P. Using Artificial Intelligence to Facilitate Assembly Automation in High-Mix Low-Volume
Production Scenario. Procedia CIRP 2022, 107, 1029–1034. [CrossRef]
12. Qu, Z.; Gao, L.-y.; Wang, S.-y.; Yin, H.-n.; Yi, T.-m. An improved YOLOv5 method for large objects detection with multi-scale
feature cross-layer fusion network. Image Vis. Comput. 2022, 125, 104518. [CrossRef]
13. Zhan, X.; Wu, W.; Shen, L.; Liao, W.; Zhao, Z.; Xia, J. Industrial internet of things and unsupervised deep learning enabled
real-time occupational safety monitoring in cold storage warehouse. Saf. Sci. 2022, 152, 105766. [CrossRef]
14. Soleimanitaleb, Z.; Keyvanrad, M.A.; Jafari, A. Object Tracking Methods: A Review. In Proceedings of the 2019 9th International
Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 24–25 October 2019; pp. 282–288.
15. Abdulrahman, B.; Zhu, Z. Real-time pedestrian pose estimation, tracking and localization for social distancing. Mach. Vis. Appl.
2022, 34, 8. [CrossRef]
16. Zhou, P.; Liu, Y.; Meng, Z. PointSLOT: Real-Time Simultaneous Localization and Object Tracking for Dynamic Environment. IEEE
Robot. Autom. Lett. 2023, 8, 2645–2652. [CrossRef]
17. Jang, J.; Seon, M.; Choi, J. Lightweight Indoor Multi-Object Tracking in Overlapping FOV Multi-Camera Environments. Sensors
2022, 22, 5267. [CrossRef] [PubMed]
18. Kumar, A.; Kalia, A.; Verma, K.; Sharma, A.; Kaushal, M. Scaling up face masks detection with YOLO on a novel dataset. Optik
2021, 239, 166744. [CrossRef]
19. A Complete Explanation of the Core Basic Knowledge of Yolov5 in the Yolo Series. Available online: https://zhuanlan.zhihu.
com/p/143747206 (accessed on 19 June 2023).
20. Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images
and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [CrossRef]
21. Li, S.; Li, C.; Yang, Y.; Zhang, Q.; Wang, Y.; Guo, Z. Underwater scallop recognition algorithm using improved YOLOv5. Aquac.
Eng. 2022, 98, 102273. [CrossRef]
Appl. Sci. 2023, 13, 9895 18 of 18
22. Wang, H.; Zhang, S.; Zhao, S.; Wang, Q.; Li, D.; Zhao, R. Real-time detection and tracking of fish abnormal behavior based on
improved YOLOV5 and SiamRPN++. Comput. Electron. Agric. 2022, 192, 10. [CrossRef]
23. Li, J.; Qiao, Y.; Liu, S.; Zhang, J.; Yang, Z.; Wang, M. An improved YOLOv5-based vegetable disease detection method. Comput.
Electron. Agric. 2022, 202, 107345. [CrossRef]
24. Kaufmane, E.; Sudars, K.; Namatēvs, I.; Kalnin, a, I.; Judvaitis, J.; Balašs, R.; Strautin, a, S. QuinceSet: Dataset of annotated Japanese
quince images for object detection. Data Brief 2022, 42, 108332. [CrossRef]
25. Chen, F.; Wang, X.; Zhao, Y.; Lv, S.; Niu, X. Visual object tracking: A survey. Comput. Vis. Image Underst. 2022, 222, 42. [CrossRef]
26. Lin, Y.; Chen, T.; Liu, S.; Cai, Y.; Shi, H.; Zheng, D.; Lan, Y.; Yue, X.; Zhang, L. Quick and accurate monitoring peanut seedlings
emergence rate through UAV video and deep learning. Comput. Electron. Agric. 2022, 197, 106938. [CrossRef]
27. Introduction to Multi Object Tracking (MOT). Available online: https://zhuanlan.zhihu.com/p/97449724 (accessed on 19
June 2023).
28. Wang, Z.; Zheng, L.; Liu, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. arXiv 2019, arXiv:1909.12605.
29. Ma, L.; Liu, X.; Zhang, Y.; Jia, S. Visual target detection for energy consumption optimization of unmanned surface vehicle. Energy
Rep. 2022, 8, 363–369. [CrossRef]
30. Lin, X.; Li, C.-T.; Sanchez, V.; Maple, C. On the detection-to-track association for online multi-object tracking. Pattern Recognit.
Lett. 2021, 146, 200–207. [CrossRef]
31. Yang, F.; Wang, Z.; Wu, Y.; Sakti, S.; Nakamura, S. Tackling multiple object tracking with complicated motions—Re-designing the
integration of motion and appearance. Image Vis. Comput. 2022, 124, 104514. [CrossRef]
32. Wong, P.K.-Y.; Luo, H.; Wang, M.; Leung, P.H.; Cheng, J.C.P. Recognition of pedestrian trajectories and attributes with computer
vision and deep learning techniques. Adv. Eng. Inform. 2021, 49, 101356. [CrossRef]
33. Tan, C.; Li, C.; He, D.; Song, H. Towards real-time tracking and counting of seedlings with a one-stage detector and optical flow.
Comput. Electron. Agric. 2022, 193, 106683. [CrossRef]
34. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.