0% found this document useful (0 votes)
58 views12 pages

Object Tracking in Infrared Images Using A Deep Learning Model and A Target-Attention Mechanism

This document summarizes a research paper that proposes a deep learning model using an attention mechanism for object tracking in infrared images. The model takes both the original image and an encoded image created using a local directional number patterns algorithm to learn unique features for visual tracking. Experiments showed the proposed pipeline obtains competitive results for small object tracking in infrared images compared to recent methods, overcoming issues like low signal-to-noise ratios and objects being submerged in complex backgrounds.

Uploaded by

apt.dwdi.iitb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views12 pages

Object Tracking in Infrared Images Using A Deep Learning Model and A Target-Attention Mechanism

This document summarizes a research paper that proposes a deep learning model using an attention mechanism for object tracking in infrared images. The model takes both the original image and an encoded image created using a local directional number patterns algorithm to learn unique features for visual tracking. Experiments showed the proposed pipeline obtains competitive results for small object tracking in infrared images compared to recent methods, overcoming issues like low signal-to-noise ratios and objects being submerged in complex backgrounds.

Uploaded by

apt.dwdi.iitb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Complex & Intelligent Systems (2023) 9:1495–1506

https://doi.org/10.1007/s40747-022-00872-w

ORIGINAL ARTICLE

Object tracking in infrared images using a deep learning model


and a target-attention mechanism
Mahboub Parhizkar1 · Gholamreza Karamali2 · Bahram Abedi Ravan2

Received: 1 December 2021 / Accepted: 4 September 2022 / Published online: 17 September 2022
© The Author(s) 2022

Abstract
Small object tracking in infrared images is widely utilized in various fields, such as video surveillance, infrared guidance, and
unmanned aerial vehicle monitoring. The existing small target detection strategies in infrared images suffer from submerging
the target in heavy cluttered infrared (IR) maritime images. To overcome this issue, we use the original image and the
corresponding encoded image to apply our model. We use the local directional number patterns algorithm to encode the
original image to represent more unique details. Our model is able to learn more informative and unique features from the
original and encoded image for visual tracking. In this study, we explore the best convolutional filters to obtain the best possible
visual tracking results by finding those inactive to the backgrounds while active in the target region. To this end, the attention
mechanism for the feature extracting framework is investigated comprising a scale-sensitive feature generation component
and a discriminative feature generation module based on the gradients of regression and scoring losses. Comprehensive
experiments have demonstrated that our pipeline obtains competitive results compared to recently published papers.

Keywords Infrared images · Deep learning · Attention mechanism · Target tracking

Abbreviations LDP Local directional pattern


LTP Local ternary pattern
UAV Unmanned aerial vehicle LDN Local directional number
IR Infrared LWDP Local word directional pattern
IRST Infrared search and track CNN Convolutional neural network
SNR Signal-to-noise ratio Grad-CAM Gradient-weighted class activation mapping
SCR Signal-to-clutter ratio FN False negative
DL Deep learning FP False positive
ROI Region of interest TP True positive
DMF Directional morphological filtering FN False negative
MSEs Multidirectional structuring elements SSD Single shot detector
LBP Local binary pattern

B Mahboub Parhizkar
[email protected] Introduction
Gholamreza Karamali
[email protected] Visual tracking can be considered as the ability to look
Bahram Abedi Ravan at something and follow its movement. Visual tracking in
[email protected] videos that learns to estimate the locations of a target object
1 has been broadly employed for several applications, such as
Department of Mathematics, Central Tehran Branch, Islamic
Azad University, Tehran, Iran infrared search and track (IRST) system (or infra-red sight-
2 ing and tracking), video surveillance, autonomous driving,
Faculty of Basic Sciences, Shahid Sattari Aeronautical
University of Science and Technology, South Mehrabad, and human motion analysis [1, 2]. However, due to the long
Tehran, Iran observation distance in the infrared image, the target has a

123
1496 Complex & Intelligent Systems (2023) 9:1495–1506

low signal-to-noise ratio (SNR) and a small size leading to describes the implementation details of the suggested model.
obtain limited information for tracking [3, 4]. “Conclusions” provides conclusions.
Moreover, under a range of environmental conditions
make infrared small target tracking even more difficult. It
mostly comprises some concrete scenes such as intense noise,
low contrast, competing for background clutter, and camera Literature review
ego-motion, and so on. For instance, the camera ego-motion
leads to the happening of an impulsive motion of the tar- A small target tracking technique infrared images based on
get between two sequential frames, which simply causes to Kalman filter and multiple cues fusion is proposed to over-
miss the target. Also, as small objects in infrared images come the problem of complex environmental conditions in
can be simply submerged in a complex background with a [1]. In the first step, they employed the Kalman filter to esti-
low signal-to-clutter ratio (SCR), it makes the tracker drift to mate the preliminary target position that is considered as
the background. The intense noise and low value of the con- the center of the region of interest (ROI). Next, the motion,
trast can lead to a drop in target SNR. Besides, because of the contrast, and grey color cues in the ROI are investigated to
long imaging distance, small targets have no concrete texture produce the confidence map to locate the small target. Finally,
and shape. Hence, robust and accurate small object tracking the target models and the fusion weights are updated, and
in infrared images remains a challenging task in crowded the predicted target position can be considered as a measure-
scenes [5, 6]. Several models that try to track small targets ment of the Kalman filter. A robust maritime dim small target
in infrared images efficiently have been implemented in the detection scheme to overcome the problem of submerging
literature. Although many researches have been conducted weak targets in heavy cluttered infrared (IR) maritime images
using visible cameras, due to their high dependency on the introduced in [14]. They enhanced the quality of employing
illumination condition, these cameras are not good options images by the multidirectional improved top-hat filter. Also,
for the night-time environment. So, to overcome this issue, they established directional morphological filtering (DMF)
we employ an infrared imaging system that is more robust to by incorporating morphological operations and constructed
illumination variations and is able to work well in night-time multidirectional structuring elements (MSEs) to explore the
and day-time [7]. multidirectional differences between target area and local
In the last few years, deep learning (DL) pipelines have proximate objects.
reached better classification and prediction results compared A learning discriminative prediction model was proposed
to the state-of-the-art performance in the different fields of in [15] which is capable of fully investigating the background
computer vision tasks [8–13]. However, there are only some and target appearance information. Firstly, a steepest descent-
DL strategies to track objects in the infrared images, and their based technique is employed that calculates an optimal step
efficiency is not as competitive as the algorithms based on length in each iteration. Then, a module that efficiently ini-
hand-crafted features. Moreover, they are unable to detect tializes the target model is integrated. Zhang et al. [16]
and track objects with variation both in size and shape effec- suggested an RGB-infrared fusion tracking strategy using
tively. So, in this paper, we suggest a small object tracking visible and infrared images. To this end, a fully convolu-
approach using infrared images which uses a deep learning tional network based on the Siamese Networks (SiamFT)
model that is able to track even small objects at the pres- was suggested. In the first step, infrared and visible images
ence of size and shape variations. As each texture includes are processed by an infrared network and a visible network.
many textural information that are crucial when dealing with Next, to form fused template image, convolutional features of
a real scene, we apply a textural descriptor approach to infrared and visible images explored from two Siamese Net-
explore key features. The employed textural descriptor is an works are merged. A modality weight calculation technique
illumination-invariant technique that is very beneficial for using the response value of Siamese network is employed
tracking tasks. Moreover, we propose a deep learning model for estimating the reliability of dissimilar images. Finally, a
which accept both original image and image obtained by cross-relation approach is used to create the final response
textural descriptor method (encoded image). Our DL model map.
includes target attention mechanism and size attention mech- An improvement of a fully convolutional neural network
anism. The attention mechanism means one or more features (FCNN) to estimate object location was proposed in [17].
are more important than others and we need to pay more Their strategy uses a comprehensive sampling technique as
attention on them. well as better scoring scheme. The possible object posi-
The remaining parts of this paper are organized as follows. tions are computed using a two-stage sampling that combines
Firstly, related works are discussed in “Literature review”. clustered foreground contour information and stochastically
The characteristics and architecture of the suggested model distributed samples. The best sample is chosen based on a
are presented in “Materials and methods”. “Experiments” combined score of model reliability, predicted location, and

123
Complex & Intelligent Systems (2023) 9:1495–1506 1497

appearance similarity. Yang et al. [18] proposed a track-


ing system using a correlation filter (CF) tracker strategy.
Moreover, a Gabor filter (GF) feature extractor is used in
the frequency domain (GF-KCF). By constructing a set
of frequency-domain GFs, the suggested method tries to
suppress background noise effectively and highlight target
texture information. Yao et al. [19] suggested a Siamese
network for tracking task that using a dilated convolution
module for enhancing scales adaptability of network. To
diminish the dependence of the model on the initially given
exemplar, they used a target template library update tech-
nique based on the tracking outcomes of historical frames.

Materials and methods

As infrared images take from a long distance, the target sig-


nals include insufficient texture information. Besides, the
complex background clutters such as sea clutter, sea-sky Fig. 1 Non-linear Kirsch kernels in 8 rotations [37]
line, forest, mountains, island, and cloud clutter are usually
changeable, which diminishes the efficiency of a tracking
As in encoding applications, the gradient value shows
model. So, in this section, we describe the importance of
more robustness compared to a graylevel intensity. Some
the textural features when we are dealing with a complex
strategies based on the gradient value such as local direc-
background while a small object needs to be tracked. More-
tional number patterns (LDN) and local word directional
over, we propose a new deep learning pipeline that uses two
pattern (LWDP) have attained much attention [34]. The LDN
attention mechanisms for a size-invariant and target track-
is used in the gradient domain for generating an illumination-
ing model. The proposed strategy to detect and estimate an
invariant representation of the image. The LDN utilizes
object in infrared images is displayed in Fig. 4.
directional information for investigating the location of all
edges that their magnitudes are insensitive to lighting varia-
Texture descriptor
tions.
This is implemented by operating the 8 directions Kirsch
Textural analysis of any kinds of images endeavors to explore
kernels (filters) that are rotated by 45° in the 8 main compass
some key informative details and characterizations of a sur-
directions (Fig. 1). Each kernel generates a feature map and
face texture such as entropy, contrast, shapes correlation,
only the maximum value in each location is chosen to obtain
smoothness, energy, homogeneity contrast, roughness, and
a final edge map [35, 36]. An example of employing the
colors [13, 20, 21]. As introduced in many works [22, 23],
non-linear kirsch kernel to an infrared image is indicated in
several kinds of local descriptors are employed to represent
Fig. 2.
an image into an encoded image based on the code-book of
The LDN algorithm is defined by
visual patterns or some pre-defined coding rules.
These strategies have a wide range of usage in many fields  
of research like object tracking [24], image segmentation ldn(cpx, cpy)  Eight prcpx, cpy + nrcpx, cpy . (1)
[13, 25–27], and aerial image analysis [28, 29]. Generally,
in texture segmentation and classification, the main aim is to As Eq. 1 demonstrates, pixel (cpx, cpy) implies the medial
split the image into a set of homogeneous textured segments pixel of a neighborhood, while nr is defined as the mini-
[30]. mum negative replication and pr states the maximum positive
Local binary pattern (LBP), local directional pattern response [38]. The result of applying the LDN approach to
(LDP), and local ternary patterns (LTP) feature descriptors some images is indicated in Fig. 3.
can be easily implemented and are influenced by varying the
pixel intensity of nearest-neighbor (rectangular, circular, etc. Our deep learning model
neighborhood) in clockwise or counter-clockwise to encode
(altering) the low-level information of a curve, line, edges, In this part, we explain how our model is able to learn
and spot inside an image and generate the result as a binary more informative and unique features from the original and
value [31–33]. encoded image for visual tracking. In the first step, the gap

123
1498 Complex & Intelligent Systems (2023) 9:1495–1506

Fig. 2 The result of applying Kirsch kernels to an image

Fig. 3 The result of applying the LDN approach to some images

123
Complex & Intelligent Systems (2023) 9:1495–1506 1499

Fig. 4 Schematic of our framework for target detection and tracking in infrared images

between the obtained features from a pre-trained convolu- testing and training samples are consistent and pre-defined,
tional neural network (CNN) and efficient representations whereas in an online object tracking system (general pur-
of best features for visual tracking is introduced. Formerly, poses) there are countless number of classes. Secondly, all
the attention mechanism for feature extracting framework is trained weights and biases in a pre-trained CNN model aim
investigated comprising a scale-sensitive feature generation to increase difference between inter classes and cannot able
component and a discriminative feature generation module to deal with the variation in intra-classes properly. This is
based on the gradients of regression and scoring losses. Our due to encountering of some insignificant features among
pipeline is displayed in Fig. 4. all features to predict the happening of scale variations and
distinguishing the aimed objects among some much similar
objects. Lastly, as differences among inter-classes are prin-
Target attention mechanism cipally related to some feature maps, all extracted features
using a trained deep learning model are sparsely activated
There are many differences between the extracted features’ using each class annotation. Moreover, some significant parts
aims to track a predefined object tracking and the visual of applying convolution kernels (filters) results in detecting
recognition of a general target. Firstly, most of the features uninformative details and redundancy leading to overfitting
extracted by a pre-trained CNN are uninformative and do
not cover all key details about general objects. This means
for a predefined object tracking task, the class labels for

123
1500 Complex & Intelligent Systems (2023) 9:1495–1506

and a high computational load. Accordingly, only some con- Gaussian label map by
volutional kernels are able to detect some patterns related to
the target object. (
− i2+ j2 )
Many strategies in the field of image processing that use Y (i, j)  e 2σ 2 (4)
convolution kernels imply the significant role of convolu-
tional kernels to recognize hidden patterns inside the image. where (i, j) demonstrates the difference in distance with
This group-level object information is calculated through the the target and σ stands for the dimension of filter (width).
corresponding gradients [2, 39–44]. Moreover, to overcome the problem of computing time a
Recently, a gradient-weighted class activation mapping ridge regression loss is employed to formulate the issue by
(Grad-CAM) model was proposed by [2] to produce a high-  2 
lighted feature map by calculating a sum of weighted neurons  
Lossregression  Y1 (i, j) − W1 ∗ pixelsi, j  + γ W1 2
along the feature channels. This strategy acts by calculat-  
2
ing the gradient at each input pixel which demonstrates the  
+ Y2 (i, j) − W2 ∗ pixelsi, j  + γ W2 2
corresponding importance belonging to given class anno-
tation. In other words, by computing the mean pooling of (5)
all the gradients in entire the channel, the weight of a fea-
where W indicates the weight of regressor and ∗ implies
ture channel is produced. Different from the gradient-based
the convolution operation. The importance of each kernel
models employing classification losses, a ranking loss and a
is calculated based on the derivation of Lossregression with
regression loss has been used in our study. Our strategy is
respect to the input feature pixelsinput and its contribution to
specifically designed for the tracking task to recognize the
fitting the label map. By considering Eq. 4 and the chain rule,
best convolutional kernels contributing to detecting the pat-
the gradient of the Lossregression can be calculated by
tern of targets and is sensitive to scale variations.
Using the gradient-based strategy, in this study a tar- ⎛ ⎞
get attention mechanism with losses has been implemented ∂Lossregression ∂Lossregression ∂pixelspredicted (i, j)
⎝ × ⎠
designed for visual tracking. Given a CNN employing for ∂pixelsinput ∂pixelspredicted (i, j) ∂pixelsinput (i, j)
i, j
 
extracting features has the output feature map , a subspace  2 Y (i, j) − pixelspredicted (i, j) × W . (6)
ζ is computed using the channel importance  as i, j

According to Eq. 3 and the gradient of the regression loss,


ζ  ψ1 (1 |1 ) + ψ2 (2 |2 ) (2) the target-active kernels can be defined which are able to
distinguish between the background and the target. These
where ψ1 and ψ2 are selecting function to choose the key produced features by employing the gradient strategy are
channels for image1 and image2, respectively. The score of able to select only some kernels to produce more discrim-
the i-th channel 1i and 2i can be calculated by inative deep features to focus on the specified target. This
strategy leads to eliminating many uninformative parts of the
image and overcoming the problem of over-fitting. In other
 
∂Loss1 words, when we remove much informative parts of the image,
i  mean poolingglobal
∂Filteri we eliminate many uninformative features from the training
  feature vectors. So, the rate of unbalancing data will dramat-
∂Loss2
+ mean poolingglobal (3) ically be decreased and lead to overcoming the problem of
∂Filteri over-fitting.

where Filteri demonstrates the output of ith filter and Loss Size attention mechanism
indicates the loss function.
In this study, we explore the best convolutional filters to To increase the target detection robustness against strong
obtain best possible visual tracking results by finding those occlusion and noises, it is essential to find some robust ker-
inactive to the backgrounds while active to the target region. nels that are able to detect the variation size of the target. As
This means, in the training process using a loss function, due to non-continuous change rate of the target’s size, it is
the best possible values for weights and biases are found. not an easy task to find the size of the object in each frame
These weights and biases learned how to respond to the precisely. But by using the proposed network to find a paired
backgrounds and target region. So, a regression approach is sample we can estimate the closest size variation. So, by for-
employed to explore all the pixelsi, j inside the image patch mulating the issue as a scoring model and finding and scoring
aligned with the center of the target center for obtaining a the size of all possible target size, we are able to select the

123
Complex & Intelligent Systems (2023) 9:1495–1506 1501

higher score as the target size. The obtained gradients from Our tracking pipeline includes a target attention mecha-
the score loss demonstrate which kernels are more sensitive nism, a pre-trained feature extractor, and a matching block.
to size variations. We only use a pre-trained feature extractor for training
Inspired by [45] we investigate a smooth approximation the network on the classification task with offline training
of the scoring loss function by strategy. Moreover, the target attention mechanism can be
employed in the training process in the first frame.
Losssize score In initial training (offline step), the scoring and regres-
⎛ ⎞
 sion loss functions are trained independently. Next, once the
⎜ ⎟
 log ⎝1 + e f (samplei )− f sample j
⎠ models are converged, gradients from each loss are com-

samplei , sample j ∈φ puted. By computing these gradients from the pre-trained
(7) networks, only those kernels with highest importance scores
are chosen to obtain the best possible outcomes.
 When we are dealing with an input video (sequential
where samplei , sample j are pair-wise samples for the
frames) in online finding target, the likelihood scores between
training phase and φ demonstrates the set of training pairs. As
the search area inside the image and the initial target in
suggested in [45], we compute the derivation of Losssize score
the current frame is directly computed employing the target
with respect to f (sample) by
attention mechanism. This step can be conducted by apply-
∂Losssize score 1 ing a convolution layer to the extracted output for obtaining
− h i, j e(− f (samples) h i, j )
a response map. All values in the response map implies the
∂ f (sample) Losssize score
φ rate of correctness of the real target. Given the exploration
(8) area in the existing frame h t and the initial target sample1 ,
we can predict the position of the target in frame t as
where h i, j  h i −h j and h i demonstrates a one-hot vector
with zero values while the ith position indicates 1 value. 
P ∧  arg max ϒ  sample1 ∗ ϒ  (h t ) (10)
By employing the backpropagation strategy, the gradients
scoring loss can be calculated as
where * implies the convolution operation.
∂Losssize score ∂Losssize score
   
∂ samplei ∂ f samplepredicted Experiments
 
∂ f samplepredicted
×  Dataset and implementation details
∂ samplei
∂Losssize score In this study, training, validation, and testing of the sug-
  ×W
∂ f samplei (9) gested strategy have been accomplished on the Dim-small
target detection and tracking dataset [46]. This dataset, made
where samplepredicted indicates the output estimation, W by the ATR laboratory of National University of Defense
implies the filter weights of a Conv layer. According to Eq. 3 Technology, comprises 22 image sequences, 30 trajectories,
and the gradient of the scoring loss, the size-sensitive kernels 16,944 targets and 16,177 frames. The aim of this dataset is
can be defined. By combining the scoring losses and regres- to detect and track of low altitude flying target and the data
sion, we are able to detect the kernels that are both sensitive acquisition scenario covers complex field background and
to size variation and active to the target. sky background. Figure 5 demonstrates some sample from
the dataset. Our pipeline is implemented in Matlab with the
Tracking process MatConvNet toolbox [47] on a PC with a GTX-1080 GPU,
core i7 3.6 GHz CPU, over CUDA 9.0, CuDNN 5.1, and 16G
The overall pipeline our suggested tracker is demonstrated in memory.
Fig. 4. There are two main reasons for integrating the target
attention mechanism and feature extraction routes. Firstly, Assessment metrics
feature extraction routes consider both significant features
extracted from original and encoded image which signifi- The effectiveness of the suggested pipeline is evaluated using
cantly highlight the key details of the target. Secondly, by the three criteria, namely Sensitivity, Accuracy, and Speci-
decreasing the searching area inside the image, the proposed ficity. Specificity is the measure of non-target that has been
model is able to perform the tracking task efficiently. estimated appropriately (actual negative rate). Sensitivity is

123
1502 Complex & Intelligent Systems (2023) 9:1495–1506

Fig. 5 Some examples from the employing dataset

the measure of targets that have been appropriately recog- Experimental results and discussions
nized (True positive rate or Recall). Accuracy is employed
as the assessment metric for computing the overlap between We use the VGG-16 model [51] as the base network for
the ground truths and the estimated targets [9, 13, 48]. These increasing in the number of layers with smaller kernels that
three criteria are computed by: leads to increasing in non-linearity (a positive in deep learn-
ing). The Adaptive Moment Estimation (Adam) is utilized to
TP the train the model, with an initial learning rate 10–4, a batch
Sensitivity  100 × (11)
TP + FN size 70, the maximum iteration 70, and weight decay 10–5.
TP + TN To obtain more robust and accurate spatial information, the
Accuracy  100 × (12) outputs of the Conv4-1 and Conv4-3 layers are employed as
TP + TN + FP + FN
the base deep features. Also, the top 80 significant kernels
TN from the Conv4-1 layers to learn the score-sensitive features
Specificity  100 × (13)
TN + FP and the top 250 significant kernels from the Conv4-3 layers
to learn the target-active features are selected.
where False Negative (FN) implies those objects, which do
To have a clear understanding and for qualitative and quan-
not cover the target and are classified as the target. While
titative comparison purposes, we also implemented eight
False Positive (FP) states those targets incorrectly predicted
other pipelines (Single Shot MultiBox Detector (SSD) [52],
by our suggested tracking pipeline. Lastly, True Positive (TP)
Target-aware [3], Discriminative Model [15], Directional
represents the number of targets over the entire frames that are
morphological filtering (DMF) [14], Kalman filter [1], GF-
correctly classified as the targets by the proposes technique.
KCF [18], Siamese network [19] and Grad-CAM [2]) for
In many cases, higher values of the sensitivity can show lower
evaluating the suggested infrared searching and tracking
specificity values. The higher the values for specificity and
target performance. The SSD [52] strategy uses an Adap-
sensitivity, the better the performance of the pipeline [12, 49,
tive Pipeline Filter (APF) using the motion information and
50].

123
Complex & Intelligent Systems (2023) 9:1495–1506 1503

Table 1 Comparison between our pipeline and other baseline models on the Dim-small target detection and tracking dataset

Method Accuracy (mean) Sensitivity (mean) Specificity (mean) Testing time of videos
(mean)/min

SSD [52] 92.48 90.70 91.40 9.3


DMF [14] 90.12 86.27 88.36 8.7
Kalman filter [1] 89.79 84.55 87.61 8.9
Discriminative model [15] 88.45 81.37 84.13 9.2
Target-aware [3] 90.06 87.15 85.29 8.7
Grad-CAM [2] 86.54 80.06 83.73 9.1
GF-KCF [18] 88.63 82.04 83.49 9.2
Siamese network [19] 89.92 88.81 89.37 8.7
Our approach (original image) 89.26 83.17 82.04 8.1
Our approach (encoded image) 59.31 63.47 60.81 8.1
Our approach (original image + encoded 94.37 91.98 91.62 8.5
image)

temporal correlation. The DMF [14] algorithm is based on image by our model obtains an unacceptable result, but its
multidirectional morphological filtering and spatiotemporal performance is still higher than Discriminative Model [15].
cues. The Kalman filter [1] strategy is employed to estimate Moreover, using only an encoded image as the input image
the preliminary target position that is considered as the cen- to feed the network obtains the worst results among all com-
ter of the region of interest (ROI). The discriminative Model pared methods.
[15] is capable of fully investigating the background and From Table 1, it is recognizable that the suggested pipeline
target appearance information. So, this model employs the obtained the highest criterion values for recognizing and
steepest descent-based technique that calculates an optimal tracking targets than those obtained by all eight other mod-
step length in each iteration. Then, a module that efficiently els. This enhancement is because of: firstly, the suggested
initializes the target model is integrated. pipeline pays special attention to finding important parts of
The Accuracy, Specificity, and Sensitivity values of all image rather than investigate all areas inside the image. Sec-
frames employing all mentioned frameworks are described ondly, our framework explores the changing size of the target
in Table 1. For each index in Table 1, the highest Accu- before it happens in the next frame. Lastly, by encoding the
racy, Specificity, and Sensitivity values are highlighted in original image into a new image, we can find more informa-
bold. Notice that when employing the DMF [14] and Target- tive details. Moreover, our strategy can analyze all frames
aware [3], accuracy values were enhanced in comparison to more rapidly than other approaches. Also, there is a mini-
other mentioned networks, but the values of sensitivity using mum difference between the evaluating time of some videos
Siamese network [19] and SSD [52] is still higher. Moreover, employing Grad-CAM [2], SSD [52], and Discriminative
there is a minimum difference between the values of Speci- Model [15].
ficity employing DMF [14] and Kalman filter [1] and the Figure 6 indicates a visual demonstration of the good
values of Sensitivity using DMF [14] and Target-aware [3]. outcomes attained by the proposed framework on the Dim-
The Grad-CAM [2] gains the worst outcomes for all three small target detection and tracking dataset. As indicated,
measures. due to employing the target attention mechanism, the dif-
Moreover, it is clear that DMF [14], SSD [52], GF-KCF ference between the value of target and background inside
[18], Siamese network [19] and Target-aware [3] models are the images is increased and the border between them is
more stable than the Grad-CAM [2], Discriminative Model recognized with a high rate of accuracy. Also, using the
[15], and Kalman filter [1]. For Grad-CAM [2], all metrics size attention mechanism make our pipeline more robust to
are less than the other models and it suffer from overfitting. track the target when varying size occurs. But it is not true
The gap between the value of accuracies by utilizing DMF when we are dealing some targets with varying size at the
[14], SSD [52], and Target-aware [3] models for tracking same time. Moreover, owing to the use of the LDN encod-
tasks equals zero which is relatively smaller than this gap ing approach, the suggested tracking framework can explore
when employing Grad-CAM [2] and Discriminative Model more unique contextual information from the target and back-
[15]. The specificity value of the SSD [52] is better than ground which leads to better tracking outcomes. The analysis
all other techniques with 0.89. Also, using only the original of our attention-based mechanism CNN model is indicated

123
1504 Complex & Intelligent Systems (2023) 9:1495–1506

Fig. 6 The results of target tracking in infrared images using our framework in different frames in two streams

Fig. 7 The analysis of the suggested network using epoch versus loss

employing epoch versus loss in Fig. 7. Although our tech- that each image has unique and informative characteristics to
nique provides outstanding outcomes compared to the other aid the framework efficiently even if varying size effects are
recently published frameworks, the suggested strategy still present. We introduced a target attention mechanism which is
has limitations when encountering changing size of multi- able to highlight only significant part of the image to work on
targets at the same time. This is due to an increase in the size it. Moreover, we have described that working only on a part of
of the target’s expected region which leads to decreasing per- the image including potential target area allows our network
formance in the feature exploration. to reach performance close to human observers. This leads to
decreasing computational burden of the model and capability
Conclusions to make estimations faster as it eliminates some uninforma-
tive parts of the image. Comprehensive experiments have
In this paper, a novel target detection and tracking in infrared been conducted, which indicate the effectiveness of the sug-
images has been developed that benefits from the character- gested framework by the comparison with the state-of-the-art
ization of an original image and an encoded image. It means models.

123
Complex & Intelligent Systems (2023) 9:1495–1506 1505

Author contributions The specific contributions made by each author 6. Zhang X, Ye P, Leung H, Gong K, Xiao G (2020) Object fusion
is as follows: MP: conceptualization, methodology, implementation, tracking based on visible and infrared images: a comprehensive
writing-original draft, writing—review and editing. GK: conceptualiza- review. Inf Fusion 63:166–187. https://doi.org/10.1016/J.INFFUS.
tion, methodology, implementation, writing-original draft, writing—re- 2020.05.002
view and editing. BAR: conceptualization, methodology, implementa- 7. Wan M et al (2018) Total variation regularization term-based low-
tion, writing-original draft, writing—review and editing. rank and sparse matrix representation model for infrared moving
target tracking. Remote Sens 10(4):510. https://doi.org/10.3390/
Funding None. RS10040510
8. Saadi SB et al (2021) Osteolysis: a literature review of basic science
and potential computer-based image processing detection methods.
Availability of data and materials The dataset used in this study can be
Comput Intell Neurosci. https://doi.org/10.1155/2021/4196241
obtained from the corresponding author on reasonable request.
9. Xu Z, Sheykhahmad FR, Ghadimi N, Razmjooy N (2020)
Computer-aided diagnosis of skin cancer based on soft comput-
Declarations ing techniques. Open Med 15(1):860–871. https://doi.org/10.1515/
med-2020-0131
Conflict of interest The authors declare that they have no competing 10. Yao H, Zhang X, Zhou X, Liu S (2019) Parallel structure deep
interests. neural network using CNN and RNN with an attention mechanism
for breast cancer histology image classification. Cancers (Basel)
Financial interests The authors declare they have no financial interests. 11(12):1901. https://doi.org/10.3390/cancers11121901
11. Aleem S, Kumar T, Little S, Bendechache M, Brennan R, McGuin-
Non-financial interests The authors declare they have no non-financial ness K (2021) Random data augmentation based enhancement: a
interests. generalized enhancement approach for medical datasets. In: 24th
Irish machine vision and image processing conference (IMVIP),
Ethics approval and consent to participate Not applicable. pp 153–160. https://doi.org/10.56541/FUMF3414
12. Valizadeh A, Jafarzadeh Ghoushchi S, Ranjbarzadeh R, Pourasad
Consent for publication Not applicable. Y (2021) Presentation of a segmentation method for a diabetic
retinopathy patient’s fundus region detection using a convolutional
neural network. Comput Intell Neurosci 2021:1–14. https://doi.org/
Open Access This article is licensed under a Creative Commons 10.1155/2021/7714351
Attribution 4.0 International License, which permits use, sharing, adap- 13. Mousavi SM, Asgharzadeh-Bonab A, Ranjbarzadeh R (2021)
tation, distribution and reproduction in any medium or format, as Time-frequency analysis of EEG signals and GLCM features
long as you give appropriate credit to the original author(s) and the for depth of anesthesia monitoring. Comput Intell Neurosci
source, provide a link to the Creative Commons licence, and indi- 2021:1–14. https://doi.org/10.1155/2021/8430565
cate if changes were made. The images or other third party material 14. Li Y et al (2021) Infrared maritime dim small target detection
in this article are included in the article’s Creative Commons licence, based on spatiotemporal cues and directional morphological filter-
unless indicated otherwise in a credit line to the material. If material ing. Infrared Phys Technol 115:103657. https://doi.org/10.1016/J.
is not included in the article’s Creative Commons licence and your INFRARED.2021.103657
intended use is not permitted by statutory regulation or exceeds the 15. Bhat G, Danelljan M, Van Gool L, Timofte R (2019) Learning dis-
permitted use, you will need to obtain permission directly from the copy- criminative model prediction for tracking, pp 6182–6191 [Online].
right holder. To view a copy of this licence, visit http://creativecomm https://github.com/visionml/pytracking. Accessed 28 Oct 2021
ons.org/licenses/by/4.0/. 16. Zhang X, Ye P, Peng S, Liu J, Gong K, Xiao G (2019) SiamFT:
an RGB-infrared fusion tracking method via fully convolutional
Siamese networks. IEEE Access 7:122122–122133. https://doi.
org/10.1109/ACCESS.2019.2936914
17. Zulkifley MA, Trigoni N (2018) Multiple-model fully convo-
References lutional neural networks for single object tracking on thermal
infrared video. IEEE Access 6:42790–42799. https://doi.org/10.
1. Xiao S, Ma Y, Fan F, Huang J, Wu M (2020) Tracking small targets 1109/ACCESS.2018.2859595
in infrared image sequences under complex environmental condi- 18. Yang X, Li S, Yu J, Zhang K, Yang J, Yan J (2021) GF-KCF: aerial
tions. Infrared Phys Technol 104:103102. https://doi.org/10.1016/ infrared target tracking algorithm based on kernel correlation filters
J.INFRARED.2019.103102 under complex interference environment. Infrared Phys Technol
2. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra 119:103958. https://doi.org/10.1016/J.INFRARED.2021.103958
D (2017) Grad-CAM: visual explanations from deep networks via 19. Yao T, Hu J, Zhang B, Gao Y, Li P, Hu Q (2021) Scale and appear-
gradient-based localization, pp 618–626 [Online]. http://gradcam. ance variation enhanced Siamese network for thermal infrared
cloudcv.org. Accessed 22 Oct 2021 target tracking. Infrared Phys Technol 117:103825. https://doi.org/
3. Li X, Ma C, Wu B, He Z, Yang M-H (2019) Target-aware 10.1016/J.INFRARED.2021.103825
deep tracking. Proc IEEE/CVF Conf, Computer vision and pat- 20. Parhizkar M, Amirfakhrian M (2022) Car detection and damage
tern recognition (CVPR), pp 1369–1378. https://doi.org/10.48550/ segmentation in the real scene using a deep learning approach. Int
arXiv.1904.01772, arXiv:1904.01772 J Intell Robot Appl 2022:1–15. https://doi.org/10.1007/S41315-
4. Sun Y, Yang J, An W (2021) Infrared dim and small target detection 022-00231-5
via multiple subspace learning and spatial-temporal patch-tensor 21. Karimi N, Ranjbarzadeh Kondrood R, Alizadeh T (2017) An intel-
model. IEEE Trans Geosci Remote Sens 59(5):3737–3752. https:// ligent system for quality measurement of Golden Bleached raisins
doi.org/10.1109/TGRS.2020.3022069 using two comparative machine learning algorithms. Meas J Int
5. Zhao J, Zhang X, Zhang P (2021) A unified approach for track- Meas Confed 107:68–76. https://doi.org/10.1016/j.measurement.
ing UAVs in infrared, pp 1213–1222. [Online]. https://anti-uav. 2017.05.009
github.io/. Accessed 05 Nov 2021

123
1506 Complex & Intelligent Systems (2023) 9:1495–1506

22. Ranjbarzadeh R, Bagherian Kasgari A, Jafarzadeh Ghoushchi S, 38. Michael Revina I, Sam Emmanuel WR (2018) Face expression
Anari S, Naseri M, Bendechache M (2021) Brain tumor segmen- recognition using LDN and dominant gradient local ternary pattern
tation based on deep learning and an attention mechanism using descriptors. J King Saud Univ Comput Inf Sci. https://doi.org/10.
MRI multi-modalities brain images. Sci Rep 11(1):10930. https:// 1016/j.jksuci.2018.03.015
doi.org/10.1038/s41598-021-90428-8 39. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A
23. Aghamohammadi A, Ranjbarzadeh R, Naiemi F, Mogharrebi M, (2015) Learning deep features for discriminative localization.
Dorosti S, Bendechache M (2021) TPCNN: two-path convolutional Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit
neural network for tumor and liver segmentation in CT images 2016-December:2921–2929. https://arxiv.org/abs/1512.04150v1.
using a novel encoding approach. Expert Syst Appl 183:115406. Accessed 22 Oct 2021 [Online]
https://doi.org/10.1016/J.ESWA.2021.115406 40. Goyal B, Dawa, Lepcha C, Dogra A, Wang S-H, Lepcha DC (2021)
24. Abbasi S, Rezaeian M (2021) Visual object tracking using sim- A weighted least squares optimisation strategy for medical image
ilarity transformation and adaptive optical flow. Multimed Tools super resolution via multiscale convolutional neural networks for
Appl 80(24):33455–33473. https://doi.org/10.1007/S11042-021- healthcare applications. Complex Intell Syst 1:1–16. https://doi.
11344-7 org/10.1007/S40747-021-00465-Z
25. Mamli S, Kalbkhani H (2019) Gray-level co-occurrence matrix 41. Ilesanmi AE, Ilesanmi TO (2021) Methods for image denoising
of Fourier synchro-squeezed transform for epileptic seizure detec- using convolutional neural network: a review. Complex Intell Syst
tion. Biocybern Biomed Eng 39(1):87–99. https://doi.org/10.1016/ 7(5):2179–2198. https://doi.org/10.1007/S40747-021-00428-4
j.bbe.2018.10.006 42. Haq EU, Jianjun H, Huarong X, Li K (2021) Block-based com-
26. Tuncer T, Dogan S, Ozyurt F (2020) An automated residual pressed sensing of MR images using multi-rate deep learning
exemplar local binary pattern and iterative ReliefF based corona approach. Complex Intell Syst 7(5):2437–2451. https://doi.org/10.
detection method using lung X-ray image. Chemom Intell Lab Syst 1007/S40747-021-00426-6
203:104054. https://doi.org/10.1016/j.chemolab.2020.104054 43. Ä 배박, Kumar T, 성 호배, Park J, Bae S-H, 약요 (2020) Search for
27. Amirfakhrian M, Parhizkar M (2021) Integration of image seg- optimal data augmentation policy for environmental sound classifi-
mentation and fuzzy theory to improve the accuracy of damage cation with deep neural networks. J Broadcast Eng 25(6):854–860.
detection areas in traffic accidents. J Big Data. https://doi.org/10. https://doi.org/10.5909/JBE.2020.25.6.854
1186/s40537-021-00539-2 44. Baseri Saadi S, Tataei Sarshar N, Sadeghi S, Ranjbarzadeh R,
28. Hojatimalekshah A, Uhlmann Z, Glenn NF, Hiemstra CA, Ten- Kooshki Forooshani M, Bendechache M (2022) Investigation of
nant CJ, Graham JD, Spaete L, Gelvin A, Marshall HP, McNamara effectiveness of shuffled frog-leaping optimizer in training a con-
JP, Enterkine J (2021) Tree canopy and snow depth relation- volution neural network. J Healthc Eng 2022:1–11. https://doi.org/
ships at fine scales with terrestrial laser scanning. Cryosphere 10.1155/2022/4703682
15(5):2187–2209. https://doi.org/10.5194/TC-15-2187-2021 45. Li Y, Song Y, Luo J (2017) Improving pairwise ranking for multi-
29. Ranjbarzadeh R, Saadi SB, Amirabadi A (2020) LNPSS: SAR label image classification. In: Proceedings of the IEEE conference
image despeckling based on local and non-local features using on computer vision and pattern recognition, pp 3617–3625. https://
patch shape selection and edges linking. Meas J Int Meas Con- doi.org/10.48550/arXiv.1704.03135
fed. https://doi.org/10.1016/j.measurement.2020.107989 46. Hui B et al (2019) A dataset for infrared image dim-small aircraft
30. El Khadiri I et al (2021) Petersen graph multi-orientation based target detection and tracking under ground/air background.
multi-scale ternary pattern (PGMO-MSTP): an efficient descriptor https://www.scidb.cn/en/detail?dataSetId=720626420933459968.
for texture and material recognition. IEEE Trans Image Process Accessed 27 Oct 2021
30:4571–4586. https://doi.org/10.1109/TIP.2021.3070188 47. Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural net-
31. Liu L, Lao S, Fieguth PW, Guo Y, Wang X, Pietikäinen M (2016) works for MATLAB. In: MM ’15: Proceedings of the 23rd ACM
Median robust extended local binary pattern for texture classifica- international conference on multimedia, pp 689–692. https://doi.
tion. IEEE Trans Image Process 25(3):1368–1381. https://doi.org/ org/10.1145/2733373.2807412
10.1109/TIP.2016.2522378 48. Liu Q, Liu Z, Yong S, Jia K, Razmjooy N (2020) Computer-aided
32. Ali H, Sharif M, Yasmin M, Rehmani MH (2017) Computer-based breast cancer diagnosis based on image segmentation and inter-
classification of chromoendoscopy images using homogeneous val analysis. Automatika 61(3):496–506. https://doi.org/10.1080/
texture descriptors. Comput Biol Med 88:84–92. https://doi.org/ 00051144.2020.1785784
10.1016/J.COMPBIOMED.2017.07.002 49. Ghoushchi SJ, Ranjbarzadeh R, Najafabadi SA, Osgooei E, Tirko-
33. Ilie M (2015) A content-based image retrieval approach based laee EB (2021) An extended approach to the diagnosis of tumour
on document queries. Emerg Trends Image Process Comput location in breast cancer using deep learning. J Ambient Intell
Vis Pattern Recognit. https://doi.org/10.1016/B978-0-12-802045- Humaniz Comput. https://doi.org/10.1007/S12652-021-03613-Y
6.00020-X 50. Ranjbarzadeh R et al (2021) Lung infection segmentation for
34. Naiemi F, Ghods V, Khalesi H (2021) A novel pipeline frame- COVID-19 pneumonia based on a cascade convolutional network
work for multi oriented scene text image detection and recogni- from CT images. Biomed Res Int 2021:1–16. https://doi.org/10.
tion. Expert Syst Appl 170:114549. https://doi.org/10.1016/j.eswa. 1155/2021/5544742
2020.114549 51. Simonyan K, Zisserman A (2015) Very deep convolutional net-
35. Uddin MZ, Hassan MM, Almogren A, Zuair M, Fortino G, Torresen works for large-scale image recognition. http://www.robots.ox.
J (2017) A facial expression recognition system using robust face ac.uk/. Accessed 11 Jun 2021 [Online]
features from depth videos and deep learning. Comput Electr Eng 52. Ding L, Xu X, Cao Y, Zhai G, Yang F, Qian L (2021) Detection and
63:114–125. https://doi.org/10.1016/j.compeleceng.2017.04.019 tracking of infrared small target by jointly using SSD and pipeline
36. Luo YT et al (2016) Local line directional pattern for palmprint filter. Digit Signal Process 110:102949. https://doi.org/10.1016/J.
recognition. Pattern Recognit 50:26–44. https://doi.org/10.1016/j. DSP.2020.102949
patcog.2015.08.025
37. Ranjbarzadeh R, Saadi SB (2020) Automated liver and tumor seg-
mentation based on concave and convex points using fuzzy c-means
and mean shift clustering. Meas J Int Meas Confed. https://doi.org/ Publisher’s Note Springer Nature remains neutral with regard to juris-
10.1016/j.measurement.2019.107086 dictional claims in published maps and institutional affiliations.

123

You might also like