0% found this document useful (0 votes)

6 views12 pages

TBConvL-Net a Hybrid Deep Learning Architecture for Robust Medical Image Segmentation - Main Ver

Uploaded by

23520945

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views12 pages

TBConvL-Net a Hybrid Deep Learning Architecture for Robust Medical Image Segmentation - Main Ver

Uploaded by

23520945

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Pattern Recognition 158 (2025) 111028

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

TBConvL-Net: A hybrid deep learning architecture for robust medical image

segmentation
Shahzaib Iqbal a,b ,∗, Tariq M. Khan c , Syed S. Naqvi b , Asim Naveed b,d , Erik Meijering c
a
Department of Electrical Engineering, Abasyn University, Islamabad, Pakistan
b
Department of Electrical and Computer Engineering, COMSATS University Islamabad, Islamabad, Pakistan
c
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
d
Department of Computer Science and Engineering, University of Engineering and Technology Lahore, Narowal Campus, Pakistan

ARTICLE INFO ABSTRACT

Keywords: Deep learning has shown great potential for automated medical image segmentation to improve the precision
Medical image segmentation and speed of disease diagnostics. However, the task presents significant difficulties due to variations in
CNN the scale, shape, texture, and contrast of the pathologies. Traditional convolutional neural network (CNN)
LSTM
models have certain limitations when it comes to effectively modelling multiscale context information and
Vision transformers
facilitating information interaction between skip connections across levels. To overcome these limitations, a
novel deep learning architecture is introduced for medical image segmentation, which takes advantage of
CNNs and vision transformers. Our proposed model, named TBConvL-Net, involves a hybrid network that
combines the local features of a CNN encoder–decoder architecture with long-range and temporal dependencies
using biconvolutional long-short-term memory (LSTM) networks and vision transformers (ViT). This enables
the model to capture contextual channel relationships in the data and account for the uncertainty of the
segmentation over time. Additionally, we introduce a novel composite loss function that considers both
segmentation robustness and boundary agreement of the predicted output with the gold standard. Our proposed
model shows consistent improvement over the state of the art on ten publicly available datasets of seven
different medical imaging modalities.

1. Introduction sizes [7]. Also, once they have been trained, the convolutional kernels
cannot change based on the content of the input image. This makes the
Accurate segmentation of lesions and other pathologies from med- network less adaptable to different input features.
ical images poses a significant challenge, but remains a crucial task Self-attention-based transformers have gained prominence in nat-
in the field of medical image analysis [1]. Relying solely on expert ural language processing, and their application to computer vision
opinions for diagnosis can be time-consuming and subject to bias from has attracted interest [8,9]. Vision Transformers (ViT) [10] emerged
clinical experience. Hence, automated medical image segmentation as the pioneering approach that used transformer encoders for im-
(MIS) can be greatly valuable for medical professionals and can offer age classification. ViT did as well or better than CNN-based models,
substantial advantages for disease diagnosis and treatment planning. showing that self-attention mechanisms could be useful for computer
In the field of computer vision, convolutional neural networks (CNNs) vision. Transformers have also been used for other visual tasks, such as
have gained prominence as the prevailing segmentation method. object detection [11] and semantic segmentation [12], with state-of-
Most segmentation methods commonly use U-Net [2] or its vari- the-art (SOTA) performance showing the best results to date. In MIS,
ants [3–5]. However, the localised nature of convolutional operations TransUNet [13] was the pioneer model to incorporate a hybrid archi-
in CNNs [6] imposes constraints on their capacity to capture long- tecture consisting of CNN and transformers. Since then, transformer
range dependencies, which can lead to less-than-optimal segmentation encoder–decoder models that are entirely based on transformers, such
outcomes. This leads to two notable drawbacks. First, the utilisation as Swin-UNet [14] and nnFormer [15], have been suggested to seg-
of small convolutional kernels mainly focuses on local features, ne- ment volumetric medical images. These approaches have shown strong
glecting the importance of global features. Global features are crucial performance because of their ability to capture interactions over long
distances and dynamically encode features.
for reliably segmenting medical images with varying lesion shapes and

∗ Corresponding author at: Department of Electrical Engineering, Abasyn University, Islamabad, Pakistan.
E-mail address: [email protected] (S. Iqbal).

https://doi.org/10.1016/j.patcog.2024.111028
Received 15 February 2024; Received in revised form 7 August 2024; Accepted 15 September 2024
Available online 18 September 2024
0031-3203/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Although transformers have shown great success in modelling long- captures contextual channel relationships in the data. Second, the
range dependencies and have been applied to MIS, they still have composite loss function considers both the robustness of the segmen-
limitations. One of the drawbacks is that transformers tend to ignore tation and the boundary agreement of the predicted output with the
crucial spatial and local feature information. Another drawback is that gold standard. Third, the use of depth-wise separable convolutions
they require large datasets for training, which limits their ability to instead of traditional convolutions minimises computational burden
model local visual cues. It should also be noted that transformers, de- and improves feature learning by exploiting filter redundancy. Utilising
spite their strengths, are limited in their ability to learn features using a the optimal number of filters prevents filter overlap and promotes
token-wise attention mechanism on a single scale. Because of this limi- convergence to globally optimal minima. The proposed method is
tation, transformers cannot easily record feature dependencies between evaluated for seven medical image segmentation applications using
channels at different scales, which can be a problem when working ten public datasets. Tasks include thyroid nodule segmentation, breast
with pathologies that are different in size and shape. Consequently, cancer lesion segmentation, optic disc segmentation, chest radiograph
there exists a potential for hybrid CNN-transformer architectures for segmentation, nuclei cell segmentation, fluorescent neuronal cell seg-
MIS, as CNNs and transformers have complementary strengths. CNNs mentation, and skin lesion segmentation. Our experimental results
are data efficient and suitable for preserving local spatial information. show that our method consistently outperforms current SOTA meth-
Transformers, on the other hand, can model long-range dependencies ods while also requiring fewer computational resources. Therefore,
and perform dynamic attention, making them useful for segmenting the method offers great benefits for segmenting medical images with
large-scale lesions. Previous work [16] has attempted to combine these limited resources.
two types of model for feature encoding, but with high computational
complexity and reliance on large-scale datasets such as ImageNet. 2. Proposed method
Moreover, current hybrid techniques employ only a single token gran-
ularity in each attention layer, disregarding the channel relationships TBConvL-Net consists of multiple key components (Fig. 1). Here,
of transformers and their importance in feature extraction. we describe the encoder–decoder architecture, the BConvLSTM and
To overcome these limitations, it is necessary to explore effective transformer block, and the loss function of the proposed network.
ways to integrate the strengths of CNNs and transformers for MIS while
maintaining computational efficiency and avoiding their respective 2.1. Encoder–decoder architecture
drawbacks. Here, we introduce TBConvL-Net, a novel architecture that
combines the strengths of CNNs and transformers with bidirectional The encoder component of TBConvL-Net consists of four stages, with
long-short-term memory (LSTM) models specifically designed for MIS. each stage comprising two separable convolutional layers using 3 × 3
The encoder part consists of several hierarchical separable convolu- filters. This is followed by a 2 × 2 max pooling layer and the application
tional layers of CNNs, responsible for capturing the spatial information of a Rectified Linear Unit (ReLU) activation function. With each sub-
in the input image. In the decoder section of the architecture, mul- sequent stage, the number of filters doubles compared to the previous
tiple separable convolutional layers and upsampling layers are used stage. By progressively increasing the layer dimensions, the TBConvL-
to facilitate the reconstruction process. Bridging the semantic and Net encoder path gradually extracts visual features, culminating in the
resolution gap between the encoder and decoder features is crucial final layer generating high-level semantic information based on high-
to capture the global multiscale context in MIS. Specifically, encoder dimensional image representations. Unlike legacy feature learning in
features have higher resolution, enabling them to capture more fine- CNNs, where independent feature learning is promoted in different
grained information, while decoder features possess higher semantic layers, densely connected convolutions are proposed [18]. The concept
information and contextual understanding. Therefore, the objective is of ‘‘collective knowledge’’ is used to improve network performance by
to learn the transfer of contextual information on a multiscale while reusing feature maps throughout the network. In this approach, the
preserving the integrity of semantic information. We aim to achieve this feature maps generated from earlier convolutional layers are integrated
without adversely impacting the richness and accuracy of contextual with those from the existing layer. This combined output is then fed
understanding embedded within the decoder features by using a com- into the subsequent convolutional layer. Densely connected convolu-
bination of bidirectional ConvLSTM (BConvLSTM) and transformers tions have notable benefits over traditional convolutions [18]. First,
within the skip connections of the proposed TBConvL-Net. This enables they help the network learn a broad range of feature maps rather than
robust feature fusion in the encoder–decoder architecture, marking the redundant features. Additionally, feature reuse and information sharing
first instance of such an application. The key idea is to leverage the throughout the network enhance the network’s ability to represent com-
power of vision transformers for contextual processing in the spatial do- plex features. Finally, as densely connected convolutions can benefit
main while considering the temporal interactions between features for from every feature that has been formed before them, the network is
semantically aware feature fusion. However, the challenge in employ- able to avoid the danger of gradients bursting or vanishing.
ing traditional transformers in dense prediction tasks is the quadratic Let 𝑙∗×∗ denote the depth-wise separable convolution (𝑓𝑠∗×∗ ) opera-
computational complexity of self-attention. To reduce complexity and
tion of any given kernel size (∗ × ∗) followed by the ReLU activation
improve the modelling of long-range dependencies and robustness to
function (ℜ) and batch normalisation (𝛽𝑁 ) operation on any given
image variations, we introduce a lightweight Swin Transformer Block
input (𝐼):
(STB) in skip connections for semantically aware feature fusion [17].
( ( ))
The shifted windowing-based self-attention layers model long-range de- 𝑙∗×∗ = 𝛽𝑁 ℜ 𝑓𝑠∗×∗ (𝐼) . (1)
pendencies and dynamic attention across the image, allowing the model
Furthermore, let 𝐵𝑖enc be the output of the 𝑖th encoder block, where
to capture features at different scales and encode channel relationships.
𝑖 = 1, 2, 3, computed by applying two consecutive 𝑙(3×3) operations
BConvLSTM complements the transformer in learning the forward and
followed by the 𝑖th (2 × 2) max-pooling operation (𝑀𝑃𝑖 ) on the encoded
backward temporal dependencies and patterns between the encoder
features (𝜒in ):
and the decoder features.
Compared to existing methods, the proposed TBConvL-Net fea- 𝐵𝑖enc = 𝑀𝑃𝑖 (𝑙3×3 (𝑙3×3 (𝜒in ))). (2)
tures the following innovations, facilitating the MIS task. First, the
design of a hybrid network that takes into account the local fea- In TBConvL-Net, three encoding blocks are used with progressively
tures of a CNN encoder–decoder architecture, as well as temporal and smaller spatial input dimensions, namely 𝑊 × 𝐻 × 𝐶 for 𝐵1enc , 21 𝑊 ×
1
long-range dependencies through BConvLSTM and Swin Transformer, 2
𝐻 ×2𝐶 for 𝐵2enc , and 14 𝑊 × 41 𝐻 ×4𝐶 for 𝐵3enc . After the encoding blocks,
allows one to account for segmentation uncertainties over time and three densely connected depth-wise separable convolution blocks 𝐵𝑖den

2
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Fig. 1. Block diagram of the TBConvL-Net architecture, showing its key components: encoder, decoder, and skip connections with BConvLSTM and Transformer layers.

are used with spatial input dimensions 18 𝑊 × 18 𝐻 × 8𝐶. The output 2.2. Bidirectional ConvLSTM and transformer block
of the first dense block is calculated by applying the two consecutive
operations 𝑙3×3 and ReLU activation on the last encoding block 𝐵3enc By modelling long-range dependencies in both directions, bidirec-
and then processed by the STB (𝑆ViT ) and concatenated with itself to tional LSTMs can capture contextual information from past and future
obtain 𝐵1den (Fig. 1): steps in the sequence. This can enhance the network’s ability to learn
( ) complex patterns and relationships in the data. On the other hand,
𝐵1den = 𝑆ViT ℜ(𝑙3×3 (𝑙3×3 (𝐵3enc ))) © ℜ(𝑙3×3 (𝑙3×3 (𝐵3enc ))), (3) Swin Transformers [14] use a hierarchical approach to process non-
where © denotes the concatenation operation. The output of the second overlapping local image patches, allowing them to learn features at
dense block is calculated by applying two consecutive 𝑙3×3 operations various scales. This significantly enhances the ability of the network
and ReLU activation on 𝐵1den and concatenated with 𝐵1den and the two to model complex structures and relationships in the data. In the
Swin Transformer architecture, the attention mechanism is employed
consecutive operations 𝑙3×3 followed by ReLU activation on 𝐵3enc to
both across patches and within them. This enables capturing global
obtain 𝐵2den (Fig. 1):
relationships among different parts of the input data. By considering
𝐵2den = ℜ(𝑙3×3 (𝑙3×3 (𝐵1den ))) © 𝐵1den © ℜ(𝑙3×3 (𝑙3×3 (𝐵3enc ))). (4) both local and global dependencies, the Swin Transformer effectively
learns the contextual information necessary for various tasks.
Similarly, the output of the last dense block is calculated by applying In our proposed network, the output of the batch normalisation
two consecutive 𝑙3×3 operations and ReLU activation on 𝐵2den and con- out , is fed into a ConvLSTM layer (Fig. 2). This layer comprises a
step, 𝛽𝑁
catenated with 𝐵2den and the two consecutive operations 𝑙3×3 followed memory cell (𝑀𝑐𝑡 ), an output gate (𝜙𝑡 ), an input gate (𝑖𝑡 ) and a forget
by ReLU activation on 𝐵3enc to obtain 𝐵3den (Fig. 1): gate (𝑓𝑡 ). These gates serve as control mechanisms for the ConvLSTM
layer, with the input, output, and forget gates specifically controlling
𝐵3den = ℜ(𝑙3×3 (𝑙3×3 (𝐵2den ))) © 𝐵2den © ℜ(𝑙3×3 (𝑙3×3 (𝐵3enc ))). (5) the access, updating, and clearing of the memory cells, respectively.
The structure and operation of ConvLSTM can be formalised as follows:
This approach of merging the outputs of earlier blocks with the output
of the current block enhances the ability of the network to learn more
complex and high-level features. The outputs of the decoder blocks are 𝑀𝑐𝑡 = 𝑓𝑡 ⊗ 𝑀𝑐𝑡−1 + 𝑖𝑡 tanh(𝜛(ℑ,𝑀𝑐 ) ⊛ ℑ𝑡 + 𝜛(ℎ,𝑀𝑐 ) ⊛ ℘𝑡−1 + 𝛽𝑀𝑐 ), (8)
computed as:

𝐵1dec = 𝑇𝑐 (𝑙3×3 (𝑙3×3 (𝐵3den ))) © 𝑆ViT (▴⇋ (𝐵 enc )),

lstm 3
𝜙𝑡 = 𝜚(𝜛(ℑ,𝜙) ⊛ ℑ𝑡 + 𝜛(ℎ,𝜙) ⊛ ℘𝑡−1 + 𝜛(𝑀𝑐 ,𝜙) ⊗ 𝑀𝑐𝑡 + 𝛽𝜙 ), (9)

𝐵2dec = 𝑇𝑐 (𝑙 (𝑙 (𝐵1dec )))

3×3 3×3
© 𝑆ViT (▴⇋ (𝐵 enc )),
lstm 2
(6)
𝑖𝑡 = 𝜚(𝜛(ℑ,𝑖) ⊛ ℑ𝑡 + 𝜛(℘,𝑖) ⊛ ℘𝑡−1 + 𝜛(𝑀𝑐 ,𝑖) ⊛ 𝑀𝑐𝑡−1 + 𝛽𝑖 ), (10)
𝐵3dec = 𝑇𝑐 (𝑙3×3 (𝑙3×3 (𝐵2dec ))) © 𝑆ViT (▴⇋ (𝐵 enc )).
lstm 1

where 𝑇𝑐 is the transposed convolution operation of the 𝑖th decoder 𝑓𝑡 = 𝜚(𝜛(ℑ,𝑓 ) ⊛ ℑ𝑡 + 𝜛(ℎ,𝑖) ⊛ ℘𝑡−1 + 𝜛(𝑀𝑐 ,𝑓 ) ⊛ 𝑀𝑐𝑡−1 + 𝛽𝑓 ), (11)
block, 𝑆ViT is the STB, and ▴⇋
lstm
denotes BConvLSTM. The final output
of TBConvL-Net, 𝜒out , is calculated by applying two consecutive 𝑙3×3 ℘𝑡 = 𝜙𝑡 ⊗ tanh(𝑀𝑐𝑡 ), (12)
operations followed by the sigmoid function 𝜚 on the output of the last
decoder block 𝐵3dec : where ⊗ and ⊛ stand for Hadamard and convolutional operations,
respectively. The input and hidden tensors are denoted by ℑ𝑡 and ℘𝑡 ,
𝜒out = 𝜚(𝑙3×3 (𝑙3×3 (𝐵3dec ))). (7) respectively. 2D convolution masks of the hidden and input states are

3
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Fig. 2. Design of the ConvLSTM block, a solution to the spatial correlation shortcomings of traditional LSTM models, achieved by the incorporation of convolutional operations
in the input-to-state and state-to-state transitions. The architecture includes a memory cell (𝑀𝑐 ), an output gate (𝜙), an input gate (𝑖) and a forget gate (𝑓 ), with these gates
serving as control mechanisms to access, update and erase the content of the memory cells. For both the hidden and the input states in the block, 2D convolution masks are used,
with Hadamard and convolutional operations symbolised by ⊗ and ⊛, respectively. The input and hidden state tensors are indicated by ℑ𝑡 and ℘𝑡 , respectively, while the biases
associated with the memory cell, the output gate, the input gate and the forget gate are denoted as 𝛽𝑀𝑐 , 𝛽𝜙 , 𝛽𝑖 , and 𝛽𝑓 , respectively.

Fig. 3. Lightweight Swin Transformer architecture. The input RGB images are divided into non-overlapping patches, transformed into tokens, and projected into an arbitrary
dimension. Transformer blocks with modified self-attention computations process these tokens, creating a hierarchical representation. The lightweight version replaces the
conventional multihead self-attention (MSA) module with a shifted window-based MSA (SW-MSA) module to reduce computational complexity while preserving core functionality.
Efficiency is further improved by computing self-attention within local windows, scaling linearly with a fixed size of 𝑁.

denoted by 𝜛(ℑ,∗) and 𝜛(ℎ,∗) . The bias terms of the memory cell, output, The Swin Transformer blocks (Fig. 3) used in our proposed network
input, and forget gates are denoted by 𝛽𝑀𝑐 , 𝛽𝜙 , 𝛽𝑖 , and 𝛽𝑓 , respectively. partition the input into non-overlapping patches using ViT. Each patch,
In TBConvL-Net, we utilise BConvLSTM [19], which extends tradi- which encapsulates a 4 × 4 pixel area in our implementation, is treated
tional ConvLSTM by capturing forward and backward temporal depen- as a ‘‘token’’, its associated features being a combination of the RGB
dencies. This is useful when understanding the past and future context pixel values. These features of the raw value are then projected onto
is crucial to interpreting current input features. In a BConvLSTM, input a chosen dimension, denoted by 𝑑, using a linear embedded layer
data is processed in two separate paths: a forward and a backward (LE). Subsequently, a sequence of transformer blocks, equipped with
direction, each with its own ConvLSTM layers that process data se- modified self-attention computations, is applied to these patch tokens.
quentially. The forward path processes the input data in its natural
This allows the model to learn more complex relationships between
order, from the first to the last image. The backward path processes the
input features, which leads to better performance on various tasks.
data in reverse, from the last image to the first image. This facilitates
Block 1 consists of transformer blocks and LE, preserving the token
the capture of information from both the preceding and subsequent
count of (ℎ∕4 × 𝑤∕4). As the network progresses, the layers combine to
frames with respect to the current input. Studies have shown that
create a hierarchical representation by reducing the token count. The
considering both forward and backward views improves prediction
performance [20]. The output of the BConvLSTM is computed as: initial patch merging layer (PML) consolidates features from clusters of
adjacent 2 × 2 patches, after which a linear layer is applied to the 4-𝑑-
ℑout = tanh((𝜛 ℘⃗ ⊛ ℘⃗𝑡 ) + (𝜛 ℘⃖ × ℘⃖𝑡 ) + 𝛽), (13) dimensional combined features. This procedure results in a quartering
where ℘⃗ and ℘⃖ represent the hidden state tensors for the forward of the token count, which is equivalent to a downsampling of the
and backward states, respectively, and 𝛽 is the bias component. The resolution 2×, with the output dimension set to 2 × 𝑑. Subsequently,
hyperbolic tangent function (tanh) is used to nonlinearly combine the the transformers are deployed to alter the features while preserving the
output of the forward and backward states in BConvLSTM. This en- resolution at (ℎ∕4 × 𝑤∕8). This beginning stage of patch merging and
sures effective integration of information from both directions and feature conversion is designated as Block 2.
helps capture complex relationships between the forward and backward To improve modelling efficiency, self-attention is applied within
dependencies in the input data. local windows [17]. Given that each window is made up of 𝑁 ×

4
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

𝑁 patches, the computational complexity of a global multihead self- 𝜕𝑆 of the segmentation mask and the boundary 𝜕𝐺 of the GT mask by
attention (MSA) module and a shifted window-based MSA (SW-MSA) integration over the interface where the regions of the two boundaries
module for an image of ℎ × 𝑤 patches are large. do not align:

𝐶MSA = 4(ℎ × 𝑤)𝑑 2 + 2(ℎ × 𝑤)2 𝑑 ‖ ‖2

(14) dist(𝜕𝑆, 𝜕𝐺) = ‖𝑝 (𝐵 ) − 𝐵𝑝 ‖ 𝑑𝐵𝑝 (22)
∫𝜕𝐺 ‖ 𝜕𝑆 𝑝 ‖
and
=2 𝐷𝐺 (𝐵𝑝 )𝑑𝐵𝑝 (23)
∫𝛥𝑆
𝐶SW-MSA = 4(ℎ × 𝑤)𝑑 2 + 2𝑁 2 (ℎ × 𝑤)2 𝑑, (15) ( )
=2 𝜗𝐺 (𝐵𝑝 )𝑠(𝐵𝑝 )𝑑𝐵𝑝 − 𝜗𝐺 (𝐵𝑝 )𝑔(𝐵𝑝 )𝑑𝐵𝑝 , (24)
respectively, where the former is quadratic in relation to the number of ∫𝛺 ∫𝛺
patches and the latter is linear for a fixed 𝑁. Global self-attention com- where 𝐵𝑝 is a point on the boundary 𝜕𝐺 and 𝑝𝜕𝑆 (𝐵𝑝 ) is the correspond-
putation is often prohibitively expensive for large ℎ×𝑤, whereas shifted ing point on the boundary 𝜕𝑆, 𝐷𝐺 (𝐵𝑝 ) is the distance map of point 𝑝
window-based self-attention is scalable. In creating a streamlined ver- with respect to the boundary 𝜕𝐺, 𝛥𝑆 represents the region between
sion of the Swin Transformers, we substituted the MSA module with the two boundaries, while 𝛺 denotes the region covered by 𝑆, and 𝜗𝐺
a SW-MSA module in each transformer block, retaining the configura- is the level-set representation of 𝜕𝐺, calculated as 𝜗𝐺 (𝑝) = −𝐷𝐺 (𝑝) if
tions of the remaining layers (see the magnified portion of Fig. 3). This 𝑝 ∈ 𝐺 and 𝜗𝐺 = +𝐷𝐺 (𝑝) otherwise. When 𝑆 = 𝑆𝜃 , the binary variables
lighter version preserves the essential features of the Swin Transformers 𝑠(⋅) in (24) can be substituted with the softmax probability outputs of
while decreasing computational complexity. The overall process of two the network, 𝑆𝜃 (𝑝). This leads to the formulation of the boundary loss,
consecutive Swin Transformer blocks (STBs) is as follows. Let 𝜏in be which approximates the boundary distance dist(𝜕𝑆, 𝜕𝐺), subject to a
the input to the first STB and 𝑧1 be the result of concatenating 𝜏in with constant that is independent of 𝜃:
the output of the first MSA module after applying a layer norm (𝐿𝑁 )
operation: 𝜁𝑏 (𝑆, 𝐺) = 𝜗𝐺 (𝑝)𝑆𝜃 (𝑝)𝑑𝑝. (25)
∫𝛺
𝑧1 = 𝜏in © 𝐿𝑁 (MSA(𝜏in )). (16) The total loss used to train TBConvL-Net is a linear combination of
the Dice, Jaccard, and boundary loss functions:
Next, the input 𝑧2 to the second STB is calculated as the concatenation
of 𝑧1 and the result of processing 𝑧1 by 𝐿𝑁 and a multilayer perceptron 𝜁 = 𝜆𝑑 𝜁𝑑 (𝑆, 𝐺) + 𝜆𝑗 𝜁𝑗 (𝑆, 𝐺) + 𝜆𝑏 𝜁𝑏 (𝑆, 𝐺), (26)
(MLP): where 𝜆𝑑 , 𝜆𝑗 , and 𝜆𝑏 are the weights (hyperparameters) of the respec-
tive loss functions. In our experimentation, we set 𝜆𝑑 and 𝜆𝑗 to 1. 𝜆𝑏
𝑧2 = 𝑧1 © MLP(𝐿N (𝑧1 )). (17)
was initialised at 1, then gradually reduced by 0.01 per epoch until
In the second STB, 𝑧3 is calculated by concatenating 𝑧2 with the output it converged at 0.01. This is done to moderate the influence of 𝜆𝑏 on
of the SW-MSA module after applying 𝐿𝑁 : the boundary constraints, ensuring that its effect remains substantial
without excessively dominating the optimisation process.
𝑧3 = 𝑧2 © 𝐿N (SW-MSA(𝑧2 )). (18)
3. Experiments and results
Finally, the output 𝜏out of the second STB is calculated as the con-
catenation of 𝑧3 and the result of processing 𝑧3 by 𝐿𝑁 and MLP: 3.1. Datasets

The proposed TBConvL-Net model was evaluated on ten challeng-

𝜏out = 𝑧3 © MLP(𝐿N (𝑧3 )). (19)
ing benchmark datasets of seven different medical imaging modalities
(Table 1), namely ISIC 2016,1 ISIC 2017,2 and ISIC 20183 for the seg-
2.3. Loss function
mentation of skin lesions in optical images, DDTI4 for the segmentation
of thyroid nodules and BUSI5 for the segmentation of breast cancer
TBConvL-Net uses ground truth (GT) to supervise the complete seg- in ultrasound images, MoNuSeg6 for the segmentation of cell nuclei
mentation method. The network is trained using a linear combination in histopathological whole-slide images, MC7 for the segmentation of
of Dice loss (𝜁𝑑 ), Jaccard loss (𝜁𝑗 ), and surface boundary loss (𝜁𝑏 ). One chest X-ray images, IDRiD8 for the segmentation of the optic disc in
of the main reasons for combining the Dice loss (𝜁𝑑 ) and Jaccard loss fundus images, Fluorescent Neuronal Cells (FluoCells)9 for the segmen-
(𝜁𝑗 ) is that the former ensures that predictions capture the pathology’s tation of cells in microscopy images, and TCIA10 for the segmentation of
overall size and shape, even if it is slightly shifted, while the latter brain tumours in magnetic resonance images. All datasets are publicly
ensures that predictions closely match the shape and location of the available and provide GT masks for the evaluation of image segmenta-
pathology. When combining both losses, TBConvL-Net learns to be tion methods. Performance evaluation on the DDTI and BUSI datasets
accurate in terms of both region similarity (𝜁𝑑 ) and placement (𝜁𝑗 ), was performed using a five-fold cross-validation method due to the
leading to more precise and robust medical image segmentation. unavailability of a separate test set.
The Dice loss evaluates the amount of overlap between the seg-
mented image 𝑆 and the GT image 𝐺:
∑ 1
https://challenge.isic-archive.com/data/#2016.
∑𝑐 2𝑤𝑘 𝑛𝑗=1 𝑆(𝑘, 𝑗) × 𝐺(𝑘, 𝑗) 2
https://challenge.isic-archive.com/data/#2017.
𝜁𝑑 (𝑆, 𝐺) = 1 − ∑𝑛 2
∑𝑛 2
+ 𝜉, (20) 3
𝑘=1 𝑗=1 𝑆(𝑘, 𝑗) + 𝑗=1 𝐺(𝑘, 𝑗)
https://challenge.isic-archive.com/data/#2018.
4
https://www.kaggle.com/datasets/dasmehdixtr/ddti-thyroid-ultrasound-
where 𝑤𝑘 denotes the 𝑘th class weight, 𝑐 is the number of classes, 𝑛 images.
the number of pixels, and 𝜉 is a smoothing constant. The Jaccard loss 5
https://www.kaggle.com/datasets/sabahesaraki/breast-ultrasound-
is calculated as: images-dataset.
|𝐵 − (𝑆 ∪ 𝐺)| 6
https://monuseg.grand-challenge.org/Data/.
𝜁𝑗 (𝑆, 𝐺) = 1 − IoU(𝑆, 𝐺) − + 𝜉, (21) 7
https://www.kaggle.com/datasets/raddar/tuberculosis-chest-xrays-
|𝐵|
montgomery.
where IoU denotes the intersection over union of the segmented image 8
https://ieee-dataport.org/open-access/indian-diabetic-retinopathy-
𝑆 and the GT image 𝐺, and 𝐵 is the bounding box covering 𝑆 and 𝐺. image-dataset-idrid.
The main purpose of MIS is to accurately identify the edges or 9
https://www.kaggle.com/datasets/nbroad/fluorescent-neuronal-cells.
boundaries of a lesion. To achieve this, we use a special boundary 10
https://www.cancerimagingarchive.net/collection/brain-tumor-
loss [21]. It computes the distance dist(𝜕𝑆, 𝜕𝐺) between the boundary progression/.

5
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Table 1
Details of the medical image datasets used for evaluation.
Modality Dataset Image count Image resolution range Format Resized Data split Task
Training Validation Testing Total
ISIC 2016 900 – 379 1279 679 × 453 − 6748 × 4499 JPEG
Optical imaging ISIC 2017 2000 150 600 2750 679 × 453 − 6748 × 4499 JPEG 256 × 256 – Skin lesions segmentation
ISIC 2018 2594 – 1000 3594 679 × 453 − 6748 × 4499 JPEG
DDTI – – – 637 245 × 360 − 560 × 360 PNG 256 × 256 80%:10%:10% Thyroid nodule
Ultrasound imaging
segmentation
BUSI – – – 780 319 × 473 − 583 × 1010 PNG 256 × 256 80%:10%:10% Breast Ultrasound
segmentation (Age
25–75)
WSI imaging MoNuSeg 30 – 14 44 1000 × 1000 PNG 512 × 512 – Nuclei segmentation
X-ray imaging MC 100 – 38 138 4892 × 4020 − 4020 × 4892 TIF 512 × 512 – Chest X-rays
segmentation
Fundus imaging IDRiD 54 – 27 81 4288 × 2848 JPEG 512 × 512 – Optic disc segmentation
Microscopic imaging FluoCells 283 – 70 353 1600 × 1200 PNG 512 × 512 – Fluorescent microscopic
cells segmentation
MRI imaging TCIA 1084 – 285 1369 256 × 256 TIF 256 × 256 – Brain tumour
segmentation

Table 2
Results of the ablation study of the different locations of the transformer in the baseline network on the ISIC 2017 dataset.
Method Transformer location in the network Performance (%)
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
Baseline Model (BM) Not applicable 79.20 78.11 91.63 76.46 97.09
BM with SCa (SC-BM) Not applicable 80.14 87.60 94.33 88.87 94.60
SC-BM + Swin Transformer Between dense layer of the network 78.61 86.45 93.78 87.07 95.11
SC-BM + Swin Transformer After every pooling layer in the decoder 81.53 88.88 94.73 89.28 94.83
SC-BM + Swin Transformer Between the skip connections of the network 81.70 88.79 94.47 90.20 94.01
SC-BM + Swin Transformer Between the skip connections and dense layers of the network 82.78 89.66 95.07 90.21 95.18
a
SC is a depth-wise separable convolution.

3.2. Evaluation criteria were replaced with depth-wise separable convolutions, resulting in
substantial reductions in computational costs. The filters were then
The segmentation performance of TBConvL-Net was evaluated and optimised. The Swin Transformer was used at various locations within
compared with SOTA methods using several metrics, including the the network. We found that performance improved substantially when
Jaccard index (𝐽 , equal to IoU), Dice similarity coefficient (𝐷), ac- the Swin Transformer was used between skip connections and the
curacy (𝐴𝑐𝑐 ), sensitivity (𝑆𝑛 ), and specificity (𝑆𝑝 ). We also evaluated densely connected, separable convolutional layer of the network at
our method in terms of 𝑆𝑛 (also known as the recall or true-positive depth (Table 2). All results of this first ablation were computed using
rate) as a function of 1 − 𝑆𝑝 (the false-positive rate) using the receiver only the Dice loss.
operating characteristic (ROC) and the area under the ROC (AUROC). The second ablation experiment was conducted to understand the
For definitions and a comprehensive discussion of these metrics we influence of different loss functions and was performed on the DDTI
refer to the report of the Metrics Reloaded consortium [22]. dataset. A variety of loss functions, such as Dice loss, Jaccard loss,
and boundary loss, were examined individually and in various com-
3.3. Training details binations. Given that the DDTI dataset comprises images with irregular
shapes and boundaries, we discovered that the linear combination of
For model training, the images (Table 1) were augmented using con-
Dice loss, Jaccard loss, and boundary loss produced the best perfor-
trast adjustments (with factors of [×0.9, ×1.1]) and flipping operations
mance, both quantitatively (Table 3) and qualitatively (Fig. 4). Thus,
(both in horizontal and vertical directions), which increased the size of
we used this combined loss in all subsequent experiments.
the datasets by a factor of 5. Segmentation models were trained through
In the third ablation experiment, we investigated the potential of
various mixtures of loss functions and training methodologies. Adam
transfer learning to further boost the performance of the proposed
optimiser was used with a maximum of 60 iterations and an initial
TBConvL-Net. The rationale of this experiment was to capitalise on the
learning rate of 0.001. In the absence of performance improvement on
the validation set after five epochs, the learning rate was reduced by principle that transferring domain knowledge from other modalities,
a quarter. To stop overfitting, an early stop strategy was implemented. meaning incorporating preexisting feature representations learnt from
The models were implemented through Keras using TensorFlow as the these other sources, may be beneficial to the segmentation process
back-end and trained on an NVIDIA K80 GPU. and yield better results for any given dataset. For each dataset (Ta-
ble 1) we experimented with transfer learning from other selected
3.4. Ablation experiments datasets (Table 4) to facilitate feature learning. For some datasets,
this involved sequential learning, as in the case of the ISIC 2016
To evaluate the impact of the main components, loss functions, and dataset, where weights were first learnt from the ISIC 2017 dataset,
training strategies used in TBConvL-Net, three ablation experiments and then the model was further trained on the ISIC 2016 dataset. For
were carried out. other datasets, specifically MC and IDRiD, we found that our model
The first ablation experiment was conducted using the ISIC 2017 already attained state-of-the-art performance without employing trans-
dataset, as it is one of the more challenging datasets. The experiment fer learning. We observed that transfer learning generally improves
began with a simple bidirectional ConvLSTM U-Net (BCDU-Net [23]) segmentation performance or does not decrease performance (Table 5).
as the baseline model (BM), and then traditional convolutional layers Thus, for comparisons with other state-of-the-art methods presented

6
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Fig. 4. Visual results of the proposed TBConvL-Net using the different loss functions for thyroid nodule segmentation in the DDTI dataset.

Table 3
Results of the ablation study of the different loss functions in TBConvL-Net on thyroid nodule segmentation in the DDTI dataset.
Loss Performance measures (%)
function ISIC 2017 DDTI
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
𝜁𝑑 78.72 87.26 94.27 86.44 95.00 76.88 86.17 97.14 84.05 97.07
𝜁𝑗 74.22 82.36 92.39 84.00 95.66 78.72 87.26 94.27 86.44 95.00
𝜁𝑏 63.06 73.21 90.24 75.91 96.25 74.22 82.36 92.39 84.00 95.66
𝜁𝑏 + 𝜁𝑑 67.26 77.72 91.27 87.59 92.98 77.79 81.99 95.22 82.83 95.22
𝜁𝑑 + 𝜁𝑗 82.83 89.71 95.07 90.08 95.41 81.22 82.88 94.85 82.91 96.99
𝜁𝑏 + 𝜁𝑗 75.03 83.94 92.96 90.92 93.11 79.36 80.54 95.88 85.85 96.46
𝜁𝑑 + 𝜁𝑗 + 𝜁𝑏 83.91 90.57 95.64 92.68 96.80 86.06 89.90 95.45 90.26 97.71

Table 4
Learnt weights transfer learning of the TBConvL-Net on different MIS datasets.
Dataset Transfer learning
Table 5
ISIC 2016 Learnt weights of ISIC 2017
Performance enhancement achieved by TBConvL-Net by using the transfer learning
ISIC 2017 Learnt weights of ISIC 2018
strategy on different datasets of MIS.
ISIC 2018 Learnt weights of ISIC 2017
DDTI Learnt weights of BUSI Dataset Transfer Performance measures in (%)
BUSI Learnt weights of DDTI learning 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
MoNuSeg Learnt weights of FluoCells
No 86.10 91.76 96.32 93.57 94.88
MC Without transfer learning ISIC 2016
Yes 89.47 95.45 97.05 94.02 97.68
IDRiD Without transfer learning
FluoCells Learnt weights of MoNuSeg No 81.78 89.66 95.07 90.21 95.18
ISIC 2017
TCIA Learnt weights of MoNuSeg Yes 84.80 90.89 96.07 91.19 97.61
No 87.31 92.54 96.04 91.94 97.61
ISIC 2018
Yes 91.65 95.47 97.60 95.29 98.55
No 86.06 89.90 95.45 90.26 97.71
DDTI
Yes 88.70 93.56 98.62 94.02 99.09
No 85.95 91.42 96.92 92.82 95.24
BUSI
Yes 91.97 95.72 99.50 95.85 99.69
No 70.59 81.34 94.22 85.38 95.24
MoNuSeg
Yes 76.07 85.16 93.62 88.04 95.53
No 97.88 98.70 99.50 98.40 99.05
MC
Yes 97.90 98.86 99.50 97.69 99.04
No 95.65 96.73 99.94 97.62 99.95
IDRiD
Yes 95.67 96.73 99.93 97.68 99.97
No 88.11 93.54 98.24 95.32 99.19
FluoCells
Yes 92.84 96.23 99.90 97.01 99.94
No 88.71 94.55 97.15 94.22 97.86
TCIA
Yes 92.93 95.47 99.34 95.63 99.79

next, we used this transfer learning strategy and the combined loss
function.
Fig. 5. ROC curves of the proposed TBConvL-Net for the considered medical imaging Finally, to assess the sensitivity-specificity behaviour of TBConvL-
datasets. TPR: true-positive rate (sensitivity). FPR: false-positive rate (one minus the
Net using the best approach suggested by the ablation experiments, we
specificity). The legend also reports the corresponding AUROC values.
plotted the ROC curve and calculated the AUROC for each dataset. The
results (Fig. 5) clearly show that TBConvL-Net consistently performs
very well across all datasets.

7
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Table 6
Performance of TBConvL-Net with and without transfer learning (TL) compared to various SOTA methods on the skin lesion segmentation datasets ISIC 2018, ISIC 2017, and ISIC
2016.
Method Performance (%)
ISIC 2018 ISIC 2017 ISIC 2016
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
U-Net [2] 80.09 86.64 92.52 85.22 92.09 75.69 84.12 93.29 84.30 93.41 81.38 88.24 93.31 87.28 92.88
UNet++ [24] 81.62 87.32 93.72 88.70 93.96 78.58 86.35 93.73 87.13 94.41 82.81 89.19 93.88 88.78 93.52
BCDU-Net [23] 81.10 85.10 93.70 78.50 98.20 79.20 78.11 91.63 76.46 97.09 83.43 80.95 91.78 78.11 96.20
DAGAN [25] 81.13 88.07 93.24 90.72 95.88 75.94 84.25 93.26 83.63 97.25 84.42 90.85 95.82 92.28 95.68
FAT-Net [26] 82.02 89.03 95.78 91.00 96.99 76.53 85.00 93.26 83.92 97.25 85.30 91.59 96.04 92.59 96.02
Ms RED [27] 83.86 90.33 96.45 91.10 – 78.55 86.48 94.10 – – 87.03 92.66 96.42 – –
ARU-GD [28] 84.55 89.16 94.23 91.42 96.81 80.77 87.89 93.88 88.31 96.31 85.12 90.83 94.38 89.86 94.65
Swin-UNet [14] 82.79 88.98 96.83 90.10 97.16 80.89 81.99 94.76 88.06 96.05 87.60 88.94 96.00 92.27 95.79
TBConvL-Net w/o TL 87.31 92.54 96.04 91.94 97.61 81.78 89.66 95.07 90.21 95.18 86.10 91.76 96.32 93.57 94.88
TBConvL-Net with TL 91.65 95.47 97.60 95.29 98.55 84.80 90.89 96.07 91.19 97.61 89.47 95.45 97.05 94.02 97.68

Fig. 6. Example segmentation results of TBConvL-Net on the skin lesions dataset ISIC 2017. From left to right, the columns show the input images, the ground-truth masks, the
segmentation results of TBConvL-Net, and the results of ARU-GD [28], UNet++ [24], U-Net [2], BCDU-Net [23], and Swin-UNet [14], respectively. True-positive pixels are depicted
in green, false-positive pixels in red, and false-negative pixels in blue. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version
of this article.)

Table 7
Performance of TBConvL-Net with and without transfer learning (TL) compared to various SOTA methods on the thyroid nodule
segmentation dataset DDTI and the breast lesion segmentation dataset BUSI (see [29,30]).
Method Performance (%)
BUSI DDTI
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
U-Net [2] 67.77 76.96 95.48 78.33 96.13 74.76 84.08 96.55 85.50 97.57
UNet++ [24] 76.85 76.22 97.97 78.61 98.86 76.16 84.18 96.82 85.59 97.71
BCDU-Net [23] 74.49 66.75 94.82 86.85 95.57 57.79 69.49 93.22 78.31 94.34
nnUnet [29] – – – – – 80.76 88.59 – 85.23 –
BGM-Net [30] 75.97 83.97 – 83.45 – – – – – –
ARU-GD [28] 77.07 83.64 97.94 83.80 98.78 77.07 83.64 97.94 83.80 98.78
Swin-UNet [14] 74.16 79.45 96.55 83.16 97.34 75.44 84.86 96.93 86.42 97.98
TBConvL-Net w/o TL 85.95 91.42 96.92 92.82 95.24 86.06 89.90 95.45 90.26 97.71
TBConvL-Net with TL 91.97 95.72 99.50 95.85 99.69 88.70 93.56 98.62 94.02 99.09

3.5. Comparisons with state-of-the-art methods performed better in terms of virtually all metrics. For example, com-
pared to the listed SOTA methods, TBConvL-Net scored 1.87%–8.09%,
For each of the datasets (Table 1) we compared TBConvL-Net with 3.91%–9.11%, and 7.07%–20.14% better in terms of the Jaccard index
various SOTA methods (briefly described in Supplementary Section 1). on ISIC 2016, 2017, and 2018, respectively. Furthermore, we observed
Given the large number of methods and the unavailability of many, that TBConvL-Net shows better performance in images of skin lesions
it was not feasible to reimplement, retrain, rerun, and/or re-evaluate with various challenges, such as irregular shapes, varying sizes, and the
them. Instead, we copied the performance scores reported by the orig- presence of hair, artefacts, and multiple lesions (Fig. 6).
inal developers in their papers, as cited throughout this section. This Next, the comparison of TBConvL-Net for thyroid nodule segmenta-
also means that the lists of SOTA methods may be different for each tion on the DDTI dataset and breast cancer lesion segmentation on the
dataset since in the literature not all methods with which we compared BUSI dataset (Table 7) shows that our method performed superiorly
were evaluated on all datasets. If scores were not reported for certain in terms of all metrics. For example, compared to the listed SOTA
metrics, we indicate this with a dash (–) in our tables. methods, TBConvL-Net scored 0.24%–30.91% and 14.81%–24.2% bet-
Comparison of TBConvL-Net for skin lesion segmentation in the ter in terms of the Jaccard index on the DDTI and BUSI datasets,
ISIC 2016, 2017, and 2018 datasets (Table 6) shows that our method respectively. Furthermore, we observed that TBConvL-Net shows better

8
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Fig. 7. Example segmentation results of TBConvL-Net on the thyroid nodule dataset DDTI (top row) and the breast lesion dataset BUSI (bottom row). From left to right, the
columns show the input images, the ground-truth masks, the segmentation results of TBConvL-Net, and the results of ARU-GD [28], UNet++ [24], U-Net [2], BCDU-Net [23], and
Swin-UNet [14], respectively. True-positive pixels are in green, false-positive pixels in red, and false-negative pixels in blue. (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article.)

Table 8
Performance of TBConvL-Net with and without transfer learning (TL) compared to various SOTA methods on the cell nuclei segmentation
dataset MoNuSeg and the fluorescent neuronal cell segmentation dataset FluoCells.
Method Performance (%)
MoNuSeg FluoCells
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
U-Net [2] 62.16 75.48 90.24 81.17 92.02 74.48 84.63 99.53 81.57 99.82
UNet++ [24] 62.05 75.30 89.79 81.32 91.49 70.99 81.97 99.47 78.75 99.81
BCDUnet [23] 66.61 79.82 92.05 82.48 94.12 74.80 84.79 99.53 82.34 99.83
ARU-GD [28] 60.76 73.89 90.97 75.54 93.70 66.27 78.74 99.35 75.24 99.77
Swin-UNet [14] 65.51 78.90 90.29 83.29 93.09 80.39 85.78 99.64 82.15 99.21
c-ResUnet [31] 66.61 79.82 92.05 82.48 94.12 82.03 89.97 99.69 82.46 99.99
TSCA-Net [32] 67.13 80.23 90.69 86.97 – – – – – –
TBConvL-Net w/o TL 70.59 81.34 94.22 85.38 95.24 88.11 93.54 98.24 95.32 99.19
TBConvL-Net with TL 76.07 85.16 93.62 88.04 95.53 92.84 96.23 99.90 97.01 99.94

Fig. 8. Example segmentation results of TBConvL-Net on the MoNuSeg dataset (top row) and the FluoCells dataset (bottom row). From left to right, the columns show the
input images, the corresponding ground-truth mask, the segmentation result of TBConvL-Net, and the results of U-Net [2], UNet++ [24], BCDU-Net [23], ARU-GD [28], and
Swin-UNet [14]. True-positive pixels are depicted in green, false-positive pixels in red, and false-negative pixels in blue. (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article.)

performance on thyroid nodule image and breast lesions (Fig. 7) with which shows that the output of TBConvL-Net closely resembles the GT
various challenges, such as irregular shapes, varying sizes, and the data, even for images with varying sizes and low contrast.
presence of hair, artefacts, and multiple lesions. To assess the statistical significance of the segmentation improve-
Similarly, the comparison of TBConvL-Net for cell nuclei segmen- ments of TBConvL-Net, we performed a paired t-test comparing with
tation on the MoNuSeg dataset and fluorescent neuronal cell seg- the results of the SOTA methods common to all presented experiments
mentation on the FluoCells dataset (Table 8) shows that our method (Supplementary Section 2). Also, while we have focused our experi-
performed superiorly in terms of all metrics. For example, compared ments on binary-class segmentation problems, we have performed a
to the listed SOTA methods, TBConvL-Net scored 9.46%–15.31% and preliminary experiment on multiclass segmentation using the ACDC
10.81%–26.57% better in terms of the Jaccard index on the two dataset (Supplementary Section 3). The results demonstrate the further
datasets, respectively. Visual results for some example cell nuclei and potential of TBConvL-Net to outperform SOTA methods.
neuronal cells (Fig. 8) confirm the quantitative results and show that
the segmentations closely resemble the GT data, even for images with 3.6. Models complexity analysis
varying object sizes, irregular shapes, and low contrast.
Finally, the comparison of TBConvL-Net for optic disc segmentation We also compared the complexity of our proposed TBConvL-Net
on the IDRiD dataset, chest X-ray image segmentation on the MC with other methods in terms of number of parameters, floating point
dataset, and brain tumour segmentation on the TCIA dataset (Table 9) operations per second (FLOPs) and inference time (Table 10). TBConvL-
shows that our method performed superiorly in terms of all metrics also Net has 9.6 million (M) parameters, 15.5 billion (G) FLOPs, and an
for these tasks. For example, compared to the listed SOTA methods, inference time of 19.1 milliseconds (ms). This outperforms all other
TBConvL-Net scored 4.06%–7.78%, 1.41%–2.24%, and 6.78%–14.49% methods used for visual performance comparisons in all three as-
better in terms of the Jaccard index on the IDRiD, MC, and TCIA pects. Swin-UNet [14] adopts global self-attention with a transformer
datasets, respectively. This is confirmed by visual examination (Fig. 9), structure, leading to high computational costs of 27.3M parameters,

9
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Table 9
Performance of TBConvL-Net with and without transfer learning (TL) compared to various SOTA methods on the optic disc segmentation dataset IDRiD, the chest X-ray segmentation
dataset MC, and brain tumour segmentation using the TCIA dataset.
Method Performance (%)
IDRiD MC TCIA
𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝 𝐽 𝐷 𝐴𝑐𝑐 𝑆𝑛 𝑆𝑝
U-Net [2] 90.22 94.65 99.81 94.07 99.93 96.47 98.20 99.14 97.91 99.51 86.15 90.28 98.67 89.26 99.27
UNet++ [24] 87.87 92.87 99.71 94.62 99.80 95.64 97.77 98.94 97.56 99.34 78.44 83.42 98.22 85.69 98.88
BCDU-Net [23] 88.74 87.02 99.35 79.84 99.88 96.39 98.16 99.11 97.82 99.50 84.18 87.97 98.45 88.16 99.10
ARU-GD [28] 91.59 95.57 99.85 95.30 99.93 96.14 98.03 99.00 97.98 99.32 81.68 86.73 98.67 85.81 99.35
Swin-UNet [14] 90.82 94.05 99.67 95.05 99.90 96.55 98.02 99.10 97.19 99.27 85.55 88.82 98.57 86.96 99.30
TBConvL-Net w/o TL 95.65 96.73 99.94 97.62 99.95 97.88 98.70 99.50 98.40 99.05 88.71 94.55 97.15 94.22 97.86
TBConvL-Net with TL 95.67 96.73 99.94 97.68 99.97 97.88 98.86 99.50 97.69 99.04 92.93 95.47 99.34 95.63 99.79

Fig. 9. Example segmentation results of TBConvL-Net on the MC dataset (top row), IDRiD dataset (middle row), and TCIA dataset (bottom row). From left to right, the columns show
the input images, the ground-truth masks, the segmentation results of TBConvL-Net, and the results of U-Net [2], UNet++ [24], BCDU-Net [23], ARU-GD [28], and Swin-UNet [14].
True-positive pixels are depicted in green, false-positive pixels in red, and false-negative pixels in blue. (For interpretation of the references to colour in this figure legend, the
reader is referred to the web version of this article.)

37.0G FLOPs, and an inference time of 34.8 ms, which is 2.84, 2.39, Table 10
and 1.82 times greater than the proposed TBConvL-Net. Even with Comparison of TBConvL-Net with other SOTA methods in terms of their numbers of
parameters, floating-point operations per second (FLOPS), and inference times.
its reduced complexity, TBConvL-Net achieves superior segmentation
Method Parameters FLOPs Inference time
performance compared to Swin-UNet [14]. The results suggest that our
(M) ↓ (G) ↓ (ms) ↓
proposed model achieves the best balance between model complexity
U-Net [2] 23.6 33.4 28.9
and segmentation performance. UNet++ [24] 24.4 35.6 31.3
BCDU-Net [23] 20.7 112.0 28.1
4. Conclusions ARU-GD [28] 23.7 33.9 29.5
Swin-UNet [14] 27.3 37.0 34.8
This article introduces TBConvL-Net, a novel hybrid deep neu- TBConvL-Net 9.6 15.5 19.1
ral network architecture for medical image segmentation (MIS) tasks.
TBConvL-Net effectively combines the advantages of CNNs and vision
transformers, overcoming the limitations of each technique. The pro-
posed encoder–decoder architecture features depth-wise separable and Future research will focus on several key areas to address these
densely connected convolutions for robust and unique feature learning, limitations and further enhance the capabilities of TBConvL-Net. Incor-
network optimisation, and improved generalisation. Additionally, the porating adaptive mechanisms that dynamically adjust model complex-
integration of bidirectional ConvLSTM and Swin Transformer mod- ity based on input characteristics and computational constraints could
ules within the skip connections enhances the feature extraction pro- improve efficiency and real-time applicability. Additionally, further
cess. Our experimental results demonstrate that TBConvL-Net surpasses testing and refinement of TBConvL-Net on 3D datasets, based on our
prior CNNs, transformer-based models, and other hybrid approaches initial experiments, will fully exploit its potential in volumetric medical
in several MIS tasks by capturing multiscale, long-range dependencies, image segmentation. Exploring semi-supervised or unsupervised learn-
and local spatial information. The proposed model strikes a good ing techniques to reduce dependence on large labelled datasets can also
balance between complexity and segmentation performance, showing be a valuable direction for future research.
promising results in various medical imaging applications.
Despite these advancements, TBConvL-Net has several limitations.
CRediT authorship contribution statement
The extensive training time required due to the intricate architecture
poses challenges for rapid model deployment. This is particularly evi-
dent when employing transfer learning techniques, which, while ben- Shahzaib Iqbal: Writing – original draft, Validation, Methodol-
eficial for leveraging pretrained models and improving performance, ogy, Investigation, Conceptualization. Tariq M. Khan: Writing – origi-
significantly increase the training duration. Another limitation is the nal draft, Validation, Methodology, Conceptualization. Syed S. Naqvi:
reliance on large amounts of labelled data for training, which can Writing – original draft, Validation, Methodology. Asim Naveed: Writ-
be difficult to obtain in the medical field due to the need for expert ing – original draft. Erik Meijering: Writing – review & editing, Super-
annotations and the inherent variability in medical imaging data. vision.

10
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

Declaration of competing interest [22] L. Maier-Hein, et al., Metrics reloaded: recommendations for image analysis
validation, Nat. Methods 21 (2) (2024) 195–212.
[23] R. Azad, M. Asadi-Aghbolaghi, M. Fathy, S. Escalera, Bi-directional ConvLSTM U-
The authors declare that they have no known competing finan-
net with densley connected convolutions, in: IEEE/CVF International Conference
cial interests or personal relationships that could have appeared to on Computer Vision (ICCV) Workshops, 2019, pp. 1–10.
influence the work reported in this paper. [24] Z. Zhou, M.M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested U-
net architecture for medical image segmentation, in: Deep Learning in Medical
Image Analysis (DLMIA) & Multimodal Learning for Clinical Decision Support
Data availability
(ML-CDS) Held in Conjunction with MICCAI, 2018, pp. 3–11.
[25] B. Lei, Z. Xia, F. Jiang, X. Jiang, Z. Ge, Y. Xu, J. Qin, S. Chen, T. Wang, S.
Data will be made available on request. Wang, Skin lesion segmentation via generative adversarial networks with dual
discriminators, Med. Image Anal. 64 (2020) 101716.
[26] H. Wu, S. Chen, G. Chen, W. Wang, B. Lei, Z. Wen, FAT-net: Feature adaptive
Appendix A. Supplementary data
transformers for automated skin lesion segmentation, Med. Image Anal. 76
(2022) 102327.
Supplementary material related to this article can be found online [27] D. Dai, C. Dong, S. Xu, Q. Yan, Z. Li, C. Zhang, N. Luo, Ms RED: A novel multi-
at https://doi.org/10.1016/j.patcog.2024.111028. scale residual encoding and decoding network for skin lesion segmentation, Med.
Image Anal. 75 (2022) 102293.
[28] D. Maji, P. Sigedar, M. Singh, Attention Res-UNet with guided decoder for
References semantic segmentation of brain tumors, Biomed. Signal Process. Control 71
(2022) 103077.
[1] S. Iqbal, T.M. Khan, K. Naveed, S.S. Naqvi, S.J. Nawaz, Recent trends and [29] F. Isensee, P.F. Jaeger, S.A. Kohl, J. Petersen, K.H. Maier-Hein, nnU-Net: A self-
advances in fundus image analysis: A review, Comput. Biol. Med. (2022) 106277. configuring method for deep learning-based biomedical image segmentation, Nat.
[2] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for Methods 18 (2) (2021) 203–211.
biomedical image segmentation, in: International Conference on Medical Image [30] Y. Wu, R. Zhang, L. Zhu, W. Wang, S. Wang, H. Xie, G. Cheng, F.L. Wang, X.
Computing and Computer-Assisted Intervention, MICCAI, 2015, pp. 234–241. He, H. Zhang, BGM-Net: Boundary-guided multiscale network for breast lesion
[3] Y. Ding, D. Mu, J. Zhang, Z. Qin, L. You, Z. Qin, Y. Guo, A cascaded framework segmentation in ultrasound, Front. Mol. Biosci. 8 (2021) 698334.
with cross-modality transfer learning for whole heart segmentation, Pattern [31] R. Morelli, L. Clissa, R. Amici, M. Cerri, T. Hitrec, M. Luppi, L. Rinaldi, F.
Recognit. 147 (2024) 110088. Squarcio, A. Zoccoli, Automating cell counting in fluorescent microscopy through
[4] Y. Fu, G. Zhang, X. Lu, H. Wu, D. Zhang, RMCA U-net: Hard exudates deep learning with c-ResUnet, Sci. Rep. 11 (1) (2021) 22920.
segmentation for retinal fundus images, Expert Syst. Appl. 234 (2023) 120987. [32] Y. Fu, J. Liu, J. Shi, TSCA-Net: Transformer based spatial-channel attention
[5] Y. Fu, J. Chen, J. Li, D. Pan, X. Yue, Y. Zhu, Optic disc segmentation by U-net segmentation network for medical images, Comput. Biol. Med. 170 (2024)
and probability bubble in abnormal fundus images, Pattern Recognit. 117 (2021) 107938.
107971.
[6] Y. Fu, G. Zhang, J. Li, D. Pan, Y. Wang, D. Zhang, Fovea localization by blood
vessel vector in abnormal fundus images, Pattern Recognit. 129 (2022) 108711. Shahzaib Iqbal received his Ph.D. degree from COMSATS University Islamabad,
[7] Y. Sun, D. Dai, Q. Zhang, Y. Wang, S. Xu, C. Lian, MSCA-Net: Multi-scale Pakistan. He is also working as an Assistant Professor in Computing department, Abasyn
contextual attention network for skin lesion segmentation, Pattern Recognit. 139 University Islamabad Campus. He has more than seven years of experience in computer
(2023) 109524. vision and machine learning as a researcher. His research interests include Medical
[8] A. Naveed, S.S. Naqvi, S. Iqbal, I. Razzak, H.A. Khan, T.M. Khan, RA-Net: Region- Image Analysis, Image Segmentation, Deep Neural Networks, and Machine Learning.
aware attention network for skin lesion segmentation, Cogn. Comput. (2024) His current research focuses on the development of lightweight deep neural networks
1–18. for automated analysis of medical images.
[9] W. Huang, Y. Deng, S. Hui, Y. Wu, S. Zhou, J. Wang, Sparse self-attention
transformer for image inpainting, Pattern Recognit. 145 (2024) 109897.
[10] M. Ding, A. Qu, H. Zhong, Z. Lai, S. Xiao, P. He, An enhanced vision transformer Tariq M. Khan is a Senior Research Associate in the School of Computer Science
with wavelet position embedding for histopathological image classification, and Engineering (CSE) at UNSW Sydney, Australia. Prior to joining UNSW in 2021,
Pattern Recognit. 140 (2023) 109532. he studied and worked at various universities, including Macquarie University (PhD),
COMSATS University Islamabad, Pakistan (Assistant Professor), and Deakin University
[11] L. Gao, B. Liu, P. Fu, M. Xu, TSVT: Token sparsification vision transformer for
(Research Fellow). His research focuses on computer vision, machine learning, and edge
robust RGB-D salient object detection, Pattern Recognit. 148 (2024) 110190.
computing, with an emphasis on resource-constrained neural networks and efficient
[12] L. Yu, W. Xiang, J. Fang, Y.-P.P. Chen, L. Chi, eX-ViT: A novel explainable vision
hardware solutions for real-time data processing in embedded systems. Currently,
transformer for weakly supervised semantic segmentation, Pattern Recognit. 142
he is developing new computer vision and machine learning methods, particularly
(2023) 109666.
deep learning, for automated quantitative analysis of biomedical imaging data and
[13] J. Chen, J. Mei, X. Li, Y. Lu, Q. Yu, Q. Wei, X. Luo, Y. Xie, E. Adeli, Y. Wang,
other industrial applications. His work extends to designing optimized neural network
M.P. Lungren, S. Zhang, L. Xing, L. Lu, A. Yuille, Y. Zhou, TransUNet: Rethinking
architectures for edge platforms, including NVIDIA Jetson, Raspberry Pi, and FPGAs, en-
the U-Net architecture design for medical image segmentation through the lens
suring high performance with low power consumption. He is also exploring applications
of transformers, Medical Image Analysis 97 (2024) 103280.
of AI and machine learning in anomaly detection and biometric systems, contributing
[14] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-unet: Unet-
to real-world challenges in security, surveillance, and industrial automation. With
like pure transformer for medical image segmentation, in: European Conference
over 100 peer-reviewed journal and conference papers (total citations 2750+, H-
on Computer Vision (ECCV) Workshops, 2023, pp. 205–218.
index of 34, and i10-index of 64), his research has advanced the fields of computer
[15] H.-Y. Zhou, J. Guo, Y. Zhang, X. Han, L. Yu, L. Wang, Y. Yu, nnFormer:
vision, machine learning, and AI, particularly in challenging and resource-constrained
Volumetric medical image segmentation via a 3D transformer, IEEE Trans. Image
environments. He is an active member of the ARC Research Hub for Connected Sensors
Process. 32 (2023) 4036–4045.
for Health, working on pioneering techniques in edge computing and intelligent sensors,
[16] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, L. Zhang, CvT: Introducing
positioning Australia at the forefront of connected technologies.
convolutions to vision transformers, in: IEEE/CVF International Conference on
Computer Vision, ICCV, 2021, pp. 22–31.
[17] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Trans- Syed S. Naqvi received the B.Sc. degree in computer engineering from the COMSATS
former: Hierarchical vision transformer using shifted windows, in: IEEE/CVF Institute of Information Technology, Islamabad, Pakistan, in 2005, the M.Sc. degree in
International Conference on Computer Vision, ICCV, 2021, pp. 10012–10022. electronic engineering from The University of Sheffield, U.K., in 2007, and the Ph.D.
[18] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected degree from the School of Engineering and Computer Science, Victoria University of
convolutional networks, in: IEEE Conference on Computer Vision and Pattern Wellington, New Zealand, in 2016. He is currently working as an Assistant Professor
Recognition, CVPR, 2017, pp. 4700–4708. with the COMSATS Institute of Information Technology. His research interests include
[19] H. Song, W. Wang, S. Zhao, J. Shen, K.-M. Lam, Pyramid dilated deeper saliency modeling, medical image analysis, scene understanding, and deep learning
ConvLSTM for video salient object detection, in: European Conference on methods for image analysis.
Computer Vision, ECCV, 2018, pp. 715–731.
[20] Z. Cui, R. Ke, Z. Pu, Y. Wang, Stacked bidirectional and unidirectional LSTM
recurrent neural network for forecasting network-wide traffic state with missing Asim Naveed received his MS degree in computer engineering from University of
values, Transp. Res. C 118 (2020) 102674. Engineering and Technology (UET) Lahore, Pakistan, in 2015. He is currently pursuing
[21] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, I.B. Ayed, Boundary for Ph.D. degree from COMSATS University Islamabad, Pakistan. He is also Lecturer
loss for highly unbalanced segmentation, in: International Conference on Medical at UET Lahore, Narowal Campus, Pakistan. He has more than four years of experience
Imaging with Deep Learning, MIDL, 2019, pp. 285–296. in computer vision and machine learning as a researcher. His research interests

11
S. Iqbal et al. Pattern Recognition 158 (2025) 111028

include Medical Image Analysis, Image Classification, Image Segmentation, Deep Neural in Electrical Engineering from Delft University of Technology in 1996 and a PhD
Networks, and Machine Learning. in Medical Image Analysis from Utrecht University in 2000. Before joining UNSW
in 2019, he held positions at the Swiss Federal Institute of Technology, Erasmus
University Medical Center, and became an IEEE Fellow in 2019. He has served in
Erik Meijering is a Professor of Biomedical Image Computing at the University of various editorial roles and organized multiple IEEE conferences, notably the IEEE
New South Wales (UNSW) in Sydney, Australia, specializing in computer vision and International Symposium on Biomedical Imaging (ISBI), while contributing to several
deep learning methods for analyzing biomedical imaging data. He earned his MSc IEEE societies and technical committees.