0% found this document useful (0 votes)

11 views26 pages

Final Version

Uploaded by

princess.histerical

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views26 pages

Final Version

Uploaded by

princess.histerical

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

1

Current and emerging trends in medical image

segmentation with deep learning
Pierre-Henri Conze, Gustavo Andrade-Miranda, Vivek Kumar Singh, Vincent Jaouen and Dimitris Visvikis
Invited Paper

Abstract—In recent years, the segmentation of anatomical not applicable to large volumes of data typically produced
or pathological structures using deep learning has experienced in clinical routine or research studies. Given the potential
a widespread interest in medical image analysis. Remarkably fatigue of human experts and the wide variations in expertise,
successful performance has been reported in many imaging
modalities and for a variety of clinical contexts to support manual segmentation is prone to strong intra- and inter-expert
clinicians in computer-assisted diagnosis, therapy or surgical variability [2]. Irregularities of the targeted structures, morpho-
planning purposes. However, despite the increasing amount logical variations or pathological deformities between patients
of medical image segmentation challenges, there remains little as well as the potential lack of clearly visible boundaries
consensus on which methodology perform best. Therefore, we with the surrounding anatomy further affect the non-agreement
examine in this paper the numerous developments and break-
throughs brought since the rise of U-Net inspired architectures. between operators. To ease the process, intra-subject semi-
Especially, we focus on the technical challenges and emerging automatic techniques consisting of ascending and descending
trends that the community is now focusing on, including con- non-linear registration steps applied to manually-drawn masks
ditional generative adversarial and cascaded networks, medical can be applied to obtain volumetric results [3, 4]. Although
Transformers, contrastive learning, knowledge distillation, active more affordable than 3D volume annotations, such propagation
learning, prior knowledge embedding, cross-modality learning,
multi-structure analysis, federated learning or semi-supervised schemes from sparse manual delineations to remaining slices
and self-supervised paradigms. We also suggest possible avenues still need interactions and may require manual refinements.
to be further investigated in future research efforts. Mathematical models and low-level image processing have
Index Terms—artificial intelligence, semantic segmentation, been extensively exploited for segmentation purposes before
deep neural networks, vision Transformers, medical imaging the rise of learning techniques. In particular, model-based
segmentation incorporating statistical shape models has been
followed in various clinical contexts [5]. These models have
I. I NTRODUCTION been further improved by exploiting prior knowledge of shape
information, for instance by relying on internal shape fitting
T HE increased volume of medical data to be interpreted by
clinicians for diagnosis, therapeutic or surgical planning
purposes has encouraged the development of computer-aided
and auto-correction to guide the delineation process [6]. Con-
versely, aligning and merging manually segmented images
image analysis tools to benefit from precise, fast, repeatable into a specific atlas coordinate space have been developed
and objective measurements made by computational resources. as a reliable alternative to statistical shape models. In this
Among existing analysis tasks, medical image segmentation context, various single- and multi-atlas methods have been
whose goal is to extract the boundaries of anatomical or proposed relying on non-linear registration [7]. Some hybrid
pathological structures from medical images remains crucial. methods relying on statistical shape models constrained with
Also commonly used in computed vision [1], semantic seg- probabilistic atlases have also been investigated. Medical im-
mentation is a key step for many medical imaging workflows age segmentation has been also performed through Bayesian
since the information arising from the resulting voxel-wise approaches using expectation-maximization [8], possibilistic
localization can greatly help clinicians to diagnose disorders, clustering [9], histogram-based thresholding followed by re-
assess disease progression, plan therapeutic interventions or gion growing [10], active contours [11] and more recently
monitor treatment effects. Core feature of many computer- machine learning (ML) [12] techniques.
aided detection and diagnosis schemes, image segmentation is However, the previously described methodologies are not
involved in the analysis of many imaging modalities includ- perfectly suited for high inter-subject shape variability, weak
ing computed tomography (CT), magnetic resonance (MR), boundaries and significant differences in tissue appearance. In
positron emission tomography (PET) or ultrasound (US). most cases, their robustness is not up to the inherent limitations
Delineating anatomical or pathological structures from med- of medical images such as noise, non-uniform contrasts or
ical images is traditionally performed manually. This task motion artifacts. Moreover, many of these methods are semi-
is exceedingly time-consuming and requires suitable clinical automatic and hence require prior knowledge, associated with
expertise to get clinically-relevant contours. This is therefore high computational costs. This has strongly motivated the
development of deep learning (DL) techniques to exploit im-
This work did not involve human subjects or animals in its research. age characteristics (e.g. contrast variation, orientation, shape,
P.-H. Conze and V. Jaouen are with IMT Atlantique, LaTIM UMR 1101, texture patterns) in a more efficient data-driven manner.
Inserm, Brest, France, [email protected]. V. K. Singh is
with Queen’s University Belfast, Belfast, United Kingdom. G. Andrade- In recent years, artificial intelligence (AI) and more par-
Miranda and D. Visvikis are with Inserm, LaTIM UMR 1101, Brest, France. ticularly DL models have reached impressive performance in
2

medical image segmentation, becoming the new state-of-the- TABLE I

art [13]. The transition between systems that use handcrafted C URRENT AND EMERGING TRENDS IN MEDICAL IMAGE SEGMENTATION
WITH DEEP LEARNING , PROVIDED WITH A REPRESENTATIVE REFERENCE .
features (i.e. ML) to systems that learn features directly from
data (i.e. DL) is now acted. The most widely-used models Topic Sect. Description Ref.
to date are based on convolutional neural networks (CNN). Current trends
Conditional generative Discriminator assesses if seg-
Especially, the U-Net convolutional architecture [14] has been adversarial networks
IV-A
mentation is synthetic or real
[22]
widely adopted in the community thank to its ability to output Series of convolutional
Cascaded networks IV-B [23]
detailed delineations using a relatively low amount of training encoder-decoders
Prior knowledge Regularization term added
data. Nevertheless, new paradigms recently appear based on embedding
IV-C
into the loss function
[24]
attention mechanisms, accompanying the emerging interest Objective functions
Deep supervision IV-D [25]
devoted to pure and hybrid Transformers-based models [15]. at some hidden layers
Despite the availability of review articles summarizing exist- Attention Channel attention, spatial
IV-E [26]
mechanisms attention, branch channel...
ing DL approaches in medical image segmentation [13, 16], Incorporating inter-
Multi-structure analysis IV-F [27]
a more exhaustive, updated and comprehensive panorama is structure relationships
needed to allow a wide audience (e.g. researchers, clinicians, Unified frameworks, neural
Learning frameworks IV-G [28]
architecture search
radiologists) to benefit from the latest and emerging trends Emerging trends
in the field. Some of these existing reviews only focus on Medical Hybrid CNN-Transformers,
V-A [29]
specific aspects such as anatomy-aided techniques [17], multi- Transformers full Transformers...
Taking advantage of infor-
organ segmentation [18], 3D CNN models [19] or U-Net and Multi-task learning V-B
mation shared among tasks
[30]
its variants [20] whereas others [15, 16, 21] address a broader Segmentation Uncertainty modelling to
V-C [31]
overview of medical image analysis with DL by describing uncertainty improve performance
various tasks including classification, detection, segmentation Learning a disentangled
Constrative learning V-D [32]
feature representation
or registration. In addition, despite the increasing amount of Distilling information
Knowledge distillation V-E [33]
medical image segmentation challenges, there remains little between models
consensus on which methodology performs the best. In this Emerging applications
Cross-modality Exploiting complementary
context, this paper aims at examining the numerous devel- segmentation
VI-A
between modalities
[34]
opments and breakthroughs brought since the rise of U-Net Multi-domain Dealing with multiple
VI-B [35]
[14] inspired architectures with a novel and specific focus on segmentation intensity domains
Self-supervised Self-supervised contrastive
the technical challenges and emerging scenarios that the com- learning
VI-C
learning, pre-text tasks
[36]
munity is now focusing on, including knowledge distillation, Semi-supervised Mean teacher, pseudo
VI-D [37]
contrastive learning, medical Transformers, prior knowledge learning labeling, auxiliary task...
embedding, cross-modality analysis, federated, active, self- Distributed training
Federated learning VI-E [38]
between clinical institutions
or semi-supervised learning. The recurring motivation for Assisting annotators
Active learning VI-F [39]
these new approaches lies in the difficulty to obtain large in the annotation process
annotated datasets, as compared to other imaging tasks (e.g. Trade-off between trainable
Lightweight networks VI-G [40]
parameters and performance
classification) or fields (e.g. computer vision).
This paper starts with a description of both background
(Sect.II) and clinical needs (Sect.III) before providing an seminal works (Sect.II-B) until the development of U-Net
in-depth overview of current trends (e.g. prior knowledge (Sect.II-C). The last part on data augmentation (Sect.II-D),
embedding) in medical image segmentation with DL (Sect.IV). especially useful to address data scarcity issues frequently
By bringing the light to medical Transformers, multi-task encountered in the field, completes the panorama.
learning, segmentation uncertainty, contrastive learning and
knowledge distillation, emerging trends are then motivated and A. Problem formulation
explained in Sect.V. Sect.VI focuses on emerging applications
including cross-modality learning, multi-domain segmentation, Let X be a set of images x ∈ RH×W ×D where H, W
semi-supervised, active and federated learning. To help readers and D are the image dimensions in x-, y- and z- dimension
navigate through the paper, an overview of the targeted current respectively while the annotation set Y ⊂ [0, 1]H×W ×D×C
and emerging trends is provided in Tab.I. Sect.VII finally con- contains for each x ∈ X a map y of H × W × D one-hot
cludes this paper by summarizing and discussing the possible vectors indicating the ground truth classes for all voxels. In a
avenues to be further investigated in the future. fully-supervised setting, a deep segmentation network φ aims
at approximating a mapping function φ : x → φ(x x; Θ φ ) = ŷy
II. BACKGROUND between intensity x and class labels y images from N train-
Over the last few years, CNN models have become state-of- ing samples {x xn , y n }1≤n<N by optimizing a loss function
N
the-art in medical image segmentation due to their ability to 1 X
Lφ (yy , ŷy ) = `φ (yy n , ŷy n ) with ŷy n = φ(x
xn ) through a
learn hierarchical representations of image features in a data- N i=0
driven fashion [16]. Before going into current and emerging stochastic optimizer. The parameters of φ, namely Θ φ , are
trends, this section aims at explicitly formulating what is med- optimized during the training process. A stochastic gradient
ical image segmentation using DL (Sect.II-A) and describing descent scheme, from standard gradient descent to Adam [41]
3

or more elaborated optimizers, aims at finding the optimal

weights Θ ∗φ such that Θ ∗φ = arg min Lφ (yy , ŷy ). In practice, the
Θφ
network weights are iteratively updated in the direction of the
steepest descent to reach the local minimum, following:

Θ φ ← Θ φ − α∇Θφ Lφ (1)
where the learning rate α is a hyper-parameter controlling
the step size at each iteration. Tuning α is of paramount
importance to find a good trade-off between convergence speed
and stable optimization. Back-propagation deals with gradient
computation, while the gradient descent algorithm, based on
this gradient, aims at performing the learning procedure. `φ is a
per-image loss function which is usually the cross-entropy loss Fig. 1. Residual V-Net inspired convolutional encoder-decoder architecture for
medical image segmentation purposes. Refer to Sect.II-C for further details.
defined, in a multi-class scenario with C classes, as follows:
details and initial resolution as well as skip-connections (i.e.
1 XX
`CE (yy n , ŷy n ) = −yy n,c,u log(ŷy n,c,u ) (2) long-range shortcuts) which concatenate features between con-
|C ||Ω| tracting and expanding paths to help in improving localization
c∈C u∈Ω

where Ω is the image grid and c ∈ C a given class with accuracy and convergence speed. The contracting path encoder
C = {0, ..., C} indexing the different structures of interest as of a standard U-Net (resp. V-Net) architecture consists of
well as the background. As reviewed in [42], a wide variety sequential layers including 3×3 (resp. 3×3×3) convolutional
of loss functions exist including distribution-based (e.g. cross- layers followed by batch normalization (BN) and rectified
entropy), region-based (e.g. Dice), compound (e.g. DiceCE) linear unit (ReLU) activations (Fig.1). Spatial size is reduced
and boundary-based (e.g. Hausdorff distance) losses. using 2 × 2 (resp. 2 × 2 × 2) max-pooling layers. The first
convolutional layer typically generates 32 or 64 channels and
this number doubles after each pooling as the network deepens.
B. Seminal works
The encoder finally projects each input greyscale image x n to
The simplest and early attempts to perform segmentation a latent representation (denoted as z n in Fig.1). On its turn,
using CNN consisted in classifying each pixel individually the decoder branch is built symmetrically with respect to the
in a patch-based manner [43]. Since input patches from encoder, except that max-pooling layers are replaced by up-
neighboring pixels have large overlaps, the same convolutions sampling operations (e.g. bi/tri-linear interpolation, transpose
were computed many times. By replacing fully-connected convolution). Depending on the binary or multi-class nature of
layers with convolutional layers, fully convolutional networks the segmentation issue at hand, a final 1×1 (resp. 1×1×1) con-
(FCN) gave the opportunity to take entire images as inputs volutional layer with sigmoid or softmax activation achieves
and produce likelihood maps instead of single-pixel outputs. pixel-wise segmentation ŷy n = φ(xxn ), at native resolution. V-
It removed the need to select representative patches and Net-inspired models may more suffer from high computational
eliminated redundant calculations due to patch overlaps. In cost and GPU memory usage than their 2D counterparts.
order to avoid outputs with far lower resolution than input Numerous refinements to the U-Net encoder-decoder archi-
shapes, FCN were applied to shifted versions of input images tecture have been proposed including, to name a few, models
[44]. Multiple resulting outputs were thus stitched together which embed encoders pre-trained on large non-medical imag-
to get results at full resolution. Further improvements were ing databases (e.g. ImageNet) to leverage low-level features
then proposed with architectures comprising a regular FCN to typically shared between different image types [46], sequen-
extract features, followed by an up-sampling part that enables tial models exploiting residual convolutions [45] (Fig.1) or
the recover the input resolution using up-convolutions [16]. pyramidal atrous convolutions (instead of pooling operations)
Compared to patch-based or shift-and-stitch methods, precise [47] as well as alternative attention models (Sect.IV-E) such
localization was possible in a single pass while taking into as attention U-Net [26] which integrates attention gates on
account the full image context. This motivated the strong in- skip-connections to focus on salient features. As an extension
terest devoted to convolutional encoder-decoders among which to vanilla U-Net, U-Net++ [48] relied on re-designed skip-
U-Net (Sect.II-C) is the most commonly used representative. connections through intermediate convolution layers as well
as deep supervision (Sect.IV-D). By aggregating features of
C. U-Net varying semantic scales at the decoder branch, nested and
Among existing convolutional encoder-decoders architec- dense skip-connections act as a flexible feature fusion scheme.
ture, most DL-based medical image segmentation models are
based on U-Net [14] and its 3D counterpart V-Net [45]. U- D. Data augmentation
Net and V-Net consist of symmetrical architectures comprising Deep segmentation models are most often trained with
an encoder that gradually reduces the spatial dimension us- extensive on-the-fly data augmentation, towards improved
ing pooling layers, a decoder progressively recovering object generalization properties. By comprising random geometric
4

transformations (e.g. translation, rotation, scaling, shear, flip- assessment of the total kidney volume (TKV) from MR images
ping) and random intensity modifications (e.g. normalization, in patients with polycystic kidney disease since TKV is the
blurring, contrast adjustment), data augmentation can be seen main image-based biomarker to follow PKD progression [58].
as a clever way to artificially increase the amount of available Segmenting healthy organs (e.g. liver, kidneys) to obtain a
data, with slightly modified copies of already existing images. measurement of volume, size or shape is also a relevant use-
In practice, geometric transformations are applied to both case in the context of transplant surgery planning [59, 60].
greyscale images and ground truth masks whereas intensity In other works, whole or sub-structure organ segmentation
transformations only modify source images. Data augmenta- is managed as a first step toward lesion detection and de-
tion enables to teach DL models desired invariance, covariance lineation. The main related challenge in this context deals
and robustness properties and to strongly reduce over-fitting. with class imbalance as most voxels usually belong to the
More recently, it was shown that DL models could further non-diseased class. In particular, there are numerous research
benefit from more elaborated data augmentation techniques works aimed at delineating skin lesions from dermatological
such as MixUp which exploits convex combinations of pairs images [30, 61], brain tumors from MR images [47, 62, 63],
of samples and associated labels to train neural networks liver lesions from CT scans [57], head and neck primary tu-
[49]. MixUp regularizes the neural network to favor simple mors, lymph nodes and organs at risk from radiotherapy com-
linear behavior in-between training examples. Originally pro- puted tomography (RT-CT) or combined PET and CT images
posed for image classification tasks, its extension to image [64–66], breast masses in mammograms [67] or ultrasound
segmentation is straightforward and efficient, as proven in images [68], cystic kidney tissues in MR modality [69] or
[50] where improved generalization with MixUp as a data lesions in whole-body images [53]. In oncology, PET and CT
augmentation technique is reached for knee MR segmentation imaging held a special place for disease characterization since
purposes. This success at the input data space further inspired they contain complementary information about the metabolic
the use of MixUp in the latent feature space [51], in a setting or biochemical function of tissues and organs as well as
referred to manifold MixUp. As reviewed in [52], synthetic the anatomy of cancer [70]. Inner-lesion tissue segmentation
augmentation based on image synthesis methods, for instance is also increasingly targeted as in [71] where both active
exploiting generative adversarial networks (GAN), is another and necrotic tissues are identified inside liver tumors for
powerful alternative to standard data augmentation since it patients with hepatocellular carcinoma in dynamic contrast-
samples the manifold on which the original training set resides. enhanced CT or in [72] where low and high-grade gliomas
Although effective, especially in extreme data scarcity scenar- are decomposed into several tissue types comprising necrotic
ios, synthetic augmentation is more demanding to implement. and enhancing cores, non-enhancing tumor and oedema.
Standard and more sophisticated data augmentation systems Another emerging application deals with automatically ex-
are obviously not mutually exclusive and can be used together. tracting blood vessels (e.g. retinal, brain, liver vessels) from
While data augmentation is typically employed during train- medical images [73, 74]. Apart from class imbalance and
ing, using it at test time recently started to get special attention. appearance similarity with non-vascular tissues, vascular seg-
Strongly linked to the way model uncertainty can be quan- mentation brings additional limitations: complex multi-scale
tified [31] (Sect.V-C), test-time data augmentation consists geometry with decreasing diameters and contrast along tree-
in performing the inference both on original and augmented like networks, inter-patient variability in branching patterns...
versions of images, followed by a merging procedure. Gains Medical image segmentation is also involved in plenty of
in performance are reported in various clinical contexts such radiomics pipelines where it has been for a very long time
as lesion segmentation from whole-body PET-CT images [53]. the bottleneck in both time and automation. Thus, extracting
radiomic features in an automated and high-throughput way
III. C LINICAL NEEDS AND APPLICATIONS from relevant lesion areas is requested to quantify the char-
An ever-increasing number of research studies have illus- acteristics of medical images, comprehensively characterize
trated the numerous applications of medical image analysis objects (e.g. tumors, organs, tissues) and finally provide useful
with DL, targeting a large number of pathologies and imaging guidance for clinicians. Although initially designed to process
modalities [16, 21]. On its side, medical image segmentation CT and functional PET images, the radiomics approach can
plays a key role in many medical imaging workflows tai- be applied to any imaging modality or radio-tracer [75]. This
lored for diagnosis, disease progression assessment, surgery includes works involving automated DL-based segmentation
or therapeutic planning, follow-up, survival analysis, treatment towards patient outcome prediction such as survival analysis
response evaluation, dosimetry and many other applications. [70] or chemotherapy response assessment and prediction [76].
Clinical needs deal, first of all, with organ delineation from More marginal applications can also be mentioned, as for
anatomical CT or MR imaging given that clinical parame- the management of musculoskeletal diseases where patient-
ters (e.g. volume, shape, inner textures) can be exploited as specific information related to the degree of muscle atrophy
biomarkers to diagnose or quantify disease progression, as in across joints is needed to plan interventions and predict inter-
cardiac [54], brain [55] or prostate [56] disorders. Hepatic ventional outcomes. In particular, DL-based shoulder muscle
pathologies with primary or secondary liver lesions are also segmentation from MR images [46] can be employed to
concerned, thus making fully-automatic liver segmentation analyze the shoulder strength balance, which is particularly
[57] particularly useful and requested in clinical routine. Re- important given that a clear relationship between muscle
garding pure organ volumetry, a good example is an automated atrophy and strength loss [77] has been established.
5

IV. C URRENT TRENDS of conditional GAN architectures to extract the contours of

A. Conditional generative adversarial networks the targeted anatomy or abnormalities, investigations on more
robust generators than traditional U-Net [14] is an interesting
A network design based on conditional generative adversar- research avenue [60]. Condition GAN can also be employed
ial networks (GAN) has been proposed as a general-purpose as a tool combined with other constraints (Sect.IV-C) such as
solution for image-to-image translation [78]. The goal is not anatomical shape priors, as proposed in [24].
only to learn the mapping from input to output images but also
to learn a loss function to train this mapping. This kind of strat-
B. Cascaded networks
egy is obviously suitable for segmentation purposes and among
the possible applications, its feasibility for medical image Managing long-range spatial context from medical images
segmentation has been demonstrated in several recent works is an important feature to improve the automatic delineation
[22, 60, 61, 79]. In practice, conditional GAN architectures process. However, increasing the network depth over and over
comprise a generator aiming at providing segmentation masks to exploit larger receptive fields is not suitable for memory
through encoding and decoding layers as well as a discrimina- and computational reasons, especially given the volumetric
tor which assesses if a given segmentation mask is synthetic or nature of most medical imaging data. In addition, too many
real. The adversarial network learns to discriminate real from high-resolution details are discarded when the number of
synthetic delineations, i.e. ground truth masks versus those down-sampling operations in the encoder branch is signifi-
arising from the generator. This enforces the generative part to cant. To address these challenging limitations, standard scale-
create increasingly plausible segmentation masks. During the space pyramid [82] and auto-context [83] ideas influenced the
training process, the generated delineations are gradually close development of cascaded strategies exploiting series of convo-
to the ground truth, to the point of being able to deceive the lutional encoder-decoders [23, 60]. Instead of simultaneously
discriminator. Unlike standard post-processing schemes, such exploiting several pathways working at various scales [84],
iterative refinement performed through adversarial learning cascaded approaches consist in using a scale-space pyramid
[79] is conducted in an end-to-end manner. to perform segmentation at a higher resolution while also
As generator φ, conditional GAN pipelines may use any considering contextual information from lower resolutions.
type of U-Net [14] inspired model, from simple [22] to By considering two convolutional encoder-decoders in cas-
extended (using dense dilated convolution) [61] and cascaded cade, the most common setup could consist in training a low-
(Sect.IV-B) [60] ones. The inputs of the discriminator D resolution model and using its weights as initialization of
are the concatenation of source images and ground truth or a high-resolution model through transfer learning and fine-
predicted masks to be evaluated. Defined between 0 (fake) and tuning. Although this strategy can significantly speed up
1 (plausible or real), the output of D is an array where each convergence, the ability of the high-resolution segmentation
value corresponds to the degree of segmentation likelihood for model to extract long-range contextual features remains lim-
a given image crop and its associated segmentation mask. Let ited. The idea of stacking (at least) two convolutional encoder-
φ(xx) and D(x x, φ(xx)) be the outputs of φ and D respectively. decoders to integrate multi-level information more directly
The loss function Lφ (Θ Θφ , Θ D ) for the generator φ can be came naturally [23] and made use of auto-context [83] such
defined as the following combination: that posterior probabilities resulting from a given model can
be used as features for the following one [55]. The models
N
1 X can be trained separately [55] but this prevents refining low-
Lφ (Θ
Θφ , Θ D ) = `CE (φ(xxn ), y n ) + λ`adv (φ(xxn ), y n ) resolution models from the high-resolution ones during back-
N n=1
(3) propagation. An end-to-end training is better suited to exploit
where λ is an empirically set weighting factor and simultaneous multi-level segmentation refinement [23, 60]. A
Θ D the trainable parameters of D. The adversarial term cascade of deep modules exploiting tissue-specific geodesic
xn ), y n ) equals to − log(D(φ(x
`adv (φ(x xn ), x n )). Minimizing distance maps as contextual information was employed in [85]
`CE tends to provide rough predictions whereas maximizing to gradually improve the delineation accuracy. Combined with
log D(φ(x xn ), x n ) aims at improving contour delineations. top-down reasoning, such bottom-up strategies could better
Conversely, the optimizer typically fits D through cross- handle texture information and region discontinuities.
entropy using both estimated and ground truth masks. The
loss function LD (Θ Θφ , Θ D ) for D is therefore defined as: C. Prior knowledge embedding
N Regularization plays a key role in DL since it tends to
1 X
LD (Θ
Θφ , Θ D ) = − log(D(yy n , x n )) increase both robustness and generalizability of a deep model
N n=1 when applied to unseen data. One common strategy consists in
− log(1 − D(φ(x
xn ), x n )) (4) adding a regularization term into the loss function to get more
accurate and plausible results [86]. The regularization term
The above equation maximizes the loss values for ground Rφ deals with adding some prior knowledge to the model φ
x, y ))) and minimizes loss values for gener-
truth (i.e. log(D(x and its regularization effect is achieved by adding the scaled
ated masks (i.e. − log(1−D(x x, φ(x
x)))). The optimization pro- regularizer λRφ to the loss function Lφ to ensure further
cess is performed sequentially by alternating gradient descents consistency between both predictions φ(x x) and targets y . The
on φ and D at each batch [81]. To further improve the ability resulting loss function can be expressed as follows:
6

projection

ce
spa
nt
late

Fig. 2. Integration of anatomical shape priors into a deep segmentation pipeline [24, 26, 35, 80]. Shape priors-based regularization is performed using a shape
encoder arising from a convolutional auto-encoder previously optimized on ground truth segmentation masks. Refer to Sect.IV-C for further details.

Nevertheless, a cosine distance between predicted and ground

N
1 X truth masks in low-dimensional space is also of interest [74].
Lφ (Θ
Θφ ) = xn ), y n ) + λRφ
`φ (φ(x (5)
N n=1 N
1 X
Many different types of information can be incorporated as Rφ (ŷy , y ) = 1 − cos(F (yy n ; Θ ∗F ), F (φ(x
xn ); Θ ∗F )) (8)
N n=1
prior knowledge into DL frameworks such as shape constraints
[26], topology specifications [73], edge polarity [87] or ad- As reported in [80], shape is just one of the geometric
jacency rules between regions [88]. Nevertheless, integrating attributes of anatomical or pathological structures one can
shape priors remains one of the most commonly used strategies exploit. Much more meaningful priors such as texture, topol-
toward anatomically meaningful predictions. In particular, the ogy or size can be embedded into training objectives towards
use of convolutional auto-encoder (AE) to learn anatomical stronger robustness and stability of DL segmentation networks.
shape variations from medical images has been demonstrated
in multiple applications [24, 26, 35, 80]. Specifically, a con-
D. Deep supervision
volutional AE is a deep network made of an encoder F : y 7→
F (yy ; Θ F ) and a decoder G : F (yy ; Θ F ) 7→ G(F (yy ; Θ F ); Θ G ) Introduced in the context of holistically-nested edge detec-
where Θ F and Θ G correspond to the learnable parameters of tion [89], additional convolutional operations can be applied
F and G respectively. F maps the input to a low-dimensional at different levels of the decoder branch in order to exploit
feature space whereas G reconstructs the original input from a deep supervision mechanism (Fig.3) able to boost the seg-
the compact representation. To avoid the AE to copy the input, mentation performance [15]. Companion objective functions
F is usually designed to be undercomplete such that the latent are estimated at some hidden layers of the network and added
space is much smaller than the input dimension. By penalizing to the output loss. In practice, feature maps as outputs of each
the reconstruction G ◦ F (yy ), the cross-entropy loss function intermediate decoder blocks can be up-sampled to the size of
can be employed to optimize the AE following: the input image using bi- or tri-linear interpolation, depending
N
of the 2D or 3D nature of the segmentation problem. Similarly
1 X to [25], a convolutional operation (e.g. with 3 × 3 kernel) can
Θ ∗F , Θ ∗G = arg min `CE ((G ◦ F )(yy n ), y n ) (6)
ΘG
Θ F ,Θ N n=1 be applied to create feature maps at each level of the decoder
(16 in Fig.3). These maps then go through deep supervision
where Θ ∗F (resp. Θ ∗G ) are the optimal weights of the encoder
modules to improve the gradient flow and encourage learning
(resp. decoder). After having trained the AE, its encoder
more useful representations [25]. After having performed the
component acts as a non-linear shape model and can project
concatenation of these intermediate outputs, two convolutional
any predicted or ground truth segmentation masks to a shape
layers including a final 3×3 one with softmax activation finally
manifold space [80]. The encoder produces a feature map F (yy )
achieve pixel-wise segmentation (multi-label in Fig.3). In this
which compactly encodes the most salient characteristics of y .
context, the overall loss function Lφ can be defined as the
Once the AE trained, its encoder component can be inte-
weighted sum of the cross-entropy losses (or any other losses)
grated into the segmentation pipeline (Fig.2). A regularizer Rφ
estimated at different decoder levels involving supervision:
that penalizes the deviation between predicted and ground truth
segmentation masks fed as inputs of the learned shape model M
X
F is included in the global loss (Eq.5). A Euclidean distance Lφ (Θ
Θφ ) = wj LjCE + wf LfCE (9)
between both latent shape representations [26] is usually used: j=1

where wj and Ljce denote the weight and loss for the points
N
1 X of supervision at level j of the decoder, wf and LfCE the
Rφ (ŷy , y ) = xn ); Θ ∗F )k22
kF (yy n ; Θ ∗F ) − F (φ(x (7)
N n=1 weight and loss computed at the final network output (where
7

3 64 64
64 128 128
128 256 256
256 512 512
512 512 512

32
128
128
128
256

256

256
512

512

512
5 64 64
128 128 128
256 256 256
512 512 512

64
128
128
128
256

256

256
512
C 64 16 16 16 16 conv + BN + reLU

max-pooling

skip connexion
512

512
512

512

512
up-sampling

concatenate
C C C C
conv + interpolate

conv + softmax

512

512
512

512

Fig. 3. Convolutional encoder-decoder architecture with deep supervision. The overall loss function is the weighted sum of losses estimated at different
decoder levels [25]. C is the number of classes. Refer to Sect.IV-D for further details.

f stands for final). When following a VGG-13 architecture output

alignment attention
[90] as depicted in Fig.3, M = 4 intermediate decoder levels function weights
can be considered. Following [25], one can use w1 = 0.8,
w2 = 0.7, w3 = 0.6, w4 = 0.5 and wf = 1 where level
j = 1 is closer to the network ending part than level j > 1.
However, how to balance the hyper-parameter setting among
different loss components remains a matter of concern. Instead
of empirically defining weights, a relative weighting can be
learned from the data using homoscedastic uncertainty [91].

E. Attention mechanisms
The human visual system can concentrate and focus actively
on a tiny portion of highly relevant perceptible information
while disregarding other irrelevant perceivable stimuli. At-
tention mechanisms were introduced in DL frameworks to
imitate this aspect of how the human visual system processes
Fig. 4. Flow-chart diagram of a general attention mechanism function. Refer
information. In general, they can be regarded as a dynamic to Sect.IV-E for further details.
selection process where the features extracted from images are
adaptively weighted to pay attention to the more salient ones, we have a query q ∈ Rdq and M pairs of key k ∈ Rdk
i.e. the feature needed for accurately solving a specific image and value v ∈ Rdv vectors {(k1 , v1 ), . . . (kM , vM )}, where
analysis task. Attention mechanisms, particularly in image all of them can be obtained from intermediate CNN features
segmentation, can suppress feature responses in irrelevant or embedding patches of an input image x . The attention is
background regions, hence reducing the rate of false-positive computed step-by-step following three operations: alignment,
predictions. This is particularly true for the challenging in- weighting and contextualization. In the alignment step E(·),
stances of small objects with high shape variability. each query is matched against the M keys to compute a
The attention problem is usually formulated using three score value. Several commonly used alignment functions are
vectors: query, key and value. Conceptually, we can think of further described in Tab.II where additive [92] and dot-product
key and value as a look-up table in which the query is matched [93, 94] functions are the most widely used. In the next step,
to a key, and the value associated with that key is returned. In the alignment scores are passed through a function H(·) (e.g.
image segmentation, it is equivalent to mapping the features softmax, sigmoid) to generate the final attention weights by
of the structure to segment (query) against a collection of normalizing all the scores to a probability distribution (Eq.10).
plausible target features (keys), then presenting the best- A contextualization vector (Eq.11) is then instantiated for each
matched regions (values). Mathematically, let us consider that q as a weighted sum of the M values vi by the set of weights
8

TABLE II GAP
S UMMARY OF SEVERAL POPULAR ALIGNMENT FUNCTIONS USED TO
COMPUTE THE MATCHING SCORE BETWEEN A QUERY AND KEYS . W AND
V ARE TRAINABLE WEIGHT MATRICES , MEANWHILE dk STANDS FOR THE convolution
X
DIMENSION OF A VECTOR k. [.; .] STANDS FOR CONCATENATION .

Name Alignment function

Additive or concatenation E(q, k) = V T tanh(W W [q; k]) [92] Fig. 5. Flow-chart diagram of a squeeze-and-excitation block [96]. GAP
Content-based attention E(q, k) = cos(q, k) [95] stands for global average pooling.
Location-based attention E(q) = W q [93]
General E(q, k) = qW
W k [93]
Dot-product E(q, k) = q T k [93]
chooses where to focus attention through an adaptive spatial
qT k area selection procedure. One may integrate an attention block
Scaled dot-product E(q, k) = √ [94]
dk into the U-Net architecture to learn semantic representations
that prioritize spatial regions with high saliency levels for the
A. This computation enables to define how much attention task of tumor segmentation [68]. In the same direction, the
should be paid to each feature vi . Fig.4 depicts the procedure addition of a spatial-channel squeeze-and-excitation block can
followed to get the context vector for a particular query q. improve the performance of various convolutional architec-
tures dedicated to medical image segmentation [102].
A(q, ki ) = H(E(q, ki )) (10) Branch attention mechanisms, on its side, separate the
attention problem into several sub-modules (branches), each
M
X of which focuses on a certain aspect (e.g. channel, spatial, res-
C(q, {(k1 , v1 ), . . . (kM , vM )}) = A(q, ki ) × vi (11) olution, degradation) while exchanging effective information.
i=1 Then, the role of the attention mechanism is to adaptively
calibrate the weight of each branch, acting as a dynamic
Four main categories of attention techniques can be found
branch selection procedure, choosing which to focus on. To
in the literature dedicated to medical image segmentation:
segment scleral blood vessels, Yao et al. [103] developed a
channel attention (what to attend), spatial attention (where to
U-Net-inspired architecture with deep feature concatenation
attend), branch channel (which to attend) as well as hybrid
and an attention mechanism branching into numerous attention
(e.g. channel and spatial attention) methods.
gates. This enables the network to focus on the border seg-
Channel attention is based on the idea that, in deep CNN
mentation of tiny blood veins. Other works such as [56] and
models, each channel represents a different feature map that
[104] exploited the multi-scale nature of CNN models. They
typically denotes distinct objects. As a result, the role of
generated series of multi-scale attention modules at different
channel attention is to adaptively calibrate the weight of each
resolutions and integrated local deep attention features with a
channel, serving as an object selector of the entities that
global context. Lastly, channel and spatial attention incorpo-
should deserve more attention. Channel attention, particularly,
rated the benefits of both attention mechanisms. This system
squeeze-and-excitation (SE) block [96], has proved to be ex-
acts as a dynamic spatial area and object selection mechanism,
tremely effective in tasks such as head and neck primary tumor
deciding what and where to focus attention [105, 106].
segmentation [97], prostate zonal segmentation [98], brain
structure segmentation [99] and micro-vessel segmentation
[100]. The SE block, depicted in Fig.5, passes an intermediate F. Multi-structure analysis
features map through a squeeze operation (i.e. global average The multi-structure analysis aims at incorporating inter-
pooling) that captures global spatial information. Then, an structure relations into a given DL-based segmentation model,
excitation module (alignment function) captures channel-wise thus leading to a more accurate representation of complex
relationships and outputs an attention vector using fully- anatomies. A typical application in this direction deals with
connected and non-linear layers (e.g. ReLU), followed by multi-organ segmentation from CT [23, 40, 107, 108], PET
a sigmoid function (weighting). Lastly, each channel of the [109] or MR [60, 107, 109] images. Although computationally
input features map is scaled by multiplying the corresponding effective, global approaches which consider one single model
element in the attention vector (contextualization). Others for all structures of interest are computationally effective,
works including [26] extended skip-connections between both extracting high-level relationship patterns between multiple
encoder and decoder branches through channel attention gates. structures but cannot fully exploit the local characteristics of
By scaling the encoder features by importance, the network each different component to delineate. Conversely, modeling
may concentrate on a specific aspect of the input to generate multiple structures by a set of structure-specific models en-
the segmentation. A similar approach was presented in [101] ables generating local structure-specific features but sacrifices
for lesion segmentation. This model also used the extracted the ability to take advantage of inter-structure relationships.
feature maps from the encoder path for the computation of In their survey, Cerrolaza et al. reported that multi-level
attention weights which are afterward merged with the up- (e.g. nested, multi-resolution) or sequential models are robust
sampled feature maps in the decoder branch. alternatives to global or individual models as they combine
Spatial attention and channel attention have relatively sim- the robustness and specificity of global approaches with the
ilar functioning. Spatial attention consists in adaptively cali- flexibility of structure-specific models [27]. Multi-level nested
brating the weight of each part of the image. This mechanism models decompose the data into different levels of detail
9

fusion
input

input
CNN CNN CNN
encoder decoder decoder

(a)

Input
input

ViT ViT ViT

encoder CNN encoder decoder
decoder
(b) (c)
Fig. 6. Typical Transformers-based model architectures for medical image segmentation: (a) hybrid Transformers-CNN encoder, (b) pure Transformers-based
encoder, (c) full Transformers-based network. Refer to Sect.V-A for further details.

according to coarse-to-fine analysis rules. Conversely, in multi- In the same scope, neural architecture search (NAS) is
level multi-resolution models, global inter-organ relations are another direction under investigation with the goal to automate
modeled at coarser resolutions while local organ-specific vari- the iterative network design process usually handled manually
ations are extracted from higher resolutions. On its side, se- by researchers. Among the existing research works in this
quential modeling deals with the analysis of multiple structures area, NAS-UNet [111] is based on the design of three types
following a pre-defined order of increasing complexity. In the of primitive operation set on search space to automatically
same vein, it is also relevant to mention the application related find two cell architectures (DownSC, UpSC) for semantic
to lesion segmentation for which delineating the organ of inter- segmentation purposes. Promising results were reported for
est as a first stage before localizing its inner lesions is usually various imaging modalities including CT, MR and ultrasound.
followed to promote the extraction of organ-specific features
and to narrow the search area (Sect.III). As an example, such V. E MERGING TRENDS
two-step sequential approach has been successfully applied in
[57] for liver lesion segmentation from CT scans. A. Medical Transformers
Transformers, as attention-based structures [94], have first
demonstrated their tremendous force in natural language pro-
G. Learning frameworks cessing (NLP) [112, 113] and have gradually gain attraction
The success of DL in medical image segmentation not only on different computer vision tasks such as image classification,
originates from the development of novel learning paradigms detection, segmentation and video analysis. Their popularity
but also from the network architecture design itself and the is now also rapidly growing in medical image analysis [114],
focus given to data management and optimization processes. especially for medical image segmentation with an exponential
Especially, a trend can be noticed towards the development of growth of related publications in the last year [15, 115].
unified frameworks such as NiftyNet [110] or nnU-Net [28]. The pioneering work of vision Transformers [116] was an
Since the design choices towards an optimal framework are interesting and meaningful attempt to replace convolutional
usually dedicated to a specific segmentation task (i.e. a given backbones with convolution-free models. In contrast to CNN,
tissue type for a given modality) and cannot easily be trans- vision Transformers (ViT) offer parallel processing and a
ferred to another application, pipelines that can configure their complete field of view in a single layer.
sub-components in an automated fashion are highly requested. ViT has a columnar structure where the 3D input volume
In particular, nnU-Net [28] which has been designed to deal x ∈ RH×W ×D×C is split into np 3D non-overlapping patches
with the dataset diversity found in the domain has proven {xx1 , x 2 , x 3 , . . . , x np } with x i ∈ RP ×P ×P ×C , C represents
its efficiency by winning many challenges. It condenses and the number of modalities, (P, P, P ) is the resolution of
automates the key decisions for designing a successful pipeline each patch and np = HW D/P 3 is the resulting number
for any given dataset. Thus, nnU-Net has become one of the of patches, which is also the effective input length of the
reference frameworks when targeting a new segmentation task. Transformer. Since Transformer layers operate over fixed size
10

1D set of vectors, the np patches are flattened and mapped 1) Pure Transformers-based encoder: The first category
to a d-dimensional embedding space through a trainable lin- exploits the global context modeling capability of
3
ear projection matrix E ∈ R(P ·C)×d . To preserve spatial Transformers to effectively encode the relationships between
information, a 1D positional embedding E pos ∈ Rnp ×d is spatially distant voxels. A convolution-free encoder is
added to each of the np patches, and the resulting sequence introduced by forwarding flattened image representations to
of embedding is used as input to the Transformer encoder: Transformers, whose outputs are then reorganized into 3D
tensors followed by CNN up-sampling blocks with multi-level
x1E ; x 2E ; x 3E ; . . . ; x np E ] + E pos
z 0 = [x (12) feature aggregation. For instance, [118, 119] employed a 3D
ViT as an encoder and connected it to the CNN decoder via
The Transformers-based encoder consists of alternating L skip-connections. At the bottleneck of the encoder, the feature
layers of multi-head self-attention (MSA) and multilayer per- map was reshaped and up-sampled by a factor of 2. Then, the
ceptron (MLP) blocks. A layer normalization (LN) is applied previous Transformer layer was used as a skip connection and
before each block and a residual connection after each block. concatenated with the resized feature to be later up-sampled
One layer of a Transformer block can be formulated as: through convolution, normalization and linear activation. This
process was repeated until the initial resolution was reached.
z 0l = MSA(LN(zz l−1 )) + z l−1 However, as anatomical structures can substantially vary in
(13) scale, they cannot be properly modelled using a set of fixed
z l = MLP(LN(zz 0l )) + z 0l
sub-regions of the image. Recently, hierarchical ViT such as
with l = {1, . . . , L}. The MSA block consists of h parallel Swin [120] or PVT [121] Transformers have been introduced
self-attention (SA) heads, where each SA head attends to to overcome these challenges by extracting features at
bring information from different representation sub-spaces at different resolutions. This improves the performance of
different positions through a scoring function A. To achieve Transformers in dense prediction tasks while saving the
this goal, an input sequence z ∈ Rnp ×d is mapped into query linear computational complexity with respect to the image
(Q ∈ Rnp ×dk ), key (K ∈ Rnp ×dk ) and value (V ∈ Rnp ×dv ) size. Hierarchical ViT architectures introduce CNN-like
matrices using three learnable parameters: W q ∈ Rd×dk , properties into the Transformers as they compute local
W k ∈ Rd×dk and W v ∈ Rd×dv . attention with shifted windows, starting from small-sized
patches and gradually merging neighboring patches in the
Q = zW q subsequent layers. To reduce the design complexity of
traditional hierarchical ViT, a 3D U-shape model inspired
K = zW k (14)
by nested hierarchical Transformers [122] exploited the
V = zW v idea of global self-attention within smaller non-overlapping
Then, the attention distribution function is computed following 3D blocks [123]. Cross-block self-attention communication
Eq.15 and the resulting attention weights are applied to the V was achieved by hierarchically nesting these Transformers
matrix obtaining the SA maps, as described in Eq.16. and connecting them with a specific aggregation function.
Valanarasu et al. proposed in [124] a gated axial-attention

Q · KT
model that extended previous designs by incorporating
A(Q, K) = softmax √ (15) a new control mechanism in the self-attention module.
dk
Furthermore, the model operated on the whole image and
patches to simultaneously learn both global and local features.
SA(Q, K, V ) = A(Q, K) × V (16)
2) Hybrid CNN-Transformers encoder: This second
For MSA, Q, K, and V are computed once for each head category integrates the global context modeling ability of
W q,k,v
using h different learnable parameters (W i ), and the final Transformers with the CNN inductive bias [29, 125, 126].
attention map results from the concatenation of the h heads CNN layers capture the multi-scale context feature maps
multiplied by a learnable aggregation matrix W o ∈ Rhdv ×d . by stacking convolution blocks. Meanwhile, Transformers
The computational cost of single head attention with full d- capture the long-term dependencies among the features that
dimension is maintained by setting dk and dv equal to d/h. would be potentially lost with purely convolutional models.
Lastly, a CNN-based decoder gradually up-samples the
Wo
MSA(Q, K, V ) = [head1 ; . . . ; headh ]W (17) Transformers output into a 4D feature map to recover the
full segmentation mask (Fig.7). Others approaches modified
with headi = SA(Qi , Ki , Vi ) = SA(zzW qi , z W ki , z W vi ). In the the traditional SA blocks using deformable Transformers
context of medical Transformers, U-Net-shaped architecture [127] or squeeze-and-expansion Transformers layers [128].
remains the preferred choice to build the Transformer In [129], Chen et al. proposed a 2D hybrid network that
segmentation models. From them and as illustrated in Fig.6, combines two independent self-attention blocks to model the
three categories can be identified [117]: pure Transformers- long-range interactions and global spatial relationships. In
based encoder, hybrid Transformers-CNN encoder as well as addition, a multi-scale skip-connection scheme aggregated
full Transformers-based network. multiple features in the decoder at a different scale to generate
more discriminative representations. PC-SwinMorph [130]
11

with multiple tasks. Multi-task learning aims at taking advan-

tage of the information shared among two or more connected
(16,H,W) or auxiliary tasks while better handling each task individually
1/2
[52]. As investigated in [140], multi-task learning can be seen
CNN as a form of inductive transfer where the introduced inductive
1/4 (64,H/2,W/2)
bias allows to prefer some hypotheses over others (i.e. the
LN
hidden features ones that explain more than one task), towards solutions that
MHSA
linear projection generalize better than their individual counterparts. One of
1/8
(128,H/4,W/4) the first multi-task learning frameworks dedicated to medical
LN Transformer layer
conv3x3, ReLU image segmentation was developed in [141] which investigated
(n=12) up-sampling
MLP
segmentation head
whether a single CNN can be trained to perform different
Transformer layer (256,H/8,W/8)
down-sampling
feature concatenation
segmentation tasks, i.e. in a multi-domain fashion (Sect. VI-B).
reshape
hidden features
(n_patch,D) (D,H/16,W/16) (512,H/16,W/16)
The combined training procedure was therefore suited for
identifying anatomical structures, tissue classes as well as
Fig. 7. Overview of the TransUNet architecture proposed in [29] including imaging modalities at once. More globally, multi-task learning
a schematic illustration of a Transformer layer. can leverage various and heterogeneous forms of annotations,
from global images to finer-grained and pixel-level labels.
performed registration-based segmentation using both CNN Based on the finding that several segmentation, classification,
and patch-based contrastive strategies followed by a Swin regression or detection issues can be effectively solved at once
Transformer which enforced the capture both global and using a single network, multi-tasking strategies have emerged
local anatomical representations. Another type of hybrid through the design of a cascade of task-specific sub-networks
Transformers-CNN encoders deal with multi-branch fusion [30] or the development of networks with shared encoder
schemes like TransFuse [131], HybridCTrM [132] and and task-specific decoders [142–145] to benefit from partial
CrossTeaching [133]. Commonly, the two parallel branches, parameters sharing between tasks or sub-tasks.
one for CNN and the other for Transformers, are fused to In particular, the encoder-decoder architecture designed in
benefit from the two learning paradigms. The CNN branch [142] featured a single encoding path and multiple decoding
provides the ability to focus on local information while branches for concurrent segmentation tasks. The encoding
Transformers learn long-range voxel dependencies. module used a generic set of parameters shared by multiple
tasks whereas the decoding branches were task-specific. An
3) Full Transformers network: Full Transformers architec- auxiliary cost was also added at the end of the encoding
tures are built in an end-to-end Transformer-based fashion module to predict the presence or absence of lesions. Sup-
[134, 135]. Cao et al. proposed in [136] a Swin 2D U-Net plementary sub-tasks including contour detection and distance
network which includes a patch expanding layer to up-sample map estimation were incorporated in [143] to refine coarse
the feature maps of the decoder. This architecture showed and discontinuous segmentation predictions from convolu-
superior performance in capturing fine details compared to tional models. In [144], a single multi-task network was pro-
decoders based on bi-linear up-sampling. Lin et al. went a posed to simultaneously address gastric tumor segmentation
step further in [137] by first adopting a dual-scale encoder and lymph node classification from CT scans. An attention-
based on 2D Swin-Transformer to extract both coarse and based reconstruction task was integrated into the segmentation
fine-grained feature representations at different scales. They pipeline of [145] to leverage unlabeled medical images in a
also included an interactive fusion module to effectively semi-supervised segmentation framework (Sect.VI-D). In [73],
establish global dependencies between features of different Keshwani et al. improved vessel CT segmentation not only
scales through the self-attention mechanism. To better leverage by considering a single decoding branch dedicated to vessel
multi-scale feature hierarchies, Huang et al. proposed in [138] segmentation but also by involving two additional decoders: a
a 2D hierarchical encoder-decoder architecture whose main centerness decoder whose task was formulated as a regression
contribution was the inclusion of an enhanced Transformer problem and a topology distance decoder aiming at enhancing
context bridge to capture the correlation and local context of the vessel connectivity which is key regarding clinical needs.
multi-scale features generated by the hierarchical Transformer By enabling fruitful cooperation between related tasks,
encoder. Peiris et al. developed in [139] VTU-Net that works these multi-task approaches have been shown to outperform
directly in 3D using Swin Transformers. The Swin decoder traditional independent or segmentation-only models. As for
introduces parallel cross-attention and self-attention, which deep supervision (Sect.IV-D), the relative weighting between
creates a bridge between queries from the decoder, keys, each task’s loss can be tuned by hand or automatically [146].
and values from the encoder. Such parallelization enables to
preserve the full global context during the decoding process,
which is key towards a robust delineation of medical images. C. Segmentation uncertainty
In DL-based image segmentation, the forward pass is a
B. Multi-task learning deterministic process that maps a voxel to a unique label. This
Another way to improve image segmentation consists in apparent determinism however fails to take into account the
exploiting the ability of deep models to simultaneously deal various sources of uncertainty that affect a neural network pre-
12

diction. Such awareness may be crucial to detect potential seg- Data or aleatoric uncertainty, on the other hand, can
mentation errors. There is a clear need to understand the lim- be assessed through test-time data augmentation (TTA), in
itations of segmentation models via the assessment of voxel- which multiple forward passes are performed to inputs altered
wise confidence measures, which is the purpose of uncertainty through basic data augmentations (e.g. flips, rotations, scaling)
quantification applied to segmentation [147, 148]. Uncertainty [31]. The resulting outputs are then aggregated with similar
modelling may also importantly be used to directly improve methods to TTD (i.e. averaging, entropy). TTA is easier to
segmentation performance, as in [149] where the distinction implement than TTD as it does not require any modification to
between quality control and model improvement techniques the network architecture and can readily be achieved through
is highlighted. This is typically done by averaging out voxel- off-the-shelf data augmentation and segmentation frameworks.
level errors using multiple stochastic forward passes in models Qualitative results seem to suggest that aleatoric uncertainty
that can sample across the uncertainty space [31, 147]. estimates provide more expressive qualitative maps for medi-
According to Bayesian terminology, uncertainty can be cal image segmentation uncertainty assessment [31].
divided into epistemic and aleatoric uncertainty [146, 150]. Regarding the improvement of performance through uncer-
Epistemic (or model) uncertainty relates to the lack of ac- tainty sampling, using epistemic uncertainty modelling with
curacy in the model parameters due to insufficient training. TTD or MC dropout generally yields moderate but consistent
It is a kind of uncertainty that can be reduced by providing improvement of segmentation results [31, 156]. For instance,
more training time and/or data. Aleatoric uncertainty, on the epistemic uncertainty-aware networks achieved state-of-the-art
other hand, relates to the inherent uncertainty of the data itself, performance on the medical image segmentation decathlon
which can further be divided into homoscedastic uncertainty challenge [153, 158]. On the other hand, TTA seems to be
(constant for all inputs) and heteroscedastic uncertainty (vari- more effective than TTD for improving medical image seg-
able between inputs). In image segmentation, both image and mentation results, with performance enhanced by up to several
label spaces are affected by aleatoric uncertainty. Example Dice points [31]. TTA is therefore a generally recommended
causes of homoscedastic uncertainty in radiation imaging are step if inference cost is a secondary concern. The question
physical processes such as Compton scattering or positron as to what is the optimal way to pool TTA predictions and
range. Image-space heteroscedastic uncertainty can be, for in- what augmentation to select is an open research subject [159].
stance, due to dataset shifts in multi-center studies, while label- Applications of deep uncertainty modelling to PET-CT are
space heteroscedastic uncertainty may be due to heterogeneity less popular than in MR modality, with few related works
in annotation quality [151]. Ideally, uncertainty assessment in segmentation [65, 160]. Sudarshan et al. leverage physics-
should be calibrated. Calibration means that prediction con- based heteroscedastic uncertainty modelling for low-dose PET-
fidence c should equal its likelihood of being correct, i.e. a MR image denoising [161]. This relative lack is arguably due
value of c ∈ [0, 100] should indeed translate to a model being to the novelty of the topic in medical imaging. Uncertainty
accurate c% of the time over multiple instances, which is an quantification being an emerging trend, more contributions are
open research subject in medical imaging [147, 148, 152]. expected in the future, especially in radiation imaging.
Several techniques may be employed to produce voxel-level
confidence maps for both epistemic and aleatoric uncertainty
quantification of segmentation predictions. The most popu- D. Contrastive learning
lar epistemic uncertainty measurement is Monte-Carlo (MC) Whatever the image analysis task involving representation
dropout, also known as test-time dropout (TTD) [150], where learning, extracting robust features means reaching distinct
many (e.g. from 10 to 50) stochastic dropout forward passes clusters reflecting the different classes involved. In this
of a model equipped with dropout weights are performed direction, contrastive learning tends to enforce the model
during training [153–155]. The dissimilarity in the predictions, to learn an efficient and disentangled feature representation
assessed through variance, mutual information or entropy by comparing the input image with comparing images (i.e.
[154] can then be assimilated to a voxel-wise epistemic anchors). The comparison is performed between positive pairs
uncertainty map. An obvious drawback of MC dropout is the of similar inputs (e.g. generated through data augmentation)
requirement for dropout during training. Dropout may indeed and negative pairs of dissimilar inputs (e.g. other image
be detrimental to segmentation performance, and a number samples used for training). The original contrastive loss
of state-of-the-art segmentation solutions including nnU-Net was initially defined for classification purposes in computer
[28] do not include dropout. Alternatives to MC dropout vision [162] and its adoption in the medical image processing
for epistemic uncertainty quantification include performing community has been relatively late [163]. The underlying idea
forward passes at various training checkpoints of the opti- was to pull together data points from the same class while
mization [156], following the empirical observation that less pushing apart negative samples in embedded space [162],
certain predictions are less stably predicted along training. A thus imposing intra-class cohesion and inter-class separation.
more computationally demanding method is deep ensembling,
whereby independently trained networks are averaged together 1) Global contrastive learning: Most existing contrastive
to get uncertainty maps [157]. Albeit more demanding, deep learning methods deal with a global contrastive loss and target
ensembling is a common practice due to its consistency in image classification (Fig.8a). For image segmentation, a global
improving segmentation results [28]. Thus, uncertainty maps contrastive loss can still be used by projecting the data through
may be derived freely as a by-product of this main objective. the encoder path to the latent space [32, 164]. Let us describe
13

to be equivalent underneath various transformations and also

focuses on the dissimilarity with other regions from the same
image. Validated on three MR datasets, this method lead to a
substantial increment in delineation accuracy. Since generating
pairs of data for the use of contrastive learning is challenging
in medical image segmentation due to the potential presence
of similar tissue or anatomical structure across the dataset,
Zeng et al. explored in [165] a novel method called positional
contrastive learning. The method dealed with generating con-
trastive data pairs based on the position of a slice within 3D
volumes. Based on the slice distance, more closest slices were
Head
referred to as positive pairs, and far slices were considered
Head
negative. This enabled the reduction of false-negative image
global average
pooling pairs and improved the segmentation results against state-of-
Head
label1
the-art. Self-supervised local contrastive learning is only one
label2 of the strategies followed in self-supervised learning whose
perimeter is further explained in Sect.VI-C.
a) global contrastive b) self-supervised local contrastive c) supervised local contrastive Conversely, a supervised local contrastive loss (Fig.8c) that
Fig. 8. Self-supervised global (a), self-supervised (b) and supervised (c) local leverages limited pixel-wise annotation to force pixels with
contrastive learning [164]. Refer to Sect.V-D for further details. the same label to gather around in the embedding space was
proposed in [164]. Some papers combine global and self-
how a global contrastive loss can be defined in this context. supervised [32] or supervised [164] local contrastive learning.
We consider an input batch B = {x x1 , x 2 , . . . , x b } where each Contrastive learning was also adopted in a federated learn-
x i corresponds to a 3D medical image or one of its 2D slice. ing context (Sect.VI-E). In [38], Wu et al. explored contrastive
Using data augmentation, one can transform twice each x i to learning for volumetric medical image segmentation in the
form a pair of augmented images. These new images form a presence of limited labeled data. Following a similar approach
set of augmented images A = {a a1 , a 2 , . . . , a 2×b }. For a given to the above-discussed ones, clients first learn a shared encoder
image x i , A contains two related augmented versions, referred on unlabeled data from various sites. Then, a network is fine-
to a j and a k . A global contrastive loss can therefore be used: tuned on a labeled dataset. The mixed contrastive data are
supplied to each medical location, enabling the use of data
1 X exp(zz j .zz k /τ ) variousness for contrastive learning. This enables performing
Lcon = − log P (18) global structural matching to learn an encoder with suitable
|A| l6=j exp(z z j .zz l /τ )
j∈{1,...,2×b} representations among clients.

where z j and z k are the normalized features obtained after

applying a header function h(.) (e.g. multi-layer perceptron) E. Knowledge distillation
at the output of the encoder E such that z j = |h(E(aaj ))|. τ Compared to the computationally expensive scenario con-
is a temperature scaling parameter. As opposed to standard sisting of training many different models on the same data
approaches that operate on image classes, a dataset label and then averaging their predictions, compressing the knowl-
information was leveraged in [35] to enhance intra-domain edge into a single model through knowledge distillation is
similarity and impose inter-domain margins, in a multi-task much easier to deploy [166]. The knowledge distillation (KD)
multi-domain segmentation scenario. mechanism tends to distill (i.e. transfer) information from a
well-trained cumbersome teacher network to a lightweight and
2) Local contrastive learning: More suited for medical compact student network with the final goal of improving the
image segmentation, a local contrastive learning approach can performance of this student model. In a standard KD setting,
be followed by designing a local version of the contrastive loss the teacher model generates soft predictions which are used
(Eq.18) able to learn distinctive representations of local regions to supervise the student model by calculating the difference
instead of relying on global representations. Only teaching the of their final layer with some measurement functions such as
encoder to extract image-level disentangled features may not cross-entropy or Kullback-Leibler divergence (KL). The soft
be sufficient since segmentation requires a class prediction predictions are obtained after the last convolution layer by
for all voxels [164]. Two main techniques can be followed mean of a parameterized softmax function following:
to train the decoder at extracting distinctive local represen-
tations through contrastive learning: self-supervised (Fig.8b) exp( zτc )
qc = ! (19)
and supervised (Fig.8c) local contrastive learning. X zc
For the first category, Chaitanya et al. introduced in [32] a exp
τ
c∈C
local extension of the contrastive loss that is helpful for per-
pixel segmentation as it learns distinctive representations of where the logits z c associated to each class c are converted
local regions. Thus, it compares the local features of the image into probabilities q c . τ is a parameter called temperature which
14

controls the softness of the output probabilities. Since tuning Paired

data
τ is a difficult process and considering τ as a constant can
lead to sub-optimal results, feature normalized knowledge
distillation (FNKD) was proposed in [167] to exploit reliable
soft predictions irregardless of the feature scale by rather Fusion Early Mid Late

considering the L2 norm instead of a scalar value:

zc
exp kz zk Unpaired
data
qc = ! (20)
X zc
exp
kzz k
c∈C Image-to-image Knowledge
UDA translation distillation
with k.k corresponding to the L2 norm. Instead of focusing
on the final layer only from both teacher and student models,
guiding the compact network to mimic intermediate features Fig. 9. Categorization of cross-modal segmentation frameworks. UDA stands
from the teacher network was adopted in [168] by means of for unsupervised domain adaptation. Refer to Sect.VI-A for further details.
attention maps. Initially proposed in the context of image
classification, KD mechanisms were rapidly extended for student, i.e. the multi-organ segmentation model. Further, the
semantic segmentation purposes in computer vision [169, 170] integration of a co-training strategy and weight-averaged mod-
and medical image analysis [33] fields. Thus, a prediction map els unified multi-organ segmentation from few-organ datasets
distillation module was used in [33] to enable the student [176]. Self-distilling a Transformer-based U-Net by simultane-
network to learn predictive capability from the output seg- ously learning global semantic information and local spatial-
mentation maps provided by the teacher. Let φs be the student detailed features was also investigated in [177].
model, q sn and q tn the soft probability maps respectively from
the student and teacher networks. The loss function for training
the student φs with standard KD can be defined as: VI. E MERGING APPLICATIONS
A. Cross-modality segmentation
N Most of the approaches presented in the literature do not
1 X
Lφs = λ KL(qq sn ||qq tn )+(1−λ) `CE (φs (x
xn ), y n ) (21) consider the multi-modal nature of medical imaging data,
N n=1 leaving aside potentially valuable cross-modal information
unused. However, exploiting complementary and redundancy
with λ ∈ [0, 1] a scalar value adjusting the contributions of both
information across modalities can possibly improve overall
terms. Note that cross-entropy can easily replace KL for the
segmentation performance, making better use of the scarcity
knowledge distillation sub-loss. The standard supervised loss
of annotated medical imaging data. The research field of
(cross-entropy in Eq.21) can be any dedicated loss function
multi-modal image segmentation brings different technical
(e.g. DiceCE), as mentioned in Sect.II-A. Such loss function
challenges and open questions to solve [178], including:
making the student mimicking the ability of the teacher to
generate soft prediction maps is however not enough to really • Is the data available for training pairwise-aligned or

boost the performance of the student since only pixel-level comes from different patients?
information is considered. In this direction, more context • How to fuse different modalities to simultaneously reduce

and class-related information are needed. As for classification the heterogeneity gap and enable the transfer knowledge?
[168], constraints on intermediate multi-scale features arising • How to map data from one modality to another?

from both teacher and student were therefore integrated [33]

with importance maps distillation modules able to encode fea- Considering these questions, we introduce in what follows
ture maps into a transformable form to deal with the diversity the multi-modal works depending on the type of data
of feature sizes between teacher and student models. Other available during training (paired, unpaired), the followed
constraints were proposed in the specific context of knowledge fusion strategy (early, mid or late) [34] as well as the adopted
distillation such as boundary-guided [171], region affinity [33], translation approach. Fig.9 depicts a generic categorization
class-similarity [172], anatomical knowledge [173] or holistic of the cross-modal segmentation framework whereas Fig.10
distillation [174] in order to align high-order relations between illustrates a cross-modal pipeline when managing paired data.
what both teacher and student generated.
To go further, a novel KD based framework called mul- 1) Paired data and early fusion strategy: When multi-
tiple teachers single student (MTSS) was developed as a modal paired data is available, one may carry out any of the
new privacy multi-organ segmentation setting learning from three fusion strategies. The most straightforward is the early
multiple pre-trained single-organ segmentation models [175]. (also known as input-level) fusion strategy (Fig.10a) which
Formulated into a special unsupervised ensemble distillation integrates at the input level of the deep network the different
problem, multiple single-organ models served as teachers m-modalities. Therefore, the final segmentation is defined as
from different specialties and collaboratively teach one general y = φ(x1 , x2 , . . . , xm ) where φ represents the segmentation
15

Modality 1 Modality 2 Modality m Modality 1 Modality 2 Modality m Modality 1 Modality 2 Modality m

Fusion

(a) (b) (c)

Fig. 10. Pipeline of cross-modal frameworks for paired data. (a) Early or input-level fusion, (b) mid or layer-level fusion, (c) late or decision-level fusion.

network. Early fusion has the advantage of its simplicity, first task learns to differentiate between tumors and normal
allowing more complex segmentation strategies such as multi- tissue until the loss curve tends to flatten. The second task is
task, multi-view, multi-scale or GAN-based approaches. One then added and split the complete tumor into intra-tumoral
of the first works devoted to solving multi-modal image classes. This task continues until its loss curve displays a
segmentation using CNN can be found in [179]. They explored flattening trend. Lastly, the third task is introduced and trained
two-pathway cascaded architectures using different receptive simultaneously with the previous ones to precisely segment
field sizes to capture both local and global context information. the enhancing tumor. In this way, the model parameters
Qin et al. proposed in [180] an adaptive convolutional layer and the training data are transferred from an easier to a
named autofocus to effectively change the size of the receptive more difficult task. Unfortunately, early fusion makes it
field to perform multi-modal brain tumor segmentation. The hard to discover highly non-linear relationships between the
autofocus layer captured the multi-scale information by paral- low-level features from different modalities, especially when
lelizing multiple convolutional layers with different dilatation the modalities have significantly different statistical properties.
rates that are later merged using a weighted soft-attention
mechanism to choose the optimal scales. 2) Paired data and mid-fusion strategy: Mid or layer-level
fusion separately processes the multi-modal data in different
The above-mentioned methods and others [181, 182] paths (Fig.10b). For each modality, m, the input xm is encoded
did not make dense predictions and are therefore slow in in each branch fm to learn the modality-specific representation
the inference stage. To promote efficiency, encoder-decoder zm . Then, each representation zm is mapped into a common la-
architecture derived from U-Net [14] has been widely tent space via a fusion operation Λ and use this as input of the
adopted. For instance, Shapey et al. used in [183] a 2.5D decoding transformations g(Λ(z1 , z2 , . . . , zm )). The main goal
U-Net to segment the vestibular schwannoma in contrast- of this strategy is to learn an optimal joint representation that
enhanced T1-weighted and high-resolution T2-weighted MR emphasizes the most informative features across modalities.
imaging. Spatial attention modules were added to each level In mid-fusion, we can distinguish two types of multi-pathway
of the decoder to deal with small target regions, giving network architectures based on the following fusion strategy:
more attention to them and penalizing voxels belonging single-layer or multi-layer fusion.
to the background. In [184], the modalities were fused Multi-modal segmentation networks based on single-layer
as multi-channel inputs and passed through an adversarial fusion generally employ encoder-decoder architectures where
network (Sect.IV-A). The generator is a 3D residual U-Net each modality has its own encoder, with no interactions
that performs the segmentation while the discriminator between them, and a single decoder. They mainly differ on
distinguishes between generated segmentation and ground the conducted fusion operation and are typically carried out
truth masks. An extra constraint was added via active contour via concatenation, addition, averaging or convolution. For
modeling by measuring the dissimilarity between ground instance, Havaei et al. used in [186] modality-specific convolu-
truth and prediction contours. To handle the class imbalance tional layers to later compute for each feature map the first and
problem, Zhou et al. carried out in [185] a coarse-to-fine second moments. Then, the moments were concatenated and
segmentation inspired on model cascades for brain tumor processed by further convolutional layers, yielding the final
segmentation. The main difference with previous works lies segmentation. For their part, Tseng et al. took in [187] the
in applying only a one-pass multi-task network (OM-Net) that encoded representation from each modality and performed a
performs three tasks that are gradually introduced in an order cross-modal convolution to combine the spatial information of
of increasing difficulty based on curriculum learning. The each feature map, modeling the correlations among them. In-
16

spired by the success of the attention mechanism (Sect.IV-E), computed by the branches are mapped into a common feature
recent fusion strategies incorporated spatial and channel-wise space via fusion operations, (e.g. concatenation, averaging,
attention to learn more informative features among modal- weighted voting), followed by series of convolutional layers
ities [188, 189]. To name a few, Zhou et al. proposed in [193]. The final output of late fusion can be formulated as
[188] a three-stage segmentation network. In the first stage, y = Λ(φ1 (x1 ), φ2 (x2 ), . . . , φm (xm )) where Λ is the fusion op-
a 3D U-Net architecture got rough mask predictions. Then, eration. Thus, common features learned by the transformation
binarization and erosion operations were used to obtain the network are considered as a further refinement of decoding
context constraints for the following stage. The second stage and prediction. Some conventional layer-level methods as
consisted of a multi-encoder-based framework where each [194] are thus categorized into late fusion strategy. Many
encoder produces a modality-specific latent representation that late fusion strategies have been proposed. Most of them are
is further fused with the assistance of attention mechanisms. based on averaging or majority voting. For averaging strategy,
This process was repeated for each structure to be segmented. Kamnitsas et al. trained in [72] three networks separately and
In the third stage, a two-encoder-based 3D U-Net segmen- then averaged the confidence of the individual networks. The
tation network was applied to combine and refine the three final segmentation was obtained by assigning each voxel with
single prediction results. A correlation block to discover the the highest confidence. For the majority voting strategy, the
latent correlation between modalities was introduced in [189], final label of a voxel depends on the majority of the labels
followed by a dual attention block that consists of a modality of the individual networks. The statistical properties of the
attention module and a spatial attention module. In this way, different modalities are different, which makes it difficult for
the network is encouraged to learn the most correlated features a single model to directly find correlations across modalities.
across modalities and more useful spatial information to boost Therefore, in a decision-level fusion scheme, the multiple
segmentation results. Despite the great results of single-layer segmentation networks can be trained to fully exploit multi-
fusion schemes, there is no complete freedom to learn within modal features. On the other hand, Zhang et al. proposed
and in-between modalities due to its single level of abstraction. in [194] a modality-aware module that fused the modality-
Regarding multi-layer fusion, it extends the idea of specific models at a high semantic level. Specifically, each
residual learning in multi-modal frameworks allowing skip- modality was embedded by a different modality-specific FCN.
connections that by-pass spatial features between modalities Then, the outputs of FCN models were fused and passed to
[190–192]. Therefore, low-level and high-level features an attention module to generate a modality-specific attention
are fused at different levels of abstraction, increasing the map to adaptively measure the contributions of each modality.
learning capabilities of the network to capture complex cues Moreover, they designed a mutual learning strategy to enable
across modalities. Li et al. proposed in [190] four dilated interactive knowledge transfer, where the modalities interact
Inception blocks consisting of three dilated convolutional as teacher and student simultaneously. In the same line,
layers for each modality. In this way, the receptive field of Zhang et al. employed in [192] a transfer knowledge strategy
the network was expanded without losing resolution, while across modalities that differs from previous works in the use
multi-scale features were also learned. In order to obtain of GAN. The authors applied cycleGAN [195] to capture
the final segmentation, the features at different levels were the knowledge transition across modalities. Each generator
concatenated and up-sampled. On the other hand, Dolz et al. represented a single-modality feature learning branch. Then,
proposed in [191] HyperDenseNet, a 3D model where each they were merged by extra convolution layers followed by an
modality has its own path. Dense connections not only occur attention block to learn powerful fusion features. The intuition
between the layers within the same stream but also across behind the use of GAN is that GAN models can learn the
modalities. Thus, the network can learn more powerful feature modality patterns of each modality and their content patterns.
representations at all levels of abstraction. To encode more Mid and late-fusion can achieve better performance
rich contextual information across modalities, Zhang et al. because each modality is employed as an input of one
developed in [192] a cross-modal self-attention distillation network that can learn complex and complementary feature
network. The model extracted attention maps of intermediate information compared to an input-level fusion network.
layers to further perform layer-wise attention distillation However, they require more memory due to the use of
among modalities. Significant spatial information can be multiple networks. Therefore, the trade-off between accuracy
distilled from an attention map of one modality and then and execution time should be carefully considered. Despite
used to ease attention learning of the other modalities. Fusing the impressive advances reached in the field of multi-modal
multi-modal contextual information at multi-layer stages image segmentation, collecting large sets of paired images
represents the current trend. Moreover, semantic guiding is often either prohibitively expensive or not possible. As a
across modalities by attention mechanisms can be used to result, techniques that make use of unpaired datasets have
bridge early feature extraction and late decision-making. attracted increasing attention in cross-modal segmentation.

3) Paired data and late fusion: Similar to mid- 4) Unpaired data and domain adaptation: When only
fusion, late fusion separately processes multi-modal data unpaired datasets are available, cross-modal segmentation is
(x1 , x2 , . . . , xm ) with the difference that the segmentation commonly managed by domain adaptation (DA) techniques.
branches (φ1 , φ2 , . . . , φm ) are integrated at the decision level. Let Xm × Ym represent the joint feature space and the corre-
More precisely, during the decoding stage, all feature maps sponding label space for a specific modality m. A domain can
17

be formulated as Dm = {Xm , P (xm )} where Xm is the feature

space and P (xm ) the marginal probability distribution of the
data xm . Let us assume that we have two domains (e.g. from
two different modalities): a source domain DS = {XS , P (xS )}
and a target domain DT = {XT , P (xT )}. The DA problem
is formulated as a sub-class of transfer learning where the
label spaces in the source and target domains are the same
YS = YT = Y but where the domains are different DS 6= DT .
In cross-modal image segmentation, the feature spaces be-
Fig. 11. Multi-domain segmentation with shared convolutional kernels and
tween source and target domains are identical (i.e. XS = XT ), domain-specific feature normalization, as employed in [35, 200].
differing only in data distribution (i.e. P (xS ) 6= P (xT )).
Hence, the goal is to learn a segmentation function f (·) that
or acquisition protocols. It is worth mentioning that variations
performs well in both source and target domains by finding
can even happen in a single center since both clinical practices
a transformation T (·) such that P (T (xS )) = P (T (xT )). The
and imaging systems may significantly evolve over time. To
previous single-source single-target definition can be extended
deal with this diversity, segmentation pipelines can integrate
to multi-source single-target or single-source multi-target DA.
an adversarial network to learn domain-invariant features [62],
Three groups of DA techniques are used for cross-modal im-
exploit transfer learning and fine-tuning between domains
age segmentation: supervised domain adaption in which both
[199], share a common decoder (resp. encoder) while using
labeled source and target data are available {XS , YS , XT , YT },
domain-specific encoders (resp. decoders) [107] or share their
semi-supervised domain adaptation in which labeled source
latent space only [107, 200]. In particular, the very different
data in addition to some labeled target data are available,
statistical distributions between unpaired multi-modal images
and unsupervised domain adaptation (UDA) in which both
making the task of learning shared representations challeng-
labeled source data and unlabeled target data are available
ing has motivated the design of a single encoder-decoder
{XS , YS , XT }. Most related works in cross-modal image seg-
segmentation network in [200] through shared convolutional
mentation are based on single-source single-target UDA [196]
kernels but domain-specific feature normalization (Fig.11).
using reconstruction-based methods, domain-invariant feature
Indeed, modality-agnostic kernels can extract expressive uni-
learning [197] and more largely GAN models [198].
versal representations across domains only if the features
The main drawbacks of existing approaches deal with their
are well-calibrated upstream. Let us consider D domains
limited scalability and robustness in handling more than two
{d1 , d2 , . . . , dD }. Let v ji,l,m be the mth feature map from the
domains since they rely on pairwise alignment using GAN
lth layer produced by the ith image arising from the intensity
(cycleGAN or similar [198]). Hence, different models should
domain dj . The calibration can be performed through domain-
be built independently for every pair of image domains which
specific batch normalization (DSBN) [35, 200], following:
demands high computational resources. Lastly, either paired
or unpaired methods are application-dependent which limits
their transferability between different clinical settings. vj − µjl,m
DSBNαj j (vv ji,l,m ) = αl,m
j
qi,l,m j
+ βl,m (22)
l,m ,βl,m j 2
(σl,m ) +
B. Multi-domain segmentation
where µjl,m and σl,m
j
are respectively the domain-specific mean
A strong assumption in the way deep segmentation pipelines
and standard deviation computed for images from domain
are usually designed and evaluated is that both training and
dj belonging to a given batch. is a scalar value that is
test data arise from the same probability distribution. Their
used to reach numerical stability. DSBN layers are therefore
accuracy usually degrades when applied to new (i.e. unseen)
defined by trainable domain-specific shift and scale weights
data that differ from the training data [62]. Instead of designing j j
{αl,m , βl,m }l,m set for each feature map of each layer.
pipelines specific to a given intensity domain, an emerging ap-
Rather than focusing on a given anatomical target across
plication consists in training a deep segmentation model over
various intensity domains, developing a single multi-task
multiple intensity domains [35, 107, 199, 200]. The underlying
multi-domain network can enable simultaneously segmenting
assumption is that exploiting the redundancy between multiple
multiple anatomies while leveraging shared features between
intensity domains can enable the extraction of robust domain-
various domains and datasets [141]. In this direction, shared
invariant feature representations to finally achieve better per-
convolutional kernels and domain-specific feature normaliza-
formance than domain-specific (i.e. marginal) computational
tion from [200] were combined in [35] with both contrastive
models. Managing various domains can partially solve the
(Sect.V-D) and shape regularizations (Sect.IV-C) to segment
issue of dealing with the scarcity of imaging resources [199].
bone structures from multiple scarce pediatric datasets.
The improved generalization abilities of the resulting models
are a further step to facilitate their integration into routine.
In practice, intensity domains can be very different in C. Self-supervised learning
nature: multi-center, multi-scanner, multi-modal (Sect.VI-A), The need for a large amount of annotated training data is
or multi-protocol. Reasons explaining the acquisition shift a strong constraint given the complexity of reaching a sig-
include differences in imaging systems, reconstruction settings nificantly well-annotated medical imaging dataset. Generative
18

models were therefore used to increase the number of training labeled training dataset to re-train the model. This updated
samples through image synthesis [201]. As an alternative, the model was used to generate pseudo-labels for another batch
availability of a reduced dataset only motivated researchers to of unlabeled images and so on. This process was repeated up
exploit the power of unlabeled images which may be easier to obtain efficient performance in COVID-19 lung infection
to collect. Unsupervised learning approaches assuming that a CT segmentation. However, the created pseudo-labels usually
related but unlabeled large dataset is available aim at learning do not have the same quality as ground truth labels, which may
transferable feature representations from unlabeled images. limit their potential for improvements from unlabeled data.
In particular, self-supervised learning can consist of pre- As a powerful alternative, one may adopt an auxiliary task
training the model (or any of its constituents) by means of on unlabeled data to facilitate performing image segmentation
pretext tasks to finally be more able to efficiently delineate with limited labeled data. In this direction, Chen et al. pro-
the targeted structures [36]. With the hypothesis of good posed in [145] a semi-supervised image segmentation method
generalization ability of self-learned features, Taleb et al. that simultaneously optimizes both supervised segmentation
investigated the effectiveness of several pretext tasks for 3D and unsupervised reconstruction objectives. The reconstruction
medical image segmentation purposes: rotation prediction, task had the particularity to exploit an attention mechanism
jigsaw puzzles, relative patch location... More complex pretext that separated the reconstruction of image regions correspond-
tasks such as semantic inpainting revealed their effectiveness ing to different classes. Such a simple yet effective multi-task
for better solving downstream segmentation tasks [202]. learning scheme (Sect.V-B) achieved strong improvements for
Leaving aside self-supervised pre-text tasks, self-supervised brain tumor and white matter hyper-intensities segmentation.
contrastive learning (Sect.V-D) is another manner adopted in
the medical image analysis area to learn expressive feature
E. Federated Learning
representations from unlabeled images [203]. In this direction,
Chaitanya et al. proposed in [32] a local contrastive loss able to Collecting large medical image datasets is a difficult and
capture local features to provide complementary information time-consuming task for research needs. Accurate labeling of
and therefore boost the segmentation accuracy. these images requires clinical experience and is challenging to
obtain. Many imaging centers have large image datasets, but
many of them are unorganized or poorly annotated in spite
D. Semi-supervised learning of their richness regarding deep model training [209, 210]. In
Given the usual scarcity of many existing annotated medical addition, medical images are usually linked to personal health
dataset and apart from transfer learning [204] whose goal is information related to the patient. Data protection to prevent
to learn from related learning problems, researchers have also sharing sensitive patient data is essential when working with
explored semi-supervised learning approaches to exploit the multiple medical institutions, in a collaborative manner.
availability of unlabeled datasets. Among the existing semi- To solve this issue, federated learning (FL) enables dis-
supervised strategies [37], semi-supervised consistency regu- tributed training of DL models without really sharing data
larization is commonly employed through the use of a mean between multiple clinical institutions. Fig.12 shows the general
teacher model [205]. In [206], a novel uncertainty-aware semi- framework of federated learning. To work in a collaborative
supervised learning framework was proposed and evaluated fashion, FL allows various clinical institutes or hospitals to
for left atrium segmentation from MR images. Teacher and work in coordination by using a central server. Each hospital
student models were built in such a way that the student keeps an individual model which focuses on the local data
model learned from the teacher model by minimizing the only. Before the training process, each institution submits a
segmentation loss on the labeled data as well as a consistency request to download the global model from a central server.
loss with respect to the targets from the teacher model on all The requested query is then approved by the central server
data (i.e. labeled and unlabeled). The predicted targets from and the global model weights can be downloaded. Once the
the teacher model being potentially unreliable and noisy on training process is executed, the local client model weights
unlabeled data, Yu et al. designed an uncertainty-aware mean are sent to the central server for updating purposes. The
teacher framework [206], where the student model gradually central server aggregates the feedback received from individual
learned from the meaningful and reliable targets by exploiting institutions and updates the global model weights based on
the uncertainty information arising from the teacher model pre-defined rules. These rules permit the model to measure
(Sect.V-C). To better deal with noisy labels for COVID-19 the quality of the feedback obtained from the client servers.
pneumonia lesion segmentation, the main novelty in [207] More and more research works in medical image segmenta-
was to propose two mechanisms: an adaptive teacher that tion involve a FL scheme [211]. Recently, Xu et al. introduced
suppresses the contribution of the student when the latter has in [212] a new federated cross-learning segmentation approach
a large training loss and an adaptive student that learns from that handles data that are not independently and identically
the teacher only when the teacher outperforms the student. distributed. Unlike the conventional FL methods that combine
As followed in [208], semi-supervised pseudo labeling is multiple individually trained local models on a server node,
another strategy to deal with both annotated and unlabeled the proposed method named FedCross consecutively trained
data. Thus, Fan et al. generated pseudo-labels by relying the global model across multiple clients in a round-robin
on a first training with 50 labeled images only. The newly fashion. The authors also suggested a new federated cross-
pseudo-labeled examples were then included in the original ensemble learning technique that together trains and sets up
19

Client model Client model

Hospital I Hospital III

+ Central server +

Data
+ Data

downloading

uploading
+ +

Data Data

Client model Client model

Global model
Hospital II Hospital IV

Fig. 12. General framework of federated learning. Refer to Sect.VI-E for further details.

various models. Wicaksana et al. [213] proposed FedMix, an annotation process to select the most useful samples to train a
FL strategy that employed mixed image labels specifically to DL-based model. This is particularly useful in medical image
segment anatomical region-of-interest from medical images. segmentation, as manually labeling large amounts of medical
These labels incorporated substantial pixel-wise annotations, images is time-consuming and costly [52]. By using AL,
weak bounding boxes and image-wise class annotations. The the model can learn to accurately segment images with less
authors initially created the pseudo annotations through clients human inputs, making the process more efficient and cost-
and employed refinement under supervision to enrich pseudo- effective. Additionally, AL allows the model to focus on the
annotations. Later, each client contained high-quality data most difficult and important samples, resulting in improved
which was determined using active sample picking for local delineation performance. Nevertheless, choosing the best data
model updates. Based on the quantity and quality of data, enabling the improvement of the model learning capability
the approach provided updates through dynamic aggregation remains challenging. In particular, there are multiple methods
techniques which allow for modification of each local client’s to measure informativeness which mainly involves uncertainty
weight. FedMix was validated on breast tumor segmentation and representativeness criteria [39]. DL-based segmentation
from ultrasound and skin lesion segmentation. methods are able to measure uncertainty (Sect.V-C) to some
Wu et al. [38] introduced a new federated contrastive learn- extent. Computing for each voxel the sum of the lowest class
ing (FCL) framework for 3D volumetric image segmentation probability is one of the simplest manners. If the prediction is
that requires limited annotations only. The local clients first uncertain, an increased number of annotated data is required
started with learning a shared encoder to spread unlabeled to exploit richer feature information. On the contrary, rep-
images. Later, annotated images were incorporated to fine- resentativeness deals more with choosing the samples from
tune the model. Through feature exchange in which each distinct regions of the data distribution such that the variability
client exchanges the features (i.e. low-dimensional vectors) of among the whole dataset is taken into account. In this context,
its local data with other clients, the approach enables better a good balance between exploration and exploitation among
local contrastive learning while avoiding raw data sharing. A the distribution is highly desirable.
global structural matching technique was developed to learn
To name a few, a cascaded 3D U-Net with CNN-correction
the structural similarity of encoded features with suitable
label curation was employed in [214] for kidney segmentation
representations to be shared with other remote clients.
from abdominal CT images in order to save the annotation
More globally, FL has shown potential for improving the
efforts and improve the segmentation outcomes. AL was
accuracy of medical image segmentation while protecting the
concluded to be able to reduce labeling efforts through CNN-
privacy of individual patient data. By offering scalability, flex-
corrected segmentation and increase training efficiency by
ible training scheduling and large training datasets via multi-
iterative learning with limited data. Shen et al. presented in
site collaborations [210], FL combines essential conditions to
[215] an AL approach able to alleviate the image annota-
consider increasing its deployment in various clinical settings.
tion issues towards brain tumor segmentation. The authors
combined both uncertainty and representativeness information
F. Active learning to ensure that AL selects enough informative and diverse
Active learning (AL) is a learning technique that involves data. Contrary to existing studies based on uncertainty or
training a model on a small, initial set of labeled data and representativeness estimated at the scale of a single image,
then iteratively selecting new data to be labeled and added Yan et al. scored in [216] dual-view mammograms according
to the training set (Fig.13). Thus, it assists annotators in the to their prediction consistency, towards better breast mass
20

Sample selection using active learning

tackling various diseases, anatomies and imaging modali-
ties. The availability of large datasets and open-source DL
frameworks has facilitated the development and deployment
annotate for of DL-based segmentation algorithms, making them more
training Intermediate accessible to researchers and clinicians. The success of DL
network
in medical image segmentation has been thus demonstrated
Annotators in a variety of studies including whole or sub-structure or-
Unlabeled data gan segmentation, abnormality extraction or vascular system
delineation. Especially, lesion segmentation is increasingly
predict Trained cut-off benefiting from the availability of combined anatomical and
Predictions
network nuclear imaging. More globally, medical image segmentation
with DL stars to have a concrete impact and to play a key
Fig. 13. General framework of active learning. See Sect.VI-F for more details. role in diagnosis, surgery or therapeutic planning, follow-up,
prognostic, dosimetric or radiomics applications at large.
delineation. A future possible extension is to integrate multiple Since the introduction of U-Net and U-shaped convolutional
single and dual-view criteria to reach a unified AL system. encoder-decoder derivatives with data augmentation and en-
More globally, combining the strengths of AL and human-in- coder pre-training, various developments and methodological
the-loop computing [39] into end-to-end systems should play breakthrough have emerged in the medical image analysis
an increasingly important role in the upcoming years. community. Among current trends, the relevancy of condi-
tional generative adversarial networks, cascaded networks,
G. Lightweight networks deep supervision and attention mechanisms have been proven
to enable the improvement of segmentation accuracy for both
Despite high performance, applying CNN or Transformers-
large and small anatomical or pathological structures. Regular-
based networks for 3D medical image segmentation is compu-
ization techniques embedding prior knowledge such as shape,
tationally expensive. For instance, 3D convolution layers made
topological or adjacency constraints tend to be democratized
of many filters involve a large set of parameters to train as well
towards greater robustness and generalizability of deep seg-
as a huge amount of floating point operations (FLOP) [40].
mentation models. Additionally to novel architecture designs
Medical images themselves obviously require a large compu-
and learning paradigms, a significantly strong focus has been
tational storage cost. However, the device on which a given
recently devoted to both data management and optimization
3D model is deployed may have limited computational power,
processes through the development of unified frameworks
making the deployment process hard in clinical routine. In this
such as nnU-Net whose popularity is steadily growing. Given
context, the development of lightweight deep segmentation
the substantial progress made in recent years, considering a
models with smaller model sizes, lower computational cost
standard U-Net as the sole baseline no longer seems relevant.
and inference time has recently attracted increasing attention.
Despite these successes, there are still challenges to the
First attempts investigated depth-separable convolutions [66]
use of DL for medical image segmentation. These challenges
consisting in replacing 3D convolution kernel 3 × 3 × 3
include the need for large amounts of labeled data for training,
with 1 × 3 × 3 intra-slice and 3 × 1 × 1 inter-slice con-
the sensitivity of DL models to noise, non-uniform contrast
volutions, combinations between point-wise, group-wise and
and artifacts in medical images, the needed incorporation of
dilated convolutions [63] or the reduction of channels [217].
local and global context to benefit from both short- and long-
However, obtaining the same performance as computationally
range spatial dependencies, the management of small struc-
expensive heavy-weight models is tedious and resolving the
tures and weak boundaries, the robustness to inter-subject vari-
trade-off between trainable parameters and performance re-
ability and various multi-center, multi-scanner, multi-modal or
mains a challenge. Towards a better trade-off, Zhao et al.
multi-protocol intensity domains as well as the risks for biased
managed in [218] the limited feature learning ability of spatio-
or non-generalizable results.
temporal separable convolutions via an attention-based feature
Given these challenges, the use of DL has shown great
calibration mechanism providing more contextual information
promise in line with the emergence of vision Transformers
with a larger receptive field.
whose ability to model long-range dependencies from 3D
The knowledge distillation mechanism fully described in
medical images appears better than standard convolutional
Sect.V-E which tends to distill information from a well-
only architecutres. Either hybrid when used in conjunction
trained cumbersome teacher network to a lightweight and
with convolutional layers or purely Transformers-based, these
compact student network [33] could be also seen as a powerful
approaches are still at an early stage. More works in this
alternative towards lightweight segmentation networks.
direction are expected, especially in the context of low-
data regimes and cross-modal analysis. The great potential
VII. D ISCUSSION of multi-task learning, constrastive learning and knowledge
Deep learning (DL) has proven to be a powerful tool for distillation which are likely to emerge can be also emphasize
medical image segmentation. Its ability to automatically learn as powerful trends to follow. Multi-task learning enables to
complex and hierarchical representations from data enabled share information and exploit fruitful cooperation between
to achieve a high level of robustness in segmentation tasks connected or auxiliary tasks while contrastive learning tends
21

to strengthen the extraction of distinctive and disentangled R EFERENCES

representations. Against computationally expensive scenarios,
[1] A. Saporta, T.-H. Vu, M. Cord, and P. Pérez, “Multi-target adversar-
knowledge distillation techniques have proven great skills to ial frameworks for domain adaptation in semantic segmentation,” in
distill information from a cumbersome teacher network to a IEEE/CVF International Conference on Computer Vision, 2021, pp.
lightweight student network. Uncertainty modelling is another 9072–9081.
[2] B. E. Nelms, W. A. Tomé, G. Robinson, and J. Wheeler, “Variations
important path to study as it may improve the learning pro- in the contouring of organs at risk: test case from a patient with
cess and provide clinicians with locally-estimated confidence oropharyngeal cancer,” International Journal of Radiation Oncology,
information. Further research and development beyond the Biology, Physics, vol. 82, no. 1, pp. 368–378, 2012.
[3] X. Zhuang, K. S. Rhode, R. S. Razavi, D. J. Hawkes, and S. Ourselin,
application of off-the-shelf DL solutions are needed to address “A registration-based propagation framework for automatic whole heart
the above-mentioned challenges and enable a wider adoption segmentation of cardiac MRI,” IEEE Transactions on Medical Imaging,
of image segmentation with DL into clinical routine. vol. 29, no. 9, pp. 1612–1625, 2010.
[4] N. Decaux, P.-H. Conze, J. Ropars, X. He, F. T. Sheehan, C. Pons,
To bridge the gap between DL paradigms and clinical D. B. Salem, S. Brochard, and F. Rousseau, “Semi-automatic muscle
needs, recent investigations have struggle with novel and segmentation in MR images using deep registration-based label prop-
concrete emerging applications. Among these applications, agation,” Pattern Recognition, p. 109529, 2023.
[5] T. Heimann and H.-P. Meinzer, “Statistical shape models for 3D
cross-modality segmentation has gained in popularity in order medical image segmentation: A review,” Medical Image Analysis,
to fully exploit both complementary and redundancy across vol. 13, no. 4, pp. 543–563, 2009.
modalities when managing paired or unpaired multi-modal [6] S. Kim, D. Lee, S. Park, K.-S. Oh, S. W. Chung, and Y. Kim, “Auto-
matic segmentation of supraspinatus from MRI by internal shape fitting
datasets. More globally, a special attention has been paid in and autocorrection,” Computer Methods and Programs in Biomedicine,
recent years to multi-domain segmentation strategies which are vol. 140, pp. 165–174, 2017.
far more relevant than focusing on multiple intensity domains [7] J. E. Iglesias and M. R. Sabuncu, “Multi-atlas segmentation of biomed-
ical images: a survey,” Medical Image Analysis, vol. 24, no. 1, pp.
separately. In this direction, multi-task and multi-domain 205–219, 2015.
techniques with multiple anatomies as targets should deserve [8] K. Van Leemput, F. Maes, D. Vandermeulen, and P. Suetens, “Auto-
further investigation in the near future. Given the complexity mated model-based tissue classification of MR images of the brain,”
IEEE Transactions on Medical Imaging, vol. 18, no. 10, pp. 897–908,
of collecting and annotating a large amount of medical images, 1999.
self-supervised, semi-supervised and active learning are sub- [9] V. Barra and J.-Y. Boire, “Segmentation of fat and muscle from MR
fields of clear progress. However, more research efforts are images of the thigh by a possibilistic clustering algorithm,” Computer
Methods and Programs in Biomedicine, vol. 68, no. 3, pp. 185–193,
needed to maximize or avoid the time-consuming and costly 2002.
manual efforts made by clinical experts. Since medical data is [10] S. Purushwalkam, B. Li, Q. Meng, and J. McPhee, “Automatic seg-
often sensitive and subject to strict regulations on sharing, fed- mentation of adipose tissue from thigh magnetic resonance images,” in
International Conference Image Analysis and Recognition, 2013, pp.
erated learning now offers the possibility for multiple hospitals 451–458.
and research institutions to collaborate by training a shared [11] S. Zhou, J. Wang, S. Zhang, Y. Liang, and Y. Gong, “Active contour
model on their own local data while keeping the data private model based on local and global intensity information for medical
image segmentation,” Neurocomputing, vol. 186, pp. 107–118, 2016.
and secure. Thus, federated learning enables the use of larger [12] P.-H. Conze, V. Noblet, F. Rousseau, F. Heitz, V. de Blasi, R. Memeo,
and more diverse datasets, resulting in improved segmentation and P. Pessaux, “Scale-adaptive supervoxel-based random forests for
performance. Finally, the development of lightweight models liver tumor segmentation in dynamic contrast-enhanced CT scans,”
International Journal of Computer Assisted Radiology and Surgery,
with few memory and computational resource requirements vol. 12, no. 2, pp. 223–233, 2017.
will for sure be beneficial to ease the deployment of DL-based [13] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep learning tech-
solutions on computationally-limited platforms. niques for medical image segmentation: achievements and challenges,”
Journal of Digital Imaging, vol. 32, no. 4, pp. 582–596, 2019.
Overall, the potential for bias in DL approaches is a com- [14] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
mon concern across medical image analysis tasks including works for biomedical image segmentation,” in International Conference
segmentation. In this context, encouraging the collection of on Medical Image Computing and Computer-Assisted Intervention,
2015, pp. 234–241.
large and diverse datasets through collective work with various [15] F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S.
experts is highly recommended. The design of novel evaluation Khan, and H. Fu, “Transformers in medical imaging: A survey,” arXiv
metrics reflecting the clinical applicability is also an area for preprint arXiv:2201.09873, 2022.
[16] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
improvement. Finally, demonstrating a better reproducibility M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I.
when designing DL pipelines could increase the trust and Sánchez, “A survey on deep learning in medical image analysis,”
confidence of researchers and clinicians and make them more Medical Image Analysis, vol. 42, pp. 60–88, 2017.
[17] L. Liu, J. M. Wolterink, C. Brune, and R. N. Veldhuis, “Anatomy-aided
suitable for large-scale clinical applications. deep learning for medical image segmentation: a review,” Physics in
Medicine & Biology, vol. 66, no. 11, 2021.
ACKNOWLEDGMENTS [18] H. Kaur, N. Kaur, and N. Neeru, “Evolution of multiorgan segmentation
techniques from traditional to deep learning in abdominal CT images:
This work benefited from state aid managed by the National A systematic review,” Displays, 2022.
Research Agency under the Future Investment Program bear- [19] S. Niyas, S. Pawan, M. A. Kumar, and J. Rajan, “Medical image
ing the reference ANR-17-RHUS-0005 (FollowKnee project). segmentation with 3D convolutional neural networks: A survey,” Neu-
rocomputing, vol. 493, pp. 397–413, 2022.
This work was also partially funded by France Life Imaging [20] N. Siddique, S. Paheding, C. P. Elkin, and V. Devabhaktuni, “U-Net
(grant ANR-11-INBS-0006). All authors declare that they have and its variants for medical image segmentation: A review of theory
no known conflicts of interest in terms of competing financial and applications,” IEEE Access, vol. 9, pp. 82 031–82 057, 2021.
[21] X. Chen, X. Wang, K. Zhang, K.-M. Fung, T. C. Thai, K. Moore, R. S.
interests or personal relationships that could have an influence Mannel, H. Liu, B. Zheng, and Y. Qiu, “Recent advances and clinical
or are relevant to the work reported in this paper. applications of deep learning in medical image analysis,” Medical
22

Image Analysis, 2022. organ segmentation: The FLARE challenge,” Medical Image Analysis,
[22] V. K. Singh, H. A. Rashwan, S. Romani, F. Akram, N. Pandey, p. 102616, 2022.
M. M. K. Sarker, A. Saleh, M. Arenas, M. Arquez, D. Puig et al., [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
“Breast tumor segmentation and shape classification in mammograms arXiv preprint arXiv:1412.6980, 2014.
using generative adversarial and convolutional neural network,” Expert [42] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L.
Systems with Applications, vol. 139, p. 112855, 2020. Martel, “Loss odyssey in medical image segmentation,” Medical Image
[23] H. R. Roth, C. Shen, H. Oda, T. Sugino, M. Oda, Y. Hayashi, K. Mi- Analysis, vol. 71, p. 102035, 2021.
sawa, and K. Mori, “A multi-scale pyramid of 3D fully convolutional [43] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Deep
networks for abdominal multi-organ segmentation,” in International neural networks segment neuronal membranes in electron microscopy
Conference on Medical Image Computing and Computer-Assisted images,” Advances in Neural Information Processing Systems, vol. 25,
Intervention, 2018, pp. 417–425. 2012.
[24] A. Boutillon, B. Borotikar, V. Burdin, and P.-H. Conze, “Multi-structure [44] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
bone segmentation in pediatric mr images with combined regularization for semantic segmentation,” in IEEE Conference on Computer Vision
from shape priors and adversarial network,” Artificial Intelligence in and Pattern Recognition, 2015, pp. 3431–3440.
Medicine, vol. 132, p. 102364, 2022. [45] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
[25] H. Dou, D. Karimi, C. K. Rollins, C. M. Ortinau, L. Vasung, neural networks for volumetric medical image segmentation,” in Inter-
C. Velasco-Annis, A. Ouaalam, X. Yang, D. Ni, and A. Gholipour, “A national Conference on 3D Vision, 2016, pp. 565–571.
deep attentive convolutional neural network for automatic cortical plate [46] P.-H. Conze, S. Brochard, V. Burdin, F. T. Sheehan, and C. Pons,
segmentation in fetal MRI,” IEEE Transactions on Medical Imaging, “Healthy versus pathological learning transferability in shoulder muscle
vol. 40, no. 4, pp. 1123–1133, 2020. MRI segmentation using deep convolutional encoder-decoders,” Com-
[26] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, puterized Medical Imaging and Graphics, vol. 83, p. 101733, 2020.
K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention [47] Z. Zhou, Z. He, and Y. Jia, “AFPNet: A 3D fully convolutional
U-Net: Learning where to look for the pancreas,” arXiv preprint neural network with atrous-convolution feature pyramid for brain tumor
arXiv:1804.03999, 2018. segmentation via MRI images,” Neurocomputing, vol. 402, pp. 235–
[27] J. J. Cerrolaza, M. L. Picazo, L. Humbert, Y. Sato, D. Rueckert, 244, 2020.
M. Á. G. Ballester, and M. G. Linguraru, “Computational anatomy for [48] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
multi-organ analysis in medical imaging: A review,” Medical Image Redesigning skip connections to exploit multiscale features in image
Analysis, vol. 56, pp. 44–67, 2019. segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6,
[28] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier- pp. 1856–1867, 2019.
Hein, “nnU-Net: a self-configuring method for deep learning-based [49] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “MixUp:
biomedical image segmentation,” Nature Methods, vol. 18, no. 2, pp. Beyond empirical risk minimization,” in International Conference on
203–211, 2021. Learning Representations, 2018.
[29] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. [50] E. Panfilov, A. Tiulpin, S. Klein, M. T. Nieminen, and S. Saarakkala,
Yuille, and Y. Zhou, “TransUNet: Transformers make strong encoders “Improving robustness of deep learning based knee MRI segmentation:
for medical image segmentation,” arXiv preprint arXiv:2102.04306, MixUp and adversarial domain adaptation,” in IEEE/CVF International
2021. Conference on Computer Vision Workshops, 2019.
[30] L. Song, J. Lin, Z. J. Wang, and H. Wang, “An end-to-end multi-task [51] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-
deep learning framework for skin lesion analysis,” IEEE Journal of Paz, and Y. Bengio, “Manifold MixUp: Better representations by
Biomedical and Health Informatics, vol. 24, no. 10, pp. 2912–2921, interpolating hidden states,” in International Conference on Machine
2020. Learning, 2019, pp. 6438–6447.
[31] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, and T. Ver- [52] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding,
cauteren, “Aleatoric uncertainty estimation with test-time augmentation “Embracing imperfect datasets: A review of deep learning solutions
for medical image segmentation with convolutional neural networks,” for medical image segmentation,” Medical Image Analysis, vol. 63, p.
Neurocomputing, vol. 338, pp. 34–45, 2019. 101693, 2020.
[32] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Contrastive [53] S. Amiri and B. Ibragimov, “Improved automated lesion segmenta-
learning of global and local features for medical image segmentation tion in whole-body FDG/PET-CT via test-time augmentation,” arXiv
with limited annotations,” Advances in Neural Information Processing preprint arXiv:2210.07761, 2022.
Systems, vol. 33, pp. 12 546–12 558, 2020. [54] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A.
[33] D. Qin, J.-J. Bu, Z. Liu, X. Shen, S. Zhou, J.-J. Gu, Z.-H. Wang, Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al.,
L. Wu, and H.-F. Dai, “Efficient medical image segmentation based “Deep learning techniques for automatic mri cardiac multi-structures
on knowledge distillation,” IEEE Transactions on Medical Imaging, segmentation and diagnosis: is the problem solved?” IEEE Transactions
vol. 40, no. 12, pp. 3820–3831, 2021. on Medical Imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
[34] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical [55] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Auto-context con-
image segmentation using multi-modality fusion,” Array, vol. 3, p. volutional neural network (Auto-Net) for brain extraction in magnetic
100004, 2019. resonance imaging,” IEEE Transactions on Medical Imaging, vol. 36,
[35] A. Boutillon, P.-H. Conze, C. Pons, V. Burdin, and B. Borotikar, no. 11, pp. 2319–2330, 2017.
“Generalizable multi-task, multi-domain deep segmentation of sparse [56] Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng,
pediatric imaging datasets via multi-scale contrastive regularization and and D. Ni, “Deep attentional features for prostate segmentation in
multi-joint anatomical priors,” Medical Image Analysis, p. 102556, ultrasound,” in International Conference on Medical Image Computing
2022. and Computer-Assisted Intervention, 2018, pp. 523–530.
[36] A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner, [57] P. F. Christ, M. E. A. Elshaer, F. Ettlinger, S. Tatavarty, M. Bickel,
and C. Lippert, “3D self-supervised methods for medical imaging,” P. Bilic, M. Rempfler, M. Armbruster, F. Hofmann, M. D’Anastasi
Advances in Neural Information Processing Systems, vol. 33, pp. et al., “Automatic liver and lesion segmentation in CT using cas-
18 158–18 172, 2020. caded fully convolutional neural networks and 3D conditional random
[37] V. Cheplygina, M. de Bruijne, and J. P. Pluim, “Not-so-supervised: fields,” in International Conference on Medical Image Computing and
a survey of semi-supervised, multi-instance, and transfer learning in Computer-Assisted Intervention, 2016, pp. 415–423.
medical image analysis,” Medical Image Analysis, vol. 54, pp. 280– [58] T. L. Kline, P. Korfiatis, M. E. Edwards, J. D. Blais, F. S. Czerwiec,
296, 2019. P. C. Harris, B. F. King, V. E. Torres, and B. J. Erickson, “Performance
[38] Y. Wu, D. Zeng, Z. Wang, Y. Shi, and J. Hu, “Federated contrastive of an artificial multi-observer deep neural network for fully automated
learning for volumetric medical image segmentation,” in International segmentation of polycystic kidneys,” Journal of Digital Imaging,
Conference on Medical Image Computing and Computer-Assisted vol. 30, no. 4, pp. 442–448, 2017.
Intervention, 2021, pp. 367–377. [59] A. E. Kavur, N. S. Gezer, M. Barış, S. Aslan, P.-H. Conze, V. Groza,
[39] S. Budd, E. C. Robinson, and B. Kainz, “A survey on active learning D. D. Pham, S. Chatterjee, P. Ernst, S. Özkan et al., “CHAOS
and human-in-the-loop deep learning for medical image analysis,” challenge-combined (CT-MR) healthy abdominal organ segmentation,”
Medical Image Analysis, vol. 71, p. 102062, 2021. Medical Image Analysis, vol. 69, p. 101950, 2021.
[40] J. Ma, Y. Zhang, S. Gu, X. An, Z. Wang, C. Ge, C. Wang, F. Zhang, [60] P.-H. Conze, A. E. Kavur, E. Cornec-Le Gall, N. S. Gezer, Y. Le Meur,
Y. Wang, Y. Xu et al., “Fast and low-GPU-memory abdomen CT M. A. Selver, and F. Rousseau, “Abdominal multi-organ segmentation
23

with cascaded convolutional and adversarial deep networks,” Artificial on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
Intelligence in Medicine, vol. 117, p. 102109, 2021. [79] Y. Huo, Z. Xu, S. Bao, C. Bermudez, H. Moon, P. Parvathaneni, T. K.
[61] B. Lei, Z. Xia, F. Jiang, X. Jiang, Z. Ge, Y. Xu, J. Qin, S. Chen, Moyo, M. R. Savona, A. Assad, R. G. Abramson et al., “Splenomegaly
T. Wang, and S. Wang, “Skin lesion segmentation via generative ad- segmentation on multi-modal MRI using deep convolutional networks,”
versarial networks with dual discriminators,” Medical Image Analysis, IEEE Transactions on Medical Imaging, vol. 38, no. 5, pp. 1185–1196,
vol. 64, p. 101716, 2020. 2018.
[62] K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, [80] H. Ravishankar, R. Venkataramani, S. Thiruvenkadam, P. Sudhakar, and
A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert et al., V. Vaidya, “Learning and incorporating shape models for semantic seg-
“Unsupervised domain adaptation in brain lesion segmentation with mentation,” in International Conference on Medical Image Computing
adversarial networks,” in International Conference on Information and Computer-Assisted Intervention, 2017, pp. 203–211.
Processing in Medical Imaging, 2017, pp. 597–609. [81] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[63] C. Chen, X. Liu, M. Ding, J. Zheng, and J. Li, “3D dilated multi- S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
fiber network for real-time brain tumor segmentation in MRI,” in Advances in Neural Information Processing Systems, vol. 27, 2014.
International Conference on Medical Image Computing and Computer- [82] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Og-
Assisted Intervention, 2019, pp. 184–192. den, “Pyramid methods in image processing,” RCA Engineer, vol. 29,
[64] D. Jin, D. Guo, T.-Y. Ho, A. P. Harrison, J. Xiao, C.-k. Tseng, and no. 6, pp. 33–41, 1984.
L. Lu, “Deep esophageal clinical target volume delineation using [83] Z. Tu and X. Bai, “Auto-context and its application to high-level vision
encoded 3D spatial context of tumors, lymph nodes, and organs at tasks and 3D brain image segmentation,” IEEE Transactions on Pattern
risk,” in International Conference on Medical Image Computing and Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1744–1757,
Computer-Assisted Intervention, 2019, pp. 603–612. 2010.
[65] J. Ma and X. Yang, “Combining CNN and hybrid active contours for [84] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
head and neck tumor segmentation in CT and PET images,” in 3D scale: Scale-aware semantic image segmentation,” in IEEE Conference
Head and Neck Tumor Segmentation in PET/CT Challenge, 2020, pp. on Computer Vision and Pattern Recognition, 2016, pp. 3640–3649.
59–64. [85] J. Wei, Z. Wu, L. Wang, T. D. Bui, L. Qu, P.-T. Yap, Y. Xia, G. Li,
[66] W. Lei, H. Mei, Z. Sun, S. Ye, R. Gu, H. Wang, R. Huang, S. Zhang, and D. Shen, “A cascaded nested network for 3T brain MR image
S. Zhang, and G. Wang, “Automatic segmentation of organs-at-risk segmentation guided by 7T labeling,” Pattern Recognition, vol. 124, p.
from head-and-neck CT using separable convolutional neural network 108420, 2022.
with hard-region-weighted loss,” Neurocomputing, vol. 442, pp. 184– [86] M. S. Nosrati and G. Hamarneh, “Incorporating prior knowl-
199, 2021. edge in medical image segmentation: a survey,” arXiv preprint
[67] Y. Yan, P.-H. Conze, G. Quellec, M. Lamard, B. Cochener, and arXiv:1607.01092, 2016.
G. Coatrieux, “Two-stage multi-scale breast mass segmentation for full [87] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P.-A. Heng, “DCAN:
mammogram analysis without user intervention,” Biocybernetics and Deep contour-aware networks for object instance segmentation from
Biomedical Engineering, vol. 41, no. 2, pp. 746–757, 2021. histology images,” Medical Image Analysis, vol. 36, pp. 135–146, 2017.
[68] A. Vakanski, M. Xian, and P. E. Freer, “Attention-enriched deep [88] P.-A. Ganaye, M. Sdika, B. Triggs, and H. Benoit-Cattin, “Remov-
learning model for breast tumor segmentation in ultrasound images,” ing segmentation inconsistencies with semi-supervised non-adjacency
Ultrasound in Medicine & Biology, vol. 46, no. 10, pp. 2819–2833, constraint,” Medical Image Analysis, vol. 58, p. 101551, 2019.
2020. [89] S. Xie and Z. Tu, “Holistically-nested edge detection,” in IEEE
[69] T. L. Kline, M. E. Edwards, J. Fetzer, A. V. Gregory, D. Anaam, A. J. International Conference on Computer Vision, 2015, pp. 1395–1403.
Metzger, and B. J. Erickson, “Automatic semantic segmentation of [90] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
kidney cysts in MR images of patients affected by autosomal-dominant large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
polycystic kidney disease,” Abdominal Radiology, vol. 46, no. 3, pp. [91] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncer-
1053–1061, 2021. tainty to weigh losses for scene geometry and semantics,” in IEEE
[70] V. Oreiller, V. Andrearczyk, M. Jreige, S. Boughdad, H. Elhalawani, Conference on Computer Vision and Pattern Recognition, 2018, pp.
J. Castelli, M. Vallières, S. Zhu, J. Xie, Y. Peng et al., “Head and neck 7482–7491.
tumor segmentation in PET/CT: the HECKTOR challenge,” Medical [92] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,”
Image Analysis, vol. 77, p. 102336, 2022. arXiv preprint arXiv:1410.5401, 2014.
[71] F. Ouhmich, V. Agnus, V. Noblet, F. Heitz, and P. Pessaux, “Liver tissue [93] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
segmentation in multiphase ct scans using cascaded convolutional neu- proaches to attention-based neural machine translation,” arXiv preprint
ral networks,” International Journal of Computer Assisted Radiology arXiv:1508.04025, 2015.
and Surgery, vol. 14, no. 8, pp. 1275–1284, 2019. [94] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[72] K. Kamnitsas, W. Bai, E. Ferrante, S. McDonagh, M. Sinclair, Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
N. Pawlowski, M. Rajchl, M. Lee, B. Kainz, D. Rueckert et al., “En- Advances in Neural Information Processing Systems, vol. 30, 2017.
sembles of multiple models and architectures for robust brain tumour [95] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
segmentation,” in International MICCAI BrainLesion Workshop, 2017, jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
pp. 450–462. 2014.
[73] D. Keshwani, Y. Kitamura, S. Ihara, S. Iizuka, and E. Simo-Serra, [96] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
“TopNet: Topology preserving metric learning for vessel tree recon- IEEE Conference on Computer Vision and Pattern Recognition, 2018,
struction and labelling,” in International Conference on Medical Image pp. 7132–7141.
Computing and Computer-Assisted Intervention, 2020, pp. 14–23. [97] A. Iantsen, D. Visvikis, and M. Hatt, “Squeeze-and-excitation normal-
[74] A. Sadikine, B. Badic, J.-P. Tasu, V. Noblet, D. Visvikis, and P.- ization for automated delineation of head and neck primary tumors in
H. Conze, “Semi-overcomplete convolutional auto-encoder embedding combined PET and CT images,” in Head and Neck Tumor Segmentation
as shape priors for deep vessel segmentation,” in IEEE International Challenge, vol. 12603, 2020, pp. 37–43.
Conference on Image Processing, 2022. [98] L. Rundo, C. Han, Y. Nagano, J. Zhang, R. Hataya, C. Militello,
[75] M. Hatt, C. Parmar, J. Qi, and I. El Naqa, “Machine (deep) learning A. Tangherloni, M. S. Nobile, C. Ferretti, D. Besozzi et al., “USE-Net:
methods for image processing and radiomics,” IEEE Transactions on Incorporating squeeze-and-excitation blocks into U-Net for prostate
Radiation and Plasma Medical Sciences, vol. 3, no. 2, pp. 104–108, zonal segmentation of multi-institutional MRI datasets,” Neurocomput-
2019. ing, vol. 365, pp. 31–43, 2019.
[76] M. M. Islam, B. Badic, T. Aparicio, D. Tougeron, J.-P. Tasu, [99] X. Li, Y. Wei, L. Wang, S. Fu, and C. Wang, “MSGSE-Net: Multi-scale
D. Visvikis, and P.-H. Conze, “Deep treatment response assessment guided squeeze-and-excitation network for subcortical brain structure
and prediction of colorectal cancer liver metastases,” in International segmentation,” Neurocomputing, vol. 461, pp. 228–243, 2021.
Conference on Medical Image Computing and Computer-Assisted [100] X. Shen, J. Xu, H. Jia, P. Fan, F. Dong, B. Yu, and S. Ren, “Self-
Intervention, 2022, pp. 482–491. attentional microvessel segmentation via squeeze-excitation trans-
[77] C. Pons, F. T. Sheehan, H. S. Im, S. Brochard, and K. E. Alter, former unet,” Computerized Medical Imaging and Graphics, vol. 97,
“Shoulder muscle atrophy and its relation to strength loss in obstetrical p. 102055, 2022.
brachial plexus palsy,” Clinical Biomechanics, vol. 48, pp. 80–87, 2017. [101] N. Abraham and N. M. Khan, “A novel focal tversky loss function
[78] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image with improved attention U-Net for lesion segmentation,” in IEEE
translation with conditional adversarial networks,” in IEEE Conference International Symposium on Biomedical Imaging, 2019, pp. 683–687.
24

[102] A. G. Roy, N. Navab, and C. Wachinger, “Recalibrating fully con- [124] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical
volutional networks with spatial and channel squeeze and excitation Transformer: Gated axial-attention for medical image segmentation,” in
blocks,” IEEE Transactions on Medical Imaging, vol. 38, no. 2, pp. International Conference on Medical Image Computing and Computer-
540–549, 2018. Assisted Intervention, 2021, pp. 36–46.
[103] C. Yao, J. Tang, M. Hu, Y. Wu, W. Guo, Q. Li, and X.-P. Zhang, [125] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, and J. Li, “TransBTS:
“Claw U-Net: A U-Net variant network with deep feature concatenation Multimodal brain tumor segmentation using Transformer,” in Inter-
for scleral blood vessel segmentation,” in International Conference on national Conference on Medical Image Computing and Computer-
Artificial Intelligence, 2021, pp. 67–78. Assisted Intervention, 2021, pp. 109–119.
[104] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical [126] J. Li, W. Wang, C. Chen, T. Zhang, S. Zha, H. Yu, and J. Wang,
image segmentation.” IEEE Journal of Biomedical and Health Infor- “TransBTSv2: Wider instead of deeper transformer for medical image
matics, 2020. segmentation,” arXiv preprint arXiv:2201.12785, 2022.
[105] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and [127] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “CoTr: Efficiently bridging CNN
channel squeeze & excitation in fully convolutional networks,” ArXiv, and Transformer for 3D medical image segmentation,” in International
2018. Conference on Medical Image Computing and Computer-Assisted
[106] W. Fang and X.-h. Han, “Spatial and channel attention modulated Intervention, 2021, pp. 171–180.
network for medical image segmentation,” in Asian Conference on [128] S. Li, X. Sui, X. Luo, X. Xu, Y. Liu, and R. Goh, “Medical image seg-
Computer Vision, 2020. mentation using squeeze-and-expansion Transformers,” arXiv preprint
[107] V. V. Valindria, N. Pawlowski, M. Rajchl, I. Lavdas, E. O. Aboagye, arXiv:2105.09511, 2021.
A. G. Rockall, D. Rueckert, and B. Glocker, “Multi-modal learning [129] B. Chen, Y. Liu, Z. Zhang, G. Lu, and D. Zhang, “TransAttUNet:
from unpaired images: Application to multi-organ segmentation in CT Multi-level attention-guided U-Net with Transformer for medical image
and MRI,” in IEEE Winter Conference on Applications of Computer segmentation,” arXiv preprint arXiv:2107.05274, 2021.
Vision, 2018, pp. 547–556. [130] L. Liu, Z. Huang, P. Liò, C.-B. Schönlieb, and A. I. Aviles-Rivero,
[108] X. Zhou, “Automatic segmentation of multiple organs on 3D CT images “PC-SwinMorph: Patch representation for unsupervised medical image
by using deep learning approaches,” Deep Learning in Medical Image registration and segmentation,” arXiv preprint arXiv:2203.05684, 2022.
Analysis, pp. 135–147, 2020. [131] Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing Transform-
[109] S. A. Taghanaki, Y. Zheng, S. K. Zhou, B. Georgescu, P. Sharma, ers and CNNs for medical image segmentation,” arXiv preprint
D. Xu, D. Comaniciu, and G. Hamarneh, “Combo loss: Handling input arXiv:2102.08005, 2021.
and output imbalance in multi-organ segmentation,” Computerized [132] Q. Sun, N. Fang, Z. Liu, L. Zhao, Y. Wen, and H. Lin, “HybridCTrm:
Medical Imaging and Graphics, vol. 75, pp. 24–33, 2019. Bridging CNN and Transformer for multimodal brain image segmen-
[110] E. Gibson, W. Li, C. Sudre, L. Fidon, D. I. Shakir, G. Wang, Z. Eaton- tation,” Journal of Healthcare Engineering, vol. 2021, 2021.
Rosen, R. Gray, T. Doel, Y. Hu et al., “NiftyNet: a deep-learning [133] X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised
platform for medical imaging,” Computer Methods and Programs in medical image segmentation via cross teaching between CNN and
Biomedicine, vol. 158, pp. 113–122, 2018. Transformer,” in Medical Imaging with Deep Learning, 2022.
[111] Y. Weng, T. Zhou, Y. Li, and X. Qiu, “NAS-UNet: Neural architecture [134] H.-Y. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang, and Y. Yu, “nnFormer:
search for medical image segmentation,” IEEE Access, vol. 7, pp. Interleaved transformer for volumetric segmentation,” arXiv preprint
44 247–44 257, 2019. arXiv:2109.03201, 2021.
[112] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [135] Y. Wu, K. Liao, J. Chen, D. Z. Chen, J. Wang, H. Gao, and J. Wu,
training of deep bidirectional transformers for language understanding,” “D-former: A U-shaped dilated Transformer for 3D medical image
arXiv preprint arXiv:1810.04805, 2018. segmentation,” arXiv preprint arXiv:2201.00462, 2022.
[113] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving [136] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang,
language understanding by generative pre-training,” 2018. “Swin-UNet: UNet-like pure Transformer for medical image segmen-
[114] E. Jun, S. Jeong, D.-W. Heo, and H.-I. Suk, “Medical Trans- tation,” arXiv preprint arXiv:2105.05537, 2021.
former: Universal brain encoder for 3D MRI analysis,” arXiv preprint [137] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, and D. Zhang, “DS-
arXiv:2104.13633, 2021. TransUNet: Dual Swin Transformer U-Net for medical image segmen-
[115] J. Li, J. Chen, Y. Tang, B. A. Landman, and S. K. Zhou, “Transforming tation,” IEEE Transactions on Instrumentation and Measurement, 2022.
medical imaging with Transformers? a comparative review of key [138] X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An ef-
properties, current progresses, and future perspectives,” arXiv preprint fective medical image segmentation transformer,” arXiv preprint
arXiv:2206.01136, 2022. arXiv:2109.07162, 2021.
[116] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [139] H. Peiris, M. Hayat, Z. Chen, G. Egan, and M. Harandi, “A volumetric
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., Transformer for accurate 3D tumor segmentation,” arXiv preprint
“An image is worth 16x16 words: Transformers for image recognition arXiv:2111.13300, 2021.
at scale,” arXiv preprint arXiv:2010.11929, 2020. [140] S. Ruder, “An overview of multi-task learning in deep neural networks,”
[117] G. Andrade-Miranda, V. Jaouen, V. Bourbonne, F. Lucia, D. Visvikis, arXiv preprint arXiv:1706.05098, 2017.
and P.-H. Conze, “Pure versus hybrid Transformers for multi-modal [141] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs,
brain tumor segmentation: a comparative study,” in IEEE International T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task
Conference on Image Processing, 2022. medical image segmentation in multiple modalities,” in International
[118] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Land- Conference on Medical Image Computing and Computer-Assisted
man, H. R. Roth, and D. Xu, “UNETR: Transformers for 3D medical Intervention, 2016, pp. 478–486.
image segmentation,” in IEEE/CVF Conference on Applications of [142] C. Playout, R. Duval, and F. Cheriet, “A multitask learning architecture
Computer Vision, 2022, pp. 574–584. for simultaneous segmentation of bright and red lesions in fundus
[119] D. Karimi, S. D. Vasylechko, and A. Gholipour, “Convolution-free images,” in International Conference on Medical Image Computing
medical image segmentation using Transformers,” in International and Computer-Assisted Intervention, 2018, pp. 101–108.
Conference on Medical Image Computing and Computer-Assisted [143] B. Murugesan, K. Sarveswaran, S. M. Shankaranarayana, K. Ram,
Intervention, 2021, pp. 78–88. J. Joseph, and M. Sivaprakasam, “Psi-Net: Shape and boundary aware
[120] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. Roth, and D. Xu, “Swin joint multi-task deep network for medical image segmentation,” in
UNETR: Swin Transformers for semantic segmentation of brain tumors International Conference of the IEEE Engineering in Medicine and
in MR images,” arXiv preprint arXiv:2201.01266, 2022. Biology Society, 2019, pp. 7223–7226.
[121] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, and L. Shao, “Polyp-PVT: [144] Y. Zhang, H. Li, J. Du, J. Qin, T. Wang, Y. Chen, B. Liu, W. Gao,
Polyp segmentation with pyramid vision transformers,” arXiv preprint G. Ma, and B. Lei, “3D multi-attention guided multi-task learning
arXiv:2108.06932, 2021. network for automatic gastric tumor segmentation and lymph node
[122] Z. Zhang, H. Zhang, L. Zhao, T. Chen, S. Ö. Arik, and T. Pfister, classification,” IEEE Transactions on Medical Imaging, vol. 40, no. 6,
“Nested hierarchical transformer: Towards accurate, data-efficient and pp. 1618–1631, 2021.
interpretable visual understanding,” in Conference on Artificial Intelli- [145] S. Chen, G. Bortsova, A. Garcı́a-Uceda Juárez, G. v. Tulder, and
gence, vol. 36, no. 3, 2022, pp. 3417–3425. M. d. Bruijne, “Multi-task attention-based semi-supervised learning for
[123] X. Yu, Y. Tang, Y. Zhou, R. Gao, Q. Yang, H. H. Lee, T. Li, S. Bao, medical image segmentation,” in International Conference on Medical
Y. Huo, Z. Xu et al., “Characterizing renal structures with 3D block Image Computing and Computer-Assisted Intervention, 2019, pp. 457–
aggregate Transformers,” arXiv preprint arXiv:2203.02430, 2022. 465.
25

[146] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian [168] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
deep learning for computer vision?” Advances in Neural Information Improving the performance of convolutional neural networks via atten-
Processing Systems, vol. 30, 2017. tion transfer,” arXiv preprint arXiv:1612.03928, 2016.
[147] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, [169] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, “Knowledge
M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya adaptation for efficient semantic segmentation,” in IEEE/CVF Confer-
et al., “A review of uncertainty quantification in deep learning: Tech- ence on Computer Vision and Pattern Recognition, 2019, pp. 578–587.
niques, applications and challenges,” Information Fusion, vol. 76, pp. [170] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured
243–297, 2021. knowledge distillation for semantic segmentation,” in IEEE/CVF Con-
[148] A. Jungo and M. Reyes, “Assessing reliability and challenges of ference on Computer Vision and Pattern Recognition, 2019, pp. 2604–
uncertainty estimations for medical image segmentation,” in Inter- 2613.
national Conference on Medical Image Computing and Computer- [171] Y. Wen, L. Chen, S. Xi, Y. Deng, X. Tang, and C. Zhou, “Towards
Assisted Intervention, 2019, pp. 48–56. efficient medical image segmentation via boundary-guided knowledge
[149] F. Galati, S. Ourselin, and M. A. Zuluaga, “From accuracy to reliability distillation,” in IEEE International Conference on Multimedia and
and robustness in cardiac magnetic resonance image segmentation: a Expo, 2021.
review,” Applied Sciences, vol. 12, no. 8, p. 3936, 2022. [172] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,”
[150] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: in IEEE/CVF International Conference on Computer Vision, 2019, pp.
Representing model uncertainty in deep learning,” in International 1365–1374.
Conference on Machine Learning, 2016, pp. 1050–1059. [173] E. Kats, J. Goldberger, and H. Greenspan, “Soft labeling by distilling
[151] D. C. Castro, I. Walker, and B. Glocker, “Causality matters in medical anatomical knowledge for improved MS lesion segmentation,” in IEEE
imaging,” Nature Communications, vol. 11, no. 1, pp. 1–10, 2020. International Symposium on Biomedical Imaging, 2019, pp. 1563–
[152] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration 1566.
of modern neural networks,” in International Conference on Machine [174] Y. Liu, C. Shu, J. Wang, and C. Shen, “Structured knowledge distilla-
Learning, 2017, pp. 1321–1330. tion for dense prediction,” IEEE Transactions on Pattern Analysis and
[153] Y. Xia, D. Yang, Z. Yu, F. Liu, J. Cai, L. Yu, Z. Zhu, D. Xu, Machine Intelligence, 2020.
A. Yuille, and H. Roth, “Uncertainty-aware multi-view co-training for [175] L. Zhang, S. Feng, Y. Wang, Y. Wang, Y. Zhang, X. Chen, and Q. Tian,
semi-supervised medical image segmentation and domain adaptation,” “Unsupervised ensemble distillation for multi-organ segmentation,” in
Medical Image Analysis, vol. 65, p. 101766, 2020. IEEE International Symposium on Biomedical Imaging, 2022.
[154] T. Nair, D. Precup, D. L. Arnold, and T. Arbel, “Exploring uncertainty [176] R. Huang, Y. Zheng, Z. Hu, S. Zhang, and H. Li, “Multi-organ
measures in deep networks for multiple sclerosis lesion detection and segmentation via co-training weight-averaged models from few-organ
segmentation,” Medical image analysis, vol. 59, p. 101557, 2020. datasets,” in International Conference on Medical Image Computing
[155] K. Wickstrøm, M. Kampffmeyer, and R. Jenssen, “Uncertainty and and Computer-Assisted Intervention, 2020, pp. 146–155.
interpretability in convolutional neural networks for semantic seg- [177] N. Wang, S. Lin, X. Li, K. Li, Y. Shen, Y. Gao, and L. Ma, “MISSU:
mentation of colorectal polyps,” Medical Image Analysis, vol. 60, p. 3D medical image segmentation via self-distilling TransUNet,” arXiv
101619, 2020. preprint arXiv:2206.00902, 2022.
[156] Y. Zhao, C. Yang, A. Schweidtmann, and Q. Tao, “Efficient bayesian [178] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
uncertainty estimation for nnu-net,” in International Conference on learning: A survey and taxonomy,” IEEE Transactions on Pattern
Medical Image Computing and Computer-Assisted Intervention, 2022, Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.
pp. 535–544. [179] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Ben-
[157] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable gio, C. Pal, P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation
predictive uncertainty estimation using deep ensembles,” Advances in with deep neural networks,” Medical Image Analysis, vol. 35, pp. 18–
neural information processing systems, vol. 30, 2017. 31, 2017.
[158] M. Antonelli, A. Reinke, S. Bakas, K. Farahani, A. Kopp-Schneider, [180] Y. Qin, K. Kamnitsas, S. Ancha, J. Nanavati, G. Cottrell, A. Crim-
B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers inisi, and A. Nori, “Autofocus layer for semantic segmentation,” in
et al., “The medical segmentation decathlon,” Nature Communications, International Conference on Medical Image Computing and Computer-
vol. 13, no. 1, pp. 1–13, 2022. Assisted Intervention, 2018, pp. 603–611.
[159] D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better [181] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,
aggregation in test-time augmentation,” in IEEE/CVF International D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3D
Conference on Computer Vision, 2021, pp. 1214–1223. CNN with fully connected CRF for accurate brain lesion segmentation,”
[160] X. Hu, R. Guo, J. Chen, H. Li, D. Waldmannstetter, Y. Zhao, B. Li, Medical Image Analysis, vol. 36, pp. 61–78, 2017.
K. Shi, and B. Menze, “Coarse-to-fine adversarial networks and zone- [182] S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor seg-
based uncertainty analysis for NK/T-cell lymphoma segmentation in mentation using convolutional neural networks in MRI images,” IEEE
CT/PET images,” IEEE Journal of Biomedical and Health Informatics, Transactions on Medical Imaging, vol. 35, no. 5, pp. 1240–1251, 2016.
vol. 24, no. 9, pp. 2599–2608, 2020. [183] J. Shapey, G. Wang, R. Dorent, A. Dimitriadis, W. Li, I. Paddick,
[161] V. P. Sudarshan, U. Upadhyay, G. F. Egan, Z. Chen, and S. P. N. Kitchen, S. Bisdas, S. R. Saeed, S. Ourselin et al., “An artificial
Awate, “Towards lower-dose PET using physics-based uncertainty- intelligence framework for automatic segmentation and volumetry
aware multimodal learning with robustness to out-of-distribution data,” of vestibular schwannomas from contrast-enhanced T1-weighted and
Medical Image Analysis, vol. 73, p. 102187, 2021. high-resolution T2-weighted MRI,” Journal of Neurosurgery, vol. 134,
[162] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric no. 1, pp. 171–179, 2019.
discriminatively, with application to face verification,” in IEEE Con- [184] H.-Y. Yang, “Volumetric adversarial training for ischemic stroke lesion
ference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. segmentation,” in MICCAI BrainLesion Workshop, 2018, pp. 343–351.
539–546. [185] C. Zhou, C. Ding, Z. Lu, X. Wang, and D. Tao, “One-pass multi-
[163] A. Jamaludin, T. Kadir, and A. Zisserman, “Self-supervised learning task convolutional neural networks for efficient brain tumor segmenta-
for spinal MRIs,” in Deep Learning in Medical Image Analysis and tion,” in International Conference on Medical Image Computing and
Multimodal Learning for Clinical Decision Support, 2017, pp. 294– Computer-Assisted Intervention, 2018, pp. 637–645.
302. [186] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “HeMIS: Hetero-
[164] X. Hu, D. Zeng, X. Xu, and Y. Shi, “Semi-supervised contrastive modal image segmentation,” in International Conference on Medical
learning for label-efficient medical image segmentation,” in Inter- Image Computing and Computer-Assisted Intervention, 2016, pp. 469–
national Conference on Medical Image Computing and Computer- 477.
Assisted Intervention, 2021, pp. 481–490. [187] K.-L. Tseng, Y.-L. Lin, W. Hsu, and C.-Y. Huang, “Joint sequence
[165] D. Zeng, Y. Wu, X. Hu, X. Xu, H. Yuan, M. Huang, J. Zhuang, J. Hu, learning and cross-modality convolution for 3D biomedical segmenta-
and Y. Shi, “Positional contrastive learning for volumetric medical tion,” in IEEE Conference on Computer Vision and Pattern Recogni-
image segmentation,” arXiv preprint arXiv:2106.09157, 2021. tion, 2017, pp. 6393–6400.
[166] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a [188] T. Zhou, S. Canu, and S. Ruan, “Fusion based on attention mechanism
neural network,” in Neural Information Processing Systems, 2014. and context constraint for multi-modal brain tumor segmentation,”
[167] K. Xu, L. Rui, Y. Li, and L. Gu, “Feature normalized knowledge distil- Computerized Medical Imaging and Graphics, vol. 86, p. 101811,
lation for image classification,” in European Conference on Computer 2020.
Vision, 2020, pp. 664–680. [189] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Latent correlation repre-
26

sentation learning for brain tumor segmentation with missing MRI learning: a collaborative effort to achieve better medical imaging mod-
modalities,” IEEE Transactions on Image Processing, vol. 30, pp. els for individual sites that have small labelled datasets,” Quantitative
4263–4274, 2021. Imaging in Medicine and Surgery, vol. 11, no. 2, p. 852, 2021.
[190] C. Li, H. Sun, Z. Liu, M. Wang, H. Zheng, and S. Wang, “Learning [211] A. Chowdhury, H. Kassem, N. Padoy, R. Umeton, and A. Karargyris,
cross-modal deep representations for multi-modal MR image segmen- “A review of medical federated learning: Applications in oncology
tation,” in International Conference on Medical Image Computing and and cancer research,” in International MICCAI Brainlesion Workshop,
Computer-Assisted Intervention, 2019, pp. 57–65. 2022, pp. 3–24.
[191] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, and I. B. [212] X. Xu, T. Chen, H. Deng, T. Kuang, J. C. Barber, D. Kim, J. Gateno,
Ayed, “HyperDense-Net: a hyper-densely connected CNN for multi- P. Yan, and J. J. Xia, “Federated cross learning for medical image
modal image segmentation,” IEEE Transactions on Medical Imaging, segmentation,” arXiv preprint arXiv:2204.02450, 2022.
vol. 38, no. 5, pp. 1116–1126, 2018. [213] J. Wicaksana, Z. Yan, D. Zhang, X. Huang, H. Wu, X. Yang, and K.-
[192] D. Zhang, G. Huang, Q. Zhang, J. Han, J. Han, and Y. Yu, “Cross- T. Cheng, “FedMix: Mixed supervised federated learning for medical
modality deep feature learning for brain tumor segmentation,” Pattern image segmentation,” arXiv preprint arXiv:2205.01840, 2022.
Recognition, vol. 110, p. 107562, 2021. [214] T. Kim, K. Lee, S. Ham, B. Park, S. Lee, D. Hong, G. B. Kim,
[193] Y. Zhang, D. Sidibé, O. Morel, and F. Mériaudeau, “Deep multimodal Y. S. Kyung, C.-S. Kim, and N. Kim, “Active learning for accuracy
fusion for semantic image segmentation: A survey,” Image and Vision enhancement of semantic segmentation with CNN-corrected label cura-
Computing, vol. 105, p. 104042, 2021. tions: Evaluation on kidney segmentation in abdominal CT,” Scientific
[194] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, and Z. He, Reports, vol. 10, no. 1, pp. 1–7, 2020.
“Modality-aware mutual learning for multi-modal medical image seg- [215] M. Shen, J. Y. Zhang, L. Chen, W. Yan, N. Jani, B. Sutton, and
mentation,” in International Conference on Medical Image Computing O. Koyejo, “Labeling cost sensitive batch active learning for brain
and Computer-Assisted Intervention, 2021, pp. 589–599. tumor segmentation,” in IEEE International Symposium on Biomedical
[195] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- Imaging, 2021, pp. 1269–1273.
image translation using cycle-consistent adversarial networks,” in IEEE [216] Y. Yan, P.-H. Conze, M. Lamard, H. Zhang, G. Quellec, B. Cochener,
International Conference on Computer Vision, 2017, pp. 2223–2232. and G. Coatrieux, “Deep active learning for dual-view mammogram
[196] R. Dorent, A. Kujawa, M. Ivory, S. Bakas, N. Rieke, S. Joutard, analysis,” in International Workshop on Machine Learning in Medical
B. Glocker, J. Cardoso, M. Modat, K. Batmanghelich et al., “Cross- Imaging, 2021, pp. 180–189.
MoDA 2021 challenge: Benchmark of cross-modality domain adapta- [217] O. Yaniv, O. Portnoy, A. Talmon, N. Kiryati, E. Konen, and A. Mayer,
tion techniques for vestibular schwannoma and cochlea segmentation,” “V-Net light-parameter-efficient 3D convolutional neural network for
Medical Image Analysis, vol. 83, p. 102628, 2023. prostate MRI segmentation,” in IEEE International Symposium on
[197] G. Wilson and D. J. Cook, “A survey of unsupervised deep domain Biomedical Imaging, 2020, pp. 442–445.
adaptation,” ACM Transactions on Intelligent Systems and Technology, [218] Q. Zhao, H. Wang, and G. Wang, “LCOV-NET: A lightweight neural
vol. 11, no. 5, pp. 1–46, 2020. network for Covid-19 pneumonia lesion segmentation from 3D CT
[198] C. Ouyang, K. Kamnitsas, C. Biffi, J. Duan, and D. Rueckert, “Data images,” in IEEE International Symposium on Biomedical Imaging,
efficient unsupervised domain adaptation for cross-modality image seg- 2021, pp. 42–45.
mentation,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2019, pp. 669–677.
[199] N. Karani, K. Chaitanya, C. Baumgartner, and E. Konukoglu, “A life-
long learning approach to brain MR segmentation across scanners and
protocols,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2018, pp. 476–484.
[200] Q. Dou, Q. Liu, P. A. Heng, and B. Glocker, “Unpaired multi-
modal segmentation via knowledge distillation,” IEEE Transactions on
Medical Imaging, vol. 39, no. 7, pp. 2415–2425, 2020.
[201] Z. Zhang, L. Yang, and Y. Zheng, “Translating and segmenting mul-
timodal medical volumes with cycle-and shape-consistency generative
adversarial network,” in IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 9242–9251.
[202] S.-Y. Hu, S. Wang, W.-H. Weng, J. Wang, X. Wang, A. Ozturk, Q. Li,
V. Kumar, and A. E. Samir, “Self-supervised pretraining with DICOM
metadata in ultrasound imaging,” in Machine Learning for Healthcare
Conference, 2020, pp. 732–749.
[203] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[204] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–
1359, 2010.
[205] A. Tarvainen and H. Valpola, “Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep
learning results,” Advances in neural information processing systems,
vol. 30, 2017.
[206] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware
self-ensembling model for semi-supervised 3D left atrium segmenta-
tion,” in International Conference on Medical Image Computing and
Computer-Assisted Intervention, 2019, pp. 605–613.
[207] G. Wang, X. Liu, C. Li, Z. Xu, J. Ruan, H. Zhu, T. Meng, K. Li,
N. Huang, and S. Zhang, “A noise-robust framework for automatic
segmentation of COVID-19 pneumonia lesions from CT images,” IEEE
Transactions on Medical Imaging, vol. 39, no. 8, pp. 2653–2663, 2020.
[208] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and
L. Shao, “Inf-Net: Automatic COVID-19 lung infection segmentation
from ct images,” IEEE Transactions on Medical Imaging, vol. 39, no. 8,
pp. 2626–2637, 2020.
[209] Y.-X. Zhao, Y.-M. Zhang, M. Song, and C.-L. Liu, “Multi-view
semi-supervised 3D whole brain segmentation with a self-ensemble
network,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2019, pp. 256–265.
[210] D. Ng, X. Lan, M. M.-S. Yao, W. P. Chan, and M. Feng, “Federated