Final Version
Final Version
Abstract—In recent years, the segmentation of anatomical not applicable to large volumes of data typically produced
or pathological structures using deep learning has experienced in clinical routine or research studies. Given the potential
a widespread interest in medical image analysis. Remarkably fatigue of human experts and the wide variations in expertise,
successful performance has been reported in many imaging
modalities and for a variety of clinical contexts to support manual segmentation is prone to strong intra- and inter-expert
clinicians in computer-assisted diagnosis, therapy or surgical variability [2]. Irregularities of the targeted structures, morpho-
planning purposes. However, despite the increasing amount logical variations or pathological deformities between patients
of medical image segmentation challenges, there remains little as well as the potential lack of clearly visible boundaries
consensus on which methodology perform best. Therefore, we with the surrounding anatomy further affect the non-agreement
examine in this paper the numerous developments and break-
throughs brought since the rise of U-Net inspired architectures. between operators. To ease the process, intra-subject semi-
Especially, we focus on the technical challenges and emerging automatic techniques consisting of ascending and descending
trends that the community is now focusing on, including con- non-linear registration steps applied to manually-drawn masks
ditional generative adversarial and cascaded networks, medical can be applied to obtain volumetric results [3, 4]. Although
Transformers, contrastive learning, knowledge distillation, active more affordable than 3D volume annotations, such propagation
learning, prior knowledge embedding, cross-modality learning,
multi-structure analysis, federated learning or semi-supervised schemes from sparse manual delineations to remaining slices
and self-supervised paradigms. We also suggest possible avenues still need interactions and may require manual refinements.
to be further investigated in future research efforts. Mathematical models and low-level image processing have
Index Terms—artificial intelligence, semantic segmentation, been extensively exploited for segmentation purposes before
deep neural networks, vision Transformers, medical imaging the rise of learning techniques. In particular, model-based
segmentation incorporating statistical shape models has been
followed in various clinical contexts [5]. These models have
I. I NTRODUCTION been further improved by exploiting prior knowledge of shape
information, for instance by relying on internal shape fitting
T HE increased volume of medical data to be interpreted by
clinicians for diagnosis, therapeutic or surgical planning
purposes has encouraged the development of computer-aided
and auto-correction to guide the delineation process [6]. Con-
versely, aligning and merging manually segmented images
image analysis tools to benefit from precise, fast, repeatable into a specific atlas coordinate space have been developed
and objective measurements made by computational resources. as a reliable alternative to statistical shape models. In this
Among existing analysis tasks, medical image segmentation context, various single- and multi-atlas methods have been
whose goal is to extract the boundaries of anatomical or proposed relying on non-linear registration [7]. Some hybrid
pathological structures from medical images remains crucial. methods relying on statistical shape models constrained with
Also commonly used in computed vision [1], semantic seg- probabilistic atlases have also been investigated. Medical im-
mentation is a key step for many medical imaging workflows age segmentation has been also performed through Bayesian
since the information arising from the resulting voxel-wise approaches using expectation-maximization [8], possibilistic
localization can greatly help clinicians to diagnose disorders, clustering [9], histogram-based thresholding followed by re-
assess disease progression, plan therapeutic interventions or gion growing [10], active contours [11] and more recently
monitor treatment effects. Core feature of many computer- machine learning (ML) [12] techniques.
aided detection and diagnosis schemes, image segmentation is However, the previously described methodologies are not
involved in the analysis of many imaging modalities includ- perfectly suited for high inter-subject shape variability, weak
ing computed tomography (CT), magnetic resonance (MR), boundaries and significant differences in tissue appearance. In
positron emission tomography (PET) or ultrasound (US). most cases, their robustness is not up to the inherent limitations
Delineating anatomical or pathological structures from med- of medical images such as noise, non-uniform contrasts or
ical images is traditionally performed manually. This task motion artifacts. Moreover, many of these methods are semi-
is exceedingly time-consuming and requires suitable clinical automatic and hence require prior knowledge, associated with
expertise to get clinically-relevant contours. This is therefore high computational costs. This has strongly motivated the
development of deep learning (DL) techniques to exploit im-
This work did not involve human subjects or animals in its research. age characteristics (e.g. contrast variation, orientation, shape,
P.-H. Conze and V. Jaouen are with IMT Atlantique, LaTIM UMR 1101, texture patterns) in a more efficient data-driven manner.
Inserm, Brest, France, [email protected]. V. K. Singh is
with Queen’s University Belfast, Belfast, United Kingdom. G. Andrade- In recent years, artificial intelligence (AI) and more par-
Miranda and D. Visvikis are with Inserm, LaTIM UMR 1101, Brest, France. ticularly DL models have reached impressive performance in
2
Θ φ ← Θ φ − α∇Θφ Lφ (1)
where the learning rate α is a hyper-parameter controlling
the step size at each iteration. Tuning α is of paramount
importance to find a good trade-off between convergence speed
and stable optimization. Back-propagation deals with gradient
computation, while the gradient descent algorithm, based on
this gradient, aims at performing the learning procedure. `φ is a
per-image loss function which is usually the cross-entropy loss Fig. 1. Residual V-Net inspired convolutional encoder-decoder architecture for
medical image segmentation purposes. Refer to Sect.II-C for further details.
defined, in a multi-class scenario with C classes, as follows:
details and initial resolution as well as skip-connections (i.e.
1 XX
`CE (yy n , ŷy n ) = −yy n,c,u log(ŷy n,c,u ) (2) long-range shortcuts) which concatenate features between con-
|C ||Ω| tracting and expanding paths to help in improving localization
c∈C u∈Ω
where Ω is the image grid and c ∈ C a given class with accuracy and convergence speed. The contracting path encoder
C = {0, ..., C} indexing the different structures of interest as of a standard U-Net (resp. V-Net) architecture consists of
well as the background. As reviewed in [42], a wide variety sequential layers including 3×3 (resp. 3×3×3) convolutional
of loss functions exist including distribution-based (e.g. cross- layers followed by batch normalization (BN) and rectified
entropy), region-based (e.g. Dice), compound (e.g. DiceCE) linear unit (ReLU) activations (Fig.1). Spatial size is reduced
and boundary-based (e.g. Hausdorff distance) losses. using 2 × 2 (resp. 2 × 2 × 2) max-pooling layers. The first
convolutional layer typically generates 32 or 64 channels and
this number doubles after each pooling as the network deepens.
B. Seminal works
The encoder finally projects each input greyscale image x n to
The simplest and early attempts to perform segmentation a latent representation (denoted as z n in Fig.1). On its turn,
using CNN consisted in classifying each pixel individually the decoder branch is built symmetrically with respect to the
in a patch-based manner [43]. Since input patches from encoder, except that max-pooling layers are replaced by up-
neighboring pixels have large overlaps, the same convolutions sampling operations (e.g. bi/tri-linear interpolation, transpose
were computed many times. By replacing fully-connected convolution). Depending on the binary or multi-class nature of
layers with convolutional layers, fully convolutional networks the segmentation issue at hand, a final 1×1 (resp. 1×1×1) con-
(FCN) gave the opportunity to take entire images as inputs volutional layer with sigmoid or softmax activation achieves
and produce likelihood maps instead of single-pixel outputs. pixel-wise segmentation ŷy n = φ(xxn ), at native resolution. V-
It removed the need to select representative patches and Net-inspired models may more suffer from high computational
eliminated redundant calculations due to patch overlaps. In cost and GPU memory usage than their 2D counterparts.
order to avoid outputs with far lower resolution than input Numerous refinements to the U-Net encoder-decoder archi-
shapes, FCN were applied to shifted versions of input images tecture have been proposed including, to name a few, models
[44]. Multiple resulting outputs were thus stitched together which embed encoders pre-trained on large non-medical imag-
to get results at full resolution. Further improvements were ing databases (e.g. ImageNet) to leverage low-level features
then proposed with architectures comprising a regular FCN to typically shared between different image types [46], sequen-
extract features, followed by an up-sampling part that enables tial models exploiting residual convolutions [45] (Fig.1) or
the recover the input resolution using up-convolutions [16]. pyramidal atrous convolutions (instead of pooling operations)
Compared to patch-based or shift-and-stitch methods, precise [47] as well as alternative attention models (Sect.IV-E) such
localization was possible in a single pass while taking into as attention U-Net [26] which integrates attention gates on
account the full image context. This motivated the strong in- skip-connections to focus on salient features. As an extension
terest devoted to convolutional encoder-decoders among which to vanilla U-Net, U-Net++ [48] relied on re-designed skip-
U-Net (Sect.II-C) is the most commonly used representative. connections through intermediate convolution layers as well
as deep supervision (Sect.IV-D). By aggregating features of
C. U-Net varying semantic scales at the decoder branch, nested and
Among existing convolutional encoder-decoders architec- dense skip-connections act as a flexible feature fusion scheme.
ture, most DL-based medical image segmentation models are
based on U-Net [14] and its 3D counterpart V-Net [45]. U- D. Data augmentation
Net and V-Net consist of symmetrical architectures comprising Deep segmentation models are most often trained with
an encoder that gradually reduces the spatial dimension us- extensive on-the-fly data augmentation, towards improved
ing pooling layers, a decoder progressively recovering object generalization properties. By comprising random geometric
4
transformations (e.g. translation, rotation, scaling, shear, flip- assessment of the total kidney volume (TKV) from MR images
ping) and random intensity modifications (e.g. normalization, in patients with polycystic kidney disease since TKV is the
blurring, contrast adjustment), data augmentation can be seen main image-based biomarker to follow PKD progression [58].
as a clever way to artificially increase the amount of available Segmenting healthy organs (e.g. liver, kidneys) to obtain a
data, with slightly modified copies of already existing images. measurement of volume, size or shape is also a relevant use-
In practice, geometric transformations are applied to both case in the context of transplant surgery planning [59, 60].
greyscale images and ground truth masks whereas intensity In other works, whole or sub-structure organ segmentation
transformations only modify source images. Data augmenta- is managed as a first step toward lesion detection and de-
tion enables to teach DL models desired invariance, covariance lineation. The main related challenge in this context deals
and robustness properties and to strongly reduce over-fitting. with class imbalance as most voxels usually belong to the
More recently, it was shown that DL models could further non-diseased class. In particular, there are numerous research
benefit from more elaborated data augmentation techniques works aimed at delineating skin lesions from dermatological
such as MixUp which exploits convex combinations of pairs images [30, 61], brain tumors from MR images [47, 62, 63],
of samples and associated labels to train neural networks liver lesions from CT scans [57], head and neck primary tu-
[49]. MixUp regularizes the neural network to favor simple mors, lymph nodes and organs at risk from radiotherapy com-
linear behavior in-between training examples. Originally pro- puted tomography (RT-CT) or combined PET and CT images
posed for image classification tasks, its extension to image [64–66], breast masses in mammograms [67] or ultrasound
segmentation is straightforward and efficient, as proven in images [68], cystic kidney tissues in MR modality [69] or
[50] where improved generalization with MixUp as a data lesions in whole-body images [53]. In oncology, PET and CT
augmentation technique is reached for knee MR segmentation imaging held a special place for disease characterization since
purposes. This success at the input data space further inspired they contain complementary information about the metabolic
the use of MixUp in the latent feature space [51], in a setting or biochemical function of tissues and organs as well as
referred to manifold MixUp. As reviewed in [52], synthetic the anatomy of cancer [70]. Inner-lesion tissue segmentation
augmentation based on image synthesis methods, for instance is also increasingly targeted as in [71] where both active
exploiting generative adversarial networks (GAN), is another and necrotic tissues are identified inside liver tumors for
powerful alternative to standard data augmentation since it patients with hepatocellular carcinoma in dynamic contrast-
samples the manifold on which the original training set resides. enhanced CT or in [72] where low and high-grade gliomas
Although effective, especially in extreme data scarcity scenar- are decomposed into several tissue types comprising necrotic
ios, synthetic augmentation is more demanding to implement. and enhancing cores, non-enhancing tumor and oedema.
Standard and more sophisticated data augmentation systems Another emerging application deals with automatically ex-
are obviously not mutually exclusive and can be used together. tracting blood vessels (e.g. retinal, brain, liver vessels) from
While data augmentation is typically employed during train- medical images [73, 74]. Apart from class imbalance and
ing, using it at test time recently started to get special attention. appearance similarity with non-vascular tissues, vascular seg-
Strongly linked to the way model uncertainty can be quan- mentation brings additional limitations: complex multi-scale
tified [31] (Sect.V-C), test-time data augmentation consists geometry with decreasing diameters and contrast along tree-
in performing the inference both on original and augmented like networks, inter-patient variability in branching patterns...
versions of images, followed by a merging procedure. Gains Medical image segmentation is also involved in plenty of
in performance are reported in various clinical contexts such radiomics pipelines where it has been for a very long time
as lesion segmentation from whole-body PET-CT images [53]. the bottleneck in both time and automation. Thus, extracting
radiomic features in an automated and high-throughput way
III. C LINICAL NEEDS AND APPLICATIONS from relevant lesion areas is requested to quantify the char-
An ever-increasing number of research studies have illus- acteristics of medical images, comprehensively characterize
trated the numerous applications of medical image analysis objects (e.g. tumors, organs, tissues) and finally provide useful
with DL, targeting a large number of pathologies and imaging guidance for clinicians. Although initially designed to process
modalities [16, 21]. On its side, medical image segmentation CT and functional PET images, the radiomics approach can
plays a key role in many medical imaging workflows tai- be applied to any imaging modality or radio-tracer [75]. This
lored for diagnosis, disease progression assessment, surgery includes works involving automated DL-based segmentation
or therapeutic planning, follow-up, survival analysis, treatment towards patient outcome prediction such as survival analysis
response evaluation, dosimetry and many other applications. [70] or chemotherapy response assessment and prediction [76].
Clinical needs deal, first of all, with organ delineation from More marginal applications can also be mentioned, as for
anatomical CT or MR imaging given that clinical parame- the management of musculoskeletal diseases where patient-
ters (e.g. volume, shape, inner textures) can be exploited as specific information related to the degree of muscle atrophy
biomarkers to diagnose or quantify disease progression, as in across joints is needed to plan interventions and predict inter-
cardiac [54], brain [55] or prostate [56] disorders. Hepatic ventional outcomes. In particular, DL-based shoulder muscle
pathologies with primary or secondary liver lesions are also segmentation from MR images [46] can be employed to
concerned, thus making fully-automatic liver segmentation analyze the shoulder strength balance, which is particularly
[57] particularly useful and requested in clinical routine. Re- important given that a clear relationship between muscle
garding pure organ volumetry, a good example is an automated atrophy and strength loss [77] has been established.
5
projection
ce
spa
nt
late
Fig. 2. Integration of anatomical shape priors into a deep segmentation pipeline [24, 26, 35, 80]. Shape priors-based regularization is performed using a shape
encoder arising from a convolutional auto-encoder previously optimized on ground truth segmentation masks. Refer to Sect.IV-C for further details.
where wj and Ljce denote the weight and loss for the points
N
1 X of supervision at level j of the decoder, wf and LfCE the
Rφ (ŷy , y ) = xn ); Θ ∗F )k22
kF (yy n ; Θ ∗F ) − F (φ(x (7)
N n=1 weight and loss computed at the final network output (where
7
3 64 64
64 128 128
128 256 256
256 512 512
512 512 512
64
64
64
32
32
32
128
128
128
256
256
256
512
512
512
5 64 64
128 128 128
256 256 256
512 512 512
64
64
64
128
128
128
256
256
256
512
C 64 16 16 16 16 conv + BN + reLU
max-pooling
skip connexion
512
512
512
512
512
up-sampling
concatenate
C C C C
conv + interpolate
conv + softmax
512
512
512
512
Fig. 3. Convolutional encoder-decoder architecture with deep supervision. The overall loss function is the weighted sum of losses estimated at different
decoder levels [25]. C is the number of classes. Refer to Sect.IV-D for further details.
E. Attention mechanisms
The human visual system can concentrate and focus actively
on a tiny portion of highly relevant perceptible information
while disregarding other irrelevant perceivable stimuli. At-
tention mechanisms were introduced in DL frameworks to
imitate this aspect of how the human visual system processes
Fig. 4. Flow-chart diagram of a general attention mechanism function. Refer
information. In general, they can be regarded as a dynamic to Sect.IV-E for further details.
selection process where the features extracted from images are
adaptively weighted to pay attention to the more salient ones, we have a query q ∈ Rdq and M pairs of key k ∈ Rdk
i.e. the feature needed for accurately solving a specific image and value v ∈ Rdv vectors {(k1 , v1 ), . . . (kM , vM )}, where
analysis task. Attention mechanisms, particularly in image all of them can be obtained from intermediate CNN features
segmentation, can suppress feature responses in irrelevant or embedding patches of an input image x . The attention is
background regions, hence reducing the rate of false-positive computed step-by-step following three operations: alignment,
predictions. This is particularly true for the challenging in- weighting and contextualization. In the alignment step E(·),
stances of small objects with high shape variability. each query is matched against the M keys to compute a
The attention problem is usually formulated using three score value. Several commonly used alignment functions are
vectors: query, key and value. Conceptually, we can think of further described in Tab.II where additive [92] and dot-product
key and value as a look-up table in which the query is matched [93, 94] functions are the most widely used. In the next step,
to a key, and the value associated with that key is returned. In the alignment scores are passed through a function H(·) (e.g.
image segmentation, it is equivalent to mapping the features softmax, sigmoid) to generate the final attention weights by
of the structure to segment (query) against a collection of normalizing all the scores to a probability distribution (Eq.10).
plausible target features (keys), then presenting the best- A contextualization vector (Eq.11) is then instantiated for each
matched regions (values). Mathematically, let us consider that q as a weighted sum of the M values vi by the set of weights
8
TABLE II GAP
S UMMARY OF SEVERAL POPULAR ALIGNMENT FUNCTIONS USED TO
COMPUTE THE MATCHING SCORE BETWEEN A QUERY AND KEYS . W AND
V ARE TRAINABLE WEIGHT MATRICES , MEANWHILE dk STANDS FOR THE convolution
X
DIMENSION OF A VECTOR k. [.; .] STANDS FOR CONCATENATION .
fusion
input
input
CNN CNN CNN
encoder decoder decoder
(a)
Input
input
according to coarse-to-fine analysis rules. Conversely, in multi- In the same scope, neural architecture search (NAS) is
level multi-resolution models, global inter-organ relations are another direction under investigation with the goal to automate
modeled at coarser resolutions while local organ-specific vari- the iterative network design process usually handled manually
ations are extracted from higher resolutions. On its side, se- by researchers. Among the existing research works in this
quential modeling deals with the analysis of multiple structures area, NAS-UNet [111] is based on the design of three types
following a pre-defined order of increasing complexity. In the of primitive operation set on search space to automatically
same vein, it is also relevant to mention the application related find two cell architectures (DownSC, UpSC) for semantic
to lesion segmentation for which delineating the organ of inter- segmentation purposes. Promising results were reported for
est as a first stage before localizing its inner lesions is usually various imaging modalities including CT, MR and ultrasound.
followed to promote the extraction of organ-specific features
and to narrow the search area (Sect.III). As an example, such V. E MERGING TRENDS
two-step sequential approach has been successfully applied in
[57] for liver lesion segmentation from CT scans. A. Medical Transformers
Transformers, as attention-based structures [94], have first
demonstrated their tremendous force in natural language pro-
G. Learning frameworks cessing (NLP) [112, 113] and have gradually gain attraction
The success of DL in medical image segmentation not only on different computer vision tasks such as image classification,
originates from the development of novel learning paradigms detection, segmentation and video analysis. Their popularity
but also from the network architecture design itself and the is now also rapidly growing in medical image analysis [114],
focus given to data management and optimization processes. especially for medical image segmentation with an exponential
Especially, a trend can be noticed towards the development of growth of related publications in the last year [15, 115].
unified frameworks such as NiftyNet [110] or nnU-Net [28]. The pioneering work of vision Transformers [116] was an
Since the design choices towards an optimal framework are interesting and meaningful attempt to replace convolutional
usually dedicated to a specific segmentation task (i.e. a given backbones with convolution-free models. In contrast to CNN,
tissue type for a given modality) and cannot easily be trans- vision Transformers (ViT) offer parallel processing and a
ferred to another application, pipelines that can configure their complete field of view in a single layer.
sub-components in an automated fashion are highly requested. ViT has a columnar structure where the 3D input volume
In particular, nnU-Net [28] which has been designed to deal x ∈ RH×W ×D×C is split into np 3D non-overlapping patches
with the dataset diversity found in the domain has proven {xx1 , x 2 , x 3 , . . . , x np } with x i ∈ RP ×P ×P ×C , C represents
its efficiency by winning many challenges. It condenses and the number of modalities, (P, P, P ) is the resolution of
automates the key decisions for designing a successful pipeline each patch and np = HW D/P 3 is the resulting number
for any given dataset. Thus, nnU-Net has become one of the of patches, which is also the effective input length of the
reference frameworks when targeting a new segmentation task. Transformer. Since Transformer layers operate over fixed size
10
1D set of vectors, the np patches are flattened and mapped 1) Pure Transformers-based encoder: The first category
to a d-dimensional embedding space through a trainable lin- exploits the global context modeling capability of
3
ear projection matrix E ∈ R(P ·C)×d . To preserve spatial Transformers to effectively encode the relationships between
information, a 1D positional embedding E pos ∈ Rnp ×d is spatially distant voxels. A convolution-free encoder is
added to each of the np patches, and the resulting sequence introduced by forwarding flattened image representations to
of embedding is used as input to the Transformer encoder: Transformers, whose outputs are then reorganized into 3D
tensors followed by CNN up-sampling blocks with multi-level
x1E ; x 2E ; x 3E ; . . . ; x np E ] + E pos
z 0 = [x (12) feature aggregation. For instance, [118, 119] employed a 3D
ViT as an encoder and connected it to the CNN decoder via
The Transformers-based encoder consists of alternating L skip-connections. At the bottleneck of the encoder, the feature
layers of multi-head self-attention (MSA) and multilayer per- map was reshaped and up-sampled by a factor of 2. Then, the
ceptron (MLP) blocks. A layer normalization (LN) is applied previous Transformer layer was used as a skip connection and
before each block and a residual connection after each block. concatenated with the resized feature to be later up-sampled
One layer of a Transformer block can be formulated as: through convolution, normalization and linear activation. This
process was repeated until the initial resolution was reached.
z 0l = MSA(LN(zz l−1 )) + z l−1 However, as anatomical structures can substantially vary in
(13) scale, they cannot be properly modelled using a set of fixed
z l = MLP(LN(zz 0l )) + z 0l
sub-regions of the image. Recently, hierarchical ViT such as
with l = {1, . . . , L}. The MSA block consists of h parallel Swin [120] or PVT [121] Transformers have been introduced
self-attention (SA) heads, where each SA head attends to to overcome these challenges by extracting features at
bring information from different representation sub-spaces at different resolutions. This improves the performance of
different positions through a scoring function A. To achieve Transformers in dense prediction tasks while saving the
this goal, an input sequence z ∈ Rnp ×d is mapped into query linear computational complexity with respect to the image
(Q ∈ Rnp ×dk ), key (K ∈ Rnp ×dk ) and value (V ∈ Rnp ×dv ) size. Hierarchical ViT architectures introduce CNN-like
matrices using three learnable parameters: W q ∈ Rd×dk , properties into the Transformers as they compute local
W k ∈ Rd×dk and W v ∈ Rd×dv . attention with shifted windows, starting from small-sized
patches and gradually merging neighboring patches in the
Q = zW q subsequent layers. To reduce the design complexity of
traditional hierarchical ViT, a 3D U-shape model inspired
K = zW k (14)
by nested hierarchical Transformers [122] exploited the
V = zW v idea of global self-attention within smaller non-overlapping
Then, the attention distribution function is computed following 3D blocks [123]. Cross-block self-attention communication
Eq.15 and the resulting attention weights are applied to the V was achieved by hierarchically nesting these Transformers
matrix obtaining the SA maps, as described in Eq.16. and connecting them with a specific aggregation function.
Valanarasu et al. proposed in [124] a gated axial-attention
Q · KT
model that extended previous designs by incorporating
A(Q, K) = softmax √ (15) a new control mechanism in the self-attention module.
dk
Furthermore, the model operated on the whole image and
patches to simultaneously learn both global and local features.
SA(Q, K, V ) = A(Q, K) × V (16)
2) Hybrid CNN-Transformers encoder: This second
For MSA, Q, K, and V are computed once for each head category integrates the global context modeling ability of
W q,k,v
using h different learnable parameters (W i ), and the final Transformers with the CNN inductive bias [29, 125, 126].
attention map results from the concatenation of the h heads CNN layers capture the multi-scale context feature maps
multiplied by a learnable aggregation matrix W o ∈ Rhdv ×d . by stacking convolution blocks. Meanwhile, Transformers
The computational cost of single head attention with full d- capture the long-term dependencies among the features that
dimension is maintained by setting dk and dv equal to d/h. would be potentially lost with purely convolutional models.
Lastly, a CNN-based decoder gradually up-samples the
Wo
MSA(Q, K, V ) = [head1 ; . . . ; headh ]W (17) Transformers output into a 4D feature map to recover the
full segmentation mask (Fig.7). Others approaches modified
with headi = SA(Qi , Ki , Vi ) = SA(zzW qi , z W ki , z W vi ). In the the traditional SA blocks using deformable Transformers
context of medical Transformers, U-Net-shaped architecture [127] or squeeze-and-expansion Transformers layers [128].
remains the preferred choice to build the Transformer In [129], Chen et al. proposed a 2D hybrid network that
segmentation models. From them and as illustrated in Fig.6, combines two independent self-attention blocks to model the
three categories can be identified [117]: pure Transformers- long-range interactions and global spatial relationships. In
based encoder, hybrid Transformers-CNN encoder as well as addition, a multi-scale skip-connection scheme aggregated
full Transformers-based network. multiple features in the decoder at a different scale to generate
more discriminative representations. PC-SwinMorph [130]
11
diction. Such awareness may be crucial to detect potential seg- Data or aleatoric uncertainty, on the other hand, can
mentation errors. There is a clear need to understand the lim- be assessed through test-time data augmentation (TTA), in
itations of segmentation models via the assessment of voxel- which multiple forward passes are performed to inputs altered
wise confidence measures, which is the purpose of uncertainty through basic data augmentations (e.g. flips, rotations, scaling)
quantification applied to segmentation [147, 148]. Uncertainty [31]. The resulting outputs are then aggregated with similar
modelling may also importantly be used to directly improve methods to TTD (i.e. averaging, entropy). TTA is easier to
segmentation performance, as in [149] where the distinction implement than TTD as it does not require any modification to
between quality control and model improvement techniques the network architecture and can readily be achieved through
is highlighted. This is typically done by averaging out voxel- off-the-shelf data augmentation and segmentation frameworks.
level errors using multiple stochastic forward passes in models Qualitative results seem to suggest that aleatoric uncertainty
that can sample across the uncertainty space [31, 147]. estimates provide more expressive qualitative maps for medi-
According to Bayesian terminology, uncertainty can be cal image segmentation uncertainty assessment [31].
divided into epistemic and aleatoric uncertainty [146, 150]. Regarding the improvement of performance through uncer-
Epistemic (or model) uncertainty relates to the lack of ac- tainty sampling, using epistemic uncertainty modelling with
curacy in the model parameters due to insufficient training. TTD or MC dropout generally yields moderate but consistent
It is a kind of uncertainty that can be reduced by providing improvement of segmentation results [31, 156]. For instance,
more training time and/or data. Aleatoric uncertainty, on the epistemic uncertainty-aware networks achieved state-of-the-art
other hand, relates to the inherent uncertainty of the data itself, performance on the medical image segmentation decathlon
which can further be divided into homoscedastic uncertainty challenge [153, 158]. On the other hand, TTA seems to be
(constant for all inputs) and heteroscedastic uncertainty (vari- more effective than TTD for improving medical image seg-
able between inputs). In image segmentation, both image and mentation results, with performance enhanced by up to several
label spaces are affected by aleatoric uncertainty. Example Dice points [31]. TTA is therefore a generally recommended
causes of homoscedastic uncertainty in radiation imaging are step if inference cost is a secondary concern. The question
physical processes such as Compton scattering or positron as to what is the optimal way to pool TTA predictions and
range. Image-space heteroscedastic uncertainty can be, for in- what augmentation to select is an open research subject [159].
stance, due to dataset shifts in multi-center studies, while label- Applications of deep uncertainty modelling to PET-CT are
space heteroscedastic uncertainty may be due to heterogeneity less popular than in MR modality, with few related works
in annotation quality [151]. Ideally, uncertainty assessment in segmentation [65, 160]. Sudarshan et al. leverage physics-
should be calibrated. Calibration means that prediction con- based heteroscedastic uncertainty modelling for low-dose PET-
fidence c should equal its likelihood of being correct, i.e. a MR image denoising [161]. This relative lack is arguably due
value of c ∈ [0, 100] should indeed translate to a model being to the novelty of the topic in medical imaging. Uncertainty
accurate c% of the time over multiple instances, which is an quantification being an emerging trend, more contributions are
open research subject in medical imaging [147, 148, 152]. expected in the future, especially in radiation imaging.
Several techniques may be employed to produce voxel-level
confidence maps for both epistemic and aleatoric uncertainty
quantification of segmentation predictions. The most popu- D. Contrastive learning
lar epistemic uncertainty measurement is Monte-Carlo (MC) Whatever the image analysis task involving representation
dropout, also known as test-time dropout (TTD) [150], where learning, extracting robust features means reaching distinct
many (e.g. from 10 to 50) stochastic dropout forward passes clusters reflecting the different classes involved. In this
of a model equipped with dropout weights are performed direction, contrastive learning tends to enforce the model
during training [153–155]. The dissimilarity in the predictions, to learn an efficient and disentangled feature representation
assessed through variance, mutual information or entropy by comparing the input image with comparing images (i.e.
[154] can then be assimilated to a voxel-wise epistemic anchors). The comparison is performed between positive pairs
uncertainty map. An obvious drawback of MC dropout is the of similar inputs (e.g. generated through data augmentation)
requirement for dropout during training. Dropout may indeed and negative pairs of dissimilar inputs (e.g. other image
be detrimental to segmentation performance, and a number samples used for training). The original contrastive loss
of state-of-the-art segmentation solutions including nnU-Net was initially defined for classification purposes in computer
[28] do not include dropout. Alternatives to MC dropout vision [162] and its adoption in the medical image processing
for epistemic uncertainty quantification include performing community has been relatively late [163]. The underlying idea
forward passes at various training checkpoints of the opti- was to pull together data points from the same class while
mization [156], following the empirical observation that less pushing apart negative samples in embedded space [162],
certain predictions are less stably predicted along training. A thus imposing intra-class cohesion and inter-class separation.
more computationally demanding method is deep ensembling,
whereby independently trained networks are averaged together 1) Global contrastive learning: Most existing contrastive
to get uncertainty maps [157]. Albeit more demanding, deep learning methods deal with a global contrastive loss and target
ensembling is a common practice due to its consistency in image classification (Fig.8a). For image segmentation, a global
improving segmentation results [28]. Thus, uncertainty maps contrastive loss can still be used by projecting the data through
may be derived freely as a by-product of this main objective. the encoder path to the latent space [32, 164]. Let us describe
13
boost the performance of the student since only pixel-level comes from different patients?
information is considered. In this direction, more context • How to fuse different modalities to simultaneously reduce
and class-related information are needed. As for classification the heterogeneity gap and enable the transfer knowledge?
[168], constraints on intermediate multi-scale features arising • How to map data from one modality to another?
Fusion
Fig. 10. Pipeline of cross-modal frameworks for paired data. (a) Early or input-level fusion, (b) mid or layer-level fusion, (c) late or decision-level fusion.
network. Early fusion has the advantage of its simplicity, first task learns to differentiate between tumors and normal
allowing more complex segmentation strategies such as multi- tissue until the loss curve tends to flatten. The second task is
task, multi-view, multi-scale or GAN-based approaches. One then added and split the complete tumor into intra-tumoral
of the first works devoted to solving multi-modal image classes. This task continues until its loss curve displays a
segmentation using CNN can be found in [179]. They explored flattening trend. Lastly, the third task is introduced and trained
two-pathway cascaded architectures using different receptive simultaneously with the previous ones to precisely segment
field sizes to capture both local and global context information. the enhancing tumor. In this way, the model parameters
Qin et al. proposed in [180] an adaptive convolutional layer and the training data are transferred from an easier to a
named autofocus to effectively change the size of the receptive more difficult task. Unfortunately, early fusion makes it
field to perform multi-modal brain tumor segmentation. The hard to discover highly non-linear relationships between the
autofocus layer captured the multi-scale information by paral- low-level features from different modalities, especially when
lelizing multiple convolutional layers with different dilatation the modalities have significantly different statistical properties.
rates that are later merged using a weighted soft-attention
mechanism to choose the optimal scales. 2) Paired data and mid-fusion strategy: Mid or layer-level
fusion separately processes the multi-modal data in different
The above-mentioned methods and others [181, 182] paths (Fig.10b). For each modality, m, the input xm is encoded
did not make dense predictions and are therefore slow in in each branch fm to learn the modality-specific representation
the inference stage. To promote efficiency, encoder-decoder zm . Then, each representation zm is mapped into a common la-
architecture derived from U-Net [14] has been widely tent space via a fusion operation Λ and use this as input of the
adopted. For instance, Shapey et al. used in [183] a 2.5D decoding transformations g(Λ(z1 , z2 , . . . , zm )). The main goal
U-Net to segment the vestibular schwannoma in contrast- of this strategy is to learn an optimal joint representation that
enhanced T1-weighted and high-resolution T2-weighted MR emphasizes the most informative features across modalities.
imaging. Spatial attention modules were added to each level In mid-fusion, we can distinguish two types of multi-pathway
of the decoder to deal with small target regions, giving network architectures based on the following fusion strategy:
more attention to them and penalizing voxels belonging single-layer or multi-layer fusion.
to the background. In [184], the modalities were fused Multi-modal segmentation networks based on single-layer
as multi-channel inputs and passed through an adversarial fusion generally employ encoder-decoder architectures where
network (Sect.IV-A). The generator is a 3D residual U-Net each modality has its own encoder, with no interactions
that performs the segmentation while the discriminator between them, and a single decoder. They mainly differ on
distinguishes between generated segmentation and ground the conducted fusion operation and are typically carried out
truth masks. An extra constraint was added via active contour via concatenation, addition, averaging or convolution. For
modeling by measuring the dissimilarity between ground instance, Havaei et al. used in [186] modality-specific convolu-
truth and prediction contours. To handle the class imbalance tional layers to later compute for each feature map the first and
problem, Zhou et al. carried out in [185] a coarse-to-fine second moments. Then, the moments were concatenated and
segmentation inspired on model cascades for brain tumor processed by further convolutional layers, yielding the final
segmentation. The main difference with previous works lies segmentation. For their part, Tseng et al. took in [187] the
in applying only a one-pass multi-task network (OM-Net) that encoded representation from each modality and performed a
performs three tasks that are gradually introduced in an order cross-modal convolution to combine the spatial information of
of increasing difficulty based on curriculum learning. The each feature map, modeling the correlations among them. In-
16
spired by the success of the attention mechanism (Sect.IV-E), computed by the branches are mapped into a common feature
recent fusion strategies incorporated spatial and channel-wise space via fusion operations, (e.g. concatenation, averaging,
attention to learn more informative features among modal- weighted voting), followed by series of convolutional layers
ities [188, 189]. To name a few, Zhou et al. proposed in [193]. The final output of late fusion can be formulated as
[188] a three-stage segmentation network. In the first stage, y = Λ(φ1 (x1 ), φ2 (x2 ), . . . , φm (xm )) where Λ is the fusion op-
a 3D U-Net architecture got rough mask predictions. Then, eration. Thus, common features learned by the transformation
binarization and erosion operations were used to obtain the network are considered as a further refinement of decoding
context constraints for the following stage. The second stage and prediction. Some conventional layer-level methods as
consisted of a multi-encoder-based framework where each [194] are thus categorized into late fusion strategy. Many
encoder produces a modality-specific latent representation that late fusion strategies have been proposed. Most of them are
is further fused with the assistance of attention mechanisms. based on averaging or majority voting. For averaging strategy,
This process was repeated for each structure to be segmented. Kamnitsas et al. trained in [72] three networks separately and
In the third stage, a two-encoder-based 3D U-Net segmen- then averaged the confidence of the individual networks. The
tation network was applied to combine and refine the three final segmentation was obtained by assigning each voxel with
single prediction results. A correlation block to discover the the highest confidence. For the majority voting strategy, the
latent correlation between modalities was introduced in [189], final label of a voxel depends on the majority of the labels
followed by a dual attention block that consists of a modality of the individual networks. The statistical properties of the
attention module and a spatial attention module. In this way, different modalities are different, which makes it difficult for
the network is encouraged to learn the most correlated features a single model to directly find correlations across modalities.
across modalities and more useful spatial information to boost Therefore, in a decision-level fusion scheme, the multiple
segmentation results. Despite the great results of single-layer segmentation networks can be trained to fully exploit multi-
fusion schemes, there is no complete freedom to learn within modal features. On the other hand, Zhang et al. proposed
and in-between modalities due to its single level of abstraction. in [194] a modality-aware module that fused the modality-
Regarding multi-layer fusion, it extends the idea of specific models at a high semantic level. Specifically, each
residual learning in multi-modal frameworks allowing skip- modality was embedded by a different modality-specific FCN.
connections that by-pass spatial features between modalities Then, the outputs of FCN models were fused and passed to
[190–192]. Therefore, low-level and high-level features an attention module to generate a modality-specific attention
are fused at different levels of abstraction, increasing the map to adaptively measure the contributions of each modality.
learning capabilities of the network to capture complex cues Moreover, they designed a mutual learning strategy to enable
across modalities. Li et al. proposed in [190] four dilated interactive knowledge transfer, where the modalities interact
Inception blocks consisting of three dilated convolutional as teacher and student simultaneously. In the same line,
layers for each modality. In this way, the receptive field of Zhang et al. employed in [192] a transfer knowledge strategy
the network was expanded without losing resolution, while across modalities that differs from previous works in the use
multi-scale features were also learned. In order to obtain of GAN. The authors applied cycleGAN [195] to capture
the final segmentation, the features at different levels were the knowledge transition across modalities. Each generator
concatenated and up-sampled. On the other hand, Dolz et al. represented a single-modality feature learning branch. Then,
proposed in [191] HyperDenseNet, a 3D model where each they were merged by extra convolution layers followed by an
modality has its own path. Dense connections not only occur attention block to learn powerful fusion features. The intuition
between the layers within the same stream but also across behind the use of GAN is that GAN models can learn the
modalities. Thus, the network can learn more powerful feature modality patterns of each modality and their content patterns.
representations at all levels of abstraction. To encode more Mid and late-fusion can achieve better performance
rich contextual information across modalities, Zhang et al. because each modality is employed as an input of one
developed in [192] a cross-modal self-attention distillation network that can learn complex and complementary feature
network. The model extracted attention maps of intermediate information compared to an input-level fusion network.
layers to further perform layer-wise attention distillation However, they require more memory due to the use of
among modalities. Significant spatial information can be multiple networks. Therefore, the trade-off between accuracy
distilled from an attention map of one modality and then and execution time should be carefully considered. Despite
used to ease attention learning of the other modalities. Fusing the impressive advances reached in the field of multi-modal
multi-modal contextual information at multi-layer stages image segmentation, collecting large sets of paired images
represents the current trend. Moreover, semantic guiding is often either prohibitively expensive or not possible. As a
across modalities by attention mechanisms can be used to result, techniques that make use of unpaired datasets have
bridge early feature extraction and late decision-making. attracted increasing attention in cross-modal segmentation.
3) Paired data and late fusion: Similar to mid- 4) Unpaired data and domain adaptation: When only
fusion, late fusion separately processes multi-modal data unpaired datasets are available, cross-modal segmentation is
(x1 , x2 , . . . , xm ) with the difference that the segmentation commonly managed by domain adaptation (DA) techniques.
branches (φ1 , φ2 , . . . , φm ) are integrated at the decision level. Let Xm × Ym represent the joint feature space and the corre-
More precisely, during the decoding stage, all feature maps sponding label space for a specific modality m. A domain can
17
models were therefore used to increase the number of training labeled training dataset to re-train the model. This updated
samples through image synthesis [201]. As an alternative, the model was used to generate pseudo-labels for another batch
availability of a reduced dataset only motivated researchers to of unlabeled images and so on. This process was repeated up
exploit the power of unlabeled images which may be easier to obtain efficient performance in COVID-19 lung infection
to collect. Unsupervised learning approaches assuming that a CT segmentation. However, the created pseudo-labels usually
related but unlabeled large dataset is available aim at learning do not have the same quality as ground truth labels, which may
transferable feature representations from unlabeled images. limit their potential for improvements from unlabeled data.
In particular, self-supervised learning can consist of pre- As a powerful alternative, one may adopt an auxiliary task
training the model (or any of its constituents) by means of on unlabeled data to facilitate performing image segmentation
pretext tasks to finally be more able to efficiently delineate with limited labeled data. In this direction, Chen et al. pro-
the targeted structures [36]. With the hypothesis of good posed in [145] a semi-supervised image segmentation method
generalization ability of self-learned features, Taleb et al. that simultaneously optimizes both supervised segmentation
investigated the effectiveness of several pretext tasks for 3D and unsupervised reconstruction objectives. The reconstruction
medical image segmentation purposes: rotation prediction, task had the particularity to exploit an attention mechanism
jigsaw puzzles, relative patch location... More complex pretext that separated the reconstruction of image regions correspond-
tasks such as semantic inpainting revealed their effectiveness ing to different classes. Such a simple yet effective multi-task
for better solving downstream segmentation tasks [202]. learning scheme (Sect.V-B) achieved strong improvements for
Leaving aside self-supervised pre-text tasks, self-supervised brain tumor and white matter hyper-intensities segmentation.
contrastive learning (Sect.V-D) is another manner adopted in
the medical image analysis area to learn expressive feature
E. Federated Learning
representations from unlabeled images [203]. In this direction,
Chaitanya et al. proposed in [32] a local contrastive loss able to Collecting large medical image datasets is a difficult and
capture local features to provide complementary information time-consuming task for research needs. Accurate labeling of
and therefore boost the segmentation accuracy. these images requires clinical experience and is challenging to
obtain. Many imaging centers have large image datasets, but
many of them are unorganized or poorly annotated in spite
D. Semi-supervised learning of their richness regarding deep model training [209, 210]. In
Given the usual scarcity of many existing annotated medical addition, medical images are usually linked to personal health
dataset and apart from transfer learning [204] whose goal is information related to the patient. Data protection to prevent
to learn from related learning problems, researchers have also sharing sensitive patient data is essential when working with
explored semi-supervised learning approaches to exploit the multiple medical institutions, in a collaborative manner.
availability of unlabeled datasets. Among the existing semi- To solve this issue, federated learning (FL) enables dis-
supervised strategies [37], semi-supervised consistency regu- tributed training of DL models without really sharing data
larization is commonly employed through the use of a mean between multiple clinical institutions. Fig.12 shows the general
teacher model [205]. In [206], a novel uncertainty-aware semi- framework of federated learning. To work in a collaborative
supervised learning framework was proposed and evaluated fashion, FL allows various clinical institutes or hospitals to
for left atrium segmentation from MR images. Teacher and work in coordination by using a central server. Each hospital
student models were built in such a way that the student keeps an individual model which focuses on the local data
model learned from the teacher model by minimizing the only. Before the training process, each institution submits a
segmentation loss on the labeled data as well as a consistency request to download the global model from a central server.
loss with respect to the targets from the teacher model on all The requested query is then approved by the central server
data (i.e. labeled and unlabeled). The predicted targets from and the global model weights can be downloaded. Once the
the teacher model being potentially unreliable and noisy on training process is executed, the local client model weights
unlabeled data, Yu et al. designed an uncertainty-aware mean are sent to the central server for updating purposes. The
teacher framework [206], where the student model gradually central server aggregates the feedback received from individual
learned from the meaningful and reliable targets by exploiting institutions and updates the global model weights based on
the uncertainty information arising from the teacher model pre-defined rules. These rules permit the model to measure
(Sect.V-C). To better deal with noisy labels for COVID-19 the quality of the feedback obtained from the client servers.
pneumonia lesion segmentation, the main novelty in [207] More and more research works in medical image segmenta-
was to propose two mechanisms: an adaptive teacher that tion involve a FL scheme [211]. Recently, Xu et al. introduced
suppresses the contribution of the student when the latter has in [212] a new federated cross-learning segmentation approach
a large training loss and an adaptive student that learns from that handles data that are not independently and identically
the teacher only when the teacher outperforms the student. distributed. Unlike the conventional FL methods that combine
As followed in [208], semi-supervised pseudo labeling is multiple individually trained local models on a server node,
another strategy to deal with both annotated and unlabeled the proposed method named FedCross consecutively trained
data. Thus, Fan et al. generated pseudo-labels by relying the global model across multiple clients in a round-robin
on a first training with 50 labeled images only. The newly fashion. The authors also suggested a new federated cross-
pseudo-labeled examples were then included in the original ensemble learning technique that together trains and sets up
19
+ Central server +
Data
+ Data
downloading
uploading
+ +
Data Data
Fig. 12. General framework of federated learning. Refer to Sect.VI-E for further details.
various models. Wicaksana et al. [213] proposed FedMix, an annotation process to select the most useful samples to train a
FL strategy that employed mixed image labels specifically to DL-based model. This is particularly useful in medical image
segment anatomical region-of-interest from medical images. segmentation, as manually labeling large amounts of medical
These labels incorporated substantial pixel-wise annotations, images is time-consuming and costly [52]. By using AL,
weak bounding boxes and image-wise class annotations. The the model can learn to accurately segment images with less
authors initially created the pseudo annotations through clients human inputs, making the process more efficient and cost-
and employed refinement under supervision to enrich pseudo- effective. Additionally, AL allows the model to focus on the
annotations. Later, each client contained high-quality data most difficult and important samples, resulting in improved
which was determined using active sample picking for local delineation performance. Nevertheless, choosing the best data
model updates. Based on the quantity and quality of data, enabling the improvement of the model learning capability
the approach provided updates through dynamic aggregation remains challenging. In particular, there are multiple methods
techniques which allow for modification of each local client’s to measure informativeness which mainly involves uncertainty
weight. FedMix was validated on breast tumor segmentation and representativeness criteria [39]. DL-based segmentation
from ultrasound and skin lesion segmentation. methods are able to measure uncertainty (Sect.V-C) to some
Wu et al. [38] introduced a new federated contrastive learn- extent. Computing for each voxel the sum of the lowest class
ing (FCL) framework for 3D volumetric image segmentation probability is one of the simplest manners. If the prediction is
that requires limited annotations only. The local clients first uncertain, an increased number of annotated data is required
started with learning a shared encoder to spread unlabeled to exploit richer feature information. On the contrary, rep-
images. Later, annotated images were incorporated to fine- resentativeness deals more with choosing the samples from
tune the model. Through feature exchange in which each distinct regions of the data distribution such that the variability
client exchanges the features (i.e. low-dimensional vectors) of among the whole dataset is taken into account. In this context,
its local data with other clients, the approach enables better a good balance between exploration and exploitation among
local contrastive learning while avoiding raw data sharing. A the distribution is highly desirable.
global structural matching technique was developed to learn
To name a few, a cascaded 3D U-Net with CNN-correction
the structural similarity of encoded features with suitable
label curation was employed in [214] for kidney segmentation
representations to be shared with other remote clients.
from abdominal CT images in order to save the annotation
More globally, FL has shown potential for improving the
efforts and improve the segmentation outcomes. AL was
accuracy of medical image segmentation while protecting the
concluded to be able to reduce labeling efforts through CNN-
privacy of individual patient data. By offering scalability, flex-
corrected segmentation and increase training efficiency by
ible training scheduling and large training datasets via multi-
iterative learning with limited data. Shen et al. presented in
site collaborations [210], FL combines essential conditions to
[215] an AL approach able to alleviate the image annota-
consider increasing its deployment in various clinical settings.
tion issues towards brain tumor segmentation. The authors
combined both uncertainty and representativeness information
F. Active learning to ensure that AL selects enough informative and diverse
Active learning (AL) is a learning technique that involves data. Contrary to existing studies based on uncertainty or
training a model on a small, initial set of labeled data and representativeness estimated at the scale of a single image,
then iteratively selecting new data to be labeled and added Yan et al. scored in [216] dual-view mammograms according
to the training set (Fig.13). Thus, it assists annotators in the to their prediction consistency, towards better breast mass
20
Image Analysis, 2022. organ segmentation: The FLARE challenge,” Medical Image Analysis,
[22] V. K. Singh, H. A. Rashwan, S. Romani, F. Akram, N. Pandey, p. 102616, 2022.
M. M. K. Sarker, A. Saleh, M. Arenas, M. Arquez, D. Puig et al., [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
“Breast tumor segmentation and shape classification in mammograms arXiv preprint arXiv:1412.6980, 2014.
using generative adversarial and convolutional neural network,” Expert [42] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L.
Systems with Applications, vol. 139, p. 112855, 2020. Martel, “Loss odyssey in medical image segmentation,” Medical Image
[23] H. R. Roth, C. Shen, H. Oda, T. Sugino, M. Oda, Y. Hayashi, K. Mi- Analysis, vol. 71, p. 102035, 2021.
sawa, and K. Mori, “A multi-scale pyramid of 3D fully convolutional [43] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Deep
networks for abdominal multi-organ segmentation,” in International neural networks segment neuronal membranes in electron microscopy
Conference on Medical Image Computing and Computer-Assisted images,” Advances in Neural Information Processing Systems, vol. 25,
Intervention, 2018, pp. 417–425. 2012.
[24] A. Boutillon, B. Borotikar, V. Burdin, and P.-H. Conze, “Multi-structure [44] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
bone segmentation in pediatric mr images with combined regularization for semantic segmentation,” in IEEE Conference on Computer Vision
from shape priors and adversarial network,” Artificial Intelligence in and Pattern Recognition, 2015, pp. 3431–3440.
Medicine, vol. 132, p. 102364, 2022. [45] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
[25] H. Dou, D. Karimi, C. K. Rollins, C. M. Ortinau, L. Vasung, neural networks for volumetric medical image segmentation,” in Inter-
C. Velasco-Annis, A. Ouaalam, X. Yang, D. Ni, and A. Gholipour, “A national Conference on 3D Vision, 2016, pp. 565–571.
deep attentive convolutional neural network for automatic cortical plate [46] P.-H. Conze, S. Brochard, V. Burdin, F. T. Sheehan, and C. Pons,
segmentation in fetal MRI,” IEEE Transactions on Medical Imaging, “Healthy versus pathological learning transferability in shoulder muscle
vol. 40, no. 4, pp. 1123–1133, 2020. MRI segmentation using deep convolutional encoder-decoders,” Com-
[26] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, puterized Medical Imaging and Graphics, vol. 83, p. 101733, 2020.
K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention [47] Z. Zhou, Z. He, and Y. Jia, “AFPNet: A 3D fully convolutional
U-Net: Learning where to look for the pancreas,” arXiv preprint neural network with atrous-convolution feature pyramid for brain tumor
arXiv:1804.03999, 2018. segmentation via MRI images,” Neurocomputing, vol. 402, pp. 235–
[27] J. J. Cerrolaza, M. L. Picazo, L. Humbert, Y. Sato, D. Rueckert, 244, 2020.
M. Á. G. Ballester, and M. G. Linguraru, “Computational anatomy for [48] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
multi-organ analysis in medical imaging: A review,” Medical Image Redesigning skip connections to exploit multiscale features in image
Analysis, vol. 56, pp. 44–67, 2019. segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6,
[28] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier- pp. 1856–1867, 2019.
Hein, “nnU-Net: a self-configuring method for deep learning-based [49] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “MixUp:
biomedical image segmentation,” Nature Methods, vol. 18, no. 2, pp. Beyond empirical risk minimization,” in International Conference on
203–211, 2021. Learning Representations, 2018.
[29] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. [50] E. Panfilov, A. Tiulpin, S. Klein, M. T. Nieminen, and S. Saarakkala,
Yuille, and Y. Zhou, “TransUNet: Transformers make strong encoders “Improving robustness of deep learning based knee MRI segmentation:
for medical image segmentation,” arXiv preprint arXiv:2102.04306, MixUp and adversarial domain adaptation,” in IEEE/CVF International
2021. Conference on Computer Vision Workshops, 2019.
[30] L. Song, J. Lin, Z. J. Wang, and H. Wang, “An end-to-end multi-task [51] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-
deep learning framework for skin lesion analysis,” IEEE Journal of Paz, and Y. Bengio, “Manifold MixUp: Better representations by
Biomedical and Health Informatics, vol. 24, no. 10, pp. 2912–2921, interpolating hidden states,” in International Conference on Machine
2020. Learning, 2019, pp. 6438–6447.
[31] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, and T. Ver- [52] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding,
cauteren, “Aleatoric uncertainty estimation with test-time augmentation “Embracing imperfect datasets: A review of deep learning solutions
for medical image segmentation with convolutional neural networks,” for medical image segmentation,” Medical Image Analysis, vol. 63, p.
Neurocomputing, vol. 338, pp. 34–45, 2019. 101693, 2020.
[32] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Contrastive [53] S. Amiri and B. Ibragimov, “Improved automated lesion segmenta-
learning of global and local features for medical image segmentation tion in whole-body FDG/PET-CT via test-time augmentation,” arXiv
with limited annotations,” Advances in Neural Information Processing preprint arXiv:2210.07761, 2022.
Systems, vol. 33, pp. 12 546–12 558, 2020. [54] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A.
[33] D. Qin, J.-J. Bu, Z. Liu, X. Shen, S. Zhou, J.-J. Gu, Z.-H. Wang, Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al.,
L. Wu, and H.-F. Dai, “Efficient medical image segmentation based “Deep learning techniques for automatic mri cardiac multi-structures
on knowledge distillation,” IEEE Transactions on Medical Imaging, segmentation and diagnosis: is the problem solved?” IEEE Transactions
vol. 40, no. 12, pp. 3820–3831, 2021. on Medical Imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
[34] T. Zhou, S. Ruan, and S. Canu, “A review: Deep learning for medical [55] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Auto-context con-
image segmentation using multi-modality fusion,” Array, vol. 3, p. volutional neural network (Auto-Net) for brain extraction in magnetic
100004, 2019. resonance imaging,” IEEE Transactions on Medical Imaging, vol. 36,
[35] A. Boutillon, P.-H. Conze, C. Pons, V. Burdin, and B. Borotikar, no. 11, pp. 2319–2330, 2017.
“Generalizable multi-task, multi-domain deep segmentation of sparse [56] Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng,
pediatric imaging datasets via multi-scale contrastive regularization and and D. Ni, “Deep attentional features for prostate segmentation in
multi-joint anatomical priors,” Medical Image Analysis, p. 102556, ultrasound,” in International Conference on Medical Image Computing
2022. and Computer-Assisted Intervention, 2018, pp. 523–530.
[36] A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner, [57] P. F. Christ, M. E. A. Elshaer, F. Ettlinger, S. Tatavarty, M. Bickel,
and C. Lippert, “3D self-supervised methods for medical imaging,” P. Bilic, M. Rempfler, M. Armbruster, F. Hofmann, M. D’Anastasi
Advances in Neural Information Processing Systems, vol. 33, pp. et al., “Automatic liver and lesion segmentation in CT using cas-
18 158–18 172, 2020. caded fully convolutional neural networks and 3D conditional random
[37] V. Cheplygina, M. de Bruijne, and J. P. Pluim, “Not-so-supervised: fields,” in International Conference on Medical Image Computing and
a survey of semi-supervised, multi-instance, and transfer learning in Computer-Assisted Intervention, 2016, pp. 415–423.
medical image analysis,” Medical Image Analysis, vol. 54, pp. 280– [58] T. L. Kline, P. Korfiatis, M. E. Edwards, J. D. Blais, F. S. Czerwiec,
296, 2019. P. C. Harris, B. F. King, V. E. Torres, and B. J. Erickson, “Performance
[38] Y. Wu, D. Zeng, Z. Wang, Y. Shi, and J. Hu, “Federated contrastive of an artificial multi-observer deep neural network for fully automated
learning for volumetric medical image segmentation,” in International segmentation of polycystic kidneys,” Journal of Digital Imaging,
Conference on Medical Image Computing and Computer-Assisted vol. 30, no. 4, pp. 442–448, 2017.
Intervention, 2021, pp. 367–377. [59] A. E. Kavur, N. S. Gezer, M. Barış, S. Aslan, P.-H. Conze, V. Groza,
[39] S. Budd, E. C. Robinson, and B. Kainz, “A survey on active learning D. D. Pham, S. Chatterjee, P. Ernst, S. Özkan et al., “CHAOS
and human-in-the-loop deep learning for medical image analysis,” challenge-combined (CT-MR) healthy abdominal organ segmentation,”
Medical Image Analysis, vol. 71, p. 102062, 2021. Medical Image Analysis, vol. 69, p. 101950, 2021.
[40] J. Ma, Y. Zhang, S. Gu, X. An, Z. Wang, C. Ge, C. Wang, F. Zhang, [60] P.-H. Conze, A. E. Kavur, E. Cornec-Le Gall, N. S. Gezer, Y. Le Meur,
Y. Wang, Y. Xu et al., “Fast and low-GPU-memory abdomen CT M. A. Selver, and F. Rousseau, “Abdominal multi-organ segmentation
23
with cascaded convolutional and adversarial deep networks,” Artificial on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
Intelligence in Medicine, vol. 117, p. 102109, 2021. [79] Y. Huo, Z. Xu, S. Bao, C. Bermudez, H. Moon, P. Parvathaneni, T. K.
[61] B. Lei, Z. Xia, F. Jiang, X. Jiang, Z. Ge, Y. Xu, J. Qin, S. Chen, Moyo, M. R. Savona, A. Assad, R. G. Abramson et al., “Splenomegaly
T. Wang, and S. Wang, “Skin lesion segmentation via generative ad- segmentation on multi-modal MRI using deep convolutional networks,”
versarial networks with dual discriminators,” Medical Image Analysis, IEEE Transactions on Medical Imaging, vol. 38, no. 5, pp. 1185–1196,
vol. 64, p. 101716, 2020. 2018.
[62] K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, [80] H. Ravishankar, R. Venkataramani, S. Thiruvenkadam, P. Sudhakar, and
A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert et al., V. Vaidya, “Learning and incorporating shape models for semantic seg-
“Unsupervised domain adaptation in brain lesion segmentation with mentation,” in International Conference on Medical Image Computing
adversarial networks,” in International Conference on Information and Computer-Assisted Intervention, 2017, pp. 203–211.
Processing in Medical Imaging, 2017, pp. 597–609. [81] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[63] C. Chen, X. Liu, M. Ding, J. Zheng, and J. Li, “3D dilated multi- S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
fiber network for real-time brain tumor segmentation in MRI,” in Advances in Neural Information Processing Systems, vol. 27, 2014.
International Conference on Medical Image Computing and Computer- [82] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Og-
Assisted Intervention, 2019, pp. 184–192. den, “Pyramid methods in image processing,” RCA Engineer, vol. 29,
[64] D. Jin, D. Guo, T.-Y. Ho, A. P. Harrison, J. Xiao, C.-k. Tseng, and no. 6, pp. 33–41, 1984.
L. Lu, “Deep esophageal clinical target volume delineation using [83] Z. Tu and X. Bai, “Auto-context and its application to high-level vision
encoded 3D spatial context of tumors, lymph nodes, and organs at tasks and 3D brain image segmentation,” IEEE Transactions on Pattern
risk,” in International Conference on Medical Image Computing and Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1744–1757,
Computer-Assisted Intervention, 2019, pp. 603–612. 2010.
[65] J. Ma and X. Yang, “Combining CNN and hybrid active contours for [84] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
head and neck tumor segmentation in CT and PET images,” in 3D scale: Scale-aware semantic image segmentation,” in IEEE Conference
Head and Neck Tumor Segmentation in PET/CT Challenge, 2020, pp. on Computer Vision and Pattern Recognition, 2016, pp. 3640–3649.
59–64. [85] J. Wei, Z. Wu, L. Wang, T. D. Bui, L. Qu, P.-T. Yap, Y. Xia, G. Li,
[66] W. Lei, H. Mei, Z. Sun, S. Ye, R. Gu, H. Wang, R. Huang, S. Zhang, and D. Shen, “A cascaded nested network for 3T brain MR image
S. Zhang, and G. Wang, “Automatic segmentation of organs-at-risk segmentation guided by 7T labeling,” Pattern Recognition, vol. 124, p.
from head-and-neck CT using separable convolutional neural network 108420, 2022.
with hard-region-weighted loss,” Neurocomputing, vol. 442, pp. 184– [86] M. S. Nosrati and G. Hamarneh, “Incorporating prior knowl-
199, 2021. edge in medical image segmentation: a survey,” arXiv preprint
[67] Y. Yan, P.-H. Conze, G. Quellec, M. Lamard, B. Cochener, and arXiv:1607.01092, 2016.
G. Coatrieux, “Two-stage multi-scale breast mass segmentation for full [87] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P.-A. Heng, “DCAN:
mammogram analysis without user intervention,” Biocybernetics and Deep contour-aware networks for object instance segmentation from
Biomedical Engineering, vol. 41, no. 2, pp. 746–757, 2021. histology images,” Medical Image Analysis, vol. 36, pp. 135–146, 2017.
[68] A. Vakanski, M. Xian, and P. E. Freer, “Attention-enriched deep [88] P.-A. Ganaye, M. Sdika, B. Triggs, and H. Benoit-Cattin, “Remov-
learning model for breast tumor segmentation in ultrasound images,” ing segmentation inconsistencies with semi-supervised non-adjacency
Ultrasound in Medicine & Biology, vol. 46, no. 10, pp. 2819–2833, constraint,” Medical Image Analysis, vol. 58, p. 101551, 2019.
2020. [89] S. Xie and Z. Tu, “Holistically-nested edge detection,” in IEEE
[69] T. L. Kline, M. E. Edwards, J. Fetzer, A. V. Gregory, D. Anaam, A. J. International Conference on Computer Vision, 2015, pp. 1395–1403.
Metzger, and B. J. Erickson, “Automatic semantic segmentation of [90] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
kidney cysts in MR images of patients affected by autosomal-dominant large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
polycystic kidney disease,” Abdominal Radiology, vol. 46, no. 3, pp. [91] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncer-
1053–1061, 2021. tainty to weigh losses for scene geometry and semantics,” in IEEE
[70] V. Oreiller, V. Andrearczyk, M. Jreige, S. Boughdad, H. Elhalawani, Conference on Computer Vision and Pattern Recognition, 2018, pp.
J. Castelli, M. Vallières, S. Zhu, J. Xie, Y. Peng et al., “Head and neck 7482–7491.
tumor segmentation in PET/CT: the HECKTOR challenge,” Medical [92] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,”
Image Analysis, vol. 77, p. 102336, 2022. arXiv preprint arXiv:1410.5401, 2014.
[71] F. Ouhmich, V. Agnus, V. Noblet, F. Heitz, and P. Pessaux, “Liver tissue [93] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
segmentation in multiphase ct scans using cascaded convolutional neu- proaches to attention-based neural machine translation,” arXiv preprint
ral networks,” International Journal of Computer Assisted Radiology arXiv:1508.04025, 2015.
and Surgery, vol. 14, no. 8, pp. 1275–1284, 2019. [94] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[72] K. Kamnitsas, W. Bai, E. Ferrante, S. McDonagh, M. Sinclair, Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
N. Pawlowski, M. Rajchl, M. Lee, B. Kainz, D. Rueckert et al., “En- Advances in Neural Information Processing Systems, vol. 30, 2017.
sembles of multiple models and architectures for robust brain tumour [95] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
segmentation,” in International MICCAI BrainLesion Workshop, 2017, jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
pp. 450–462. 2014.
[73] D. Keshwani, Y. Kitamura, S. Ihara, S. Iizuka, and E. Simo-Serra, [96] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
“TopNet: Topology preserving metric learning for vessel tree recon- IEEE Conference on Computer Vision and Pattern Recognition, 2018,
struction and labelling,” in International Conference on Medical Image pp. 7132–7141.
Computing and Computer-Assisted Intervention, 2020, pp. 14–23. [97] A. Iantsen, D. Visvikis, and M. Hatt, “Squeeze-and-excitation normal-
[74] A. Sadikine, B. Badic, J.-P. Tasu, V. Noblet, D. Visvikis, and P.- ization for automated delineation of head and neck primary tumors in
H. Conze, “Semi-overcomplete convolutional auto-encoder embedding combined PET and CT images,” in Head and Neck Tumor Segmentation
as shape priors for deep vessel segmentation,” in IEEE International Challenge, vol. 12603, 2020, pp. 37–43.
Conference on Image Processing, 2022. [98] L. Rundo, C. Han, Y. Nagano, J. Zhang, R. Hataya, C. Militello,
[75] M. Hatt, C. Parmar, J. Qi, and I. El Naqa, “Machine (deep) learning A. Tangherloni, M. S. Nobile, C. Ferretti, D. Besozzi et al., “USE-Net:
methods for image processing and radiomics,” IEEE Transactions on Incorporating squeeze-and-excitation blocks into U-Net for prostate
Radiation and Plasma Medical Sciences, vol. 3, no. 2, pp. 104–108, zonal segmentation of multi-institutional MRI datasets,” Neurocomput-
2019. ing, vol. 365, pp. 31–43, 2019.
[76] M. M. Islam, B. Badic, T. Aparicio, D. Tougeron, J.-P. Tasu, [99] X. Li, Y. Wei, L. Wang, S. Fu, and C. Wang, “MSGSE-Net: Multi-scale
D. Visvikis, and P.-H. Conze, “Deep treatment response assessment guided squeeze-and-excitation network for subcortical brain structure
and prediction of colorectal cancer liver metastases,” in International segmentation,” Neurocomputing, vol. 461, pp. 228–243, 2021.
Conference on Medical Image Computing and Computer-Assisted [100] X. Shen, J. Xu, H. Jia, P. Fan, F. Dong, B. Yu, and S. Ren, “Self-
Intervention, 2022, pp. 482–491. attentional microvessel segmentation via squeeze-excitation trans-
[77] C. Pons, F. T. Sheehan, H. S. Im, S. Brochard, and K. E. Alter, former unet,” Computerized Medical Imaging and Graphics, vol. 97,
“Shoulder muscle atrophy and its relation to strength loss in obstetrical p. 102055, 2022.
brachial plexus palsy,” Clinical Biomechanics, vol. 48, pp. 80–87, 2017. [101] N. Abraham and N. M. Khan, “A novel focal tversky loss function
[78] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image with improved attention U-Net for lesion segmentation,” in IEEE
translation with conditional adversarial networks,” in IEEE Conference International Symposium on Biomedical Imaging, 2019, pp. 683–687.
24
[102] A. G. Roy, N. Navab, and C. Wachinger, “Recalibrating fully con- [124] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical
volutional networks with spatial and channel squeeze and excitation Transformer: Gated axial-attention for medical image segmentation,” in
blocks,” IEEE Transactions on Medical Imaging, vol. 38, no. 2, pp. International Conference on Medical Image Computing and Computer-
540–549, 2018. Assisted Intervention, 2021, pp. 36–46.
[103] C. Yao, J. Tang, M. Hu, Y. Wu, W. Guo, Q. Li, and X.-P. Zhang, [125] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, and J. Li, “TransBTS:
“Claw U-Net: A U-Net variant network with deep feature concatenation Multimodal brain tumor segmentation using Transformer,” in Inter-
for scleral blood vessel segmentation,” in International Conference on national Conference on Medical Image Computing and Computer-
Artificial Intelligence, 2021, pp. 67–78. Assisted Intervention, 2021, pp. 109–119.
[104] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical [126] J. Li, W. Wang, C. Chen, T. Zhang, S. Zha, H. Yu, and J. Wang,
image segmentation.” IEEE Journal of Biomedical and Health Infor- “TransBTSv2: Wider instead of deeper transformer for medical image
matics, 2020. segmentation,” arXiv preprint arXiv:2201.12785, 2022.
[105] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and [127] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “CoTr: Efficiently bridging CNN
channel squeeze & excitation in fully convolutional networks,” ArXiv, and Transformer for 3D medical image segmentation,” in International
2018. Conference on Medical Image Computing and Computer-Assisted
[106] W. Fang and X.-h. Han, “Spatial and channel attention modulated Intervention, 2021, pp. 171–180.
network for medical image segmentation,” in Asian Conference on [128] S. Li, X. Sui, X. Luo, X. Xu, Y. Liu, and R. Goh, “Medical image seg-
Computer Vision, 2020. mentation using squeeze-and-expansion Transformers,” arXiv preprint
[107] V. V. Valindria, N. Pawlowski, M. Rajchl, I. Lavdas, E. O. Aboagye, arXiv:2105.09511, 2021.
A. G. Rockall, D. Rueckert, and B. Glocker, “Multi-modal learning [129] B. Chen, Y. Liu, Z. Zhang, G. Lu, and D. Zhang, “TransAttUNet:
from unpaired images: Application to multi-organ segmentation in CT Multi-level attention-guided U-Net with Transformer for medical image
and MRI,” in IEEE Winter Conference on Applications of Computer segmentation,” arXiv preprint arXiv:2107.05274, 2021.
Vision, 2018, pp. 547–556. [130] L. Liu, Z. Huang, P. Liò, C.-B. Schönlieb, and A. I. Aviles-Rivero,
[108] X. Zhou, “Automatic segmentation of multiple organs on 3D CT images “PC-SwinMorph: Patch representation for unsupervised medical image
by using deep learning approaches,” Deep Learning in Medical Image registration and segmentation,” arXiv preprint arXiv:2203.05684, 2022.
Analysis, pp. 135–147, 2020. [131] Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing Transform-
[109] S. A. Taghanaki, Y. Zheng, S. K. Zhou, B. Georgescu, P. Sharma, ers and CNNs for medical image segmentation,” arXiv preprint
D. Xu, D. Comaniciu, and G. Hamarneh, “Combo loss: Handling input arXiv:2102.08005, 2021.
and output imbalance in multi-organ segmentation,” Computerized [132] Q. Sun, N. Fang, Z. Liu, L. Zhao, Y. Wen, and H. Lin, “HybridCTrm:
Medical Imaging and Graphics, vol. 75, pp. 24–33, 2019. Bridging CNN and Transformer for multimodal brain image segmen-
[110] E. Gibson, W. Li, C. Sudre, L. Fidon, D. I. Shakir, G. Wang, Z. Eaton- tation,” Journal of Healthcare Engineering, vol. 2021, 2021.
Rosen, R. Gray, T. Doel, Y. Hu et al., “NiftyNet: a deep-learning [133] X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised
platform for medical imaging,” Computer Methods and Programs in medical image segmentation via cross teaching between CNN and
Biomedicine, vol. 158, pp. 113–122, 2018. Transformer,” in Medical Imaging with Deep Learning, 2022.
[111] Y. Weng, T. Zhou, Y. Li, and X. Qiu, “NAS-UNet: Neural architecture [134] H.-Y. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang, and Y. Yu, “nnFormer:
search for medical image segmentation,” IEEE Access, vol. 7, pp. Interleaved transformer for volumetric segmentation,” arXiv preprint
44 247–44 257, 2019. arXiv:2109.03201, 2021.
[112] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [135] Y. Wu, K. Liao, J. Chen, D. Z. Chen, J. Wang, H. Gao, and J. Wu,
training of deep bidirectional transformers for language understanding,” “D-former: A U-shaped dilated Transformer for 3D medical image
arXiv preprint arXiv:1810.04805, 2018. segmentation,” arXiv preprint arXiv:2201.00462, 2022.
[113] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving [136] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang,
language understanding by generative pre-training,” 2018. “Swin-UNet: UNet-like pure Transformer for medical image segmen-
[114] E. Jun, S. Jeong, D.-W. Heo, and H.-I. Suk, “Medical Trans- tation,” arXiv preprint arXiv:2105.05537, 2021.
former: Universal brain encoder for 3D MRI analysis,” arXiv preprint [137] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, and D. Zhang, “DS-
arXiv:2104.13633, 2021. TransUNet: Dual Swin Transformer U-Net for medical image segmen-
[115] J. Li, J. Chen, Y. Tang, B. A. Landman, and S. K. Zhou, “Transforming tation,” IEEE Transactions on Instrumentation and Measurement, 2022.
medical imaging with Transformers? a comparative review of key [138] X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An ef-
properties, current progresses, and future perspectives,” arXiv preprint fective medical image segmentation transformer,” arXiv preprint
arXiv:2206.01136, 2022. arXiv:2109.07162, 2021.
[116] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [139] H. Peiris, M. Hayat, Z. Chen, G. Egan, and M. Harandi, “A volumetric
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., Transformer for accurate 3D tumor segmentation,” arXiv preprint
“An image is worth 16x16 words: Transformers for image recognition arXiv:2111.13300, 2021.
at scale,” arXiv preprint arXiv:2010.11929, 2020. [140] S. Ruder, “An overview of multi-task learning in deep neural networks,”
[117] G. Andrade-Miranda, V. Jaouen, V. Bourbonne, F. Lucia, D. Visvikis, arXiv preprint arXiv:1706.05098, 2017.
and P.-H. Conze, “Pure versus hybrid Transformers for multi-modal [141] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs,
brain tumor segmentation: a comparative study,” in IEEE International T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task
Conference on Image Processing, 2022. medical image segmentation in multiple modalities,” in International
[118] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Land- Conference on Medical Image Computing and Computer-Assisted
man, H. R. Roth, and D. Xu, “UNETR: Transformers for 3D medical Intervention, 2016, pp. 478–486.
image segmentation,” in IEEE/CVF Conference on Applications of [142] C. Playout, R. Duval, and F. Cheriet, “A multitask learning architecture
Computer Vision, 2022, pp. 574–584. for simultaneous segmentation of bright and red lesions in fundus
[119] D. Karimi, S. D. Vasylechko, and A. Gholipour, “Convolution-free images,” in International Conference on Medical Image Computing
medical image segmentation using Transformers,” in International and Computer-Assisted Intervention, 2018, pp. 101–108.
Conference on Medical Image Computing and Computer-Assisted [143] B. Murugesan, K. Sarveswaran, S. M. Shankaranarayana, K. Ram,
Intervention, 2021, pp. 78–88. J. Joseph, and M. Sivaprakasam, “Psi-Net: Shape and boundary aware
[120] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. Roth, and D. Xu, “Swin joint multi-task deep network for medical image segmentation,” in
UNETR: Swin Transformers for semantic segmentation of brain tumors International Conference of the IEEE Engineering in Medicine and
in MR images,” arXiv preprint arXiv:2201.01266, 2022. Biology Society, 2019, pp. 7223–7226.
[121] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, and L. Shao, “Polyp-PVT: [144] Y. Zhang, H. Li, J. Du, J. Qin, T. Wang, Y. Chen, B. Liu, W. Gao,
Polyp segmentation with pyramid vision transformers,” arXiv preprint G. Ma, and B. Lei, “3D multi-attention guided multi-task learning
arXiv:2108.06932, 2021. network for automatic gastric tumor segmentation and lymph node
[122] Z. Zhang, H. Zhang, L. Zhao, T. Chen, S. Ö. Arik, and T. Pfister, classification,” IEEE Transactions on Medical Imaging, vol. 40, no. 6,
“Nested hierarchical transformer: Towards accurate, data-efficient and pp. 1618–1631, 2021.
interpretable visual understanding,” in Conference on Artificial Intelli- [145] S. Chen, G. Bortsova, A. Garcı́a-Uceda Juárez, G. v. Tulder, and
gence, vol. 36, no. 3, 2022, pp. 3417–3425. M. d. Bruijne, “Multi-task attention-based semi-supervised learning for
[123] X. Yu, Y. Tang, Y. Zhou, R. Gao, Q. Yang, H. H. Lee, T. Li, S. Bao, medical image segmentation,” in International Conference on Medical
Y. Huo, Z. Xu et al., “Characterizing renal structures with 3D block Image Computing and Computer-Assisted Intervention, 2019, pp. 457–
aggregate Transformers,” arXiv preprint arXiv:2203.02430, 2022. 465.
25
[146] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian [168] S. Zagoruyko and N. Komodakis, “Paying more attention to attention:
deep learning for computer vision?” Advances in Neural Information Improving the performance of convolutional neural networks via atten-
Processing Systems, vol. 30, 2017. tion transfer,” arXiv preprint arXiv:1612.03928, 2016.
[147] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, [169] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, “Knowledge
M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya adaptation for efficient semantic segmentation,” in IEEE/CVF Confer-
et al., “A review of uncertainty quantification in deep learning: Tech- ence on Computer Vision and Pattern Recognition, 2019, pp. 578–587.
niques, applications and challenges,” Information Fusion, vol. 76, pp. [170] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured
243–297, 2021. knowledge distillation for semantic segmentation,” in IEEE/CVF Con-
[148] A. Jungo and M. Reyes, “Assessing reliability and challenges of ference on Computer Vision and Pattern Recognition, 2019, pp. 2604–
uncertainty estimations for medical image segmentation,” in Inter- 2613.
national Conference on Medical Image Computing and Computer- [171] Y. Wen, L. Chen, S. Xi, Y. Deng, X. Tang, and C. Zhou, “Towards
Assisted Intervention, 2019, pp. 48–56. efficient medical image segmentation via boundary-guided knowledge
[149] F. Galati, S. Ourselin, and M. A. Zuluaga, “From accuracy to reliability distillation,” in IEEE International Conference on Multimedia and
and robustness in cardiac magnetic resonance image segmentation: a Expo, 2021.
review,” Applied Sciences, vol. 12, no. 8, p. 3936, 2022. [172] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,”
[150] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: in IEEE/CVF International Conference on Computer Vision, 2019, pp.
Representing model uncertainty in deep learning,” in International 1365–1374.
Conference on Machine Learning, 2016, pp. 1050–1059. [173] E. Kats, J. Goldberger, and H. Greenspan, “Soft labeling by distilling
[151] D. C. Castro, I. Walker, and B. Glocker, “Causality matters in medical anatomical knowledge for improved MS lesion segmentation,” in IEEE
imaging,” Nature Communications, vol. 11, no. 1, pp. 1–10, 2020. International Symposium on Biomedical Imaging, 2019, pp. 1563–
[152] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration 1566.
of modern neural networks,” in International Conference on Machine [174] Y. Liu, C. Shu, J. Wang, and C. Shen, “Structured knowledge distilla-
Learning, 2017, pp. 1321–1330. tion for dense prediction,” IEEE Transactions on Pattern Analysis and
[153] Y. Xia, D. Yang, Z. Yu, F. Liu, J. Cai, L. Yu, Z. Zhu, D. Xu, Machine Intelligence, 2020.
A. Yuille, and H. Roth, “Uncertainty-aware multi-view co-training for [175] L. Zhang, S. Feng, Y. Wang, Y. Wang, Y. Zhang, X. Chen, and Q. Tian,
semi-supervised medical image segmentation and domain adaptation,” “Unsupervised ensemble distillation for multi-organ segmentation,” in
Medical Image Analysis, vol. 65, p. 101766, 2020. IEEE International Symposium on Biomedical Imaging, 2022.
[154] T. Nair, D. Precup, D. L. Arnold, and T. Arbel, “Exploring uncertainty [176] R. Huang, Y. Zheng, Z. Hu, S. Zhang, and H. Li, “Multi-organ
measures in deep networks for multiple sclerosis lesion detection and segmentation via co-training weight-averaged models from few-organ
segmentation,” Medical image analysis, vol. 59, p. 101557, 2020. datasets,” in International Conference on Medical Image Computing
[155] K. Wickstrøm, M. Kampffmeyer, and R. Jenssen, “Uncertainty and and Computer-Assisted Intervention, 2020, pp. 146–155.
interpretability in convolutional neural networks for semantic seg- [177] N. Wang, S. Lin, X. Li, K. Li, Y. Shen, Y. Gao, and L. Ma, “MISSU:
mentation of colorectal polyps,” Medical Image Analysis, vol. 60, p. 3D medical image segmentation via self-distilling TransUNet,” arXiv
101619, 2020. preprint arXiv:2206.00902, 2022.
[156] Y. Zhao, C. Yang, A. Schweidtmann, and Q. Tao, “Efficient bayesian [178] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
uncertainty estimation for nnu-net,” in International Conference on learning: A survey and taxonomy,” IEEE Transactions on Pattern
Medical Image Computing and Computer-Assisted Intervention, 2022, Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.
pp. 535–544. [179] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Ben-
[157] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable gio, C. Pal, P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation
predictive uncertainty estimation using deep ensembles,” Advances in with deep neural networks,” Medical Image Analysis, vol. 35, pp. 18–
neural information processing systems, vol. 30, 2017. 31, 2017.
[158] M. Antonelli, A. Reinke, S. Bakas, K. Farahani, A. Kopp-Schneider, [180] Y. Qin, K. Kamnitsas, S. Ancha, J. Nanavati, G. Cottrell, A. Crim-
B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers inisi, and A. Nori, “Autofocus layer for semantic segmentation,” in
et al., “The medical segmentation decathlon,” Nature Communications, International Conference on Medical Image Computing and Computer-
vol. 13, no. 1, pp. 1–13, 2022. Assisted Intervention, 2018, pp. 603–611.
[159] D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better [181] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,
aggregation in test-time augmentation,” in IEEE/CVF International D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3D
Conference on Computer Vision, 2021, pp. 1214–1223. CNN with fully connected CRF for accurate brain lesion segmentation,”
[160] X. Hu, R. Guo, J. Chen, H. Li, D. Waldmannstetter, Y. Zhao, B. Li, Medical Image Analysis, vol. 36, pp. 61–78, 2017.
K. Shi, and B. Menze, “Coarse-to-fine adversarial networks and zone- [182] S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor seg-
based uncertainty analysis for NK/T-cell lymphoma segmentation in mentation using convolutional neural networks in MRI images,” IEEE
CT/PET images,” IEEE Journal of Biomedical and Health Informatics, Transactions on Medical Imaging, vol. 35, no. 5, pp. 1240–1251, 2016.
vol. 24, no. 9, pp. 2599–2608, 2020. [183] J. Shapey, G. Wang, R. Dorent, A. Dimitriadis, W. Li, I. Paddick,
[161] V. P. Sudarshan, U. Upadhyay, G. F. Egan, Z. Chen, and S. P. N. Kitchen, S. Bisdas, S. R. Saeed, S. Ourselin et al., “An artificial
Awate, “Towards lower-dose PET using physics-based uncertainty- intelligence framework for automatic segmentation and volumetry
aware multimodal learning with robustness to out-of-distribution data,” of vestibular schwannomas from contrast-enhanced T1-weighted and
Medical Image Analysis, vol. 73, p. 102187, 2021. high-resolution T2-weighted MRI,” Journal of Neurosurgery, vol. 134,
[162] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric no. 1, pp. 171–179, 2019.
discriminatively, with application to face verification,” in IEEE Con- [184] H.-Y. Yang, “Volumetric adversarial training for ischemic stroke lesion
ference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. segmentation,” in MICCAI BrainLesion Workshop, 2018, pp. 343–351.
539–546. [185] C. Zhou, C. Ding, Z. Lu, X. Wang, and D. Tao, “One-pass multi-
[163] A. Jamaludin, T. Kadir, and A. Zisserman, “Self-supervised learning task convolutional neural networks for efficient brain tumor segmenta-
for spinal MRIs,” in Deep Learning in Medical Image Analysis and tion,” in International Conference on Medical Image Computing and
Multimodal Learning for Clinical Decision Support, 2017, pp. 294– Computer-Assisted Intervention, 2018, pp. 637–645.
302. [186] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “HeMIS: Hetero-
[164] X. Hu, D. Zeng, X. Xu, and Y. Shi, “Semi-supervised contrastive modal image segmentation,” in International Conference on Medical
learning for label-efficient medical image segmentation,” in Inter- Image Computing and Computer-Assisted Intervention, 2016, pp. 469–
national Conference on Medical Image Computing and Computer- 477.
Assisted Intervention, 2021, pp. 481–490. [187] K.-L. Tseng, Y.-L. Lin, W. Hsu, and C.-Y. Huang, “Joint sequence
[165] D. Zeng, Y. Wu, X. Hu, X. Xu, H. Yuan, M. Huang, J. Zhuang, J. Hu, learning and cross-modality convolution for 3D biomedical segmenta-
and Y. Shi, “Positional contrastive learning for volumetric medical tion,” in IEEE Conference on Computer Vision and Pattern Recogni-
image segmentation,” arXiv preprint arXiv:2106.09157, 2021. tion, 2017, pp. 6393–6400.
[166] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a [188] T. Zhou, S. Canu, and S. Ruan, “Fusion based on attention mechanism
neural network,” in Neural Information Processing Systems, 2014. and context constraint for multi-modal brain tumor segmentation,”
[167] K. Xu, L. Rui, Y. Li, and L. Gu, “Feature normalized knowledge distil- Computerized Medical Imaging and Graphics, vol. 86, p. 101811,
lation for image classification,” in European Conference on Computer 2020.
Vision, 2020, pp. 664–680. [189] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Latent correlation repre-
26
sentation learning for brain tumor segmentation with missing MRI learning: a collaborative effort to achieve better medical imaging mod-
modalities,” IEEE Transactions on Image Processing, vol. 30, pp. els for individual sites that have small labelled datasets,” Quantitative
4263–4274, 2021. Imaging in Medicine and Surgery, vol. 11, no. 2, p. 852, 2021.
[190] C. Li, H. Sun, Z. Liu, M. Wang, H. Zheng, and S. Wang, “Learning [211] A. Chowdhury, H. Kassem, N. Padoy, R. Umeton, and A. Karargyris,
cross-modal deep representations for multi-modal MR image segmen- “A review of medical federated learning: Applications in oncology
tation,” in International Conference on Medical Image Computing and and cancer research,” in International MICCAI Brainlesion Workshop,
Computer-Assisted Intervention, 2019, pp. 57–65. 2022, pp. 3–24.
[191] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, and I. B. [212] X. Xu, T. Chen, H. Deng, T. Kuang, J. C. Barber, D. Kim, J. Gateno,
Ayed, “HyperDense-Net: a hyper-densely connected CNN for multi- P. Yan, and J. J. Xia, “Federated cross learning for medical image
modal image segmentation,” IEEE Transactions on Medical Imaging, segmentation,” arXiv preprint arXiv:2204.02450, 2022.
vol. 38, no. 5, pp. 1116–1126, 2018. [213] J. Wicaksana, Z. Yan, D. Zhang, X. Huang, H. Wu, X. Yang, and K.-
[192] D. Zhang, G. Huang, Q. Zhang, J. Han, J. Han, and Y. Yu, “Cross- T. Cheng, “FedMix: Mixed supervised federated learning for medical
modality deep feature learning for brain tumor segmentation,” Pattern image segmentation,” arXiv preprint arXiv:2205.01840, 2022.
Recognition, vol. 110, p. 107562, 2021. [214] T. Kim, K. Lee, S. Ham, B. Park, S. Lee, D. Hong, G. B. Kim,
[193] Y. Zhang, D. Sidibé, O. Morel, and F. Mériaudeau, “Deep multimodal Y. S. Kyung, C.-S. Kim, and N. Kim, “Active learning for accuracy
fusion for semantic image segmentation: A survey,” Image and Vision enhancement of semantic segmentation with CNN-corrected label cura-
Computing, vol. 105, p. 104042, 2021. tions: Evaluation on kidney segmentation in abdominal CT,” Scientific
[194] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, and Z. He, Reports, vol. 10, no. 1, pp. 1–7, 2020.
“Modality-aware mutual learning for multi-modal medical image seg- [215] M. Shen, J. Y. Zhang, L. Chen, W. Yan, N. Jani, B. Sutton, and
mentation,” in International Conference on Medical Image Computing O. Koyejo, “Labeling cost sensitive batch active learning for brain
and Computer-Assisted Intervention, 2021, pp. 589–599. tumor segmentation,” in IEEE International Symposium on Biomedical
[195] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- Imaging, 2021, pp. 1269–1273.
image translation using cycle-consistent adversarial networks,” in IEEE [216] Y. Yan, P.-H. Conze, M. Lamard, H. Zhang, G. Quellec, B. Cochener,
International Conference on Computer Vision, 2017, pp. 2223–2232. and G. Coatrieux, “Deep active learning for dual-view mammogram
[196] R. Dorent, A. Kujawa, M. Ivory, S. Bakas, N. Rieke, S. Joutard, analysis,” in International Workshop on Machine Learning in Medical
B. Glocker, J. Cardoso, M. Modat, K. Batmanghelich et al., “Cross- Imaging, 2021, pp. 180–189.
MoDA 2021 challenge: Benchmark of cross-modality domain adapta- [217] O. Yaniv, O. Portnoy, A. Talmon, N. Kiryati, E. Konen, and A. Mayer,
tion techniques for vestibular schwannoma and cochlea segmentation,” “V-Net light-parameter-efficient 3D convolutional neural network for
Medical Image Analysis, vol. 83, p. 102628, 2023. prostate MRI segmentation,” in IEEE International Symposium on
[197] G. Wilson and D. J. Cook, “A survey of unsupervised deep domain Biomedical Imaging, 2020, pp. 442–445.
adaptation,” ACM Transactions on Intelligent Systems and Technology, [218] Q. Zhao, H. Wang, and G. Wang, “LCOV-NET: A lightweight neural
vol. 11, no. 5, pp. 1–46, 2020. network for Covid-19 pneumonia lesion segmentation from 3D CT
[198] C. Ouyang, K. Kamnitsas, C. Biffi, J. Duan, and D. Rueckert, “Data images,” in IEEE International Symposium on Biomedical Imaging,
efficient unsupervised domain adaptation for cross-modality image seg- 2021, pp. 42–45.
mentation,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2019, pp. 669–677.
[199] N. Karani, K. Chaitanya, C. Baumgartner, and E. Konukoglu, “A life-
long learning approach to brain MR segmentation across scanners and
protocols,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2018, pp. 476–484.
[200] Q. Dou, Q. Liu, P. A. Heng, and B. Glocker, “Unpaired multi-
modal segmentation via knowledge distillation,” IEEE Transactions on
Medical Imaging, vol. 39, no. 7, pp. 2415–2425, 2020.
[201] Z. Zhang, L. Yang, and Y. Zheng, “Translating and segmenting mul-
timodal medical volumes with cycle-and shape-consistency generative
adversarial network,” in IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 9242–9251.
[202] S.-Y. Hu, S. Wang, W.-H. Weng, J. Wang, X. Wang, A. Ozturk, Q. Li,
V. Kumar, and A. E. Samir, “Self-supervised pretraining with DICOM
metadata in ultrasound imaging,” in Machine Learning for Healthcare
Conference, 2020, pp. 732–749.
[203] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[204] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–
1359, 2010.
[205] A. Tarvainen and H. Valpola, “Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep
learning results,” Advances in neural information processing systems,
vol. 30, 2017.
[206] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware
self-ensembling model for semi-supervised 3D left atrium segmenta-
tion,” in International Conference on Medical Image Computing and
Computer-Assisted Intervention, 2019, pp. 605–613.
[207] G. Wang, X. Liu, C. Li, Z. Xu, J. Ruan, H. Zhu, T. Meng, K. Li,
N. Huang, and S. Zhang, “A noise-robust framework for automatic
segmentation of COVID-19 pneumonia lesions from CT images,” IEEE
Transactions on Medical Imaging, vol. 39, no. 8, pp. 2653–2663, 2020.
[208] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and
L. Shao, “Inf-Net: Automatic COVID-19 lung infection segmentation
from ct images,” IEEE Transactions on Medical Imaging, vol. 39, no. 8,
pp. 2626–2637, 2020.
[209] Y.-X. Zhao, Y.-M. Zhang, M. Song, and C.-L. Liu, “Multi-view
semi-supervised 3D whole brain segmentation with a self-ensemble
network,” in International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2019, pp. 256–265.
[210] D. Ng, X. Lan, M. M.-S. Yao, W. P. Chan, and M. Feng, “Federated