2023 - MFCNet - A Multi-Modal Fusion and Calibration Networks For 3D Pancreas Tumor Segmentation On PET-CT Images
2023 - MFCNet - A Multi-Modal Fusion and Calibration Networks For 3D Pancreas Tumor Segmentation On PET-CT Images
A R T I C L E I N F O A B S T R A C T
Keywords: In clinical diagnosis, positron emission tomography and computed tomography (PET-CT) images containing
Multi-modal tumor segmentation complementary information are fused. Tumor segmentation based on multi-modal PET-CT images is an impor
PET-CT tant part of clinical diagnosis and treatment. However, the existing current PET-CT tumor segmentation methods
Pancreas
mainly focus on positron emission tomography (PET) and computed tomography (CT) feature fusion, which
3D convolutional neural network
weakens the specificity of the modality. In addition, the information interaction between different modal images
is usually completed by simple addition or concatenation operations, but this has the disadvantage of introducing
irrelevant information during the multi-modal semantic feature fusion, so effective features cannot be high
lighted. To overcome this problem, this paper propose a novel Multi-modal Fusion and Calibration Networks
(MFCNet) for tumor segmentation based on three-dimensional PET-CT images. First, a Multi-modal Fusion
Down-sampling Block (MFDB) with a residual structure is developed. The proposed MFDB can fuse comple
mentary features of multi-modal images while retaining the unique features of different modal images. Second, a
Multi-modal Mutual Calibration Block (MMCB) based on the inception structure is designed. The MMCB can
guide the network to focus on a tumor region by combining different branch decoding features using the
attention mechanism and extracting multi-scale pathological features using a convolution kernel of different
sizes. The proposed MFCNet is verified on both the public dataset (Head and Neck cancer) and the in-house
dataset (pancreas cancer). The experimental results indicate that on the public and in-house datasets, the
average Dice values of the proposed multi-modal segmentation network are 74.14% and 76.20%, while the
average Hausdorff distances are 6.41 and 6.84, respectively. In addition, the experimental results show that the
proposed MFCNet outperforms the state-of-the-art methods on the two datasets.
* Corresponding author.
** Corresponding author.
E-mail addresses: [email protected] (Z. Yan), [email protected] (Z. Liu).
1
Zhaobang Liu, and Zhuangzhi Yan are equal-contributed as the co-correspondence.
https://doi.org/10.1016/j.compbiomed.2023.106657
Received 24 May 2022; Received in revised form 29 January 2023; Accepted 9 February 2023
Available online 10 February 2023
0010-4825/© 2023 Elsevier Ltd. All rights reserved.
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
In PET-CT images, a large number of tumors show obvious lesion sharing parameters. However, it should be noted that the SDB module
areas only in PET images. Namely, a tumor area usually shows high 18F- dose not consider the specificity of single-modal (PET or CT) features. To
FDG uptake or “hot spot,” with a high SUV value. Therefore, some address this shortcoming, this study design a Multi-modal Fusion
studies have applied the SUV-based threshold methods [6,7] to perform Down-sampling Block (MFDB), which can process multi-modal infor
tumor segmentation on PET images. However, the performance of mation while preserving single-modal image features. Namely, in the
single-modality image segmentation algorithms is affected [8,9], by the MFDB, each encoded feature contains semantic information from CT (or
low resolution of PET images (i.e., partial volume effect), tumor het PET) images and fused semantic information obtained from the PET-CT
erogeneity, and the non-specificity of 18F-FDG. In computed tomogra images.
phy (CT) images, it is challenging to distinguish tumor tissue from In addition, achieving an efficient feature interaction between the
normal tissue based on the pixel value, but CT images can provide the PET and CT branches is crucial in the multi-modal images fusion net
anatomical location of abnormal FDG uptake in a PET-CT image and design. In general, the existing literature has achieved multi-modal
improve image interpretation accuracy. Multi-modal data can provide feature fusion most by directly combining the features of each modal
different perspectives and much information for solving clinical prob ity [22,25], which could ignore important information in some modal
lems [10,11]. Accordingly, many tumor segmentation methods for ities. To overcome this problem in multi-modal feature fusion, Fu et al.
multi-modal PET-CT images have been proposed [14,15]. For instance, [27] proposed the MSAM net, which can learn the tumor-related
Han et al. [12] defined the PET-CT image pair segmentation problem as attention maps obtained from PET images fused with the decoded fea
a Markov random field optimization problem, and then used the seg tures obtained from CT images to realize tumor segmentation. Consid
mentation differences between the PET and CT images during parameter ering the false positives caused by the volume effect of PET images and
optimization to obtain heterogeneous segmentation results. Based on the low discrimination (tumor and non-tumor regions) of CT images and
Han’s method, Song et al. [13] proposed an adaptive context loss to being inspired by the attention mechanism, this paper proposes a
improve segmentation results, and tried to obtain two different tumor Multi-modal Mutual Calibration Block (MMCB) to calibrate different
contours using PET and CT. The above-mentioned studies have shown scale semantics of PET (CT) images guided by attention maps from CT
that combining the information of PET and CT image data can help to (PET) images.
improve tumor segmentation accuracy. Although the aforementioned In this study, the elaborated MFDB and MMCB modules are intro
methods have demonstrated encouraging estimation performance, they duced into a 3D U-shaped network to segment tumors tumor areas in
have cannot meet the real-time requirements of image segmentation. PET-CT images. Based on the proposed multi-modal feature fusion
In recent years, deep convolutional neural networks (CNNs) have strategy, the Multi-modal Fusion Calibration Network (MFCNet) is
been widely used in the fields of medical image analysis, such as lesion developed. The proposed MFCNet is applied to two challenging medical
detection [16], image registration [17,18], and lesion segmentation [19, image segmentation tasks: head and neck tumor segmentation [28,29]
20]. Due to the excellent feature extraction ability of convolution neural and pancreatic tumor segmentation in the PET-CT images. The first task
networks, a number of studies have applied it to tumor segmentation of is performed on a public dataset and the second task is conducted on a
PET-CT images. For instance, Kumar et al. [21] used a multi-modal private clinical dataset. The average Dice values of the proposed
feature co-learning and fusion module to learn the relative importance multi-modal segmentation network are 74.14% and 76.20%, which
of each modality feature at different spatial locations. Then, the ob outperforms the other state-of-the art methods on both two datasets. The
tained fusion maps are multiplied with modality-specific feature maps to main contributions of this work can be summarized as follows.
generate complementary features at different locations. Finally, the late
fusion approach was employed to make each stage go through inde (1) Based on the U-Net structure, a multi-modal fusion and calibra
pendent streams of the encoder-decoder network, and the learned fea tion network is proposed for tumor segmentation in PET-CT
tures were fused at the end of each stream. Further, Zhong et al. [22] images
generated tumor probability maps based on the PET and CT images (2) The MFDB and MMCB modules are designed to solve the infor
separately using two separate 3D U-Net networks and then employed the mation fusion problem in multi-modal medical image segmen
graph-cut algorithm to optimize the segmentation results. Similarly, tation tasks. The MFDB module aims to improve the effectiveness
Zhao et al. [23] extracted semantic features of the PET and CT images by of feature extraction in a multi-modal image encoder and the
two V-NET networks and then used multiple cascaded convolutions to MMCB module realizes mutual calibration of multi-modal se
fuse the PET and CT feature maps to perform tumor segmentation. In the mantic features by attention mechanism.
EF-Net [24], an evidence loss function was adopted to extract uncer (3) The effectiveness and robustness of the MFCNet are verified on
tainty in segmentation results, and the evidence fusion was used to public and private datasets. The experimental results demon
reduce the uncertainty of the segmentation results of each modality. In strate that the proposed MFCNet achieves state-of-the-art per
addition, some of the related studies have used a mixture of fusion formance in the two datasets.
strategies. The above-mentioned PET-CT multi-modal image segmen
tation networks can be categorized into two groups: networks that fuse The rest of this paper is organized as follows. In Section 2, the pro
high-level semantic features of different modal images [21,25], and posed MFCNet method is described in detail. In Section 3, the experi
networks that fuse the segmentation results of different modal images mental data and evaluation indicators are introduced. In Section 4, the
[22,23]. However, the aforementioned have focused only on the dif experimental results and analyzed. The results are discussed in Section
ferences in the segmentation results of different modal images, while the 5. Finally, in Section 6, the main conclusions are drawn.
mutual cooperation between different modal images has been ignored
the segmentation process, causing that different modal images could not 2. Methods
directly benefit from each other in the feature extraction process.
Particularly, the semantic features extracted by the CT encoding blocks 2.1. Overview
are obtained only from CT images, while PET images have no contri
bution to them, and vice versa.However, in clinical practice, the tumor Based on the U-Net architecture [30], this paper proposes a 3D
segmentation process of PET-CT images represents a multi-modal dual-branch segmentation network. The architecture of the Multi-modal
interactive process, where PET images are used to determine the Fusion and Calibration Networks (MFCNet) is shown in Fig. 1(a), where
tumor location, and CT images are used to determine the tumor it can be seen that it consists of the encoder, decoder, Multi-modal
boundary. Xue et al. [26] proposed a Shared Down-sampling Block Fusion Down-Sampling block (MFDB), and Multi-modal Mutual Cali
(SDB) module to realize the interaction between PET-CT images by using bration Block (MMCB).
2
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
(1) The encoder contains five encoding blocks, of which the first two after each pair of decoding blocks in the decoder, which facilitates
consists of two convolutional blocks, as shown in Fig. 1(c). The mutual calibration of the reconstructed features between images.
remaining encoding blocks are designed as a residual blocks [31], which
contain three convolutional blocks, as shown in Fig. 1(d). The con 2.2. Multi-modal Fusion Down-Sampling block
volutional block structure is shown in Fig. 1(b), which it can be seen that
it contains a convolution with the kernel size 3*3*3 and the stride of In the multi-modal tumor segmentation task on PET-CT images, on
one, GroupNorm [32], and LeakyReLU [33]. (2) The decoder is the one hand, the CT images contain rich anatomical information, which
composed of four decoding blocks, each of which contains a deconvo can help the network to identify differences between tissue and non-
lution layer (see Fig. 1(e)) and two convolutional blocks (see Fig. 1(f)). tissue regions and those between different organs. On the other hand,
The final segmentation score map is obtained by a 1*1*1 convolution PET images can provide adequate functional information, such as global
layer and Sigmoid function. (3) The MFDB module is used to replace the high uptake region information, to indicate candidate tumor regions.
pooling layer, and it can be used by CT and PET encoding blocks of the Intuitively, a complementary fusion of multi-modal data obtained from
same depth within the network. The detail of MFDB was shown in Fig. 2 the PET-CT images can improve segmentation performance. However,
(a). (4) The MMCB structure is presented in Fig. 3(a), which is placed current commonly used multi-modal segmentation models tend to use
multiple encoder branches to extract the information on each modality
image independently, which makes it difficult for multi-modal images to
benefit from each other in the feature extraction process. Therefore, how
to make use of the respective advantages of the CT and PET images to
benefit from each other has still been a challenging problem. To address
this problem, this paper propose a Multi-modal Fusion Down-sampling
Block (MFDB) to solve this problem, as shown in Fig. 2(a).
In the MFDB module, a series of convolutions are used to fuse
functional information (FPET) obtained from a PET image and the
anatomical information (FCT) obtained from a CT image, thus helping
the encoder to focus on the information fusion regions. Specifically, a
convolution with a kernel size of 1*1*1 is used to reduce the number of
channels in the fused feature maps. Next, a convolution with a kernel
size of 2*2*2 is adopted to reduce the size of the feature map. Then, the
fused feature map is obtained by two additional convolutional blocks to
achieve cross-channel interaction and information integration, as shown
in Fig. 2(b). Further, to preserve specific features of the two modal im
ages (i.e., the structural information on CT images and functional in
formation on PET images), the MaxPool block is used to down-sample
the encoded features (i.e., non-shared parameters) of the PET and CT
images, respectively. Finally, these features are added to the convolution
down-sampling result to obtain the final feature maps. Each MFDB
module can be summarized as:
Fig. 2. The architecture of the down-sample block. (a) The MFDB structure Fout = DConv (C(FPET , FCT )) + FMaxPool (1)
diagram. (b) The Conv-Block is convolutional operation. The ‘2*2*2’ and ‘2’
denote convolution kernel size and stride, respectively. Where Fout denotes the MFDB module output, DConv represents a series of
3
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
Fig. 3. The calibration block structure. (1) The MMCB structure. (b) the MFEB structure. The ‘1*1*1’ denote the kernel size, and ‘1’ denote the stride.
convolutions, C is the concatenation operation, and FMaxPod denotes the However, although various computational intelligence algorithms have
output of pool layer. been rendered [34,35], the adaptive moment estimation method
(Adam) [36] has still been the most commonly used method in the
network training process. In this work, the Adam optimizer with an
2.3. Multi-modal Mutual Calibration Block
initial learning rate of 0.0001, which was divided by two every 20
epochs was adopted. Each network was trained in an end-to-end manner
Theoretically, when the input multi-modal PET-CT images are
for 100 epochs with a batch-size of one. The dropout [37] was used to
registered, the lesions is at the same location in different modal images.
prevent the neural network from over-fitting during the training process.
However, due to the partial volume effect of PET images, the obtained
The rate of dropout was set to 0.5. For network initialization, the method
lesion area is often not completely. Therefore, in the manual segmen
proposed by He et al. [38] was used to initialize the convolutional filter
tation process, radiologists first determine the extent of the lesion ac
weights. During the network training phase, the dice loss was used as a
cording to the high uptake area in a PET image and then adjusts the edge
loss function. In data augmentation, the online (Z-axis), rotation (90◦ ),
details of the lesion according to the anatomical structure of the CT
and random Gaussian noise (mean = 0, std = 0.1) were randomly
image. However, this is not a one-way operation, lesion segmentation is
applied to each input training triplet (PET, CT, and gold standard) for
a process of mutual calibration using CT and PET images. Inspired by
online (dynamic) image data augmentation.
this, this study designs a plug-and-play Multi-modal Mutual Calibration
Block(MMCB) to achieve mutual calibration of multi-modal semantic
features, which can help single-modal image extraction and highlight 3. Materials
salient information. Specifically, one single-modality image can provide
current prediction results, while the other singke-modality image can 3.1. Imaging data
provide attention maps to help enhance attention to regions of interest
(tumors) while suppressing irrelevant features. The public and private datasets were used to evaluate the proposed
The structure is shown in Fig. 3(a). In the MMCB, the Sigmoid method. The public dataset was the Head and Neck (H&N) dataset [28,
function is used to activate semantic features obtained from CT images 29], which contained 201 data on patients from four medical centers. All
(FCT) to an attention map. Then, the attention map combined with the patients had histologically confirmed H&N carcinoma. The image res
PET semantic features (FPET) is incorporated into the Multi-scale Feature olution in the x and y directions was approximately 1.0 mm, while in the
Extraction Block (MFEB), which is presented in Fig. 3(b), to extract multi z-direction, it varied from 1.5 mm to 3.0 mm, depending on the acqui
scale semantic features from the PET image; this process can be sition device. The private dataset was a PET-CT dataset consisting of
expressed by: data on 93 pancreatic cancer patients (Pancreas dataset) provided by the
Radiology Department of Shanghai Changhai Hospital. The resolution of
Ffuison = C(Sigmoid(FCT ), FPET ) (2) the CT images was 512 × 512 pixels at 0.98 mm × 0.98 mm (x-y axes),
and that of the PET images was 168 × 168 pixels at 4.07 mm × 4.07 mm
Finally, the rectified features are fused with the original features
(x-y axes). In this dataset, the PET and CT images have the same reso
using residual connections to obtain updated features, which can be
lution in the z-direction of 3 mm. Two radiologists with more than ten
expressed as follows:
years of clinical experience outlined the tumor contour.
( )
Fcal = Dpar Ffusion + Fpet (3)
3.2. Data preprocessing
where Fcal represents the MMCB module output, and Dpar denotes three
sets of parallel convolutions with the kernel sizes of 1*1*1, 3*3*3, and
The PET-CT images were preprocessed before feeding the data into
5*5*5, which are used to capture different scale information. Similarly,
the network. First, considering the effect of breathing motion, 3D CT
the same operation is performed on the feature maps obtained from CT
images were merged with FDG-PET images [39,40]. Then images fed to
images to update the feature map.
the network were composed of numerically transformed 3D CT images
(in Hounsfield units, HU) and FDG-PET images (in standard uptake
2.4. Implementation details values, SUV) [41]. Second, before data entered the network, trilinear
interpolation was used to resample the data to keep it uniform in reso
The experiments were performed on a pc with a Windows 10 oper lution (1 mm × 1 mm × 1 mm). The images were cropped to a fixed size
ating system, an 11-GB NVIDIA GTX 2080Ti GPU, and 64-GB memory. of 144 × 144 × 144. Finally, for image normalization, different
All networks were developed using PyTorch 1.5 with the CUDA 10.1. normalization methods were employed according to different types of
4
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
images. Namely, for the CT images, the global normalization approach Table 1
was used. First, all foreground voxels of the CT images in the training Comparison of segmentation performance with the state-of-the-art methods on
data were collected, and then the CT intensity values were selected the Pancreas and H&N datasets. A ± B = Mean Value ± Standard Deviation.
based on the 0.5% and 99.5% bits of these values. Finally, all images Dataset Method Dice(%) Jaccard Sens Pre(%) HD
were normalized using the global foreground mean and standard devi (%) (%) (mm)
ation. For the PET images, each image was normalized independently Pancreas UNet(CT) 47.58 33.11 ± 66.59 46.80 9.84 ±
using the z-score method during the training and inference stages. ± 4.07 3.35 ± 6.36 ± 3.35 0.99
UNet 71.27 58.18 ± 82.26 70.47 8.78 ±
(PET) ± 2.41 2.37 ± 3.24 ± 4.41 3.36
4. Results Xue et al. 69.31 55.05 ± 77.70 70.20 23.56
[26] ± 2.32 2.63 ± 4.92 ± 5.36 ± 7.13
4.1. Baseline and metrics Zhong 72.46 59.58 ± 78.78 76.46 7.63 ±
et al. [25] ± 0.25 0.80 ± 1.17 ± 0.99 2.24
Fu et al. 73.79 60.53 78.74 78.27 8.98 ±
Since this study proposes the 3D pancreatic tumor segmentation in
±
[27] ± 1.88 1.64 ± 1.79 ± 0.55 4.48
multimodal PET-CT images for the first time, there has been no direct MFCNet 76.20 63.08 ± 84.26 75.96 6.84 ±
off-the-shelf known baseline method that can be used for comparison. In (ours) ± 0.53 0.7 ± 2.98 ± 2.13 3.27
theory, the V-Net with the two encoder and decoder branches can be H&N UNet(CT) 46.96 33.81 ± 58.72 46.52 14.54
± 6.15 5.24 ± 3.86 ± 7.29 ± 3.14
seen a baseline model. Aiming to limit the number of variable changes,
UNet 71.36 59.38 ± 77.96 70.87 9.20 ±
the same modules as those in the proposed method were used in the (PET) ± 3.51 3.23 ± 3.94 ± 3.14 1.60
baselines network(see Fig. 1(c), (d), and Fig. 1(f)). Then, the three-fold Xue et al. 63.88 51.54 ± 82.77 57.14 24.55
cross-validation was performed for performance evaluation and com [26] ± 5.38 5.27 ± 6.29 ± 4.92 ± 7.45
parison of the proposed MFCNet on the H&N and Pancreas datasets. The Zhong 71.10 59.71 ± 74.98 74.98 7.86 ±
et al. [25] ± 5.71 4.98 ± 2.61 ± 2.64 1.80
proposed network’s performance was evaluated from three aspects: dice
Fu et al. 70.64 59.28 ± 76.21 71.92 8.24 ±
[42], Jaccard, and hausdorff distance (HD) [43]. [27] ± 5.03 4.42 ± 7.14 ± 2.20 1.31
The dice coefficient indicates the overlap between model predictions MFCNet 74.14 62.96 ± 77.20 76.12 6.41 ±
and ground truth labels, and it is defined as follows: (ours) ± 2.77 2.24 ± 3.71 ± 2.88 1.01
2|X ∩ Y|
Dice(X, Y) = (4)
|X ∪ Y| It can be seen from Table 1 that the proposed MFCNet achieves better
overall segmentation performance on the Pancreas and H&N cancer
where X represents the predicted result output by the network, and Y datasets. As shown in Table 1, the segmentation performance of UNet
represents the true label. (CT) was the worst among all methods on both datasets, especially for
The Jaccard coefficient is calculated by: Dice and Jaccard, achieving values of 47.58% and 33.11% on the
Pancreas dataset and 46.96% and 33.81% on the H&N dataset, respec
|X ∩ Y|
Jaccard(X, Y) = (5) tively. Compare to the CT images, the PET images with high sensitivity
|X ∪ Y|
and specificity improved segmentation accuracy significantly. However,
where the numerator represents the intersection between the ground- tumor heterogeneity and partial volume effects of the PET images
truth and the predicted result, while the denominator is the union of caused a relatively low segmentation accuracy.
the two. On the Pancreas dataset, the sensitivity of tumor segmentation in the
The HD has been widely used to evaluate the performance of medical PET images was almost consistent with that of the proposed MFCNet of
image segmentation models, and it is defined as: 82.26% and 82.39%. However, the U-Net (PET) segmentation achieved
( ) poor results in precision, indicating that there were many false positive
regions in the segmentation results. Compared with the U-Net(PET), the
HD(X, Y) = max maxmin‖x − y‖2 , maxmin‖x − y‖2 (6)
x∈X y∈Y y∈Y x∈X DFCN-CoSeg network proposed by Zhong et al. [25] mapped the
multi-modal features to the same space by channel splicing, and the
where ‖‖2 represents the Euclidean distance in a 3D space. integration of multi-modal data improved model precision by approxi
The sensitivity (Sens) measures the proportion of positives that are mately 6%. The MSAM network proposed by Fu et al. [27] improved the
correctly identified, and it is expressed by: precision by 7.8%–78.27% by helping the network to focus on important
TP areas of the input image. The shared down-sampling module proposed
Sens = (7) by Xue et al. [26] realized the interaction between multi-modal image
TP + FN
information in the encoding stage, but the Dice coefficient decreases by
Further, the precision (Pre) measures the proportion of truly positive 1.96% compared to UNet (PET). It is worth noting that the proposed
voxels in the predictions, and it is calculated by: MFCNet achieved the best performance among all methods, with the
TP average values of DSC, Jac, Sen, Pre and HD values of 76.20%, 63.08%,
Pr e = (8)
TP + FP 84.26%, 75.96%, and 6.85, respectively. A qualitative comparison of
MFCNet and other models is presented in Fig. 4., where it can be seen
where TP (true positives) represents the number of positives that were that MFCNet achieved the most accurate results among all models.
correctly classified as positives, FP (false positives) is the number of Generally, tumor segmentation was significantly more challenging
negatives that were misclassified as positives, FN (false negatives) refers on the H&N dataset compared to the Pancreas dataset, as shown in
to the number of positives that misclassified as negatives. Table 1. This is because head and neck tumors are more variable in
shape and size and can occur in different locations, thus increasing the
difficulty of tumor segmentation. As shown in Table 1, the compre
4.2. Comparison with the state-of-the-art methods
hensive performance of the segmentation method proposed by Xue et al.
[26] was the worst, having a Dice values of 63.88%, precision of
To evaluate the performance of the multi-modal image segmentation
57.14%, and HD of 24.56 mm. Compared to the others methods, the
methods quantitatively, the mean and standard deviation values of five
network achieved significant improvement in sensitivity (82.77%). In
metrics including DSC, JAC, Sens, Pre, and HD, of different methods
addition, the segmentation performance of DFCN-CoSeg [25] and
were calculated, as shown in Table 1.
5
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
Fig. 4. The example of pancreas tumor segmentation results. The areas outlined with blank and blue lines represent the ground truth and the segmented results,
respectively. From left to right: ground truth (GT), the prediction results of the methods proposed by Xue et al. [29], Zhong et al. [28], Fu et al. [27] and the
MFCNet method.
MSAM [27] networks were comparable to UNet (PET), but they signif effective and could accurately segment head and neck tumors in the
icantly degraded for the HD (7.86 and 8.24). Note that the performance PET-CT images. The visual performance comparison of the MFCNet and
of MFCNet in Dice, Jac, Pre and HD were significantly better than other state-of-the-art multi-modal PET-CT segmentation methods is shown in
segmentation networks, reaching 74.17%, 62.96%, 76.12%, and 6.41, Fig. 5, where it can be seen that the proposed MFCNet could obtain more
respectively. These results demonstrate that the proposed MFCNet is accurate segmentation masks compared to the other methods. This
Fig. 5. The example of the head and neck tumor segmentation results. The areas outlined with blank and blue lines represent the ground truth and the segmented
results, respectively. From left to right: ground truth (GT), prediction results obtained by the methods proposed by Xue et al. [26], Zhong et al. [25], Fu et al. [27] and
the MFCNet method.
6
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
proved that the proposed MFCNet was more effective in the integration segmentation in PET-CT images. However, the limitations of PET-CT
of multi-modal information than the other methods. images make automatic segmentation challenging. Specifically, due to
the partial volume effect and false positive in PET images, non-tissue
4.3. Ablation study on pancreas dataset areas and non-tumor areas can also show a high uptake state in PET
images, thus affecting the precision of the UNet (Pre: 70.47% and
To validate the contributions of the MMDB and MMCB modules to 70.87%). However, for CT images, the network can distinguish between
the MFCNet, we then conduct the ablation experiments on the Pancreas tissue and non-tissue regions by learning semantic information, but it is
dataset. Compared with the baseline, it can be seen from Table 2 that the still challenging to distinguish lesion areas from non-lesion areas in a
“Baseline + MFDB” network achieves substantial improvement in terms tissue. A single-modal image provides limited information, which affects
of all three-evaluation metrics. Among them, the average value was the network segmentation performance significantly. Joint multi-modal
increased by 2.27% in Dice, the average value of Jaccard was increased image segmentation has been shown to be effective in improving seg
by 2.52%, and the average value was increased by 3.00 in HD. And as mentation accuracy. The feature fusion effectiveness is crucial to the
shown in Table 2, the combination of the MMCB module and the base multi-modal segmentation model design, which means that a network
line (Baseline + MMCB) improved the segmentation performance. should pay more attention to the valuable information and reduce the
Compared to the Baseline, the Dice value of the ‘Baseline + MMCB’ influence of interference information. In this study, the MFCNet is pro
method increased by 1.36% and reaching 74.65%. Meanwhile, the posed to retain the advantages of PET and CT images while overcoming
Jaccard and Hausdorff Distance values increase from 59.53% to their respective shortcomings through semantic feature fusion. The
10.50–61.35% and 8.65, respectively, implying that the MMCB module MFDB module was used to fuse multi-modal image information to help
could improve the segmentation accuracy through mutual calibration of the network locate a tumor region more accurately in the feature
multi-modal semantic features. Furthermore, when both the MFDB and extraction stage. Specifically, the semantic information obtained from
MMCB modules were inserted into the baseline, the Dice coefficient CT images can help the network locate the tissue correctly, while the
increased by nearly 3.3%, reaching 76.20%, the Jaccard increases by PET images can help to locate the lesion area. In addition, the MMCB
3.55%, and the average HD decreased from 10.50 to 6.84. module is added after each pair of decoders of the same depth within the
network, thus helping to optimize segmentation results during the image
4.4. Ablation study on H&N dataset reconstruction stage. Specifically, the attention mechanism highlights
relevant features among all candidate features. Then, the multi-scale
Next, THE ablation experiments were also conducted on the H&N feature extraction module extracts semantic information of different
dataset to validate the proposed MFDB and MMCB modules. The scales, expands the receptive field, and integrates the multi-modal raw
experimental results are presented in Table 2. First, compared with the semantics obtained by the decoder.
baseline, the “Baseline + MFDB” network improved the average Dice As shown in Table 1, the model proposed by Xue et al. [26] had
from 71.88% to 73.48% and Jaccard from 60.34% to 62.34%, while the worse overall performance on the Pancreas and H&N datasets. Although
HD dropped from 8.44 to 7.52. This indicated that the MFDB module this model achieved good results in the liver tumor segmentation task,
could improve the efficiency of feature extraction and could be suitable because the segmentation task was based on the liver masks, the model
for different data. Second, we note that the MMCB module improved the segmentation accuracy was low when faced with more complex and
average Dice by 1.49% (73.37% vs 71.88%), the Jaccard by 1.65% variable tumors, especially head and neck cancer data. In contrast, the
(60.34% vs 61.99%), and the HD by 0.92 (8.44 vs 7.52), respectively. proposed MFCNet exhibits the best segmentation performance on both
Finally, when the MFDB and MMCB modules were both used in model datasets, which proved its robustness. Some non-tissue areas and
training, more accurate segmentation results were achieved. Compared non-tumor areas showed high uptake in PET images, making the
to baseline, the Dice, Jaccard and HD of the MFCNet method were network unable to identify tumor location precisely. However, the
improved by 2.26%, 2.62%, and 2.03, respectively. MSAM [27] could not directly solve this problem, resulting in poor
network performance on complex datasets. In contrast, tne proposed
5. Discussion method could effectively alleviate this problem by utilizing CT image
information. Further, the DFCN-CoSeg network [25] directly fused all
Accurate tumor segmentation is essential for automatic cancer modal semantic features through a concatenation operation. Although
diagnosis, screening, and radiotherapy. This study aims to design, fusing multi-modal information through concatenation operation ten
develop, and evaluate a multi-modal deep learning method for tumour ded to introduce irrelevant information, it reduced the segmentation
accuracy of the network. However, the multi-modal data fusion enables
Table 2 the network to achieve better segmentation performance on complex
Ablation experiments results of the MFDB and MMCB on the Pancreas and data.
H&N datasets; A ± B = Mean Value ± Standard Deviation. Although the proposed method achieved good segmentation results
Dataset Methods Dice Jaccard Sens Pre(%) HD on the Pancreas and N&N datasets, the proposed network still could not
(%) (%) (%) (mm) accurately segment tumors on extreme data. This could be due to that
Pancreas Baseline 72.92 59.53 ± 80.27 75.65 10.50
the lack of data with pixel-level annotations, which prevents the
± 3.27 3.13 ± 3.90 ± 3.26 ± 4.88 network from learning relevant knowledge. To address this issue, future
Baseline + 75.19 62.05 ± 78.20 80.29 7.50 ± work will consider using semi-supervised methods or GAN to improve
MFDB ± 1.86 1.67 ± 2.47 ± 0.51 3.05 segmentation performance. In terms of semantic feature extraction, the
Baseline + 74.65 61.53 81.66 77.19 8.65 ±
attention network (Transformer) could be a effective method. Finally, in
±
MMCB ± 2.11 1.30 ± 1.06 ± 1.94 3.77
MFCNet 76.20 63.08 ± 84.26 75.96 6.84 ± terms of parameter optimization, we can combine metaheuristics
(ours) ± 0.53 0.70 ± 2.98 ± 2.13 3.27 [44–46] and convolutional neural networks to find optimal
H&N Baseline 71.88 60.34 ± 76.66 73.75 8.44 ± hyper-parameters.
± 5.31 5.21 ± 3.81 ± 5.11 1.70
Baseline + 73.48 62.34 ± 77.63 74.51 7.43 ±
MFDB ± 3.55 3.54 ± 3.28 ± 4.30 1.93 6. Conclusion
Baseline + 73.37 61.99 ± 74.01 74.01 7.52 ±
MMCB ± 3.12 2.77 ± 3.62 ± 3.62 2.05 In this study, a novel and effective deep learning-based method
MFCNet 74.14 62.96 ± 77.20 76.12 6.41 ± named MFCNet is designed for tumor segmentation in 3D multi-modal
(ours) ± 2.77 2.24 ± 3.71 ± 2.88 1.01
PET-CT images. A dual-branch U-Net is used as a segmentation
7
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
backbone network to extract and restore the multi-modal image fea [8] C. Hu, C.P. Liu, J.S. Cheng, Y.L. Chiu, H.P. Chan, N.J. Peng, Application of whole-
body FDG-PET for cancer screening in a cohort of hospital employees, Medicine 95
tures. In addition, the MFDB is proposed for fusing complementary
(44) (2016).
features of PET-CT images and reserving the specific features of a single- [9] B. Foster, U. Bagci, A. Mansoor, Z. Xu, D.J. Mollura, A review on segmentation of
modal images, which improves feature extraction effectiveness. Further, positron emission tomography images, Comput. Biol. Med. 50 (2014) 76–96.
in the feature reconstruction stage, the MMCB is used to achieve the [10] C. Xu, D. Tao, C. Xu, Large-margin multi-view information bottleneck, IEEE Trans.
Pattern Anal. Mach. Intell. 36 (8) (2014) 1559–1572.
calibration between the multi-modal data, prompting the network to [11] P.H. Ahn, M.K. Garg, Positron emission tomography/computed tomography for
focus on tumor regions. Finally, the proposed MFCNet is validated on the target delineation in head and neck cancers, Semin. Nucl. Med. 38 (2) (2008)
H&N and Pancreas datasets. The experimental results show that the 141–148.
[12] D. Han, J. Bayouth, Q. Song, A. Taurani, M. Sonka, J. Buatti, X. Wu, Globally
proposed MFCNet exceeds other state-of-the-art methods, which proves optimal tumor segmentation in PET-CT images: a graph-based co-segmentation
the effectiveness of the proposed method. method, in: Inf Process Med. Imaging, 2011, pp. 245–256.
[13] Q. Song, J. Bai, D. Han, S. Bhatia, W. Sun, W. Rockey, J.E. Bayouth, J.M. Buatti,
X. Wu, Optimal co-segmentation of tumor in PET-CT images with context
Ethics approval and consent to participate information, IEEE Trans. Med. Imag. 32 (9) (2013) 1685–1697.
[14] W. Ju, D. Xiang, B. Zhang, L. Wang, I. Kopriva, X. Chen, Random walk and graph
Not applicable. cut for co-segmentation of lung tumor on PET-CT images, IEEE Trans. Image
Process. 24 (12) (2015) 5854–5867.
[15] K. Yu, X. Chen, F. Shi, W. Zhu, B. Zhang, D. Xiang, A novel 3D graph cut based co-
Consent for publication segmentation of lung tumor on PET-CT images with Gaussian mixture models, in:
SPIE. Med. Imaging, 2016, p. 9784.
[16] S. Zheng, J. Guo, X. Cui, R.N.J. Veldhuis, M. Oudkerk, P.M.A. van Ooijen,
Not applicable. Automatic pulmonary nodule detection in CT scans using convolutional neural
networks based on maximum intensity projection, IEEE Trans. Med. Imag. 39 (3)
Authors contributions (2020) 797–805, https://doi.org/10.1109/TMI.2019.2935553.
[17] L. Duan, X. Ni, Q. Liu, L. Gong, G. Yuan, M. Li, X. Yang, T. Fu, J. Zheng,
Unsupervised learning for deformable registration of thoracic CT and cone-beam
Fei Wang: Writing-Original Draft, Code, and Experimentations. CT based on multiscale features matching with spatially adaptive weighting, Med.
Chao Cheng: Data annotation. Weiwei Cao: Writing - Review Editing. Phys. 47 (11) (2020) 5632–5647.
[18] Y. Fu, Y. Lei, T. Wang, K. Higgins, J.D. Bradley, W.J. Curran, T. Liu, X. Yang,
Zhongyi Wu and Zhaobang Liu: Funding acquisition. Heng Wang and LungRegNet: an unsupervised deformable image registration method for 4D-CT
Wenting Wei: Code. Zhuangzhi Yan and Zhaobang Liu: Writing - lung, Med. Phys. 47 (4) (2020) 1763–1774.
Review Editing, Supervision, and conceptualizations. [19] W.W. Cao, J. Zheng, D.H. Xiang, S.S. Ding, H.T. Sun, X.D. Yang, Z.B. Liu, Y.K. Dai,
Edge and neighborhood guidance network for 2D medical image segmentation,
Biomed. Signal Process Control 69 (2021), 102856.
Funding [20] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, I.B. Ayed, HyperDense-
Net: a hyper-densely connected CNN for multi-modal image segmentation, IEEE
Trans. Med. Imag. 38 (5) (2018) 1116–1126.
This study was supported by the National Natural Science Founda [21] A. Kumar, M. Fulham, D. Feng, J. Kim, Co-learning feature fusion maps from PET-
tion of China (Grant No. 62101551 and 62001471). CT images of lung cancer, IEEE Trans. Med. Imag. 39 (1) (2019) 204–217.
[22] Z. Zhong, Y. Kim, L. Zhou, K. Plichta, B. Allen, J. Buatti, X. Wu, 3D fully
convolutional networks for co-segmentation of tumors on PET-CT images, Proc.
Availability of data and materials IEEE 15th Int. Symp. Biomed. Imaging (ISBI) (2018) 228–231, https://doi.org/
10.1109/ISBI.2018.8363561.
The data sets used and/or analyzed during the current study are [23] X. Zhao, L. Li, W. Lu, S. Tan, Tumor co-segmentation in PET/CT using multi-
modality fully convolutional neural network, Phys. Med. Biol. 64 (1) (2018),
available from the corresponding author on reasonable request. 015011.
[24] Z. Diao, H. Jiang, X.H. Han, Y.D. Yao, T. Shi, EFNet: evidence fusion network for
tumor segmentation from PET-CT volumes, Phys. Med. Biol. 66 (20) (2021),
Declaration of competing interest 205005.
[25] Z. Zhong, Y. Kim, K. Plichta, B.G. Allen, L. Zhou, J. Buatti, X. Wu, Simultaneous
We confirm that neither the manuscript nor any parts of its content cosegmentation of tumors in PET-CT images using deep fully convolutional
networks, Med. Phys. 46 (2) (2019) 619–633.
are currently under consideration or published in another journal. All [26] Z. Xue, P. Li, L. Zhang, X. Lu, G. Zhu, P. Shen, S.A.A. Shah, M. Bennamoun, Multi-
authors listed have contributed to this manuscript and agreed to submit modal Co-learning for liver lesion segmentation on PET-CT images, IEEE Trans.
to your journal. The authors declare that there is no conflict of interest Med. Imag. 40 (12) (2021) 3531–3542.
[27] X. Fu, L. Bi, A. Kumar, M. Fulham, J. Kim, Multimodal spatial attention module for
regarding the publication of this paper. targeting multimodal PET-CT lung tumor segmentation, IEEE J. Biomed. Health
Inform. 25 (9) (2021) 3507–3516.
References [28] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips,
D. Maffitt, M. Pringle, L. Tarbox, The Cancer Imaging Archive (TCIA): maintaining
and operating a public information repository, J. Digit. Imag. 26 (6) (2013)
[1] S. Kligerman, S. Digumarthy, Staging of non–small cell lung cancer using
1045–1057.
integrated PET/CT, Am. J. Roentgenol. 193 (5) (2009) 1203–1211.
[29] M. Vallieres, E. Kay-Rivest, L.J. Perrin, X. Liem, C. Furstoss, H.J. Aerts,
[2] D.V. Sahani, P.A. Bonaffini, O.A. Catalano, A.R. Guimaraes, M.A. Blake, State-of-
N. Khaouam, P.F. Nguyen-Tan, C.S. Wang, K. Sultanem, J. Seuntjens, Radiomics
the-art PET/CT of the pancreas: current role and emerging indications,
strategies for risk assessment of tumour failure in head-and-neck cancer, Sci. Rep. 7
Radiographics 32 (4) (2012) 1133–1158.
(1) (2017) 1–14.
[3] Y. Zhang, C. Cheng, Z. Liu, L. Wang, G. Pan, G. Sun, X. Yang, Radiomics analysis for
[30] O. Ronneberger, P. Fischer, T. Brox, U-net, Convolutional networks for biomedical
the differentiation of autoimmune pancreatitis and pancreatic ductal
image segmentation, Proc. Med. Image Comput. Comput.-Assisted Intervention
adenocarcinoma in 18F-FDG PET/CT, Med. Phys. 46 (10) (2019) 4520–4530.
(2015) 234–241.
[4] Z. Liu, M. Li, C. Zuo, Z. Yang, X. Yang, S. Ren, Y. Peng, G. Sun, J. Shen, C. Cheng,
[31] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
X. Yang, Radiomics model of dual-time 2-[18F] FDG PET/CT imaging to
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR), 2016, pp. 770–778.
distinguish between pancreatic ductal adenocarcinoma and autoimmune
[32] Y. Wu, K. He, Group normalization, in: Proc. Eur. Conf. Comp. Vis. (ECCV), 2018,
pancreatitis, Eur. Radiol. 31 (9) (2021) 6983–6991.
pp. 3–19.
[5] Y. Cui, J. Song, E. Pollom, M. Alagappan, H. Shirato, D.T. Chang, A.C. Koong, R. Li,
[33] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network
Quantitative analysis of 18F-fluorodeoxyglucose positron emission tomography
acoustic models, Proc. Int. Conf. Mach. Learn. 30 (2013).
identifies novel prognostic imaging biomarkers in locally advanced pancreatic
[34] G.G. Wang, Y. Tan, Improving metaheuristic algorithms with information feedback
cancer patients treated with stereotactic body radiation therapy, Int. J. Radiat.
models, IEEE Trans. Cybern. 49 (2) (2019) 542–555, https://doi.org/10.1109/
Oncol. Biol. Phys. 96 (1) (2016) 102–109.
TCYB.2017.2780274.
[6] R. Hong, J. Halama, D. Bova, A. Sethi, B. Emami, Correlation of PET standard
[35] G.G. Wang, D. Gao, W. Pedrycz, Solving multiobjective fuzzy job-shop scheduling
uptake value and CT window-level thresholds for target delineation in CT-based
problem by a hybrid adaptive differential evolution algorithm, IEEE Trans. Ind. Inf.
radiation treatment planning, Int. J. Radiat. Oncol. Biol. Phys. 67 (3) (2007)
18 (12) (2022) 8519–8528, https://doi.org/10.1109/TII.2022.3165636.
720–726.
[36] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014 arXiv
[7] Y.E. Erdi, O. Mawlawi, S.M. Larson, M. Imbriaco, H. Yeung, R. Finn, J.L. Humm,
preprint arXiv:1412.6980.
Segmentation of lung lesion volume by adaptive positron emission tomography
image thresholding, Cancer 80 (S12) (1997) 2505–2509.
8
F. Wang et al. Computers in Biology and Medicine 155 (2023) 106657
[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a [44] N. Bacanin, T. Bezdan, E. Tuba, I. Strumberger, M. Tuba, Monarch butterfly
simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) optimization based convolutional neural network design, Mathematics 8 (6) (2020)
(2014) 1929–1958. 936.
[38] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level [45] T. Bezdan, S. Milosevic, K. Venkatachalam, M. Zivkovic, N. Bacanin,
performance on imagenet classification, Proc. IEEE Int. Conf. Comp. Vis. ICCV I. Strumberger, Optimizing convolutional neural network by hybridized elephant
(2015) 1026–1034. herding optimization algorithm for magnetic resonance image classification of
[39] D. Mattes, D.R. Haynor, H. Vesselle, T.K. Lewellen, W. Eubank, PET-CT image glioma brain tumor grade, in: 2021 Zooming Innovation in Consumer Technologies
registration in the chest using free-form deformations, IEEE Trans. Med. Imag. 22 Conference (ZINC), IEEE, 2021, pp. 171–176.
(1) (2003) 120–128. [46] K. Shankar, E. Perumal, R.M. Vidhyavathi, Deep neural network with moth search
[40] S. Pieper, M. Halle, R. Kikinis, 3D slicer, 2004, IEEE Int. Symp. Biomed. Imaging optimization algorithm based detection and classification of diabetic retinopathy
Nano. Macro 1 (2004) 632–635. images, SN Appl. Sci. 2 (748) (2020) 1–10.
[41] P. Masa-Ah, S. Soongsathitanon, A novel standardized uptake value (SUV)
calculation of PET DICOM files using MATLAB, Proc. 10th WSEAS Int. Conf. Appl.
Inform. Commun. 3rd WSEAS Int. Conf. Biomed. Electron. Biomed. Inform. (2010) Abbreviations
413–416.
[42] A.A. Taha, A. Hanbury, Metrics for evaluating 3D medical image segmentation:
MFDB: Multi-modal Fusion Down-sampling Block
analysis, selection, and tool, BMC Med. Imag. 15 (29) (2015).
MMCB: Multi-modal Mutual Calibration Block
[43] M.S. Choi, B.S. Choi, S.Y. Chung, N. Kim, J. Chun, Y.B. Kim, Clinical evaluation of
Sens: Sensitivity
atlas- and deep learning-based automatic segmentation of multiple organs and
Pre: Precision
clinical target volumes for breast cancer, Radiother. Oncol. 153 (2020) 139–145.
HD: Hausdorff Distance