0% found this document useful (0 votes)

12 views17 pages

Unsupervised Multimodal Medical Image Fusion

This document presents MRSCFusion, an unsupervised multimodal medical image fusion framework that combines a Swin Transformer and a multiscale CNN to enhance image quality for clinical diagnosis. The proposed model utilizes a two-stage training strategy with a novel residual Swin-Convolution fusion module to effectively capture both global and local features, improving the visual quality of fused images. Extensive experiments demonstrate that MRSCFusion outperforms existing state-of-the-art methods in both qualitative and quantitative evaluations on various medical imaging tasks.

Uploaded by

Rittick Maity

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views17 pages

Unsupervised Multimodal Medical Image Fusion

Uploaded by

Rittick Maity

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL.

72, 2023 5026917

MRSCFusion: Joint Residual Swin Transformer and

Multiscale CNN for Unsupervised Multimodal
Medical Image Fusion
Xinyu Xie , Xiaozhi Zhang , Shengcheng Ye , Dongping Xiong , Lijun Ouyang ,
Bin Yang , Member, IEEE, Hong Zhou , and Yaping Wan

Abstract— It is crucial to integrate the complementary infor- I. I NTRODUCTION

mation of multimodal medical images for enhancing the image
quality in clinical diagnosis. Convolutional neural network
(CNN)-based deep-learning methods have been widely used for
image fusion due to their strong modeling ability; however, CNNs
fail to build the long-range dependencies in an image, which
M ULTIMODAL medical image fusion aims at deriv-
ing common and complementary information from
different modal medical images, as well as integrating the
limits the fusion performance. To address this issue, in this work, information to generate a more visual and clearer image [1],
we develop a new unsupervised multimodal medical image fusion [2], [3]. Medical image fusion can enhance the meaningful
framework that combines the Swin Transformer and CNN. The details of the underlying anatomy, which helps in accurate
proposed model follows a two-stage training strategy, where an
autoencoder is trained to extract multiple deep features and
diagnosis and effective treatment. Nowadays, multimodal med-
reconstruct fused images. A novel residual Swin-Convolution ical image fusion plays an increasingly critical role in clinical
fusion (RSCF) module is designed to fuse the multiscale features. applications [3].
Specifically, it consists of a global residual Swin Transformer Medical imaging provides a favorable understanding of
branch for capturing the global contextual information, as well organizations and structures. For instance, computed tomog-
as a local gradient residual dense branch for capturing the local
fine-grained information. To further effectively integrate more
raphy (CT) examines dense structures like bones and
meaningful information and ensure the visual quality of fused implants [2]. Magnetic resonance imaging (MRI) reveals
images, we define a joint loss function, including content loss, and high-resolution anatomical information of soft tissues [2].
intensity loss to constrain the RSCF fusion module; moreover, Except for these anatomical imaging techniques, there are
we introduce an adaptive weight block (AWB) to assign learnable functional patterns, including positron emission tomography
weights in the loss function, which can control the information
preservation degree of source images. In such cases, abundant
(PET) and single-photon emission CT (SPECT). The PET
texture features from magnetic resonance imaging (MRI) images reflects the tumor formation and metabolism, and SPECT
and appropriate intensity information from functional images can characterizes the organ and tissue blood flow, both of which
be well preserved simultaneously. Extensive comparisons have have low resolution [2]. An illustration of MRI, CT, PET, and
been conducted between the proposed model and other state-of- SPECT images and their fusion images is shown in Fig. 1.
the-art fusion methods on CT-MRI, PET-MRI, and SPECT-MRI
image fusion tasks. Both qualitative and quantitative comparisons
Because of the physical and hardware limitations, in general,
have demonstrated the superiority of our model. single modal medical images cannot provide accurate and
comprehensive information. It is essential to correlate several
Index Terms— End-to-end network, image fusion, multimodal
medical images, Swin Transformer, unsupervised learning. modality images to obtain the desired information. The key
point for multimodal medical image fusion is to integrate
the prominent and complementary information to generate the
Manuscript received 28 May 2023; revised 20 July 2023; accepted 1
September 2023. Date of publication 20 September 2023; date of current ver- desired image, which will facilitate doctors with better visual
sion 3 October 2023. This work was supported in part by the National Natural perception and more accurate diagnosis [3].
Science Foundation of China under Grant 62071213 and Grant 61871210 and The common image fusion methods can be divided into
in part by the Research Project of Hunan Provincial Department of Education
under Grant 18C0413 and Grant 20C1632. The Associate Editor coordi- two groups, namely the traditional fusion framework and
nating the review process was Dr. Bardia Yousefi. (Corresponding author: the deep-learning fusion framework. The traditional fusion
Xiaozhi Zhang.) framework mainly includes multiscale transform methods [4],
Xinyu Xie, Xiaozhi Zhang, and Bin Yang are with the School of Electrical
Engineering, University of South China, Hengyang 421001, China (e-mail: [5], [6], [7], [8], [9], [10], [11], sparse representation (SR)
[email protected]; [email protected]; [email protected]). methods [12], [13], [14], [15], and pulse coupled neural
Shengcheng Ye, Dongping Xiong, Lijun Ouyang, and Yaping Wan network (PCNN)-based methods [16], [17], [18]. Although
are with the School of Computing/Software, University of South China,
Hengyang 421001, China (e-mail: [email protected]; xiongdongping@ satisfactory results have been achieved by these methods, they
163.com; [email protected]; [email protected]). still have suffered some limitations. 1) When source images
Hong Zhou is with the Department of Radiology, The First Affiliated are complex, the representation performance is degraded.
Hospital, University of South China, Hengyang 421001, China (e-mail:
[email protected]). 2) The design of the manual fusion rule is challenging.
Digital Object Identifier 10.1109/TIM.2023.3317470 3) These fusion methods always take a long run time [20].
1557-9662 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on October 01,2024 at 14:06:06 UTC from IEEE Xplore. Restrictions apply.
5026917 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

block (GRDB) for capturing the local fine-grained information.

To further effectively integrate more meaningful information
and ensure the visual quality of fused images, we define a joint
loss function, including content loss and intensity loss to train
the RSCF fusion module. In addition, we introduce an adap-
tive weight block (AWB) to assign learnable weights, which
control the information preservation degree of source images.
In such cases, abundant texture features from MRI images
and appropriate salient/intensity information from functional
images can be preserved simultaneously, which effectively
improves the fusion performance.
The main contributions of this work can be summarized as
follows.
1) We design a novel, learnable RSCF module to mine
and fuse multiscale deep features. It is capable of
Fig. 1. Illustration of multimodal medical image fusion. From left to right
capturing global contextual information with long-range
in columns: MRI images, CT/PET/SPECT images, and the fusion images dependencies, as well as local fine-grained information.
(MRI-CT, MRI-PET, MRI-SPECT) with our method. 2) We propose a new unsupervised two-stage training strat-
egy for multimodal medical image fusion. It contains an
autoencoder trained for multiscale feature extraction and
Recently, deep convolutional neural networks (CNNs) have image reconstruction and the RSCF module trained for
been exploited to fuse multimodal medical images [21], [22], feature fusion.
[23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], 3) We introduce an AWB to assign learnable weights in the
[34], [35], [36], [37], [38], [39], [40], [41], [42]. Generally, the loss function, which control the information preservation
CNN-based methods can achieve more accurate edge texture degree of source images. It aims to preserve the texture
information, clearer fusion features, and lower computational details well while retaining the functional information
cost. Especially due to the strong modeling ability and adaptive in fused images.
feature extraction of neural networks, these methods avoid the 4) Extensive experiments demonstrate that the proposed
design of fusion rules. Although these fusion networks have model achieves better visual quality, as well as quan-
achieved superior performance in preserving local information, titative evaluations in fused images, than other state-of-
they still have trouble in fully capturing global knowledge due the-art methods.
to their intrinsic locality of convolutional operations [43]. The rest of this article is as follows. In Section II, we intro-
More recently, the transformer model has achieved excellent duce the related work, including traditional and deep-learning
performance on various vision tasks due to modeling the image fusion methods. Then the proposed fusion strategy and
long-range dependencies in images [44], [45], [46], [47], [48], network details are illustrated in Section III. The experiment
[49], [50], [51]. It has also been employed in medical imaging, results and discussions are presented in Section IV. Finally,
e.g., multimodal MRI segmentation [49], MRI reconstruction in Section V, we draw conclusions.
and super-resolution [50]. Especially the Swin transformer has
II. R ELATED W ORK
shown great promise in vision tasks, which is modified from
the standard transformer by substituting the standard multihead A. Traditional Image Fusion
self-attention (MSA) with the shifted window based MSA In the past several decades, various transform and SR meth-
[46]; furthermore, as compared with CNN based image fusion ods have been introduced for image fusion. The transform-
models, transformer has advantages in processing images with based fusion methods include discrete wavelet transform
large sizes. (DWT) [6], double tree complex wavelet transform (DTCWT)
Motivated by the successful applications of transformer in [7], gradient pyramid transform (GPT), Laplace pyramid trans-
vision tasks, in this work, we develop a novel unsupervised form (LPT) [8], curvelet transform, and contourlet transform
multimodal medical image fusion strategy that joins Swin [9]. Despite the excellent performance that has been achieved
transformers and CNNs. Specifically, we design a learnable in image fusion by these methods, they failed to have trans-
residual Swin-Convolution fusion (RSCF) module, which is lation invariance and led to instability. Besides, there are
further inserted into a U-shaped autoencoder for multimodal translation invariant multiscale decomposition-based methods,
medical image fusion. The autoencoder is employed to extract including nonsubsampled contourlet transform (NSCT), non-
multiscale deep features and reconstruct fused images. The subsampled shearlet transform (NSST), Guide Filter, gradient
RSCF fusion module is designed to fuse multiscale deep fea- minimum smooth filter (GMSF), local Laplace filter (LLF),
tures extracted by the autoencoder, which aims to improve the and rolling guide filter (RGF) [11]. The NSCT and NSST
visual quality, as well as evaluation metrics of fusion images. It have translation invariance since there are no upsampled or
consists of a global branch based on residual Swin transformer downsampled operations during the decomposition process;
block (RSTB) for capturing the global contextual information, however, these methods have high computational complex-
as well as a local branch based on gradient residual dense ity, and spatial consistency is not considered in the fusion,

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on October 01,2024 at 14:06:06 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MRSCFusion: JOINT RESIDUAL SWIN TRANSFORMER AND MULTISCALE CNN 5026917

which leads to the distortion of brightness, color, and other It is worth noting that the parameters in these fusion strategies
information. cannot be changed, and hence, they are unlearnable.
The SR-based methods [12] combine the local and single 2) End-to-End CNN-Based Image Fusion: Several CNN-
component fusion methods, such as orthogonal matching pur- based end-to-end fusion methods have been suggested in
suit (OMP), simultaneous OMP (SOMP), and dictionary-based [30], [31], [32], [33], [34], [35], and [36]. In [30], a gen-
adaptive SR [13], [14] for image fusion. Although these erative adversarial network (GAN) was introduced, where the
methods have the advantages of easy solutions, their per- generator computed the fused images, and the discriminator
formance is constrained by less flexibility to design fusion constrained the details. Later, DDcGAN [31] introduced a
strategies, poor detail preservation ability, and low robustness dual-discriminator architecture to enhance the prominence of
for misregistration and noise. thermal targets; nevertheless, although the fused images tend
The PCNN can extract image features and reduce the to be similar to the source images, they fail to preserve the
difference in the source images. PCNN-based image fusion details and also cause content loss. It is difficult to get stable
methods are generally divided into two categories [15], [16], fusion images due to the application of a discriminator. To
[17], [18]. The first is the combination of PCNN and multiscale solve this problem, GANMcC [32] proposed a multiclassifi-
transform, such as the combination of PCNN and NSST cation constraint GAN-based method to obtain better fused
[15], PCNN and wavelet transform [16], PCNN and NSCT results with more significant contrast and abundant texture
[17]. The other is to reduce the computational complexity information. Zhang et al. [33] developed an end-to-end image
of PCNN [18]. Although PCNN performed well in image fusion network (IFCNN) that was a simple and effective fusion
fusion, parameters in the PCNN model relied on experience network. IFCNN used two convolution layers to extract the
and manual debugging. PCNN has had a more advanced adap- deep feature and used the element-wise fusion rule to fuse
tive development recently. For example, the gradient descent the features. Although IFCNN was a state-of-the-art image
method was employed to adapt the parameters and optimize fusion model, the fusion strategy in the traditional ways was
the decay coefficient; however, these PCNN-based methods still not optimal. Xu et al. [34] proposed an unsupervised end-
still have a high computational cost and difficult design of to-end network that tackled the fusion task by combining it
fusion rule [19]. with continual learning. Li et al. [35] designed a novel residual
fusion network with a two-stage training strategy to supercede
handcrafted fusion strategies. They, furthermore, proposed a
B. Deep-Learning Image Fusion new loss to retain more details and salient information. Xu
1) NonEnd-to-End Image Fusion: Recently, deep neural and Ma [36] proposed an unsupervised medical image fusion
networks have been introduced for image fusion to overcome model. The network aimed at enhancing the chrominance
the shortcomings of traditional approaches. Early on, deep information by surface- and deep-level constraints; however,
neural networks have been employed to extract image features these methods focused on learning spatial local features and
[21], [23]. For example, Li et al. [21] designed a fusion did not take into account the long-range dependencies in
network, where the images were decomposed into basic parts images.
and salient parts. Then, the decision maps were calculated
from the features and fused by average feature fusion. After C. Vision Transformer
that, the same authors proposed a deep fusion framework Despite CNNs providing highly efficient and generalizable
that used the pretrained ResNet-50 to extract deep features solutions for image fusion, the local extraction ability limits
[23]. In [24], the feature extraction and the fusion method the capture of long-range dependencies. Transformer-based
were conducted by a single network. In addition, the CNN model [43] derived from natural language processing (NLP)
was trained based on multiple blurred image patches, and the can capture long-range dependencies and has become popular.
decision map was generated, from which the fused images The success of Transformer stems from its self-attention (SA)
could easily be reconstructed [25]. Inspired by DeepFuse [26], mechanism, which can model long-standing relationships.
an autoencoder fusion network was developed in [27], which SA aims to capture the relationship between all features by
consisted of the encoder, fusion layer, and decoder. In this transforming the inputs into query, key, and value by using
framework, the features of multimodal images were extracted three learnable weight matrices.
by Dense blocks [28] and CNN layers and fused by fusion Visual Transformer (ViT) [44] was first proposed to per-
strategy. Although it mitigated the degradation problem in form image recognition tasks and has achieved excellent
traditional CNN-based networks, the network heavily relied performance; moreover, according to most recent studies, the
on the fusion rules. In [29], NestFuse has been designed to prediction errors of ViT are closer to that of humans compared
enhance the salient/intensity features, as well as preserve more with CNN. These desirable properties of Transformers attract
details. Besides, a novel attention model was proposed to fuse great interest in the medical community, and their applications
the images. alleviate the inherent inductive bias of CNNs. The Transformer
The deep-learning-based fusion models have achieved can effectively encode organs distributed in large receptive
excellent performance; however, the above-mentioned methods fields by spatially modeling the relationship between distant
generally employ handcrafted fusion strategies such as element pixels. Most Transformer-based models have shown superior
addition and element maximum to combine the deep features, performance over CNN-based methods. For conducting image
which hinder the optimal performance of the fusion model. fusion, IFT [45] developed a Transformer-based multiscale

fusion strategy that focused on both local and remote informa-

tion. It combined Transformer branches with spatial branches
and has achieved competitive results on multiple fusion bench-
mark datasets. MAFusion [51] proposed a multiscale attention
network that has achieved good performance in visible and
infrared image fusion.
More recently, as a backbone for vision tasks, Swin Trans-
former [46] has achieved superior performance in image
classifications, object detections, and semantic segmentations.
It computes SA in an efficient shifted windows scheme.
Fig. 2. Overall procedure of PET/SPECT and MRI image fusion with the
Based on the Swin Transformer, a Dual Swin Transformer proposed MRSCFuion.
U-Net was developed in [47] for image segmentation. This
pure Transformer-based Encoder-Decoder network has outper-
formed other CNN-based methods. Besides, SwinIR has been integrate the complementary information from source images
proposed for image restorations, which achieved significantly to generate the fused image I f . To effectively integrate more
better performance than existing methods in image super- meaningful information and ensure the visual quality of fused
resolution, image denoising, and JPEG artifact reduction [48]. images, we further define a joint loss function, including the
These achievements prompt us to design new image fusion content loss and intensity loss, to constrain the MRSCFusion.
strategies. In this work, we develop an architecture based The content loss aims to preserve the structural similarity and
on Swin Transformer to improve image fusion performance, texture information, while the intensity loss aims to maintain
which enables the model to encode long-range dependencies the pixel intensity. In addition, we introduce an AWB to assign
and enhance local information extraction. learnable weights, which control the information retention
degree of source images.
III. M ETHODOLOGY The framework of MRSCFusion is illustrated in Fig. 3,
which includes three parts, i.e., feature extraction, feature
A. Problem Formulation
fusion, and image reconstruction. Specifically, it contains an
In this work, we perform three typical multimodal medical autoencoder (including encoder and decoder) for the feature
image fusion tasks: 1) CT and MRI; 2) PET and MRI; extraction and image reconstruction and the proposed RSCF
and 3) SPECT and MRI. For MRI images (gray images), module for the feature fusion. It is worth noting that the
the texture information should be preserved. For CT images, MRSCFusion is a two-stage training network. Inspired by
the dense structure with high-intensity information should RFN-Nest [35], in the first stage, we train the encoder and
be retained, which can be directly fused with MRI images. decoder to extract multiscale deep features and reconstruct
For PET/SPECT images, the functional information (color images, respectively. With the trained autoencoder, in the sec-
information) should be maintained. On PET/SPECT and MRI ond stage, we train the RSCF module to fuse multiscale deep
image fusion, we first convert PET/SPECT images from features. As shown in Fig. 3, there are four encoder blocks
the RGB space to YCbCr space as in (1). This conversion j
for feature extraction, where Faj and Fb denote the extracted
operation extracts the chrominance information to the Cb and features of CT/PET/SPECT and MRI images, respectively, and
Cr channels and provides the brightness information to the Y j ∈ {1, 2, 3, 4} represents the jth scale. Then these features are
channel, which can be fused with MRI images [3], [27], [36], fused via the proposed RSCF fusion modules in parallel, where
[45]. Then, the fused Y channel is combined with the Cb and j
the output FRSC denotes the jth fused feature. Finally, the
Cr channels as the YCbCr space, and it is converted again to j
fusion features FRSC are reconstructed by the trained decoder
the RGB space as the final fusion image. We expect the fused to generate the fused image I f .
image to obtain abundant texture information, as well as high
pixel intensity
       B. Feature Extraction and Image Reconstruction
Y 0.299 0.587 0.114 R 0
 Cb  =  −0.169 −0.331 0.500  G  + 128. The feature extraction and image reconstruction are con-
Cr 0.500 −0.419 −0.081 B 128 ducted by the autoencoder, which consists of the ender and
(1) decoder. The framework of the autoencoder is presented in
Fig. 4. There, ECB represents the encoder convolutional block,
In Fig. 2 it illustrates the overall procedure for PET/SPECT which contains two convolutional layers with a kernel size
and MRI image fusion with the proposed MRSCFusion model. of 3 × 3 and a max-pooling layer. They are employed to
Ia ∈ R H ×W ×Ca and Ib ∈ R H ×W ×Cb denote two aligned multi- extract multiscale deep features. DCB represents the decoder
modal medical images, where H , W , and C denote the height, convolutional block without the pooling operator. Here, the
width, and number of channels, respectively. In PET/SPECT decoder network is similar to U-Net++ [52], which is light
and MRI image fusion, Ia denotes the Y channel information and effective in reconstructing fused images. There are also
of PET/SPECT images, and Ib denotes the MRI image. In cross-layer links between the encoder and decoder for con-
CT and MRI image fusion, Ia denotes the CT image, and Ib necting multiscale deep features. In general, deep features can
denotes the MRI image. The proposed MRSCFusion aims to convey semantic information, which is important for image

Fig. 3. Framework of the proposed MRSCFusion.

Fig. 4. Framework of autoencoder.

Fig. 6. (a) Framework of RSTB. (b) Architecture of two consecutive STLs.

for capturing the global contextual information with long-

range dependencies. It involves two parallel feature extraction
streams. Each contains a RSTB and a convolutional layer with
a kernel size of 3 × 3. The two streams are concatenated
j
to generate the global contextual feature FGlobal . The local
branch is based on the GRDB for capturing the local fine-
grained information. It also involves two parallel streams to
capture rich texture details. Each stream contains a GRDB
and a convolutional layer with a kernel size of 3 × 3. The
two streams are concatenated and passed through a bottleneck
Fig. 5. Architecture of RSCF module. j
block to generate the local fine-grained feature FLocal . Ten the
j
global contextual feature FGlobal and local fine-grained feature
j
FLocal are added via the element-wise addition to generate
reconstruction. As shown in Fig. 4, the inputs are multimodal j
the fused feature FRSC , which is further fed into the trained
medical images, and the output is the fused image.
encoder for the image reconstruction.
1) Residual Swin Transformer Block: Different from pre-
C. Feature Fusion vious deep-learning image fusion methods, in this work,
In this work, we design an RSCF module to mine and we design a novel global branch based on the RSTB. It can
fuse multiscale deep features. The network architecture of the effectively capture long-range dependencies in images, as well
proposed RSCF module is illustrated in Fig. 5. It consists of a as improve the visual quality of fused images. Inspired by
global branch and a local branch to fuse the input multiscale SwinIR [48], the proposed RSTB contains four Swin Trans-
j
deep features Faj and Fb . The global branch is based on RSTB former layers (STLs) and a 1 × 1 convolutional layer with

Fig. 8. Dataflow of the AWB.

The feature measurement of Ia can be defined as

N
1 X
ma = ∥1Ia ∥2F (3)
H W N n=1
where H , W , and N represent the height, width, and channel
Fig. 7. Architecture of GRDB.
number of the features, respectively. ∥·∥ F is the Frobenius
norm and 1 is the Laplacian operator. Since the gradient
information has a strong ability to represent image edges and
a residual connection, as shown in Fig. 6(a). Here, the Swin details, we employ the Laplacian operator to measure the
Transformer is modified from the standard Transformer, where feature information. The feature measurement values of Ia and
the MSA is replaced by the regular window or shifted window Ib are indicated as m a and m b , respectively. Then the adaptive
MSA. The corresponding attention mechanism is computed as weights wa and wb can be calculated as
h m m i
a b
[wa , wb ] = softmax , . (4)

QKT
c c
Attention(Q, K, V) = Softmax √ + B V (2) Here, the normal scale constant c is used to regulate
d
weight assignments, which magnify the difference between m a
and m b . It is worth noting that the softmax function makes
where Q, K, and V are the query, key, and value matrices,
wa and wb in an interval [0, 1], and their sum is 1. Then
respectively, B is the positional encoding, d is the dimension
wa and wb are used in loss functions to control the degrees of
of keys.
information preservation for given medical images. The higher
As illustrated in Fig. 6(b), two consecutive STLs are used.
the weight, the higher the information preservation degree of
They contain the regular window-based MSA(W-MSA) and
the corresponding medical image is.
shifted window-based MSA(SW-MSA), followed by the mul-
tilayer perceptron (MLP) and LayerNorm (LN) with residual
connections. The use of an alternating regular window and D. Loss Function
shifted window scheme enables cross-window MSA connec- In this work, the U-shaped autoencoder is trained for
tions and reduces the computational cost, which adequately extracting multiscale deep features and reconstructing fused
builds global context interactions and long-range dependencies images. We follow RFN-Nest [35] to train the autoencoder
in source images. In addition, the residual connections provide in the first stage. Based on the fixed autoencoder, we mainly
the aggregation with different level features, which enhances focus on training the proposed learnable RSCF fusion module
the retaining information ability of the proposed model. in the second stage. The RSCF is employed for fusing the
2) Gradient Residual Dense Block: To extract fine-grained features extracted by the fixed autoencoder, which aims to
features more precisely, we introduce a novel GRDB, as shown improve the visual quality, as well as evaluation metrics of
in Fig. 7. It can aggregate learnable convolutional features fused medical images. To effectively integrate multimodal
with the gradient magnitude information, which has excellent medical images, we define the loss function from two per-
performance in capturing the detail information [39]. The spectives. On the one side, the content information, such as
GRDB contains two streams, where the mainstream performs the structure and texture in MRI images, should be preserved.
dense connections and another stream involves the gradient Hence, we define a content loss to ensure visual fidelity. For
operator. In the mainstream, there are three convolutional lay- another, the apparent intensity information of CT/PET/SPECT
ers with dense connections, which can take full advantage of images should be maintained; therefore, we define an inten-
the extracted features by convolutional layers. In the gradient sity loss to guide the RSCF to capture appropriate intensity
stream, the feature gradient information is calculated. With information.
the element-wise addition of two streams, it can integrate the 1) Content Loss: The content loss includes two aspects,
fine-grained features and multiscale deep features. i.e., the structural similarity loss and texture loss, which is
3) Information Preservation With Adaptive Weights: To defined as
further effectively integrate the complementary information of
L content = L ssim + αL grad (5)
different medical images, we introduce adaptive weights to
control their information preservation degree. The dataflow where L ssim is the structural similarity loss to constrain the
of the AWB is shown in Fig. 8. It includes three parts, structural similarity between source images and fusion images.
i.e., the feature measurement, softmax operation, and infor- L grad is the texture loss to constrain the preservation of
mation preservation. fine-grained texture details. α is employed to control the

balance between L ssim and L grad . Here, we set α = 0.01 via where λ1 and λ2 are the parameters to control the trade-off
performing the ablation study (Section IV-D). between L content and L int . Here, we set λ1 = 1000, λ2 = 10 via
a) Structural similarity loss: The structural similarity performing the ablation study (Section IV-D).
(SSIM) has been widely used in image processing community,
which displays the distortion of image structures. We employ IV. E XPERIMENTS AND D ISCUSSIONS
structural similarity loss to constrain the structural similarity
In this section, we perform extensive experiments to demon-
between the source images and the fusion image. The struc-
strate the validation of the proposed image fusion strategy.
tural similarity loss is defined as

L ssim = wa · 1 − SSIM I f , Ia + wb · 1 − SSIM I f , Ib

A. Dataset and Training Details
(6) We compare MRSCFusion with other state-of-the-art meth-
ods on CT-MRI, PET-MRI, and SPECT-MRI fusion tasks.
where SSIM(·) denotes the operator of the structural similarity
Experiments are implemented on NVIDIA TITAN X GPU
measurement, which calculates the structural similarity of two
with PyTorch.
images. It is worth noting that the adaptive weights wa and
In the first stage, we train the autoencoder network fol-
wb are introduced to control the structure preservation degree
lowing RFN-Nest [35] on the MS-COCO dataset [54]. We
of different source images.
randomly select 80 000 images as the training dataset. The
b) Texture loss: It is crucial to integrate the texture
images are in PNG format and 8-bit depth. The size of the
details of source images into fused images to enhance the
original image is 640 × 480. These images are converted to
visual quality. Since the texture details can be integrated
grayscale and cropped to 256 × 256. The reasons for image
via the maximum selection method [39], we introduce the
reshaping are as follows. 1) Images that are too large may
following texture loss to constrain the RSCF fusion module
increase the computation pressure, while images that are too
to preserve more fine-grained texture information. It can be
small may cause information loss. 2) The multiscale network
defined as
reduces the image from large resolution to small resolution,
1 and the reduction factor is usually the exponential power of 2.
L grad = ∇I f − max(|∇Ia |, |∇Ib |) 1
(7)
HW 3) Considering that the original size of the medical image in
where ∇ represents the Sobel gradient operator, which is used the second stage is 256 × 256, we keep the same image size
to calculate the fine-grained texture information. |·| and ∥·∥1 in the first stage. 4) In addition, this ratio is adopted by other
denote the absolute operator and the l1 -norm, respectively. classic fusion networks [27], [29], [35], [45].
H and W are the height and width, respectively. max(·) In the second stage, due to the limitation of the Public
denotes the element-wise maximum selection. Harvard medical dataset, the number of medical images is as
2) Intensity Loss: In general, the functional images follows: 166 pairs of images in the CT-MRI dataset, 329 pairs
(e.g., PET/SPECT) and dense structure images (e.g., CT) of images in the PET-MRI dataset, and 539 pairs of images
contain significant contrast and intensity information. We in the SPECT-MRI dataset. The image format is PNG, and
expect fused images to preserve the abundant texture details the size of the original image is 256 × 256. The bit depth
and maintain then appropriate intensity information according of CT, MRI, and PET/SPECT images are 8-bit, 8-bit, and
to the source images. Although content loss constrains the 24-bit, respectively; furthermore, we increase the number of
structural similarity and texture information, there is a weak image pairs by augmenting the samples with the image rotation
constraint on the intensity information. We, therefore, intro- [60]. The source images are randomly rotated clockwise or
duce an intensity loss to guide the RSCF fusion module to counterclockwise by a certain angle [60]. Random rotation
capture appropriate intensity information, which is defined as is a useful data augmentation since it changes the angles of
the images that appear in the dataset during training. It can
J
X j

j
2 improve the model without collecting more data and also help
L int = w j FRSC − wa Faj + wb Fb (8) combat potential overfitting. With the image augmentation,
F
j=1 we increase the dataset to contain 30 000 pairs of images. We
where ∥·∥ F denotes the Frobenius norm. J is the number of selected 21 pairs of images for the test independently.
j The learning rate is a tunable parameter that defines the step
multiscale features, which is set to 4. FRSC is the jth layer
j
fused feature, and Faj , Fb represent the jth layer features of size at each iteration. When the learning rate is too large, it will
source images, respectively. wa and wb are adaptive weights, make the training oscillate around or skip over the minima.
which control the influence of source image features in the When the learning rate is too low, it will take a long time to
fused features. w j is a trade-off parameter to balance the converge or get stuck in an undesirable local minimum. In this
loss magnitude between different layers. Here, we follow work, there is a trade-off between the rate of convergence and
RFN-Nest [35] and IFT [45] to set w with four values overshooting with experiments, where the learning rate is set
{1, 10, 100, 1000}. to 1 × 10−4 . The batch size is the number of samples to be
Finally, we join the content loss L content and intensity loss trained before updating the parameters. A too-small batch size
L int as the total loss of the fusion module as will introduce a high degree of variance. Conversely, a too-
large batch size may not fit in the computational memory and
L = λ1 L content + λ2 L int (9) it will have the tendency to overfit the data. Generally, the

TABLE I
E VALUATION M ETRICS FOR Q UANTITATIVE C OMPARISONS

Fig. 9. Qualitative comparisons of MRSCFusion with other methods on three pairs of representative CT and MRI images. From left to right: CT images,
MRI images, fusion results of IFCNN, FusionGAN, U2fusion, DenseFuse, EMFusion, RFN-Nest, IFT, and MRSCFusion (ours).

batch size is to choose exponentials of 2 as that uses memory quantitative evaluation metrics, including the sum of correla-
most efficiently. The number of epochs is the number of times tion differences (SCD) [55], multiscale structural similarity
that the network will work through the training dataset. It is index measure (MS-SSIM) [53], edge information measure-
dependent on the type of network and the dataset. In this work, ment Q AB/F [56], feature mutual information (FMI) [57],
the batch size and epoch are set to 4 and 2, respectively, since spatial frequency (SF) [59], and visual information fidelity
the training data is sufficient and the loss can be converged (VIF) [58]. These evaluation metrics and their descriptions
under the limited computational memory. In addition, the are presented in Table I.
effectiveness of these parameters has been verified in previous
studies [27], [29], [35], [45].
C. Experiment Results and Discussions
1) CT-MRI: To intuitively display the fusion performance,
B. Comparison Methods and Evaluation Metrics qualitative fusion comparisons on three pairs of representative
To demonstrate the effectiveness of the proposed MRSC- CT and MRI images are illustrated in Fig. 9. From left to
Fusion, we conduct comprehensive comparisons with seven right are CT images, MRI images, the fusion images with
representative fusion methods, including DenseFuse [27], IFCNN, FusionGAN, U2Fusion, DenseFuse, EMFusion, RFN-
FusionGAN [30], IFCNN [33], U2Fusion [34], RFN-Nest Nest, IFT, and our MRSCFusion. In the CT and MRI image
[35], EMFusion [36], and IFT [45]. We perform these methods fusion, we expect that dense structures (e.g., bone structures)
using their publicly available codes with the corresponding in CT images and soft tissues in MRI images can be simul-
parameter settings. taneously retained in the fused images. To better demonstrate
Qualitative and quantitative comparisons of the fused results the fusion effect, we zoomed-in view on an area with more
are implemented. The qualitative results intuitively evaluate dense structures and texture details in a red box. As shown
the fusion performance based on visual perception. For the in Fig. 9, the brightness and sharpness of fused images by
multimodal medical image fusion, we expect the fused results FusionGAN, U2Fusion, DenseFuse, RFN-Nest, and IFT are
to contain abundant texture information and appropriate inten- unsatisfactory. The gray matter in CT images blurs the details
sity information. The quantitative results indicate the objective of MRI images with some comparison methods. Specifically,
assessments of the fusion performance. We choose six popular the FusionGAN brings in much redundant information like

TABLE II
Q UANTITATIVE C OMPARISONS OF MRSCF USION W ITH S EVEN C OMPETITORS ON CT-MRI F USION .
S IX M ETRICS A RE S HOWN B ELOW (AVERAGE VALUE )

Fig. 10. Quantitative comparisons of MRSCFusion with other methods on 21 pairs of CT and MRI images. Six evaluation metrics are used, and average
values with different methods are marked in legends.

artifacts. In addition, for IFCNN, the fused images can retain ability of fine-grained details with the introduced GRDB. The
structure details well in MRI images but the dense structures highest VIF demonstrates our fusion images have a better
in CT images are weakened. EMFusion can well retain dense visual effect, which is consistent with qualitative results. In
structures but lose a little edge detail. By contrast, our MRSC- addition, for SF, the MRSCFusion merely follows behind
Fusion can preserve more edge details, as well as retain the IFCNN, which implies our fused images also contain rich
dense structures in fused images. texture details.
The quantitative fusion comparisons with six evaluation 2) PET-MRI: The qualitative fusion comparisons on three
metrics on 21 test CT and MRI image pairs are reported in pairs of representative PET and MRI images are illustrated
Table II and Fig. 10. In Table II, we can find that the MRSCFu- in Fig. 11. From left to right are PET images, MRI images,
sion achieves optimal results (average values) on metrics SCD, and the fused images with different methods. We zoomed-
MS-SSIM, Q AB/F , FMI, VIF, and ranks second on metric SF. in view on an area with more functional (color) information
In Fig. 10, the evaluation scores of each metric on 21 fused and texture information in a red box to better exhibit the
images are connected by a line, and the average value of each fusion effect. As shown in Fig. 11, the MRSCFusion shows
method is marked in the legend. Obviously, for MS-SSIM, several superiorities. We find that most competitors can retain
Q AB/F , and VIF, the MRSCFusion achieves the highest scores the functional information well in PET images, but lose
in all 21 fused images. For SCD and FMI, it achieves higher some texture details in MRI images. Specifically, the fused
scores for most single fused images than competitors with the images by FusionGAN contain some redundant information,
highest average values. The best results on SCD and MS-SSIM and texture details are blurred. The IFCNN can extract the
indicate our fused images contain abundant information with texture information well from MRI images but the fused
less distortion, and achieve higher structural similarities. The images are a little dark, indicating that it cannot retain the
highest FMI indicates that MRSCFusion transfers more dense color (functional) information well from PET images. In addi-
structures from CT images to fused images; moreover, the tion, the U2Fusion, EMFusion, RFN-Nest, IFT, and DenseFuse
highest Q AB/F that reflects more edge information from MRI can achieve relatively satisfactory fusion results, while there is
images is preserved, which benefits from the strong extraction some loss of texture details. In contrast, our MRSCFusion well

Fig. 11. Qualitative comparisons of MRSCFusion with other methods on three pairs of representative PET and MRI images. From left to right: PET images,
MRI images, fusion results of IFCNN, FusionGAN, U2fusion, DenseFuse, EMFusion, RFN-Nest, IFT, and MRSCFusion (ours).

Fig. 12. Quantitative comparisons of MRSCFusion with other methods on 21 pairs of PET and MRI images. Six evaluation metrics are used, and average
values with different methods are marked in legends.

preserves texture details and meanwhile retains the functional effect, which is also consistent with qualitative results. Overall,
information; moreover, there are fewer mosaics in our fusion based on visual perception and the objective assessment, our
images, which are more consistent with the human vision. MRSCFusion achieves better performance.
The quantitative fusion comparisons of six evaluation met- 3) SPECT-MRI: The qualitative fusion comparisons on
rics on 21 test PET and MRI image pairs are reported in three pairs of representative SPECT and MRI images are
Table III and Fig. 12. In Table III, we can find that the illustrated in Fig. 13. From left to right are SPECT images,
MRSCFusion achieves the optimal results on metrics SCD, MRI images, and the fused images with different methods. To
Q AB/F , FMI, SF, and VIF, and ranks second on metric intuitively display qualitative comparisons, we zoomed-in view
MS-SSIM following IFCNN with a narrow margin. In Fig. 12, on a local area in a red box. Similar to PET and MRI image
the evaluation scores of each metric on 21 fused images are fusion, the SPECT and MRI image fusion with MRSCFuion
connected by a line, and the average value of each method also has advantages. We can find that there are some artifacts
is marked in the legend. We can find that, for SCD, Q AB/F , or intensity distortions in fused images with the competitors.
FMI, and SF, the MRSCFusion achieves the highest score Specifically, the FusionGAN and DenseFuse over-preserve the
in all 21 fused images. In addition, for VIF, it achieves functional information of SPECT images, leading to artifacts.
higher scores for most single fused images than competitors The IFCNN can extract the texture information of MRI images
with the highest average values. The highest Q AB/F and SF well but suffers from capturing the metabolism information
indicate our fused images contain more edge information and of SPECT images. The EMFusion, U2Fusion, RFN-Nest, and
texture details. The superior SCD, FMI, and MS-SSIM reflect IFT also can achieve satisfactory fusion results, while they
our MRSCFusion preserves more meaningful information and cause a slight loss of details. In contrast, our MRSCFuion
structural similarities from source images; moreover, the best can retain the functional information of SPECT images well
VIF displays that our fused images have a better visual and meanwhile preserve structure details of MRI images;

Fig. 13. Qualitative comparisons of MRSCFusion with other methods on three pairs of representative SPECT and MRI images. From left to right: SPECT
images, MRI images, fusion results of IFCNN, FusionGAN, U2fusion, DenseFuse, EMFusion, RFN-Nest, IFT, and MRSCFusion (ours).

Fig. 14. Quantitative comparisons of MRSCFusion with other methods on 21 pairs of SPECT and MRI images. Six evaluation metrics are used, and average
values with different methods are marked in legends.

TABLE III
Q UANTITATIVE C OMPARISONS OF MRSCF USION W ITH S EVEN C OMPETITORS ON PET-MRI F USION . S IX M ETRICS A RE S HOWN B ELOW (M EAN VALUE )

furthermore, it preserves more edge information and has a scores of each metric on 21 fused images are connected by
better visual effect. a line, and the average value of each method is marked in
The quantitative fusion comparisons of six evaluation met- the legend. We can find that, for SCD, Q AB/F , FMI, SF, and
rics on 21 test SPECT and MRI image pairs are reported VIF, the MRSCFusion achieves higher scores for most single
in Table IV and Fig. 14. In Table IV, we can find that the fused images than the other methods with the highest average
MRSCFusion achieves the optimal results (average values) on values. With the best SCD and FMI, the MRSCFusion can
most metrics except for MS-SSIM that ranks second, following preserve more meaningful information from source images. In
IFCNN with a narrow margin. In Fig. 14, the evaluation addition, the optimal Q AB/F , SF, and VIF indicate our fused

TABLE IV
Q UANTITATIVE C OMPARISONS OF MRSCF USION W ITH S EVEN C OMPETITORS ON SPECT-MRI F USION .
S IX M ETRICS A RE S HOWN B ELOW (M EAN VALUE )

Fig. 16. Convergence of the losses without the weight block (left plot) and
with the weight block (right plot).

Fig. 15. Qualitative comparisons on RSCF module analysis. From left and worth noting that, for SPECT-MRI fusion, the fused result with
right: MRI images, CT/PET/SPECT images, fusion images without RSCF, RSCF has a lower MS-SSIM than that without RSCF. This is
and fusion images with RSCF. because the GRDB in the RSCF module refines the edges
of color, which results in slightly lower similarities; however,
it effectively improves the Q AB/F and SF metrics.
images have clearer quality and better visual effects. Overall,
2) Analysis of AWB: In MRSCFusion, we employ the AWB
the MRSCFusion achieves better performance with the fused
to assign learnable weights wa and wb for controlling the
images containing more morphological information (texture
information preservation degree of source images. To validate
and edge information) and functional information.
the effectiveness, we perform comparison experiments without
the AWB, where weights wa and wb are both fixed to 0.5. The
D. Ablation Study experiments are carried out with 3000 pairs of images. It is
1) Analysis of RSCF Fusion Module: The proposed RSCF analyzed from two aspects: 1) the loss analysis and 2) the
fusion module boosts the comprehensive information with the qualitative and quantitative results analysis.
guidance of RSCF. We implement ablation studies on the The convergence of losses in both cases are shown in
fusion strategy to verify its special role. Specifically, we train Fig. 16. The left plot stands for the loss on each fusion task
a model only using an autoencoder and employ the “add” without applying the AWB, while the right plot indicates that
to replace RSCF as the fusion strategy, which means that with the weight block. Here, we perform 3000 iterations. It
source features are added to generate fused features. A typical is worth noting that 300 points are drawn on the horizontal
example with/without the RSCF fusion module is displayed in axis for exhibitions. We can find that the losses with the
Fig. 15. It can be noticed that the image edges obtained without weight block on different tasks are lower than those without
RSCF are not clear enough. In the fusion images on CT-MRI the weight block, which indicates that the weight block can
and PET-MRI fusion, the information is lost seriously. The reduce the loss of information.
fusion results of SPECT-MRI fusion are also unsatisfactory, The qualitative comparison results with/without the weight
and the edge information is still blurred. Since the structure block on different tasks are depicted in Fig. 17. The qualitative
of the RSCF fusion module captures both local features and results have shown that our model has two advantages. First,
long-range dependencies, it can simultaneously enhance the we can find that the fusion images with the weight block have
description of texture details and retain the intensity distribu- more details. The second is that the AWB can highlight the
tion of prominent structures. dense structure or color information in source images. This
The quantitative fusion comparisons of six evaluation met- advantage makes the fused images maintain high contrast.
rics for CT-MRI, PET-MRI, and SPECT-MRI fusion are From Table VI, we can also find that most evaluation metrics
reported in Table V. We can find that the fusion results with the with the weight block are higher than those without the weight
RSCF module surpass those without RSCF on the whole. It is block. The adaptive weights can assist the model in better

TABLE V
Q UANTITATIVE C OMPARISON R ESULTS B ETWEEN W ITH /W ITHOUT RSCF M ODULE

Fig. 17. Qualitative comparisons on analysis of the AWB. The first two
columns are MRI and CT/PET/SPECT images, and the last two columns are
the fused images without/with the adaptive weight block.

Fig. 18. Quantitative comparisons of six metrics with different λ1 on

balancing the two source images, which promotes the fusion CT-MRI, PET-MRI, and SPECT-MRI fusion, respectively.
model to improve the quality of fused images.
3) Analysis of Parameters in Loss Functions: In
Section III-D, we employ loss function L = λ1 L content +λ2 L int
to train our RSCF fusion module. There are parameters λ1
and λ2 to control the trade-off between content loss L content
and intensity loss L int . In addition, since content loss
L content = L ssim + αL grad , there is a parameter α to control
the balance between structural similarity loss L ssim and
texture loss L grad . Here, we investigate the impacts of these
parameters on the fusion performance. It is worth noting
that when we analyze the impact of one parameter, the other
two parameters are fixed. Since the magnitude difference
between L content and L int , in general, λ1 should be large and
λ2 should be relatively small to preserve structural features
and reduce the fuzzy appearance; furthermore, since L grad
in content loss L content forces the fusion images to contain
Fig. 19. Qualitative results with different λ2 . From left to right: MRI images,
finer-grained texture information, α should be small to CT/PET/SPECT images, fusion results with λ2 = 1, λ2 = 10, and λ2 = 100,
reduce the redundancy of details. We, therefore, carry out respectively.
extensive experiments with α in {0, 0.01, 0.1} and λ2 in
{1, 10, 100}.
To find appropriate λ1 , we set λ1 to {0, 1 × 10, 1 × 102 , with different λ1 . We can find that when λ1 = 1000, better
1 × 103 , 1.5 × 103 , 1 × 104 } for comparison experiments. fusion performance is achieved. Hence, we set λ1 to 1000.
Since there are many combinations of these parameter values, In Fig. 19, it displays the qualitative fusion results with
it is not necessary to show all experiment results. We just give different λ2 (α = 0.01, λ1 = 1000). We can find that
one group quantitative fusion results. The quantitative results when λ2 = 1, the PET-MRI and SPECT-MRI fusion tasks
are shown in Fig. 18, which reveals the values of six metrics focus on preserving the structure information, the metabolism

TABLE VI
Q UANTITATIVE C OMPARISONS ON A NALYSIS OF THE AWB

can hardly be expressed. In the CT-MRI fusion result, the

intensity information is not clear, which is more similar to
the CT image. When λ2 is set to 100, the PET-MRI fusion
result retains more color information but has less detailed
information from the MRI image. A similar phenomenon
appears in the SPECT-MRI fusion result. As λ2 increases, the
regions with texture details are reduced. It is worth noting that,
when λ2 = 10, the fusion results simultaneously preserve the
abundant textures and salient information.
In content loss L content = L ssim + αL grad , there is the
gradient constraint L grad for obtaining the fine-grained texture
information. To find an appropriate α, we perform experiments
by setting α in {0, 0.01, 0.1}. As discussed above, here
the parameters λ1 and λ2 are set to λ1 = 1000 and λ2 = Fig. 20. Qualitative results with different α. From left to right: MRI images,
10, respectively. The qualitative results on CT-MRI, PET- CT/PET/SPECT images, fusion results with α = 0, α = 0.01, and α = 0.1,
MRI, and SPECT-MRI fusion with different α are illustrated respectively.
in Fig. 20. When α = 0, the gradient constraint L grad is
discarded and there is less fine-grained texture information in
the fusion results. As α increases, there are more meaningful
texture details retained in the fused images. When α is set
to 0.1, the fusion performance is degraded. Some noises
appear on the texture edges, and the structural similarities
between fused images and source images are reduced. The
quantitative comparisons are shown in Fig. 21. We can find
that when α = 0.01, better performance is achieved in both
qualitative and quantitative evaluations, and thus, α = 0.01 is
the appropriate value; therefore, λ1 = 1000, λ2 = 10, and
α = 0.01 are used in this work.
4) Analysis of Computational Cost: Generally, we hope
that the fusion model can obtain better fusion results in a
shorter time [59]. Here, we calculate the running time of
different methods, which can express their computational cost
[61]. To better compare deep-learning-based methods with
traditional methods, we add a comparison with the advanced
traditional algorithm CSMCA [62] implemented by MATLAB.
As shown in Table VII, we can find that deep-learning meth-
ods have significant advantages in running time due to the
GPU acceleration. Among these CNN-based methods, IFCNN, Fig. 21. Quantitative results of six metrics with different α. From left to right:
U2Fusion, and DenseFuse have a lower average running time. MRI images, CT/PET/SPECT images, fusion results with α = 0, α = 0.01,
The running time of our MRSCFusion is similar to that of and α = 0.1, respectively.
IFT, which takes more time than the CNN-based methods
due to the Transformer architecture. In addition, we calculate
the number of model parameters to represent the complexity proposed fusion strategy is learnable, the number of param-
[32]. We have to admit that the proposed MRSCFuion is eters increases accordingly; moreover, the SA mechanism of
larger in terms of parameters, which is about 23 M. Since the Swin Transformer requires a greater number of parameters

TABLE VII
RUNNING T IME FOR D IFFERENT M ETHODS W ITH F USING T WO I MAGES OF S IZE 256 × 256 (U NIT: S ECONDS )

than the convolutional operations [47]. It is worth noting that [7] R. Singh, R. Srivastava, O. Prakash, and A. Khare, “Multimodal medical
our MRSCFusion can, however, run on common hardware image fusion in dual tree complex wavelet transform domain using
maximum and average fusion rules,” J. Med. Imag. Health Informat.,
devices, which achieves better performance with an acceptable vol. 2, no. 2, pp. 168–173, Jun. 2012.
computational cost. [8] J. Chen, X. Li, L. Luo, X. Mei, and J. Ma, “Infrared and visible image
fusion based on target-enhanced multiscale transform decomposition,”
Inf. Sci., vol. 508, pp. 64–78, Jan. 2020.
V. C ONCLUSION AND F UTURE W ORK [9] L. Yang, B. L. Guo, and W. Ni, “Multimodality medical image fusion
In this work, an end-to-end unsupervised fusion model is based on multiscale geometric analysis of contourlet transform,” Neu-
rocomputing, vol. 72, nos. 1–3, pp. 203–211, Dec. 2008.
proposed to tackle multimodal medical image fusion. We
[10] G. Bhatnagar, Q. M. J. Wu, and Z. Liu, “Directive contrast based
design a novel RSCF module that can effectively mine and multimodal medical image fusion in NSCT domain,” IEEE Trans.
fuse multiscale deep features. The proposed RSCF fusion Multimedia, vol. 15, no. 5, pp. 1014–1024, Aug. 2013.
module includes a global branch based on RSTB for cap- [11] J. Jose et al., “An image quality enhancement scheme employing
adolescent identity search algorithm in the NSST domain for multi-
turing the global contextual information, as well as a local modal medical image fusion,” Biomed. Signal Process. Control, vol. 66,
branch based on GRDB for capturing the local fine-grained Apr. 2021, Art. no. 102480.
information. To further effectively integrate more meaningful [12] J. Wang, C. Lu, M. Wang, P. Li, S. Yan, and X. Hu, “Robust face
information from source images and ensure the visual quality recognition via adaptive sparse representation,” IEEE Trans. Cybern.,
vol. 44, no. 12, pp. 2368–2378, Dec. 2014.
of fused images, we define a joint loss function, including
[13] X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei, “The infrared and
content loss and intensity loss, to constrain the RSCF fusion visible image fusion algorithm based on target separation and sparse
module; moreover, we introduce adaptive weights to control representation,” Infr. Phys. Technol., vol. 67, pp. 397–407, Nov. 2014.
the information preservation degree of source images. The [14] M. Yin, P. Duan, W. Liu, and X. Liang, “A novel infrared and
visible image fusion algorithm based on shift-invariant dual-tree complex
proposed model follows a two-stage training strategy, where shearlet transform and sparse representation,” Neurocomputing, vol. 226,
an autoencoder is trained to extract multiple deep features and pp. 182–191, Feb. 2017.
reconstruct fused images in the first stage. Then, the RSCF [15] K. Zhang, Y. Huang, and C. Zhao, “Remote sensing image fusion via
fusion module is trained to fuse the multiscale features in the RPCA and adaptive PCNN in NSST domain,” Int. J. Wavelets, Multires-
olution Inf. Process., vol. 16, no. 5, Sep. 2018, Art. no. 1850037.
second stage. The proposed model is evaluated on multiple
[16] Z. Wang and C. Gong, “A multi-faceted adaptive image fusion algorithm
medical fusion tasks, where we have achieved better results in using a multi-wavelet-based matching measure in the PCNN domain,”
both qualitative and quantitative evaluations compared to other Appl. Soft Comput., vol. 61, pp. 1113–1124, Dec. 2017.
state-of-the-art fusion methods. In future work, we will further [17] S. Singh and D. Gupta, “Detail enhanced feature-level medical image
fusion in decorrelating decomposition domain,” IEEE Trans. Instrum.
improve the performance of the proposed model and reduce the Meas., vol. 70, pp. 1–9, 2021.
computational cost. In addition, carrying out a comprehensive [18] C. Panigrahy, A. Seal, and N. K. Mahato, “MRI and SPECT image
clinical evaluation of medical applications is valuable future fusion using a weighted parameter adaptive dual channel PCNN,” IEEE
work, which will help to adopt our strategy. Signal Process. Lett., vol. 27, pp. 690–694, 2020.
[19] Y. Li, J. Zhao, Z. Lv, and J. Li, “Medical image fusion method by deep
learning,” Int. J. Cognit. Comput. Eng., vol. 2, pp. 21–29, Jun. 2021.
R EFERENCES [20] H. Zhang, H. Xu, X. Tian, J. Jiang, and J. Ma, “Image fusion meets deep
learning: A survey and perspective,” Inf. Fusion, vol. 76, pp. 323–336,
[1] K. Padmavathi, C. S. Asha, and V. K. Maya, “A novel medical Dec. 2021.
image fusion by combining TV-L1 decomposed textures based on
[21] H. Li, X.-J. Wu, and J. Kittler, “Infrared and visible image fusion using
adaptive weighting scheme,” Eng. Sci. Technol., Int. J., vol. 23, no. 1,
a deep learning framework,” in Proc. 24th Int. Conf. Pattern Recognit.
pp. 225–239, Feb. 2020.
(ICPR), Aug. 2018, pp. 2705–2710.
[2] P. Ganasala and V. Kumar, “Multimodality medical image fusion based
on new features in NSST domain,” Biomed. Eng. Lett., vol. 4, no. 4, [22] Z. Wang, Y. Wu, J. Wang, J. Xu, and W. Shao, “Res2Fusion: Infrared
pp. 414–424, Dec. 2014. and visible image fusion based on dense Res2Net and double nonlocal
attention models,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–12, 2022.
[3] G. Wang, W. Li, X. Gao, B. Xiao, and J. Du, “Functional and anatomical
image fusion based on gradient enhanced decomposition model,” IEEE [23] H. Li, X.-J. Wu, and T. S. Durrani, “Infrared and visible image fusion
Trans. Instrum. Meas., vol. 71, pp. 1–14, 2022. with ResNet and zero-phase component analysis,” Infr. Phys. Technol.,
[4] M. Yin, X. Liu, Y. Liu, and X. Chen, “Medical image fusion with vol. 102, Nov. 2019, Art. no. 103039.
parameter-adaptive pulse coupled neural network in nonsubsampled [24] Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with
shearlet transform domain,” IEEE Trans. Instrum. Meas., vol. 68, no. 1, a deep convolutional neural network,” Inf. Fusion, vol. 36, pp. 191–207,
pp. 49–64, Jan. 2019. Jul. 2017.
[5] S. Li, B. Yang, and J. Hu, “Performance comparison of different multi- [25] Y. Liu, X. Chen, R. K. Ward, and Z. Jane Wang, “Image fusion with
resolution transforms for image fusion,” Inf. Fusion, vol. 12, no. 2, convolutional sparse representation,” IEEE Signal Process. Lett., vol. 23,
pp. 74–84, Apr. 2011. no. 12, pp. 1882–1886, Dec. 2016.
[6] J. Jinju, N. Santhi, K. Ramar, and B. S. Bama, “Spatial frequency [26] K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “DeepFuse: A deep
discrete wavelet transform image fusion technique for remote sensing unsupervised approach for exposure fusion with extreme exposure image
applications,” Eng. Sci. Technol., Int. J., vol. 22, no. 3, pp. 715–726, pairs,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
Jun. 2019. pp. 4724–4732.

[27] H. Li and X.-J. Wu, “DenseFuse: A fusion approach to infrared and visi- [50] C. Feng, Y. Yan, H. Fu, L. Chen, and Y. Xu, “Task transformer
ble images,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2614–2623, network for joint MRI reconstruction and super-resolution,” in Proc.
May 2019. Int. Conf. Med. Image Comput. Comput. Assist. Intervent. (MICCAI),
[28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, 2021, pp. 307–317.
“Densely connected convolutional networks,” in Proc. IEEE [51] X. Li, H. Chen, Y. Li, and Y. Peng, “MAFusion: Multiscale attention
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, network for infrared and visible image fusion,” IEEE Trans. Instrum.
pp. 2261–2269. Meas., vol. 71, pp. 1–16, 2022.
[29] H. Li, X.-J. Wu, and T. Durrani, “NestFuse: An infrared and visible [52] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
image fusion architecture based on nest connection and spatial/channel A nested U-Net architecture for medical image segmentation,” in
attention models,” IEEE Trans. Instrum. Meas., vol. 69, no. 12, Proc. Int. Workshop Multimodal Learn. Clinical Decis. Support, 2018,
pp. 9645–9656, Dec. 2020. pp. 3–11.
[30] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “FusionGAN: A generative [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
adversarial network for infrared and visible image fusion,” Inf. Fusion, quality assessment: From error visibility to structural similarity,” IEEE
vol. 48, pp. 11–26, Aug. 2019. Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[31] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: [54] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
A dual-discriminator conditional generative adversarial network for Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
multi-resolution image fusion,” IEEE Trans. Image Process., vol. 29, [55] V. Aslantas and E. Bendes, “A new image quality metric for image
pp. 4980–4995, 2020. fusion: The sum of the correlations of differences,” AEU-Int. J. Electron.
[32] J. Ma, H. Zhang, Z. Shao, P. Liang, and H. Xu, “GANMcC: A generative Commun., vol. 69, no. 12, pp. 1890–1896, Dec. 2015.
adversarial network with multiclassification constraints for infrared and [56] C. S. Xydeas and V. Petrović, “Objective image fusion performance
visible image fusion,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–14, measure,” Electron. Lett., vol. 36, no. 4, pp. 308–309, 2000.
2021. [57] M. B. A. Haghighat, A. Aghagolzadeh, and H. Seyedarabi, “A non-
[33] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “IFCNN: reference image fusion metric based on mutual information of image
A general image fusion framework based on convolutional neural features,” Comput. Electr. Eng., vol. 37, no. 5, pp. 744–756, Sep. 2011.
network,” Inf. Fusion, vol. 54, pp. 99–118, Feb. 2020.
[58] Y. Han, Y. Cai, Y. Cao, and X. Xu, “A new image fusion performance
[34] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A unified metric based on visual information fidelity,” Inf. Fusion, vol. 14, no. 2,
unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach. pp. 127–135, Apr. 2013.
Intell., vol. 44, no. 1, pp. 502–518, Jan. 2022.
[59] K. Guo, X. Li, X. Hu, J. Liu, and T. Fan, “Hahn-PCNN-CNN: An end-
[35] H. Li, X. Wu, and J. Kittler, “RFN-Nest: An end-to-end residual to-end multi-modal brain medical image fusion framework useful for
fusion network for infrared and visible images,” Inf. Fusion, vol. 73, clinical diagnosis,” BMC Med. Imag., vol. 21, no. 1, pp. 1–22, Jul. 2021.
pp. 720–786, Sep. 2021.
[60] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-
[36] H. Xu and J. Ma, “EMFusion: An unsupervised enhanced medical image tation for deep learning,” J. Big Data, vol. 6, no. 1, pp. 1–48, Jul. 2019.
fusion network,” Inf. Fusion, vol. 76, pp. 177–186, Dec. 2021.
[61] R. Livni et al., “On the computational efficiency of training neural
[37] H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “FusionDN: A unified densely networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2014,
connected network for image fusion,” in Proc. AAAI Conf. Artif. Intell. pp. 1–12.
(AAAI), Apr. 2020, pp. 1249–12484.
[62] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Medical image fusion via
[38] C. Liu, B. Yang, Y. Li, X. Zhang, and L. Pang, “An information retention convolutional sparsity based morphological component analysis,” IEEE
and feature transmission network for infrared and visible image fusion,” Signal Process. Lett., vol. 26, no. 3, pp. 485–489, Mar. 2019.
IEEE Sensors J., vol. 21, no. 13, pp. 14950–14959, Jul. 2021.
[39] L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level
vision tasks: A semantic-aware real-time infrared and visible image
fusion network,” Inf. Fusion, vol. 82, pp. 28–42, Jun. 2022.
[40] Y. Long, H. Jia, Y. Zhong, Y. Jiang, and Y. Jia, “RXDNFuse:
A aggregated residual dense network for infrared and visible image
fusion,” Inf. Fusion, vol. 69, pp. 128–141, May 2021.
Xinyu Xie received the B.E. degree in commu-
[41] H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma, “Rethinking the image
nication engineering from the University of South
fusion: A fast unified image fusion network based on proportional
China, Hengyang, China, in 2020, where she is
maintenance of gradient and intensity,” in Proc. AAAI. Artif. Intell.
currently pursuing the M.S. degree with the School
(AAAI), Apr. 2020, pp. 12797–12804.
of Electrical Engineering.
[42] Y. Liu, X. Chen, J. Cheng, and H. Peng, “A medical image fusion method Her research interests include computer vision,
based on convolutional neural networks,” in Proc. 20th Int. Conf. Inf. information fusion, and medical image processing.
Fusion (Fusion), Jul. 2017, pp. 1–7.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, and L. Jones, “Atten-
tion is all you need,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30,
2017, pp. 1–13.
[44] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
for image recognition at scale,” 2020, arXiv:2010.11929.
[45] V. Vs, J. M. J. Valanarasu, P. Oza, and V. M. Patel, “Image fusion
transformer,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2022,
pp. 3566–3570.
[46] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using Xiaozhi Zhang received the Ph.D. degree from the
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), School of Information Engineering, Guangdong Uni-
Oct. 2021, pp. 9992–10002. versity of Technology, Guangzhou, China, in 2018.
[47] A. Lin, B. Chen, J. Xu, Z. Zhang, G. Lu, and D. Zhang, “DS-TransUNet: From 2017 to 2018, he was a joint Ph.D. Stu-
Dual Swin transformer U-Net for medical image segmentation,” IEEE dent and a Research Assistant with the Department
Trans. Instrum. Meas., vol. 71, pp. 1–15, 2022. of Mathematics and Statistics, Curtin University,
[48] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Tim- Perth, WA, Australia. He is an Associate Professor
ofte, “SwinIR: Image restoration using Swin transformer,” in Proc. with the School of Electrical Engineering, Univer-
IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, sity of South China, Hengyang, China. His main
pp. 1833–1844. research interests include optimization algorithms
[49] C.-M. Feng, Y. Yan, G. Chen, Y. Xu, L. Shao, and H. Fu, and applications, machine learning, medical image
“Multi-modal transformer for accelerated MR imaging,” 2021, processing, and time-frequency analysis.
arXiv:2106.14248. Dr. Zhang is a regular reviewer for many conferences and journals.

Shengcheng Ye received the B.E. degree in software Bin Yang (Member, IEEE) received the B.S. degree
engineering from the University of South China, from the Zhengzhou University of Light Industry,
Hengyang, China, in 2023. Zhengzhou, China, in 2005, and the Ph.D. degree
His research interests include computer vision and in electrical engineering from Hunan University,
medical image processing. Changsha, China, in 2010.
He joined the School of Electrical Engineer-
ing, University of South China, Hengyang, China,
in 2010. He is a Full Professor with the School of
Electrical Engineering, University of South China.
His professional interests are information fusion,
pattern recognition, and image processing.
Dr. Yang has won one Second-Grade National Awards at Science of China
in 2019.

Hong Zhou is with the Department of Radiology,

The First Affiliated Hospital, University of South
Dongping Xiong received the Ph.D. degree from the China, Hengyang, China. He is a Medical Doctor,
College of Life Science and Technology, Huazhong an Associate Chief Physician, and a Visiting Scholar
University of Science and Technology, Wuhan, in the U.K., in 2014. He mainly focuses on medical
China, in 2019. imaging, brain functional imaging, medical big data,
He is a Lecturer with the School of Comput- and artificial intelligence.
ing/Software, University of South China, Hengyang,
Hunan, China. His main research interests include
optimization algorithms and applications, machine
learning, and medical image processing.
Dr. Xiong is a regular reviewer for many confer-
ences and journals.

Yaping Wan received the B.S. and Ph.D. degrees

from the Huazhong University of Science and Tech-
nology (HUST), Wuhan, China, in 2004 and 2009,
respectively.
He is currently a Professor and the Dean with
the School of Computing/Software, University of
South China, Hengyang, China, and the Interna-
tional Cooperation Research Center for Medical Big
Lijun Ouyang received the B.S. degree in theo-
Data of Hunan Province, Hengyang, China. He has
retical physics from Huazhong Normal University,
authored several books and over 40 papers in jour-
Wuhan, China, in 1988, and the M.S. degree in
nals and at international conferences/workshops. His
computer science and application from Xiamen Uni-
current research interests include medical image processing, intelligent nuclear
versity, Xiamen, China, in 2003.
security, big data analysis and causal inference, high-reliability computing, and
His main research interests include computer WEB
security evaluation.
technology, machine learning, and medical image
Dr. Wan has been the Workshop Chairperson at the 16th IEEE International
processing.
Conference on Big Data Science and Engineering in 2022, and the Session
Chairperson of Asian Conference on Artificial Intelligence Technology in
2021 and 2022.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on October 01,2024 at 14:06:06 UTC from IEEE Xplore. Restrictions apply.