Transformer CNN mixture Architecture
Transformer CNN mixture Architecture
3 4
JST PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, Japan Shanghai Jiao Tong University, Shanghai, China
{jmliu@toki., hemingsun@aoni., katto@}waseda.jp
Abstract
Learned image compression (LIC) methods have exhib-
Ground Truth
ited promising progress and superior rate-distortion per-
formance compared with classical image compression stan-
dards. Most existing LIC methods are Convolutional Neural
Networks-based (CNN-based) or Transformer-based, which Ours [MSE]
0.127bpp|35.78dB|16.25dB
have different advantages. Exploiting both advantages is a
point worth exploring, which has two challenges: 1) how
to effectively fuse the two methods? 2) how to achieve
VVC (VTM-12.1) WebP Ours [MS-SSIM]
higher performance with a suitable complexity? In this pa- 0.131bpp|34.69dB|15.04dB 0.180bpp|30.77dB|12.47dB 0.114bpp|30.81dB|17.07dB
14389
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Stage I
Main Path Hyper- SwinT Block
prior Path (W-MSA)
concatenate
1x1 conv
1x1 conv
Leaky Relu
Leaky Relu
Conv3x3 ↓2
3x3 conv
3x3 conv
split
+
RBS ↓2
RBS ↓2
RBS ↓2
TCM
TCM TCM
TCM TCM
TCM +
block block block enc
block block block Channel-
wise
Entropy SwinT Block
Model (SW-MSA)
concatenate
Conv3x3 ↑ 2
1x1 conv
1x1 conv
Leaky Relu
Leaky Relu
RBU↑ 2
RBU ↑ 2
RBU ↑ 2
3x3 conv
3x3 conv
subpel
split
TCM
TCM TCM
TCM TCM dec +
RBU
block block block +
block block +
Stage II
The designed architecture of image compressoin model
Transformer-CNN Mixture block (TCM block)
Figure 2. The overall framework of our method (left) and the proposed TCM block (right), enc and dec contain the processes of quantizing
and range coding. ↑ 2 or ↓ 2 means that the feature size is enlarged/reduced by two times compared to the previous layer.
14390
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Transformer- CNN- a tensor with size C × HF × WF . And then, the concate-
Ours based based nated tensor is input to another 1 × 1 convolutional layer to
fuse local and non-local features. At last, the skip connec-
tion between F and the output is built to get F out . In order
to combine the residual network with the Swin-Transformer
= ݐ0.01
more effectively, we divide TCM into two stages with sim-
ilar processes. In the transformer block of stage I, we use
window-based multi-head self-attention (W-MSA). In stage
II, we use shifted window-based multi-head self-attention
Analysis
Point (SW-MSA) in the transformer block. The benefit of this is
that the residual networks are inserted into the common two
consecutive Swin-transformer blocks, which can be more
effective for feature fusion. Both two stages can be formu-
lated as follows:
= ݐ0.0001
F cnn , F trans = Split(Conv1 × 1(F ))
F cnn , F trans = Res(F cnn ), SwinT (F trans ) (4)
F out = F + Conv1 × 1(Cat(F cnn , F trans ))
14391
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Slice Network
enc
concat- (Scale) (Scale)
tenate SWAtten Parameters Net
conv 3x3 ↓2
conv 3x3 ↑2
enc prediction (LRP)
subpel
RBU ↑2
RBS ↓2
conv 3x3
conv 3x3
conv 3x3
conv 3x3
conv 3x3
conv 3x3
GELU
GELU
GELU
GELU
TCM
TCM
Figure 4. The proposed channel-wise entropy model. The encoded slice y <i can assist the encoding of the subsequent slice y i .
Raw Image ࢟ (w/o SWAtten) ࢟ (w/ SWAtten) Scaled deviation map Scaled deviation map
(w/o SWAtten) (w/ SWAtten)
Figure 5. The scaled deviation map and the channel with the largest entropy of latent representations y, from the model without/with
SWAtten. For kodim01 (top), the [PSNR@bitrates@s ] of the model without and with SWAtten are [[email protected]@0.451] and
[[email protected]@0.389]. For kodim05 (bottom), they are [[email protected]@0.422] and [[email protected]@0.365].
return at a distance far from point p). This means that our tion module for the entropy model which has 1/16 input
model also has the long-distance modeling ability of trans- size compared to the main path and can reduce much com-
formers. It is also worth noting that at t = 0.0001, the plexity. The designed parameter-efficient swin-Transformer
CNN-based model exhibits a circular ERF, while the ERF based Attention (SWAtten) module is shown in Fig. 6. The
of our model exhibit a shape closer to the context (a long swin-transformer block which can capture non-local infor-
strip shape like the background scarf). This shows that our mation is added into the architecture, while the other resid-
model has better modeling ability compared to the CNN- ual blocks (RB) can get local information. Since the num-
based model. ber of channels of features inputted to ei accumulates with
the increase of slice index i, the input channels of ei can be
3.3. Proposed Entropy Model expressed as:
Motivated from [13, 27], we propose a channel-wise M + i × (M//s) (5)
auto-regressive entropy model with a parameter-efficient Where M is the number of channels of the latent variable
swin-transformer-based attention module (SWAtten) by us- y. The input channel number of e9 can reach 608 when s
ing channel squeezing. The framework is shown in Fig. 4. is set as 10 and M is set as 320, which causes the model
to be parameter-consuming. To achieve the balance be-
3.3.1 SWAtten Module tween complexity and RD-performance, the total number
of the slices s is reduced to 5 from 10 which is the com-
Past works on attention have demonstrated their effective- mon setting in [27]. At the same time, a channel squeeze
ness for image compression. However, many of them are operation is used to squeeze the input channels. In this pa-
time-consuming, or only have the ability to capture local in- per, we squeeze the input channels of all slices to 128, i.e.,
formation. Different from these modules which are placed let the output channel of the first 1 × 1 convolutional layer
on both the main path (ga and gs ) and hyper-prior path be 128. At last, an unsqueeze operation is utilized to un-
(ha and hs ) of image compression, we design an atten- squeeze the channels of output to the original number, i.e.,
14392
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
let the output channel of the last 1x1 convolutional layer be model with SWAtten. It suggests that the model with SWAt-
M + i × (M/s). ten can have less information loss and get a higher quality
decompressed image.
RB
4. Experiments
1x1 conv
1x1 conv
RB x3
x + 1x1 conv
3x3 conv
1x1 conv
4.1. Experimental Setup
1x1 conv
Sigmoid
SwinT
Block
RB x 3
+
Channel Channel 4.1.1 Training Details
Squeeze Unsqueeze
the model without SWAtten to 0.389 and 0.365 by using the html
14393
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Figure 7. Performance evaluation on the Kodak dataset.
Figure 8. Performance evaluation on the CLIC Professional Val- Figure 9. Performance evaluation on the Tecnick dataset.
idation dataset.
4.2. Rate-Distortion Performance different datasets (BD-rate=0%). Our method outperforms
VVC (VTM-12.1) by 12.30%, 13.71%, 11.85% in BD-rate
We compare our large model with state-of-the-arts on Kodak, Tecnick, and CLIC datasets, respectively. Table
(SOTA) learned end-to-end image compression algorithms, 1 shows partial results on Kodak. More comparisons are
including [3], [6], [34], [5], [15], [13], and [37]. The clas- reported in Supplementary.
sical image compression codec, VVC [29] is also tested
by using VTM12.1. The rate-distortion performance on 4.3. Ablation Studies
Kodak dataset is shown in Fig. 7. Both PSNR and MS-
4.3.1 Comparison with Transformer-only/CNN-only
SSIM are tested on Kodak to demonstrate the robustness of
based Models
our method. Here, we convert MS-SSIM to −10log10 (1 −
MS-SSIM) for clearer comparison. As we can see, at the In order to show the effectiveness of our proposed
same bitrate, we can improve up to about 0.4dB PSNR and Transformer-CNN Mixture (TCM) blocks, we compare
0.5dB MS-SSIM compared with SOTA methods. The re- our medium model without SWAtten modules to the
sults of CLIC dataset and Tecnick dataset are shown in Fig. Transformer-only based model and CNN-only based model
8 and Fig. 9, respectively. We also achieve similar good in [36]. The results are shown in Fig. 10a. “Conv ChARM”
results on these two datasets. These results suggest that our and “SwinT ChARM” are a CNN-only based model and
method is robust and can achieve SOTA performance based a Transformer-only based model, respectively. They have
on all of the three datasets with different resolutions. To similar architectures to our methods without SWAtten mod-
get quantitative results, we present the BD-rate [4] com- ules. The difference is that “SwinT ChARM” uses Swin
puted from PSNR-BPP curves as the quantitative metric. Transformer block, “Conv ChARM” uses convolutional
The anchor RD-performance is set as the results of VVC on neural networks, and we use the proposed TCM block. By
14394
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
using the advantages both of transformer and CNN, the re- our small, medium and large models can achieve SOTA
sults show that our method surpasses the Transformer-only RD-performance. Meanwhile, the performance can further
based model and CNN-only based model. improve as the complexity increases, which shows that our
model has a lot of potentials.
4.3.2 SWAtten Module
Table 1. The RD-performance and complexity of learned image
In Fig. 10b, we compare the cases using SWAtten modules compression models based on Kodak using GPU (RTX 3090). A
or not. It can be observed that SWAtten modules bring a lower BD-rate indicates higher RD-performance.
significant gain in RD-performance. Meanwhile, by using
the channel squeeze operation in SWAtten modules, we can Encoding Decoding Para- FLOPs BD-
Methods
get a comparable performance compared with the situation Time(/ms) Time(/ms) meters(/M) (/G) rate
without using that operation while saving many parameters. Xie et al. [34] 2346 5212 47.55 408.21 -1.65
Section 4.5 shows more information on the parameters effi- SwinT
132 84 60.55 230.37 -4.02
ciency gain of this operation. ChARM [36]
Ours (Large,
151 141 160.45 830.90 -12.54
w/o squeeze)
Ours(Large) 150 140 75.89 700.96 -12.30
Ours(Medium) 130 122 58.72 415.20 -9.65
Ours(Small) 109 102 44.96 211.54 -7.39
4.6. Visualization
Fig. 1 shows the visualization example of decompressed
images (kodim23) from Kodak dataset by our methods
and the classical compression standards WebP, VVC (VTM
12.1) [29]. In some complex texture parts, our methods can
keep more details (clearer feather outline).
(a) (b) (c)
5. Conclusion
Figure 10. Experiments on Kodak dataset. (a) Comparison with
Transformer-only/CNN-only based Models. (b) The ablation stud- In this paper, we incorporate transformers and CNN
ies on SWAtten module (“w/o squeeze” represents that we don’t to propose an efficient parallel transformer-CNN mixture
add the channel squeeze/unsqueeze operation to SWAtten). (c)
block that utilizes the local modeling ability of CNN and
The RD-performance of models using different attention modules.
the non-local modeling ability of transformers. Then, a new
4.4. Various Attention Modules image compression architecture is designed based on the
TCM block. Besides, we present a swin-transformer-based
In Fig. 10c, we compared our proposed SWAtten with attention module to improve channel-wise entropy mod-
the previous attention modules, including the non-local at- els. The results of experiments show that the image com-
tention (NonlocalAtten) module [21], the local attention pression model with TCM blocks outperforms the CNN-
(LocalAtten) module [6], and the window-based attention only/Transformer-only based models under a suitable com-
(WAtten) module [37]. Compared with these different at- plexity. Furthermore, the performance of SWAtten sur-
tention modules, SWAtten gets the best RD-performance passes previous attention modules designed for image com-
because it can capture non-local information while also pay- pression. At last, our method achieves state-of-the-art on
ing enough attention to local information. three different resolution datasets (i.e., Kodak, Tecnick,
CLIC Professional Validation) and is superior to existing
4.5. Complexity and Qualitative Results
image compression methods.
We test the complexity and qualitative results of dif-
ferent methods based on Kodak. Two other SOTA works 6. Acknowledgment
[34, 36], are also tested as Table 1 shows. The results of
our method suggest that the efficiency and RD-performance This paper is supported by Japan Science and Technol-
of our method can outperform both of these two methods. ogy Agency (JST), under Grant JPMJPR19M5; Japan So-
Meanwhile, after using channel squeeze in SWAtten, we ciety for the Promotion of Science (JSPS), under Grant
can save a lot of parameters and FLOPs, while we can get 21K17770; Kenjiro Takayanagi Foundation; the Foundation
a comparable BD-rate. It also should be noted that all of of Ando Laboratory; NICT, Grant Number 03801, Japan.
14395
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
References pression. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 5992–
[1] Nicola Asuni and Andrea Giachetti. Testimages: a large- 6001, 2022. 2, 7
scale archive for testing visual devices and basic image pro- [16] Diederik P Kingma and Jimmy Ba. Adam: A method for
cessing algorithms. In STAG, pages 63–70, 2014. 6 stochastic optimization. arXiv preprint arXiv:1412.6980,
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End- 2014. 6
to-end optimized image compression. In International Con-
[17] Eastman Kodak. Kodak lossless true color image suite (pho-
ference on Learning Representations, 2016. 2
tocd pcd0992). 1993. 6
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
[18] A Burakhan Koyuncu, Han Gao, and Eckehard Steinbach.
Hwang, and Nick Johnston. Variational image compression
Contextformer: A transformer with spatio-channel attention
with a scale hyperprior. In Proceedings of the International
for context modeling in learned image compression. In Eu-
Conference on Learning Representations, 2018. 1, 2, 7
ropean Conference on Computer Vision, 2022. 2
[4] Gisle Bjontegaard. Calculation of average psnr differences
between rd-curves. In VCEG-M33, 2001. 2, 7 [19] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
Van Gool, and Radu Timofte. Swinir: Image restoration us-
[5] Fangdong Chen, Yumeng Xu, and Li Wang. Two-stage oc-
ing swin transformer. In Proceedings of the IEEE/CVF Inter-
tave residual network for end-to-end image compression. In
national Conference on Computer Vision, pages 1833–1844,
Proceedings of AAAI conference on artificial intelligence,
2021. 2
2022. 1, 2, 7
[6] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro [20] Fangzheng Lin, Heming Sun, Jinming Liu, and Jiro Katto.
Katto. Learned image compression with discretized gaussian Multistage spatial context models for learned image com-
mixture likelihoods and attention modules. In Proceedings of pression. arXiv preprint arXiv:2302.09263, 2023. 1
the IEEE/CVF Conference on Computer Vision and Pattern [21] Haojie Liu, Tong Chen, Peiyao Guo, Qiu Shen, Xun Cao,
Recognition, pages 7939–7948, 2020. 1, 2, 3, 4, 7, 8 Yao Wang, and Zhan Ma. Non-local attention optimized
[7] CLIC. Workshop and challenge on learned image compres- deep image compression. arXiv preprint arXiv:1904.09757,
sion. In Proceedings of the IEEE/CVF Conference on Com- 2019. 1, 2, 8
puter Vision and Pattern Recognition, 2021. 6 [22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[8] Jia Deng. A large-scale hierarchical image database. Pro- Zhang, Stephen Lin, and Baining Guo. Swin transformer:
ceedings of IEEE/CVF conference on Computer Vision and Hierarchical vision transformer using shifted windows. In
Pattern Recognition, 2009. 6 Proceedings of the IEEE/CVF International Conference on
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Computer Vision, pages 10012–10022, 2021. 1, 2, 4
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [23] Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Zhan Ma. Transformer-based image compression. arXiv
vain Gelly, et al. An image is worth 16x16 words: Trans- preprint arXiv:2111.06707, 2021. 1
formers for image recognition at scale. In International Con- [24] Ao Luo, Heming Sun, Jinming Liu, and Jiro Katto. Memory-
ference on Learning Representations, 2020. 1, 2 efficient learned image compression with pruned hyperprior
[10] Haisheng Fu, Feng Liang, Jianping Lin, Bing Li, Mo- module. In 2022 IEEE International Conference on Image
hammad Akbari, Jie Liang, Guohe Zhang, Dong Liu, Processing (ICIP), pages 3061–3065, 2022. 6
Chengjie Tu, and Jingning Han. Learned image com- [25] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel.
pression with discretized gaussian-laplacian-logistic mixture Understanding the effective receptive field in deep convolu-
model and concatenated residual modules. arXiv preprint tional neural networks. Advances in neural information pro-
arXiv:2107.06463, 2021. 1 cessing systems, 29, 2016. 4
[11] Google. Web picture format. 2010. 1 [26] David Minnen, Johannes Ballé, and George D Toderici.
[12] Zongyu Guo, Zhizheng Zhang, Runsen Feng, and Zhibo Joint autoregressive and hierarchical priors for learned im-
Chen. Causal contextual prediction for learned image com- age compression. Advances in neural information processing
pression. IEEE Transactions on Circuits and Systems for systems, 31, 2018. 2, 3
Video Technology, 32(4):2329–2341, 2021. 2
[27] David Minnen and Saurabh Singh. Channel-wise autoregres-
[13] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei
sive entropy models for learned image compression. In IEEE
Qin, and Yan Wang. Elic: Efficient learned image compres-
International Conference on Image Processing (ICIP), pages
sion with unevenly grouped space-channel contextual adap-
3339–3343. IEEE, 2020. 2, 3, 4, 5, 6
tive coding. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 7 [28] Yichen Qian, Xiuyu Sun, Ming Lin, Zhiyu Tan, and Rong
Jin. Entroformer: A transformer-based entropy model for
[14] Dailan He, Yaoyan Zheng, Baocheng Sun, Yan Wang,
learned image compression. In International Conference on
and Hongwei Qin. Checkerboard context model for effi-
Learning Representations, 2021. 2
cient learned image compression. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern [29] Joint Video Experts Team. Vvc official test model vtm. 2021.
Recognition, pages 14771–14780, 2021. 2, 4 1, 7, 8
[15] Jun-Hyuk Kim, Byeongho Heo, and Jong-Seok Lee. Joint [30] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
global and local hierarchical priors for learned image com- Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
14396
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
data-efficient image transformers & distillation through at-
tention. In International Conference on Machine Learning,
pages 10347–10357. PMLR, 2021. 2
[31] Gregory K Wallace. The jpeg still picture compression
standard. IEEE transactions on consumer electronics,
38(1):xviii–xxxiv, 1992. 1
[32] Dezhao Wang, Wenhan Yang, Yueyu Hu, and Jiaying Liu.
Neural data-dependent transform for learned image com-
pression. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 17379–
17388, 2022. 1
[33] Yaojun Wu, Xin Li, Zhizheng Zhang, Xin Jin, and Zhibo
Chen. Learned block-based hybrid image compression.
IEEE Transactions on Circuits and Systems for Video Tech-
nology, 2021. 1
[34] Yueqi Xie, Ka Leong Cheng, and Qifeng Chen. Enhanced
invertible encoding for learned image compression. In Pro-
ceedings of the 29th ACM International Conference on Mul-
timedia, pages 162–170, 2021. 1, 2, 6, 7, 8
[35] Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yulun
Zhang, Hao Tang, Radu Timofte, and Luc Van Gool. Prac-
tical blind denoising via swin-conv-unet and data synthesis.
arXiv preprint arXiv:2203.13278, 2022. 2
[36] Yinhao Zhu, Yang Yang, and Taco Cohen. Transformer-
based transform coding. In International Conference on
Learning Representations, 2022. 1, 2, 3, 7, 8
[37] Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. The
devil is in the details: Window-based attention for image
compression. Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2022. 1, 2, 4, 7, 8
14397
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.