0% found this document useful (0 votes)
12 views10 pages

Transformer CNN mixture Architecture

The paper presents a novel learned image compression (LIC) framework that combines Convolutional Neural Networks (CNN) and Transformer architectures through a parallel Transformer-CNN Mixture (TCM) block. This approach aims to leverage the strengths of both methods while maintaining controllable complexity, resulting in state-of-the-art rate-distortion performance across multiple datasets. Experimental results indicate that the proposed method significantly outperforms existing LIC techniques, including the best classical standards.

Uploaded by

f20221547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Transformer CNN mixture Architecture

The paper presents a novel learned image compression (LIC) framework that combines Convolutional Neural Networks (CNN) and Transformer architectures through a parallel Transformer-CNN Mixture (TCM) block. This approach aims to leverage the strengths of both methods while maintaining controllable complexity, resulting in state-of-the-art rate-distortion performance across multiple datasets. Experimental results indicate that the proposed method significantly outperforms existing LIC techniques, including the best classical standards.

Uploaded by

f20221547
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Learned Image Compression with Mixed Transformer-CNN Architectures

Jinming Liu14 Heming Sun23 * Jiro Katto12


1
Department of Computer Science and Communication Engineering, Waseda University, Tokyo, Japan
2
Waseda Research Institute for Science and Engineering, Waseda University, Tokyo, Japan
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 979-8-3503-0129-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/CVPR52729.2023.01383

3 4
JST PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, Japan Shanghai Jiao Tong University, Shanghai, China
{jmliu@toki., hemingsun@aoni., katto@}waseda.jp

Abstract
Learned image compression (LIC) methods have exhib-
Ground Truth
ited promising progress and superior rate-distortion per-
formance compared with classical image compression stan-
dards. Most existing LIC methods are Convolutional Neural
Networks-based (CNN-based) or Transformer-based, which Ours [MSE]
0.127bpp|35.78dB|16.25dB
have different advantages. Exploiting both advantages is a
point worth exploring, which has two challenges: 1) how
to effectively fuse the two methods? 2) how to achieve
VVC (VTM-12.1) WebP Ours [MS-SSIM]
higher performance with a suitable complexity? In this pa- 0.131bpp|34.69dB|15.04dB 0.180bpp|30.77dB|12.47dB 0.114bpp|30.81dB|17.07dB

per, we propose an efficient parallel Transformer-CNN Mix-


Figure 1. Visualization of decompressed images of kodim23 from
ture (TCM) block with a controllable complexity to incor-
Kodak dataset based on different methods (The subfigure is titled
porate the local modeling ability of CNN and the non-local as “Method|Bit rate|PSNR|MS-SSIM”).
modeling ability of transformers to improve the overall ar-
chitecture of image compression models. Besides, inspired image compression (LIC) is optimized as a whole. Some
by the recent progress of entropy estimation models and at- very recent LIC works [5, 13, 32, 34, 36, 37] have outper-
tention modules, we propose a channel-wise entropy model formed VVC which is the best classical image and video
with parameter-efficient swin-transformer-based attention coding standards at present, on both Peak signal-to-noise
(SWAtten) modules by using channel squeezing. Experi- ratio (PSNR) and Multi-Scale Structural Similarity (MS-
mental results demonstrate our proposed method achieves SSIM). This suggests that LIC has great potential for next-
state-of-the-art rate-distortion performances on three dif- generation image compression techniques.
ferent resolution datasets (i.e., Kodak, Tecnick, CLIC Pro- Most LIC methods are CNN-based methods [6, 10, 20,
fessional Validation) compared to existing LIC methods. 21, 33] using the variational auto-encoder (VAE) which is
The code is at https://github.com/jmliu206/ proposed by Ballé et al. [3]. With the development of vision
LIC_TCM . transformers [9,22] recently, some vision transformer-based
LIC methods [23, 36, 37] are also investigated. For CNN-
1. Introduction based example, Cheng et al. [6] proposed a residual block-
based image compression model. For transformer-based ex-
Image compression is a crucial topic in the field of im- ample, Zou et al. [37] tried a swin-transformer-based image
age processing. With the rapidly increasing image data, compression model. These two kinds of methods have dif-
lossy image compression plays an important role in storing ferent advantages. CNN has the ability of local modeling,
and transmitting efficiently. In the passing decades, there while transformers have the ability to model non-local in-
were many classical standards, including JPEG [31], WebP formation. It is still worth exploring whether the advantages
[11], and VVC [29], which contain three steps: transform, of these two methods can be effectively combined with a
quantization, and entropy coding, have achieved impres- suitable complexity. In our method, we try to efficiently
sive Rate-Distortion (RD) performance. On the other hand, incorporate both advantages of CNN and transformers by
different from the classical standards, end-to-end learned proposing an efficient parallel Transformer-CNN Mixture
* Heming Sun is the corresponding author. (TCM) block under a controllable complexity to improve

979-8-3503-0129-8/23/$31.00 ©2023 IEEE 14388


DOI 10.1109/CVPR52729.2023.01383
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
the RD performance of LIC. 2. Related Work
In addition to the type of the neural network, the design
of entropy model is also an important technique in LIC. 2.1. Learned end-to-end Image Compression
The most common way is to introduce extra latent variables 2.1.1 CNN-based Models
as hyper-prior to convert the probability model of compact
coding-symbols to a joint model [3]. On that basis, many In the past decade, learned image compression has made
methods spring up. Minnen et al. [26] utilized the masked significant progress and demonstrated impressive perfor-
convolutional layer to capture the context information. Fur- mance. Ballé et al. [2] firstly proposed an end-to-end
thermore, they [27] proposed a parallel channel-wise auto- learned CNN-based image compression model. Then they
regressive entropy model by splitting the latent to 10 slices. proposed a VAE architecture and introduced a hyper-prior
The results of encoded slices can assist in the encoding of to improve image compression in [3]. Furthermore, a local
remaining slices in a pipeline manner. context model was utilized to improve the entropy model
Recently, many different attention modules [6, 21, 37] of image compression in [26]. In addition to that, a causal
were designed and proposed to improve image compres- context was proposed by using global context information
sion. Attention modules can help the learned model pay in [12]. Since the context model is time-consuming, He et
more attention to complex regions. However, many of them al. [14] designed a checkerboard context model to achieve
are time-consuming, or can only capture local informa- parallel computing, while Minnen et al. [27] used channel-
tion [37]. At the same time, these attention modules are wise context to accelerate the computing. Apart from im-
usually placed in both the main and the hyper-prior path proving the entropy model, some works attempt to adopt
of image compression network, which will further intro- different types of convolutional neural networks to enhance
duce large complexity because of a large input size for the image compression, such as Cheng et al. [6] who developed
main path. To overcome that problem, we try to move at- a residual network. Chen et al. [5] introduced octave resid-
tention modules to the channel-wise entropy model which ual networks into image compression models, and Xie et
has 1/16 input size compared with that of main path to re- al. [34] used invertible neural networks (INNs) to improve
duce complexity. Nevertheless, if the above attention mod- performance.
ules are directly added to the entropy model, a large num-
ber of parameters will be introduced. Therefore, we pro- 2.1.2 Transformer-based Models
pose a parameter-efficient swin-transformer-based attention
With the rapid development of vision transformers, trans-
module (SWAtten) with channel squeezing for the channel-
formers show impressive performance on not only high-
wise entropy model. At the same time, to avoid the latency
level vision tasks, such as, image classification [22, 30], but
caused by too many slices, we reduce the number of slices
also some low-level vision tasks, such as, image restora-
from 10 to 5 to achieve the balance between running speed
tion [19], and image denoising [35]. Motivated by those
and RD-performance. As Fig. 1 shows, our method can get
works, some transformer-based LIC models are also pro-
pleasant results compared with other methods.
posed recently. Some works [36, 37] tried to construct a
The contributions of this paper can be summarized as
swin-transformer-based LIC model. Qian et al. [28] used
follows:
a ViT [9] to help the entropy model capture global context
• We propose a LIC framework with parallel information. Koyuncu et al. [18] utilized a sliding window
transformer-CNN mixture (TCM) blocks that ef- to reduce the complexity of ViT in entropy models. Kim
ficiently incorporate the local modeling ability of et al. [15] proposed an Information Transformer to get both
CNN and the non-local modeling ability of transform- global and local dependencies.
ers, while maintaining controllable complexity.
2.2. Attention Modules
• We design a channel-wise auto-regressive entropy
model by proposing a parameter-efficient swin- Attention modules try to help the learned models focus
transformer-based attention (SWAtten) module with on important regions to obtain more details. Many attention
channel squeezing. modules designed for image compression significantly im-
prove the RD-performance. Liu et al. [21] firstly introduced
• Extensive experiments demonstrate that our approach a non-local attention module into image compression. Be-
achieves state-of-the-art (SOTA) performance on three cause of the non-local block, this attention module is time-
datasets (i.e., Kodak, Tecnick, and CLIC datasets) consuming. Therefore, Cheng et al. [6] removed this non-
with different resolutions. The method outperforms local block and proposed a local attention module to accel-
VVC (VTM-12.1) by 12.30%, 13.71%, 11.85% in erate the computing. Furthermore, Zou et al. [37] adopted
Bjøntegaard-delta-rate (BD-rate) [4] on Kodak, Tec- a window-based attention module to improve image com-
nick, and CLIC datasets, respectively. pression.

14389

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Stage I
Main Path Hyper- SwinT Block
prior Path (W-MSA)

concatenate
1x1 conv

1x1 conv
Leaky Relu

Leaky Relu
Conv3x3 ↓2

3x3 conv
3x3 conv
split
+

RBS ↓2

RBS ↓2

RBS ↓2
TCM
TCM TCM
TCM TCM
TCM +
block block block enc
block block block Channel-
wise
Entropy SwinT Block
Model (SW-MSA)

concatenate
Conv3x3 ↑ 2

1x1 conv

1x1 conv
Leaky Relu

Leaky Relu
RBU↑ 2

RBU ↑ 2

RBU ↑ 2

3x3 conv
3x3 conv
subpel

split
TCM
TCM TCM
TCM TCM dec +

RBU
block block block +
block block +
Stage II
The designed architecture of image compressoin model
Transformer-CNN Mixture block (TCM block)

Figure 2. The overall framework of our method (left) and the proposed TCM block (right), enc and dec contain the processes of quantizing
and range coding. ↑ 2 or ↓ 2 means that the feature size is enlarged/reduced by two times compared to the previous layer.

3. Proposed Method model ψ is used to encodequantized ẑ as pẑ|ψ (ẑ | ψ) =


  1 1
3.1. Problem Formulation j pzj |ψ (ψ) ∗ U − 2 , 2 (ẑj ) where j specifies the po-
sition of each element or each signal. ẑ is then fed to the
As Fig.2 shows, the LIC models with a channel-wise en- hyper-prior decoder hs with parameters θ h for decoding to
tropy model [27] can be formulated by: obtain two latent features F mean , F scale which are used to
be input to the following each slice network ei . After that,
y = ga (x; φ) each slice y i is sequentially processed to get y i . During
ŷ = Q(y − μ) + μ (1) this process, encoded slices y <i = {y 0 , y 1 , ..., y i−2 , y i−1 }
x̂ = gs (ŷ; θ) and current slice y i are input to the slice network ei to
get the estimated distribution parameters Φi = (μi , σ i )
where x and x̂ represent the raw images and decompressed to help generate bit-streams.
  Therefore, we can assume
images. By inputting x to the encoder ga with learned pa- pŷ|ẑ (ŷ | ẑ) ∼ N μ, σ 2 . At the same time, the residual
rameters φ, we can get the latent representation y which r i is used to reduce the quantization errors (y − ŷ) which
is estimated to have a mean μ. To encode it, y is quan- is introduced by quantization. Therefore, y with less error
tized to ŷ by quantization operator Q. According to previ- is entered into the decoder gs with learned parameters θ,
ous works [13, 26] and discussion1 , we round and encode instead of ŷ in Equation 1. At last, we can get the decom-
each y − μ] to the bitstream instead of y and restore the pressed image x̂. Fig. 4 illustrates the detailed process of
coding-symbol ŷ as y − μ + μ, which can further ben- this channel-wise entropy model clearly.
efit entropy models. Then, we use range coder to encode In order to train the overall learned image compression
losslessly y − μ] which is modeled as a single Gaussian model, we consider the problem as a Lagrangian multiplier-
distribution with the variance σ to bitstreams, and transmit based rate-distortion optimization. The loss is defined as:
it to decoder gs . In this overall pipline, Φ = (μ, σ) in this
paper is derived by a channel-wise entropy model as Equa- L =R(ŷ) + R(ẑ) + λ · D(x, x̂)
     
tion 2, Fig. 2 and Fig. 4 show. The entropy model of [27] =E − log2 pŷ|ẑ (ŷ | ẑ) + E − log2 pẑ|ψ (ẑ | ψ)
divides y to s even slices {y 0 , y 1 , ..., y s−1 } so that the en- + λ · D(x, x̂)
coded slices can help improve the encoding of subsequent (3)
slices, we formulate it as: where λ controls the rate-distortion tradeoff. Different λ
z = ha (y; φh ) , ẑ = Q(z) values are corresponding to different bit rates. D(x, x̂)
denotes the distortion term which is calculated by Mean
F mean , F scale = hs (ẑ; θ h )
(2) squared error (MSE) loss. R(ŷ), R(ẑ) denote the bit rates
r i , Φi = ei (F mean ,F scale , y <i , y i ), 0 <= i < s of latents ŷ and ẑ.
y i = r i + ŷ i
3.2. Transformer-CNN Mixture Blocks
where ha denotes the hyper-prior encoder with parameters Image compression models based on CNN [6] have
φh . It is used to get side information z to capture spatial de- achieved excellent RD-performance. Besides, with the
pendencies among the elements of y. A factorized density rapid development of vision transformers, some methods
1 https : / / groups . google . com / g / tensorflow - based on vision transformers [36] are also proposed and
compression/c/LQtTAo6l26U/m/mxP-VWPdAgAJ outperform CNN-based methods because transformers can

14390

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Transformer- CNN- a tensor with size C × HF × WF . And then, the concate-
Ours based based nated tensor is input to another 1 × 1 convolutional layer to
fuse local and non-local features. At last, the skip connec-
tion between F and the output is built to get F out . In order
to combine the residual network with the Swin-Transformer
‫ = ݐ‬0.01
more effectively, we divide TCM into two stages with sim-
ilar processes. In the transformer block of stage I, we use
window-based multi-head self-attention (W-MSA). In stage
II, we use shifted window-based multi-head self-attention
Analysis
Point (SW-MSA) in the transformer block. The benefit of this is
that the residual networks are inserted into the common two
consecutive Swin-transformer blocks, which can be more
effective for feature fusion. Both two stages can be formu-
lated as follows:
‫ = ݐ‬0.0001
F cnn , F trans = Split(Conv1 × 1(F ))
 
F cnn , F trans = Res(F cnn ), SwinT (F trans ) (4)
 
F out = F + Conv1 × 1(Cat(F cnn , F trans ))

Based on the proposed TCM Block, the main path (ga


and gs ) is designed as Fig. 2 shows. The Residual Block
Figure 3. The effective receptive fields (ERF) calculated by differ- with stride (RBS), Residual Block Upsampling (RBU) and
ent methods and clipped by different thresholds t. The two rows subpixel conv3x3 are proposed by [6]. The detailed archi-
correspond to different thresholds while the three columns corre- tectures of these three modules are reported in Supplemen-
spond to different methods including our proposal, transformer- tary. In our framework, except for the last layer of ga and
based [37], and CNN-based [27] method. gs , we attach a TCM block after each RBS/RBU to obtain
non-local and local information. Also, we add TCM blocks
capture non-local information. However, according to pre- into the hyper-prior path by re-designing the hyper-prior en-
vious works [14, 37], even though non-local information coder ha and hyper-prior decoder hs as Fig. 4 shows.
can improve image compression, local information still has To explore how the TCM block aggregates local and
a significant impact on the performance of image com- non-local information, we present effective receptive fields
pression. CNN can pay more attention to local patterns (ERF) [25] of our model. Besides, the effective receptive
while transformers have the ability of non-local informa- fields of the transformer-based model [37] and the CNN-
tion. Therefore, we try to incorporate residual networks based model [27] are used to make comparisons. ERF is
and Swin-Transformer (SwinT) blocks [22] to utilize both defined as absolute gradients of a pixel in the output (i.e.,
dx̂
advantages of these two kinds of models. There are two | dxp |). Here, we calculate gradients of the analysis point
challenges in this combination, the first is how to fuse two p = (70, 700) of kodim04 in Kodak dataset. In order to es-
different features effectively, and the other is how to reduce timate the importance of information at different distances,
the required complexity. we clip gradients with two thresholds t (i.e., 0.01, 0.0001)
Here, an efficient parallel transformer-CNN mixture to visualize them. Note that the clip operation means we
(TCM) block is proposed as Fig. 2 shows. We assume the reduce the gradient values larger than the threshold to the
input tensor as F with size C × HF × WF . It is firstly in- threshold, and the gradient values smaller than the thresh-
put into a 1 × 1 convolutional layer whose output channels old remain unchanged. This can help our visualization has
number is also C. Then we evenly split the tensor to two higher visibility precision. The visualization is shown in
tensor F cnn and F trans with size C2 × HF × WF . This Fig. 3. As we can see, when t = 0.01, the red regions (high
operation has two benefits. Firstly, it reduces the number gradient values) of our TCM-based model and the CNN-
of feature channels fed to subsequent CNN and transform- based model are smaller than that of the transformer-based
ers to decrease the model complexity. Secondly, local and model. This suggests that our model pays more attention to
non-local features can be processed independently and in neighbor regions, and has a similar local modeling ability
parallel, which facilitates better feature extraction. After of CNN. Meanwhile, when t = 0.0001, the ERF show our
that, the tensor F cnn is sent to the residual network (Res) model and the transformer-based model can capture infor-
to get F cnn , while F trans is sent to the SwinT Blocks to mation at a long distance (As shown by the two small boxes
  
get F trans . Then, we concatenate F trans and F cnn to get in yellow and dark blue in Fig. 3, there is still a gradient

14391

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Slice Network
enc
concat- (Scale) (Scale)
tenate SWAtten Parameters Net

concat- (Mean) (Mean) dec


tenate SWAtten Parameters Net +
dec concat-
LRP
tenate

Parameters Net Latent residual

conv 3x3 ↓2

conv 3x3 ↑2
enc prediction (LRP)

subpel
RBU ↑2
RBS ↓2
conv 3x3

conv 3x3

conv 3x3

conv 3x3

conv 3x3

conv 3x3
GELU

GELU

GELU

GELU

TCM

TCM
Figure 4. The proposed channel-wise entropy model. The encoded slice y <i can assist the encoding of the subsequent slice y i .

Raw Image ࢟ (w/o SWAtten) ࢟ (w/ SWAtten) Scaled deviation map Scaled deviation map
(w/o SWAtten) (w/ SWAtten)

Figure 5. The scaled deviation map and the channel with the largest entropy of latent representations y, from the model without/with
SWAtten. For kodim01 (top), the [PSNR@bitrates@s ] of the model without and with SWAtten are [[email protected]@0.451] and
[[email protected]@0.389]. For kodim05 (bottom), they are [[email protected]@0.422] and [[email protected]@0.365].

return at a distance far from point p). This means that our tion module for the entropy model which has 1/16 input
model also has the long-distance modeling ability of trans- size compared to the main path and can reduce much com-
formers. It is also worth noting that at t = 0.0001, the plexity. The designed parameter-efficient swin-Transformer
CNN-based model exhibits a circular ERF, while the ERF based Attention (SWAtten) module is shown in Fig. 6. The
of our model exhibit a shape closer to the context (a long swin-transformer block which can capture non-local infor-
strip shape like the background scarf). This shows that our mation is added into the architecture, while the other resid-
model has better modeling ability compared to the CNN- ual blocks (RB) can get local information. Since the num-
based model. ber of channels of features inputted to ei accumulates with
the increase of slice index i, the input channels of ei can be
3.3. Proposed Entropy Model expressed as:
Motivated from [13, 27], we propose a channel-wise M + i × (M//s) (5)
auto-regressive entropy model with a parameter-efficient Where M is the number of channels of the latent variable
swin-transformer-based attention module (SWAtten) by us- y. The input channel number of e9 can reach 608 when s
ing channel squeezing. The framework is shown in Fig. 4. is set as 10 and M is set as 320, which causes the model
to be parameter-consuming. To achieve the balance be-
3.3.1 SWAtten Module tween complexity and RD-performance, the total number
of the slices s is reduced to 5 from 10 which is the com-
Past works on attention have demonstrated their effective- mon setting in [27]. At the same time, a channel squeeze
ness for image compression. However, many of them are operation is used to squeeze the input channels. In this pa-
time-consuming, or only have the ability to capture local in- per, we squeeze the input channels of all slices to 128, i.e.,
formation. Different from these modules which are placed let the output channel of the first 1 × 1 convolutional layer
on both the main path (ga and gs ) and hyper-prior path be 128. At last, an unsqueeze operation is utilized to un-
(ha and hs ) of image compression, we design an atten- squeeze the channels of output to the original number, i.e.,

14392

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
let the output channel of the last 1x1 convolutional layer be model with SWAtten. It suggests that the model with SWAt-
M + i × (M/s). ten can have less information loss and get a higher quality
decompressed image.
RB
4. Experiments

1x1 conv
1x1 conv

RB x3
x + 1x1 conv
3x3 conv
1x1 conv
4.1. Experimental Setup

1x1 conv

Sigmoid
SwinT
Block

RB x 3
+
Channel Channel 4.1.1 Training Details
Squeeze Unsqueeze

For training, we randomly choose 300k images of size


Figure 6. The proposed SWAtten module. larger than 256 × 256 from ImageNet [8], and randomly
crop them with the size of 256 × 256 during the training
It should be noted that although both SWAtten and TCM process. We adopt Adam [16] with a batch size 8 to opti-
are composed of transformers and CNNs, there are some mize the network. The initial learning rate is set as 1×10−4 .
differences between them. Firstly, according to [24], the After 1.8M steps, the learning rate is reduced to 1 × 10−5
hyper-prior path of the image compression network con- for the last 0.2M steps.
tains numerous redundant parameters, making it possible The model is optimized by RD-formula as Equation
to use channel squeezing to greatly reduce the parameters 3. Two kinds of quality metrics, i.e., mean square er-
without compromising performance. But the main path is ror (MSE) and MS-SSIM, are used to represent the distor-
sensitive to parameters, which means we cannot use such tion D. When the model is optimized by MSE, the λ be-
operations. Secondly, CNN in SWAtten is used not only longs to {0.0025, 0.0035, 0.0067, 0.0130, 0.0250, 0.0500}.
for extracting local features but also for extracting attention When the model is optimized by MS-SSIM, the λ belongs
maps, which differs from TCM. Lastly, the receptive field of to {3, 5, 8, 16, 36, 64}.
the entropy model in SWAtten is large enough, eliminating For swin-transformer blocks, window sizes are set as 8
the need for a two-stage fusion framework like TCM. in the main path (ga and gs ), and 4 in the hyper-prior path
According to [34], the deviation between ŷ and y with (ha and hs ). The channel number M of the latent y is set
size M × Hy × Wy can be used to analyze the information as 320, while that of z is set as 192, respectively. Other
loss in the process of compression, we can formulate the hyper-parameters in the entropy model follow the setting
mean absolute pixel deviation  as: in [27]. We use RTX 3090 and Intel i9-10900K to complete
the following experiments.
 
H y Wy M
= |ŷh,w,m − yh,w,m |
h=1 w=1 m=1 4.1.2 Evaluation
 
H y Wy M
= |Q(yh,w,m − μh,w,m ) − (yh,w,m − μh,w,m )| We test our method on three datasets, i.e., Kodak image set
h=1 w=1 m=1 [17] with the image size of 768 × 512, old Tecnick test set2
(6) [1] with the image size of 1200 × 1200, CLIC professional
To compare the deviation among different models, it validation dataset3 [7] with 2k resolution. Both PSNR and
is unfair to directly compare the values  because the ŷ MS-SSIM are used to measure the distortion, while bits per
and y in different models have different ranges. Since pixel (bpp) are used to evaluate bitrates.
the deviation is relative to y, a scaling factor γ =
 H y  Wy  M
h=1 w=1 m=1 |yh,w,m | is introduced to define a 4.1.3 Definition of Various Models
scaled mean absolute pixel deviation s to better evaluate
the information loss. s can be formulated as: In the experiments, to explore the performance of our model
with different complexities, we tested three different mod-
s = /γ (7) els (small, medium and large by setting different channel
number C = 128, 192, 256 in the middle layers). The lo-
Fig. 5 shows the scaled deviation map and the chan- cation of C is shown in Fig. 2. The number of slices s of
nel with the largest entropy of y of kodim01 and kodim05 the entropy model for all models with attention modules is
in Kodak dataset by using the model with SWAtten or not. reduced to 5 from 10 which is a common setting in [27].
Note that the deviation map is in shape (Hy , Wy ) and each More details are reported in Supplementary.
pixel is the mean of the absolute deviation along the chan- 2 https : / / sourceforge . net / projects / testimages /
nel dimension after scaling with γ. The s for kodim01 and files/OLD/
kodim05 can be reduced from 0.451 and 0.422 by using 3 http://clic.compression.cc/2021/tasks/index.

the model without SWAtten to 0.389 and 0.365 by using the html

14393

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
Figure 7. Performance evaluation on the Kodak dataset.

Figure 8. Performance evaluation on the CLIC Professional Val- Figure 9. Performance evaluation on the Tecnick dataset.
idation dataset.
4.2. Rate-Distortion Performance different datasets (BD-rate=0%). Our method outperforms
VVC (VTM-12.1) by 12.30%, 13.71%, 11.85% in BD-rate
We compare our large model with state-of-the-arts on Kodak, Tecnick, and CLIC datasets, respectively. Table
(SOTA) learned end-to-end image compression algorithms, 1 shows partial results on Kodak. More comparisons are
including [3], [6], [34], [5], [15], [13], and [37]. The clas- reported in Supplementary.
sical image compression codec, VVC [29] is also tested
by using VTM12.1. The rate-distortion performance on 4.3. Ablation Studies
Kodak dataset is shown in Fig. 7. Both PSNR and MS-
4.3.1 Comparison with Transformer-only/CNN-only
SSIM are tested on Kodak to demonstrate the robustness of
based Models
our method. Here, we convert MS-SSIM to −10log10 (1 −
MS-SSIM) for clearer comparison. As we can see, at the In order to show the effectiveness of our proposed
same bitrate, we can improve up to about 0.4dB PSNR and Transformer-CNN Mixture (TCM) blocks, we compare
0.5dB MS-SSIM compared with SOTA methods. The re- our medium model without SWAtten modules to the
sults of CLIC dataset and Tecnick dataset are shown in Fig. Transformer-only based model and CNN-only based model
8 and Fig. 9, respectively. We also achieve similar good in [36]. The results are shown in Fig. 10a. “Conv ChARM”
results on these two datasets. These results suggest that our and “SwinT ChARM” are a CNN-only based model and
method is robust and can achieve SOTA performance based a Transformer-only based model, respectively. They have
on all of the three datasets with different resolutions. To similar architectures to our methods without SWAtten mod-
get quantitative results, we present the BD-rate [4] com- ules. The difference is that “SwinT ChARM” uses Swin
puted from PSNR-BPP curves as the quantitative metric. Transformer block, “Conv ChARM” uses convolutional
The anchor RD-performance is set as the results of VVC on neural networks, and we use the proposed TCM block. By

14394

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
using the advantages both of transformer and CNN, the re- our small, medium and large models can achieve SOTA
sults show that our method surpasses the Transformer-only RD-performance. Meanwhile, the performance can further
based model and CNN-only based model. improve as the complexity increases, which shows that our
model has a lot of potentials.
4.3.2 SWAtten Module
Table 1. The RD-performance and complexity of learned image
In Fig. 10b, we compare the cases using SWAtten modules compression models based on Kodak using GPU (RTX 3090). A
or not. It can be observed that SWAtten modules bring a lower BD-rate indicates higher RD-performance.
significant gain in RD-performance. Meanwhile, by using
the channel squeeze operation in SWAtten modules, we can Encoding Decoding Para- FLOPs BD-
Methods
get a comparable performance compared with the situation Time(/ms) Time(/ms) meters(/M) (/G) rate
without using that operation while saving many parameters. Xie et al. [34] 2346 5212 47.55 408.21 -1.65
Section 4.5 shows more information on the parameters effi- SwinT
132 84 60.55 230.37 -4.02
ciency gain of this operation. ChARM [36]
Ours (Large,
151 141 160.45 830.90 -12.54
w/o squeeze)
Ours(Large) 150 140 75.89 700.96 -12.30
Ours(Medium) 130 122 58.72 415.20 -9.65
Ours(Small) 109 102 44.96 211.54 -7.39

4.6. Visualization
Fig. 1 shows the visualization example of decompressed
images (kodim23) from Kodak dataset by our methods
and the classical compression standards WebP, VVC (VTM
12.1) [29]. In some complex texture parts, our methods can
keep more details (clearer feather outline).
(a) (b) (c)
5. Conclusion
Figure 10. Experiments on Kodak dataset. (a) Comparison with
Transformer-only/CNN-only based Models. (b) The ablation stud- In this paper, we incorporate transformers and CNN
ies on SWAtten module (“w/o squeeze” represents that we don’t to propose an efficient parallel transformer-CNN mixture
add the channel squeeze/unsqueeze operation to SWAtten). (c)
block that utilizes the local modeling ability of CNN and
The RD-performance of models using different attention modules.
the non-local modeling ability of transformers. Then, a new
4.4. Various Attention Modules image compression architecture is designed based on the
TCM block. Besides, we present a swin-transformer-based
In Fig. 10c, we compared our proposed SWAtten with attention module to improve channel-wise entropy mod-
the previous attention modules, including the non-local at- els. The results of experiments show that the image com-
tention (NonlocalAtten) module [21], the local attention pression model with TCM blocks outperforms the CNN-
(LocalAtten) module [6], and the window-based attention only/Transformer-only based models under a suitable com-
(WAtten) module [37]. Compared with these different at- plexity. Furthermore, the performance of SWAtten sur-
tention modules, SWAtten gets the best RD-performance passes previous attention modules designed for image com-
because it can capture non-local information while also pay- pression. At last, our method achieves state-of-the-art on
ing enough attention to local information. three different resolution datasets (i.e., Kodak, Tecnick,
CLIC Professional Validation) and is superior to existing
4.5. Complexity and Qualitative Results
image compression methods.
We test the complexity and qualitative results of dif-
ferent methods based on Kodak. Two other SOTA works 6. Acknowledgment
[34, 36], are also tested as Table 1 shows. The results of
our method suggest that the efficiency and RD-performance This paper is supported by Japan Science and Technol-
of our method can outperform both of these two methods. ogy Agency (JST), under Grant JPMJPR19M5; Japan So-
Meanwhile, after using channel squeeze in SWAtten, we ciety for the Promotion of Science (JSPS), under Grant
can save a lot of parameters and FLOPs, while we can get 21K17770; Kenjiro Takayanagi Foundation; the Foundation
a comparable BD-rate. It also should be noted that all of of Ando Laboratory; NICT, Grant Number 03801, Japan.

14395

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
References pression. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 5992–
[1] Nicola Asuni and Andrea Giachetti. Testimages: a large- 6001, 2022. 2, 7
scale archive for testing visual devices and basic image pro- [16] Diederik P Kingma and Jimmy Ba. Adam: A method for
cessing algorithms. In STAG, pages 63–70, 2014. 6 stochastic optimization. arXiv preprint arXiv:1412.6980,
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End- 2014. 6
to-end optimized image compression. In International Con-
[17] Eastman Kodak. Kodak lossless true color image suite (pho-
ference on Learning Representations, 2016. 2
tocd pcd0992). 1993. 6
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
[18] A Burakhan Koyuncu, Han Gao, and Eckehard Steinbach.
Hwang, and Nick Johnston. Variational image compression
Contextformer: A transformer with spatio-channel attention
with a scale hyperprior. In Proceedings of the International
for context modeling in learned image compression. In Eu-
Conference on Learning Representations, 2018. 1, 2, 7
ropean Conference on Computer Vision, 2022. 2
[4] Gisle Bjontegaard. Calculation of average psnr differences
between rd-curves. In VCEG-M33, 2001. 2, 7 [19] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
Van Gool, and Radu Timofte. Swinir: Image restoration us-
[5] Fangdong Chen, Yumeng Xu, and Li Wang. Two-stage oc-
ing swin transformer. In Proceedings of the IEEE/CVF Inter-
tave residual network for end-to-end image compression. In
national Conference on Computer Vision, pages 1833–1844,
Proceedings of AAAI conference on artificial intelligence,
2021. 2
2022. 1, 2, 7
[6] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro [20] Fangzheng Lin, Heming Sun, Jinming Liu, and Jiro Katto.
Katto. Learned image compression with discretized gaussian Multistage spatial context models for learned image com-
mixture likelihoods and attention modules. In Proceedings of pression. arXiv preprint arXiv:2302.09263, 2023. 1
the IEEE/CVF Conference on Computer Vision and Pattern [21] Haojie Liu, Tong Chen, Peiyao Guo, Qiu Shen, Xun Cao,
Recognition, pages 7939–7948, 2020. 1, 2, 3, 4, 7, 8 Yao Wang, and Zhan Ma. Non-local attention optimized
[7] CLIC. Workshop and challenge on learned image compres- deep image compression. arXiv preprint arXiv:1904.09757,
sion. In Proceedings of the IEEE/CVF Conference on Com- 2019. 1, 2, 8
puter Vision and Pattern Recognition, 2021. 6 [22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[8] Jia Deng. A large-scale hierarchical image database. Pro- Zhang, Stephen Lin, and Baining Guo. Swin transformer:
ceedings of IEEE/CVF conference on Computer Vision and Hierarchical vision transformer using shifted windows. In
Pattern Recognition, 2009. 6 Proceedings of the IEEE/CVF International Conference on
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Computer Vision, pages 10012–10022, 2021. 1, 2, 4
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [23] Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Zhan Ma. Transformer-based image compression. arXiv
vain Gelly, et al. An image is worth 16x16 words: Trans- preprint arXiv:2111.06707, 2021. 1
formers for image recognition at scale. In International Con- [24] Ao Luo, Heming Sun, Jinming Liu, and Jiro Katto. Memory-
ference on Learning Representations, 2020. 1, 2 efficient learned image compression with pruned hyperprior
[10] Haisheng Fu, Feng Liang, Jianping Lin, Bing Li, Mo- module. In 2022 IEEE International Conference on Image
hammad Akbari, Jie Liang, Guohe Zhang, Dong Liu, Processing (ICIP), pages 3061–3065, 2022. 6
Chengjie Tu, and Jingning Han. Learned image com- [25] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel.
pression with discretized gaussian-laplacian-logistic mixture Understanding the effective receptive field in deep convolu-
model and concatenated residual modules. arXiv preprint tional neural networks. Advances in neural information pro-
arXiv:2107.06463, 2021. 1 cessing systems, 29, 2016. 4
[11] Google. Web picture format. 2010. 1 [26] David Minnen, Johannes Ballé, and George D Toderici.
[12] Zongyu Guo, Zhizheng Zhang, Runsen Feng, and Zhibo Joint autoregressive and hierarchical priors for learned im-
Chen. Causal contextual prediction for learned image com- age compression. Advances in neural information processing
pression. IEEE Transactions on Circuits and Systems for systems, 31, 2018. 2, 3
Video Technology, 32(4):2329–2341, 2021. 2
[27] David Minnen and Saurabh Singh. Channel-wise autoregres-
[13] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei
sive entropy models for learned image compression. In IEEE
Qin, and Yan Wang. Elic: Efficient learned image compres-
International Conference on Image Processing (ICIP), pages
sion with unevenly grouped space-channel contextual adap-
3339–3343. IEEE, 2020. 2, 3, 4, 5, 6
tive coding. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 7 [28] Yichen Qian, Xiuyu Sun, Ming Lin, Zhiyu Tan, and Rong
Jin. Entroformer: A transformer-based entropy model for
[14] Dailan He, Yaoyan Zheng, Baocheng Sun, Yan Wang,
learned image compression. In International Conference on
and Hongwei Qin. Checkerboard context model for effi-
Learning Representations, 2021. 2
cient learned image compression. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern [29] Joint Video Experts Team. Vvc official test model vtm. 2021.
Recognition, pages 14771–14780, 2021. 2, 4 1, 7, 8
[15] Jun-Hyuk Kim, Byeongho Heo, and Jong-Seok Lee. Joint [30] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
global and local hierarchical priors for learned image com- Massa, Alexandre Sablayrolles, and Hervé Jégou. Training

14396

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.
data-efficient image transformers & distillation through at-
tention. In International Conference on Machine Learning,
pages 10347–10357. PMLR, 2021. 2
[31] Gregory K Wallace. The jpeg still picture compression
standard. IEEE transactions on consumer electronics,
38(1):xviii–xxxiv, 1992. 1
[32] Dezhao Wang, Wenhan Yang, Yueyu Hu, and Jiaying Liu.
Neural data-dependent transform for learned image com-
pression. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 17379–
17388, 2022. 1
[33] Yaojun Wu, Xin Li, Zhizheng Zhang, Xin Jin, and Zhibo
Chen. Learned block-based hybrid image compression.
IEEE Transactions on Circuits and Systems for Video Tech-
nology, 2021. 1
[34] Yueqi Xie, Ka Leong Cheng, and Qifeng Chen. Enhanced
invertible encoding for learned image compression. In Pro-
ceedings of the 29th ACM International Conference on Mul-
timedia, pages 162–170, 2021. 1, 2, 6, 7, 8
[35] Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yulun
Zhang, Hao Tang, Radu Timofte, and Luc Van Gool. Prac-
tical blind denoising via swin-conv-unet and data synthesis.
arXiv preprint arXiv:2203.13278, 2022. 2
[36] Yinhao Zhu, Yang Yang, and Taco Cohen. Transformer-
based transform coding. In International Conference on
Learning Representations, 2022. 1, 2, 3, 7, 8
[37] Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. The
devil is in the details: Window-based attention for image
compression. Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2022. 1, 2, 4, 7, 8

14397

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on February 05,2025 at 17:30:03 UTC from IEEE Xplore. Restrictions apply.

You might also like