0% found this document useful (0 votes)
43 views36 pages

A Survey On Perceptually Optimized Video Coding

Uploaded by

jinxiake1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views36 pages

A Survey On Perceptually Optimized Video Coding

Uploaded by

jinxiake1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

A Survey on Perceptually Optimized Video Coding

YUN ZHANG, School of Electronics and Communication Engineering, Sun Yat-Sen University, China
arXiv:2112.12284v2 [cs.MM] 16 Nov 2022

LINWEI ZHU, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
GANGYI JIANG, Faculty of Information and Engineering, Ningbo University, China
SAM KWONG, Department of Computer Science, City University of Hong Kong, China
C. -C. JAY KUO, University of Southern California, USA
To provide users with more realistic visual experiences, videos are developing in the trends of Ultra High
Definition (UHD), High Frame Rate (HFR), High Dynamic Range (HDR), Wide Color Gammut (WCG) and
high clarity. However, the data amount of videos increases exponentially, which requires high efficiency video
compression for storage and network transmission. Perceptually optimized video coding aims to maximize
compression efficiency by exploiting visual redundancies. In this paper, we present a broad and systematic 11
survey on perceptually optimized video coding. Firstly, we present problem formulation and framework of the
perceptually optimized video coding, which includes visual perception modelling, visual quality assessment and
perceptual video coding optimization. Secondly, recent advances on visual factors, computational perceptual
models and quality assessment models are presented. Thirdly, we review perceptual video coding optimizations
from four key aspects, including perceptually optimized bit allocation, rate-distortion optimization, transform
and quantization, filtering and enhancement. In each part, problem formulation, working flow, recent advances,
advantages and challenges are presented. Fourthly, perceptual coding performances of the latest coding
standards and tools are experimentally analyzed. Finally, challenging issues and future opportunities are
identified.
CCS Concepts: • Computing methodologies → Image compression; • General and reference → Sur-
veys and overviews.
Additional Key Words and Phrases: Perceptual video coding, quality assessment, human visual system, visual
attention, just noticeable difference, rate-distortion optimization
ACM Reference Format:
Yun Zhang, Linwei Zhu, Gangyi Jiang, Sam Kwong, and C. -C. Jay Kuo. 2022. A Survey on Perceptually
Optimized Video Coding. ACM Comput. Surv. 00, 0, Article 11 (August 2022), 37 pages. https://doi.org/01.1234/
1234567.1234567
This work was supported in part by the National Natural Science Foundation of China under Grants 62172400, 61901459
and 62271276, in part by the Shenzhen Science and Technology Program under Grant JCYJ20200109110410133, in part by
the Guangdong Natural Science Foundation under Grant 2022A1515011351, in part by the CAS President’s International
Fellowship Initiative (PIFI) under Grant 2022VTA0005, in part by the Hong Kong Innovation and Technology Commission
(InnoHK Project CIMDA), in part by the Hong Kong GRF-RGC General Research Fund under Grants 11209819 (CityU
9042816) and 11203820 (CityU 9042598).
Authors’ addresses: Yun Zhang (corresponding author), [email protected], School of Electronics and Communi-
cation Engineering, Sun Yat-Sen University, Shenzhen, Guangdong, China, 518107; Linwei Zhu, [email protected], Shenzhen
Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China, 518055; Gangyi Jiang,
Faculty of Information and Engineering, Ningbo University, Ningbo, Zhejiang, China, 315211, [email protected];
Sam Kwong, Department of Computer Science, City University of Hong Kong, Hong Kong, China, [email protected];
C.C. Jay Kuo, University of Southern California, Los Angeles, California, USA, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
0360-0300/2022/8-ART11 $15.00
https://doi.org/01.1234/1234567.1234567

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:2 Y. Zhang et al.

1 INTRODUCTION
With the development of capturing, display and computing technology, video applications are
developing in the trends of providing more realistic and immersive visual experiences. For example,
they develop from Standard Definition (SD) to High Definition (HD) and 4K/8K Ultra-HD (UHD) for
higher spatial resolution, from low dynamic range, color gammut and frame rate to High Dynamic
Range (HDR), Wide Color Gammut (WCG), and High Frame Rate (HFR) for higher color and
temporal fidelity, from 2D to stereo, multiview and 3D for providing depth perception, and from low
clarity to high clarity. In addition, Virtual Reality (VR) and Augmented Reality (AR) applications
based on volumetric video are also boosting as they provide more immersive visual experiences
and interactive functionalities between real and virtual objects. However, data volume of these
videos increases to dozens or even hundreds of times, which has been a critical challenge for video
transmission/streaming [53], storage and computing. To tackle this problem, developing highly
effective video coding algorithms that compress videos into smaller ones is highly desired.
The worldwide researchers and organizations have made significant contributions to the de-
velopment of video coding technologies. In the past three decades, video coding standards have
been developed for four generations, where leading standards in each generation are MPEG-1/2,
H.264/Advanced Video Coding (AVC), High Efficiency Video Coding (HEVC) [89] and Versatile
Video Coding (VVC) [10], respectively. In 2003, experts from ITU-T Video Coding Experts Group
(VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) established Joint Video Team (JVT)
to develop the H.264/AVC, which has been widely used in SD/HD video applications nowadays.
Since 2013, VCEG and MPEG gathered together again and founded Joint Collaborative Team on
Video Coding (JCT-VC) to standardize HEVC for UHD video, which doubled the compression ratio
as compared with H.264/AVC. Meanwhile, 3D/multiview, scalable and screen content extensions
of HEVC, called 3D/MV-HEVC[90], Scalable HEVC (SHVC) and Screen Content Coding (SCC),
respectively, were investigated for specific video applications. Beyond HEVC, Joint Video Experts
Team (JVET) was established to standardize VVC [9], which targeted to encode 8K UHD video and
beyond. To include a variety of video sources and applications, JVET also launched standardiza-
tion activities for representation and coding of immersive media, including 360◦ omnidirectional
video coding and point cloud compression [105]. Since 2003, Audio Visual coding Standard (AVS)
workgroup was established in China to develop video coding standards, called AVS-1/2/3 [123]. In
the past few decades, many advanced coding technologies have been developed to improve the
compression efficiency further and further, which becomes saturated nowadays. A little further
improvement may require extremely high cost, which is more challenging than ever.
To compress videos more effectively, perceptual video coding is one of the most promising
research directions, which tends to minimize visual redundancies in videos by exploiting properties
of Human Visual System (HVS). This is a systematic research not only includes image/video
signal processing, but also involves video representation, psychovisual and neurophysiological
researches. World-wide researchers have devoted their efforts on this perceptual video coding
field. Wu et al. [108] presented a survey on visual Just Noticeable Difference (JND) estimation
that affected by luminance adaptation, contrast masking, pattern masking, and visual sensitivity.
Lin et al. [60] reviewed handcrafted modeling and machine learning approaches for JND. Lin et
al. [61] presented an overview on Perceptual Visual Quality Metrics (PVQM), which included
basic computational models for key factors of human perception, model based and signal driven
PVQM models. Athar et al. [1] presented a survey on Image Quality Assessment (IQA) metrics
for 2D images and reported their performances over different datasets. Min et al. [70] analyzed
characteristics of screen content images from human, system, and context perspectives, and then
reviewed quality assessment methods and datasets for them. However, these overviews focus on

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:3

perceptual models for image/video quality assessment, but video coding optimizations have not
been considered. Zhang et al. [130] reviewed machine learning based video coding optimizations
from three perspectives, including low complexity, high compression ratio and high visual quality.
To further improve video compression efficiency, Liu et al. [62] reviewed representative Deep
Learning based Video Coding (DLVC) schemes for key coding modules, including deep intra/inter
prediction, cross channel prediction, transform, in/post-loop filtering, probability prediction and
up/down-sampling, an so on. They utilized learning algorithms, especially deep learning, to promote
video coding performances. However, visual redundancies have not been well considered. Yuan et
al. [120] reviewed recent advances on visual JND models and their applications to Rate-Distortion
Optimization (RDO) and quantization in video coding, where machine learning algorithms were used
to improve JND prediction. Lee et al. [52] reviewed perceptual video compression from perceptual
model, coding implementation and performance evaluation. Chen et al. [16] briefly reviewed visual
attention and visual sensitivity factors, which were used to guide perceptual quality allocation for
constrained video coding. However, these works mainly focused on H.264/AVC. Perceptual factors
on UHD/HDR/WCG video and coding algorithms on more recent HEVC/VVC/AVS standards have
not been addressed. Also, learning based optimizations have not been addressed.
Therefore, an in-depth survey on recent advances of perceptual video coding is highly desirable
at this time due to the following five reasons: 1) Since video representation develops in the trends of
providing more realistic visual experiences, such as HD/UHD and 3D, visual properties are different.
In-depth analyses are required for possible extensions of existing perceptual models in improving
their applicabilities. 2) More advanced perceptual factors and mechanism have been revealed with
the developments of neural and computer sciences. Consequently, the perceptual models and quality
assessment methods have been significantly advanced. In addition, the concept of perceptual quality
has been extended from clarity to a wider Quality-of-Experience (QoE). 3) Computational models in
modelling visual factors and assessing visual quality have been significantly improved, which can
be used to exploit visual redundancies in videos. 4) In the latest coding standards, such as AVS-3
and HEVC/VVC, many novel coding algorithms have been proposed. It becomes more challenging
to further improve the latest coding technologies in exploiting the visual redundancies. 5) Machine
learning algorithms, especially the deep learning, have been explored in enhancing video coding
algorithms[130][62], which bring new opportunities to boost perceptual coding optimizations with
learning techniques. By taking these opportunities, an in-depth review on perceptual video coding
will provide helpful guidelines for future researches in promoting video coding.
In this survey, we firstly analyze visual properties of HVS from neurophysiological perspectives,
and review computational perceptual models and visual quality assessment algorithms. Then, we
review on recent advances of perceptually optimized video coding algorithms, where problem
formulation, key features, performances, advantages and disadvantages are analyzed. Thirdly,
coding performance of latest coding standards are experimentally analyzed. Finally, challenges and
opportunities in perceptual coding are identified.

2 PROBLEM FORMULATION AND FRAMEWORK OF PERCEPTUAL VIDEO CODING


2.1 Problem Formulation
Fig.1 shows an end-to-end chain of a visual communication system, which consists of five key
components, light and scene P, imaging and representation (𝐹𝑅 ) [21], visual processing and com-
munication (𝐹𝐸 ), display and viewing condition (𝐹𝐷 ), and HVS (𝐹𝐻𝑉 𝑆 ). Dynamic scene (P) in
3D world is firstly captured by cameras with different view angles or positions. Then, captured
successive images are processed and organized to be an effective video representation, denoted as
I = 𝐹𝑅 (P). 𝐹𝑅 is an Opto-Electronic Transfer Function (OETF) [27] that converts captured lights

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:4 Y. Zhang et al.

Fig. 1. End-to-end chain of a visual communication system.

Fig. 2. Luminance dynamic range mappings among nature light, digital signal, display and HVS.

into digital signal, e.g., 8-bit depth RGB color space based on BT.709 [13]. Perceptual Quantization
(PQ) and Hybrid Log-Gamma (HLG) transfer functions [27] were developed to represent higher
dynamic range light with 10∼12-bit depth signal according to BT.2100 [11]. Then, digital videos are
encoded to bitstreams and then transmitted to client through network. At remote clients, bitstreams
are decoded and reconstructed as ^I = 𝐹𝐸 (I). Then, ^I is shown with display 𝐹𝐷 , which transfers
electronic signal back to visual light P ^ with Electro-Optical Transfer Function (EOTF). Finally,
visual light P representing image I is shown to human eyes. As shown in Fig.2, there are mismatches
^ ^
among the numbers of levels in representing natural scene (10−6 ∼108 𝑛𝑖𝑡𝑠), digital pixel (0∼28 for
SDR and 0∼210 for HDR) and display light (10−1 ∼102 𝑛𝑖𝑡𝑠 for SDR, 5 × 10−4 ∼104 𝑛𝑖𝑡𝑠 for HDR).
In HVS 𝐹𝐻𝑉 𝑆 , visible light P ^ is perceived by retina and then converted to neural visual signal
with rods and cones cells, known as photoreceptive cells. These neural impulses are transmitted to
visual cortex through optic nerve and Lateral Geniculate Nucleus (LGN), which generates primary
visual perceptual image (V), denoted as V = 𝐹𝐻𝑉 𝑆 ( P). ^ Then, V goes through two functional
pathways for further perception and recognition tasks, which are a ventral pathway to V3 and V5
for “Where/How” information and a dorsal pathway to V2 and V4 for “What” information. There is
a mismatch between the number of levels for vision range (10−5 ∼106 𝑛𝑖𝑡𝑠) and perceivable scales
(about 102 ) at one time, as shown in the bottom part of Fig.2. Due to the visibility threshold and
properties of HVS, not every distortion is perceivable, i.e., visual redundancy.
Conventionally, error based visual metric, such as Mean-Squared-Error (MSE) and Peak Signal-
to-Noise Ratio (PSNR), measures the signal distortion D as D = I − ^I, which may overestimate or
underestimate visual responses. While taking the whole end-to-end visual system into account,

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:5

Fig. 3. Framework of the perceptually optimized video coding.

perceptual distortion (D𝑉 ) is expressed as


 D = V − V𝑂𝑟𝑔
 𝑉


(1)

V = 𝐹𝐻𝑉 𝑆 (𝐹𝐷 (𝐹𝐸 (𝐹𝑅 (P)))) ,

 V𝑂𝑟𝑔 = 𝐹𝐻𝑉 𝑆 (𝐹𝐷 (𝐹𝑅 (P)))


where V and V𝑂𝑟𝑔 are perceived images with or without the coding distortion D in visual commu-
nication. It has been widely acknowledged that D ≠ D𝑉 due to visual redundancies [1, 61, 108] and
non-linear properties in HVS. Based on Eq.1, we shall not only consider processing and communica-
tion (𝐹𝐸 ), but also jointly consider imaging (𝐹𝑅 ), display (𝐹𝐷 ) and HVS (𝐹𝐻𝑉 𝑆 ), while measuring the
perceptual distortion D𝑉 . Due to visibility threshold in HVS, not every distortion D is perceivable.
In addition, people is more interested in semantic contents, such as human face, sign and characters.
Distortion that causes misunderstanding and false recognition will severely degrade QoE. In this
work, we mainly analyze the perceptual distortion D𝑉 by jointly considering P,𝐹𝑅 ,𝐹𝐸 ,𝐹𝐷 , and 𝐹𝐻𝑉 𝑆 .
Then, based on findings in D𝑉 and 𝐹𝐻𝑉 𝑆 , perceptual redundancies are exploited to optimize video
compression in 𝐹𝐸 , called perceptually optimized video coding.

2.2 Framework of the Perceptually Optimized Video Coding


Generally, a framework of perceptually optimized video coding can be divided into four levels, which
are principal factors, computational models, video coding optimization and video applications, as
shown in Fig.3. There are four categories of key factors in the first level, which are HVS properties
(𝐹𝐻𝑉 𝑆 ), imaging (𝐹𝑅 ) and display (𝐹𝐷 ), and visual processing and communication system (𝐹𝐸 )
including compression, transmission and rendering. The second level includes computational visual
perception models to model visual properties and responses of HVS, and computational visual
quality assessment model to evaluate visual quality. Above this level is video coding optimization
by exploiting the visual redundancies. Then, qualities of compressed videos are evaluated by the
quality assessment models. Finally, video coding algorithms are applied to various applications
to minimize bit rate while maintaining the visual quality. In this paper, we firstly review the

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:6 Y. Zhang et al.

Fig. 4. The neural pathway and key perceptual factors of HVS, 𝐹𝐻𝑉 𝑆 .

physiological visual factors, the computational visual perception models and the visual quality
assessment. Then, perceptually optimized video coding algorithms are reviewed.

3 PHYSIOLOGICAL VISUAL FACTORS AND COMPUTATIONAL PERCEPTUAL


MODELS OF HVS
In HVS, visual light perceived by photoreceptors will be converted to neuron-electronic visual signal
and then transmitted to LGN via nervous system. Then, these visual signals go to visual cortex for
cognitions via visual channels, as shown in the left column of Fig. 4. Based on these bio-mechanism
of HVS, we present visual properties of monocular vision into six key categories, including fovea and
eccentricity, angular resolution, spatial and temporal contrast sensitivity, spectral sensitivity and
masking effects, as shown in the middle column of Fig. 4. Then, three higher-level computational
models, including visual sensitivity, JND and visual attention, are reviewed based on the visual
properties. Finally, visual quality assessment models are reviewed.

3.1 Visual Properties and Models


3.1.1 Retinal Fovea and Visual Acuity. Field of View (FoV) of human eyes covers 200◦ in width and
135◦ in height. However, visual acuity is not evenly distributed in FoV. Since the photoreceptors
and ganglion cells distributed extremely dense at the retinal fovea, whose radius is about 1.5 𝑚𝑚
covering 1% of the retina, the fovea becomes the most sensitive visual area. As the densities of
the photoreceptors and ganglion cells decrease rapidly from the fovea to the peripheral, visual
acuity progressively decreases as the eccentricity increases [87]. In other word, visibility threshold
increases as the eccentricity increases [81]. Recently, it was found the visual acuity at isoeccentric
locations was asymmetric, which was better along horizontal meridian than along the vertical [17].
The visual acuity of the fovea can be modelled as 𝜃 =1.22(𝜆/𝐷) rad, where 𝜆 is wavelength of light
and 𝐷 is pupil diameter, which varies from 3 𝑚𝑚 at day time to 9 𝑚𝑚 at night vision. Accordingly,
the visual acuity decreases from day to night. The optimal acuity is about 0.0128◦ for 0.55 𝜇𝑚
V-band at day time according to the Rayleigh’s criterion. When we watch a 4K@3840×1920 video
on a 75 inch UHDTV with 2 𝑚 distance, the fidelity provided by each pixel is about 0.0104◦ in
horizonal and 0.0139◦ in vertical, which close to the optimal acuity of the retinal fovea.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:7

3.1.2 Spectral Sensitivity. There are two kinds of photoreceptors: rods and cones. The rods are
the basis for scotopic vision sensing monocular luminance and shape stimuli, and the cones are
the basis for photopic vision sensing color stimuli. Because the photoreceptrtors are selectively
sensitive to wavelengths and have non-linear responses, there exists visual sensitivities to spectrum,
i.e., luminance and chromatic sensitivities.
• Luminance Sensitivity: Natural luminance is continuous ranging from 10−6 ∼108 𝑛𝑖𝑡𝑠. Human
eye can perceive a high dynamic range of luminance that changes from scotopic (10−5 ∼10
𝑛𝑖𝑡𝑠) to photopic (10∼106 𝑛𝑖𝑡𝑠), as shown in Fig.2. However, it is much narrower when human
eyes distinguish the relative brightness differences at one time, which is about 102 scales.
HVS requires to adjust the iris, photoreceptors and neurons to adapt a wide range luminance.
Thus, only 256 scales with 8-bit depth are used to representing luminance of each pixel in
conventional video. Meanwhile, SDR display provides 0.1 to 300 𝑛𝑖𝑡𝑠 brightness with 3000
contrast ratio. Recently, more bits (10∼12 bits) are used in representing HDR video and HDR
display provides a wider luminance range from 0.05 to 1000 𝑛𝑖𝑡𝑠 with 20000 contrast ratio.
The mismatch between vision range and the narrower discrimination scales of HVS motivates
visibility threshold and JND in 𝐹𝐻𝑉 𝑆 , which depends on image background, surround, peak
and dynamic range of luminance, as well as image contents [47].
• Chromatic Sensitivity: The visual sensitivity of human eye varies strongly for the light with
wavelengths between 380 𝑛𝑚 and 800 𝑛𝑚. Three kinds of cones (S,M,L-cone) are sensitive
to blue (437 𝑛𝑚), green (533 𝑛𝑚) and red (564 𝑛𝑚) lights, respectively [86], which motivates
the RGB primary representation. Meanwhile, the photoreceptors have a relatively higher
sensitivity in green than in the other two primary colors. Since the number of rods (1.3×108 )
is much larger than that of cones (6.5×106 ), HVS is less sensitive to color than luminance
and scotopic vision is colorless. Also, color components are usually downsampled in video
representation, such as YUV422 and YUV420.

3.1.3 Spatial Contrast Sensitivity. Spatial contrast sensitivity describes the luminance discriminabil-
ity between brightness and darkness regions in spatial domain. Given an image with background
luminance 𝐼 and a centered brightness spot with luminance 𝐼 + Δ𝐼 , there is a ratio defined as
𝐾 = Δ𝐼 /𝐼 , called Weber ratio [94]. The minimum distinguishable contrast threshold between the
spot and background is found when 𝐾 is 0.02. Also, a larger 𝐾 enables a stronger visual response.
The spatial contrast sensitivity provides a characterization of frequency response in HVS, which
is a bandpass filter in low spatial frequency and a lowpass filter in high spatial frequency [94], as
shown in Fig. 5(a). It reduces at extremely low or high spatial frequency and reaches its peak around
10 𝑐𝑝𝑑. In addition to the frequency, the spatial contrast sensitivity is affected by many other factors,
including background luminance, human age, eccentricity, spectral, patterns and types of stimuli
[4]. It was found that the contrast sensitivities for chromatic (blue-yellow, red-green) are low-pass
filters [49] and their peaks are relatively lower than those of achromatic [71]. Generally, the spatial
contrast sensitivity increases with the background luminance increases[5], and reduces as age and
distance from eccentric [81] increase. Moreover, the sensitivity falls for test pattern whose spatial
frequency is similar to adaptation patterns and orientations, and also falls for sinewave stimuli as
compared with squarewave stimuli.

3.1.4 Temporal Contrast Sensitivity. Temporal contrast sensitivity measures the discriminability of
the luminance difference over time, which relates to temporal frequency, level of contrast, patterns
and so on [45]. It was found that this sensitivity increased as input contrast stimulus increased.
In this case, the photoreceptors (rods and cones) are easier to reach their perceptual threshold
and critical duration is shorter. The temporal contrast sensitivity is a bandpass filter and reaches

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:8 Y. Zhang et al.

(a) (b)

Fig. 5. Spatial and temporal CSFs with luminance adaptation. (a) Spatial [94], (b) Temporal [45].

its peak around 5Hz [45], as shown in Fig. 5(b), which can be modelled as an addition of two
log-scaled Gaussian filters [79]. Also, it was found that complex cells in V1 and V2 were usually
more sensitive in spatial and temporal contrasts as compared with simple cells [92]. The temporal
contrast sensitivity is higher than the spatial contrast sensitivity [50], as shown in Fig. 5, which
indicates that the temporal distortion is more perceivable.
When the temporal frequency reaches 50Hz, the temporal contrast sensitivity is extremely low, as
shown in Fig.5(b). So, the refresh rate of a display is higher than 60Hz to avoid perceiving flickering.
Also, the temporal contrast sensitivity degrades when background luminance is extremely bright or
dark [45]. Since HDR display provides a higher dynamic range of brightness, HDR video requires a
higher frame rate, e.g., 60 to 120 fps, to reduce motion judder and smooth motion transition.
3.1.5 Visual Masking Effects. The LGN and primary visual cortex have different sensitivity re-
sponses when receiving neuro-electronic signals from retina over multiple channels, such as
frequency, luminance, color, motion and so on. Their responses are bandpass filters whose peaks
and bands vary with stimuli. Meanwhile, interferences among multiple visual stimuli will be caused
due to their co-existence. When these stimuli present simultaneously, the presence of one stimulus
may weaken or enhance the responses of other excitations, called visual masking effects. The
masking effects can be categorized as Luminance Masking (LM), spatial and temporal contrast
masking [47], binocular masking [39, 138] and pattern masking [107] depending on the input stim-
ulus. Response from the strongest stimulus weakens other responses. For example, the perceived
brightness of an object not only depends on brightness intensity of the object, but also depends on
its surrounding background. Moreover, distortion in regions with regular pattern, such as parallel
lines, is more perceivable than that in chaos textural regions, such as grasses [36, 107].

3.2 Computational Perceptual Models


Computational models were proposed to model the perceptual factors of HVS, i.e., 𝐹𝐻𝑉 𝑆 . Three
kinds of high-level models, including visual sensitivity, JND, and visual attention, are reviewed.
3.2.1 Visual Sensitivity Models. Visual sensitivity is affected by many factors including spatial
contrast [4, 5, 8, 36], temporal contrast [37, 50, 79], input stimuli pattern [107], brightness [49, 94],
color sensitivity [71, 83], type of visual cells, fovea and eccentricities [17], binocular masking [39]
and so on. Barten et al. [4] built a spatial Contrast Sensitivity Function (CSF), called Barten’s CSF,
as an isotropic bandpass shaped function of image size, luminance level and pupil size, which is [4]

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:9

𝑀𝑂 (𝑢)
𝑆 (𝑢) =

 √︂
Φ0
 2 1 1 𝑢2 1
( + + 2 ) ) ( 𝜂𝑝𝐸 +
(2)
 𝑘 𝑇 𝑋2 𝑋2
𝑚𝑎𝑥
2
,
𝑁𝑚𝑎𝑥 1−𝑒 −(𝑢/𝑢 0 )
𝑂
2 2 2 2 2
𝑀𝑂 (𝑢) = 𝑒 −2𝜋 (𝜎0 +𝐶𝑎𝑏 𝑑 )𝑢



where 𝑆 (𝑢) is the sensitivity, 𝑀𝑂 (𝑢) is the optical Modulation Transfer Function (MTF) of eye, 𝐸 is
retinal luminance, 𝑑 is pupil diameter, 𝑘, 𝑝, 𝑇 , 𝜂, 𝜎0 , 𝑋𝑚𝑎𝑥 , Φ0 , 𝐶𝑎𝑏 , 𝑁𝑚𝑎𝑥 and 𝑢 0 are parameters,
whose typical values and simplified form of Eq.2 can be found in [4]. Kim et al.[49] modelled spatio-
chromatic contrast sensitivities as low-pass filters, whose peaks increased with the background
luminance in HDR display. Lambrecht and Kent [50] modelled a joint spatial and temporal CSF in a
3D non-separable form, which took the difference between excitatory and inhibitory mechanism.
Hu et al. modelled the visual sensitivity with spatial randomness [36] and temporal randomness [37].
Kim et al. [47] modelled the visual sensitivity by jointly exploiting temporal masking from motion
vector, contrast masking from edge, and luminance masking from mean pixel values in images.
Furthermore, a stereo visual sensitivity model [39] was proposed by considering the binocular
masking effect. Since HVS is complicated and many visual factors are not fully understood yet, it is
challenging to model 𝐹𝐻𝑉 𝑆 mathematically. Data-driven sensitivity models [8, 35] were developed to
explore visual responses to image distortion. Bosse et al. [8] found that the spatial visual sensitivity
to MSE distortion decreased as the image texture increased. Then, a neural network was trained to
predict the perceptual weights for each MSE based distortion. Hosseini et al. [35] tried to synthesize
a convolutional filter to mimic the falloff sensitivity response of HVS for image sharpness, i.e.,
I𝑂 = I ∗ 𝐹𝐻𝑉 𝑆 , which was approximated as an inverse generalized Gaussian distribution and
implemented with MaxPol filter library.
3.2.2 JND Models. Due to visual sensitivity and masking effects in HVS, not every distortion
is perceivable. The minimum visibility threshold of pixel intensity change is denoted as JND,
such as the Weber’s law in CSF. Many pixel-wise JND and sub-band domain JND models for 2D
images/videos have been developed. In [2, 47], video JND was modelled by jointly exploiting spatial
CSF, temporal masking from motion vectors, contrast masking from texture edges and luminance
masking. Based on the finding that random pattern region has higher spatial masking, this pattern
masking effect was modelled and incorporated in modelling JND [107]. Jiang et al. [40] predicted a
Critical Perceptual Lossless (CPL) threshold of an image with Karhunen-Loeve Transform (KLT),
and calculated the difference between the input image and its CPL image as a JND map. In [98],
JND prediction was modelled as maximizing the difference between the original and distorted
image subject the same perceptual quality. Then, a JND estimation scheme was proposed based
on the hierarchical predictive coding model of visual cortex. More JND models can be referred in
[1, 60, 108]. Most of the existing JND models are perceptual model based and process the pixel
or block as basic unit. However, HVS perceives image/video entirely rather than pixel or block
individually, which is more complicated.
Picture and Video Wise JND (PWJND/VWJND) models were studied. Although there are 100
quality scales or more for image compression with JPEG/JPEG2000 and 52 quality scales for video
compression with H.264/HEVC, Jin and Wang et al. found that HVS can only distinguish 5 to 7
quality scales [41, 96] and built a PWJND dataset, called MCL_JCI, for JPEG compressed images.
As shown in Fig.6, the real perceptual distortion in compressed image/video does not decrease
continuously as bit rate increases, but decreases as a discrete staircase function. The jump point
between two quality scales is the visibility threshold that subjects can just distinguish the difference
between distorted image/video and its reference, called PWJND/VWJND, which slightly varies
with image/video contents, resolutions and displays. To analyze the VWJND in compressed videos,
Wang et al. [96] built a large-scale compressed video quality dataset based on JND, called VideoSet,

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:10 Y. Zhang et al.

Fig. 6. PWJND and VWJND for compressed images/videos.[63, 96]

where the first three perceptual JND points are labeled among 52 quality levels of H.264 compressed
videos. It has 220 source videos with four resolutions. Liu et al. [63] developed a deep learning based
JND model to predict the perceptual difference, where the JND prediction was modelled as a binary
classification problem and then solved. Shen et al. [85] proposed a deep learning-based structural
degradation estimation model to predict structural visibility at patch-level, and then predicted the
PWJND. Fan et al. [25] modelled the PWJND for symmetrically and asymmetrically compressed
stereo images with H.265 Intra coding and JPEG2000, where Scale-Invariant Feature Transform
(SIFT), cyclopean image quality, rivalry quality were additionally introduced, and gradient boosting
decision tree was used for feature selection and fusion. Then, based on VMAF features [58] and
Support Vector Regression (SVR), Zhang et al. [129] developed a Satisfied User Ratio (SUR) prediction
method for compressed videos, which indicated the ratio of subjects who cannot perceive the visual
difference between the distorted video and its reference. JND point was derived at 75% SUR. In
[131], a deep learning based Video Wise Spatial-Temporal SUR (VW-STSUR) model was proposed to
predict the SUR and VWJND using two-stream CNN, where spatio-temporal features were fused at
score and feature levels. These PWJND/VWJND and SUR models were developed in a data-driven
way, and high dimensional handcrafted features or deep learning features were used to improve
the JND prediction accuracy. Compared with conventional pixel-wise JND models in [108], more
distortion is allowed in PWJND/VWJND due to compound masking effects in entire image/video
[91]. However, large scale JND datasets are required, which are time-consuming and laborious to
create. Further subjective studies and JND modelling related to human, content and system factors
will be interesting.
3.2.3 Visual Attention Models. Visual Attention (VA) or saliency is a high-level cognitive mecha-
nism that drives the retinal fovea in eye to attentional contents for higher fidelity, which is also
noted as Region-Of-Interest (ROI). Usually, HVS is easier to be attracted by regions with high
contrast, such as luminance, texture, orientation, temporal motion and color contrasts. To mimic
the decomposition happened in the visual cortex, multi-scale image decomposition was usually
employed. So, these inputs were represented with multi-scale center-surround contrasts of texture,
optical flow or (blue-yellow, red-green) chromatic maps [38]. At higher level, cognitive features,
such as shape, sign, faces, skin, and characters, draw attention. However, these high-level fea-
tures vary from person to person due to their different knowledge backgrounds and preferences,
while low-level features are fundamental and stable among people. In [126], Zhang et al. explored
stereoscopic video saliency with deep learning, where 3D Convolutional Neural Network (CNN)
was used to extract spatio-temporal saliency feature and Convolutional Long Short-Term Memory

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:11

(Conv-LSTM) was used to fuse spatio-temporal and depth attributes. More recent advances on VA
models can be referred in [18, 26].

3.3 Discussions
With the advances of visual-psychological and physiological research, more perceptual mechanism
and models have been revealed. Conventional visual-psychological studies usually analyzed a
perceptual factor and its visual response by adjusting one stimulus and fixing the rest. However,
multiple stimuli co-exist and mutually interact, such as binocular fusion and rivalry in LGN. HVS
is a compound non-linear binocular vision system, which is more complicated than a simple
combination of multiple models. Modeling the joint effects among multiple stimuli and binocular
vision is more challenging. Thanks to learning algorithms, especially the deep learning, which are
able to discover statistical relationships in massive data, data-driven approach to computational
perceptual model is a promising direction for scenarios with sufficient labels. On the other hand,
the found perceptual mechanisms, such as multi-scale and visual attention, are also able to improve
the design of learning models.

4 COMPUTATIONAL MODELS FOR VISUAL QUALITY ASSESSMENT


Subjective test [12, 74] is the ultimate way to evaluate image or video quality. The main subjective
quality methods include Degradation Category Rating (DCR), Pair Comparison (PC) and Absolute
Category Rating (ACR)[12]. The reference and distorted sequences are shown to a number of
human subjects, 16 or more, under controlled viewing conditions. Then, the subjects are asked to
assess the overall quality of the distorted sequences with respect to the reference, and score it on a
five or nine-grade scale corresponding to their perceived quality in mind. Finally, Mean Opinion
Score (MOS) is calculated based on the average rating scores of the subjects, while Differential MOS
(DMOS) calculates the average difference between scores of the distorted and reference sequences.
However, subjective test is expensive, time-consuming, and often impractical. Computational
models for visual quality assessment that measure the quality of video are required.
To evaluate the perceptual quality of visual signal ^I, visual quality assessment is to model the
relationship between V and ^I, which is mathematically expressed as
V = 𝐹𝑉 𝑄𝐴 (^I), (3)
where 𝐹𝑉 𝑄𝐴 () is a relationship function defining non-reference visual quality assessment. According
to Eq.1, visual quality V = 𝐹𝐻𝑉 𝑆 (𝐹𝐷 (^I)). Thus, 𝐹𝑉 𝑄𝐴 () = 𝐹𝐻𝑉 𝑆 (𝐹𝐷 ()). In video coding, reference
signal I is usually available and visual distortion D𝑉 caused by signal distortion D really matters.
So, quality assessment with the reference I is
D𝑉 = 𝐹𝑉 𝑄𝐴 (D). (4)
where D𝑉 is the visual quality degradation of ^I whose groudtruth is obtained through subjective
experiments, signal distortion D = I − ^I. Based on whether the temporal information is exploited
or not, the full-reference visual quality assessment 𝐹𝑉 𝑄𝐴 () is categorized as IQA and Video Quality
Assessment (VQA) models.

4.1 IQA Models


PSNR and MSE measure image quality with mean squared difference between the source and
distorted images, i.e., D𝑉 is approximated with D, which has been widely used in video coding
due to its simplicity. However, PSNR only measures the signal difference without considering
visual properties, which cannot truly reflect the perceived quality in HVS. To handle this problem,

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:12 Y. Zhang et al.

Table 1. Key Features and Fusion Schemes of the Featured IQA/VQA Schemes for video compression.

Methods Types Key Features Fusion Schemes


PSNR/PSNR-HVS [34] IQA MSE between the reference and distorted images /
CSPSNR [83] IQA MSEs of Y, Cb and Cr weighted summation with empirical weights
HDR-VDP-2 [69] IQA threshold-normalized difference for each band weighted logarithmical summation
SSIM/MS-SSIM [104] IQA Similarity for luminance, contrast and structure multiplication
Visual saliency, contrast sensitivity and chroma saturation
VSI[125] IQA /
are used as perceptual weight
logarithmical summation of MSE modulated with exponen-
PWMSE[36] IQA Masking from spatial randomness,low-pass filtered MSE
tially tuned randomness
Divisively normalized sparse-domain similarity for lumi-
SDS[128] IQA multiplication
nance energy and sparse coefficient
DeepQA[48] IQA CNN for visual sensitivity and fully connected layers for quality regression
Fan[24] IQA One CNN to classify the distortion types and multiple CNNs to predict the quality of each distortion types
spatial information loss, spread of chroma, spatial gain,
VQM [77] VQA weighted summation with empirical weights
temporal impairments, local color impairments, edge shift
Difference between spatial and temporal coefficients from
MOVIE [82] VQA multiplication
Gabor filter banks
HDR-VQM[73] VQA Subband error from log-Gabor filters spatial pooling, short and long temporal error pooling
Contrast sensitivity from spatial, temporal and foveated
PWMSE-V[37] VQA perceptually weighted MSE
low-pass filters, spatial and temporal randomness
VMAF [3] VQA AN-SNR,DLM,VIF,MCPD SVR
ST-VMAF [58] VQA 12 features from VMAF (5), T-SpEED (3) and T-VIF (4) SVR and hysteresis temporal pooling
E-VMAF [58] VQA ST-VMAF, VMAF ensembled with SVM, hysteresis pooling over time
DeepVQA[106] VQA CNN extracts spatiotemporal sensitivity and CNAN pools qualities in temporal
C3DVQA[113] VQA CNN extracts spatial features and 3D CNN extracts spatiotemporal features, regression layers to predict quality

developing a evaluator to qualitatively measure the perceptual quality of visual contents has been
a key topic for visual signal processing.
A large number of IQAs have been proposed in the past few decades. Image artifacts, including
blocking, blurring, ringing, and color bleeding, will be caused in compression. Based on the PSNR,
an extension named PSNR-HVS [34] was developed with weighted summation of three PSNRs,
which were error sensitivity from RGB channels, structural distortion from mean, max and min
values, and edge distortion from edge, texture and flat regions. For compressed color images,
Shang et al. [83] proposed a Color-Sensitivity-based PSNR (CSPSNR) where MSEs of Y, Cb and Cr
components were combined with weights 0.695, 0.130 and 0.175 from subjective test. It had a better
consistency than the empirical 6:1:1 combination in PSNR [88] and 4:1:1 combination in MSE in
the HEVC reference software. Moreover, to access HDR images with a broader luminance range,
distortion of each frequency and orientation band was threshold-normalized and algorithmically
summed [69]. Since HVS is more sensitive to structural distortion, Structural SIMilarity (SSIM)
[103] and its variants, such as Gradient based SSIM (GSSIM), Multi-scale SSIM (MS-SSIM) [104],
have been developed by measuring the mean and variance similarities between the reference
and distorted images. Zhang et al. [125] proposed a Visual Saliency-Induced (VSI) Index for IQA,
where visual saliency, contrast sensitivity and chroma saturation features were exploited and
multiplicatively fused. Hu et al. [36] found that HVS was more sensitive to the distortion in regular
texture than that in disordered texture. They proposed a Perceptually Weighted MSE (PWMSE) by
exploiting the masking effects in spatial and pattern. There are many classical IQA metrics that
exploit the perceptual factors mentioned in Section 3. Their general framework includes feature
representation to extract effective visual features and fusion model to fit the quality score, as shown
in Fig.7(a), where feature representation is motivated from the perceptual factors of HVS. Finally,
consistency indices, such as Pearson Linear Correlation Coefficient (PLCC), Spearman Rank-Order
Correlation Coefficient (SROCC) and MSE, were measured between the predicted scores from IQA
and MOS/DMOS from subjective tests, to validate the effectiveness of an IQA model.
To improve the accuracy of IQAs, learning algorithms have been exploited to discover statistical
knowledge in massive data. Fig.7(b) shows a framework of the learning based IQAs, where feature

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:13

(a)

(b)

Fig. 7. Framework of the IQA/VQA computational models.(a) Perception based IQA/VQA. (b) Learning based
IQA/VQA.

extraction and fusion models are learned. To improve the feature representation of visual quality,
Zhang et al. [128] developed a Sparse-Domain Similarity (SDS) index, where sparse representation
was used to learn more effective quality features. Meanwhile, Divisive Normalization Transform
(DNT) was used to remove statistical and perceptual redundancies. Due to the powerful capability of
deep neural networks in feature representation and non-linear data fitting, Kim et al. [48] proposed
a deep learning based full-reference IQA model without considering prior knowledge of HVS, but
exploiting data statistics in databases. Fan et al. [24] utilized multi-expert CNNs to classify distortion
types and predicted the quality in each distortion type. Deep IQA models usually outperform in
quality prediction. However, they are highly data-dependent and their accuracy may degrade in
cross-database validation. Moreover, large dataset with quality labels is required in training general
and stable deep IQAs. These metrics are IQAs and temporal information is not considered.

4.2 VQA Models


In addition to the spatial artifacts, temporal artifacts introduced in video compression, such as
flickering, jerkiness and floating, shall be considered in VQA. VQM and MOVIE are two classical
VQAs. In VQM [77], seven key features, including spatial information loss, spread of chroma,
spatial gain, temporal impairments, and local color impairments and edge shift, were calculated and
fused with empirically weighted summation. In MOVIE [82], reference and distorted videos were
decomposed with multiple Gabor filters and quality degradations were measured with the Gabor
coefficient differences in spatial and temporal domains. Finally, the two degradations were multi-
plicatively fused. In [73], quality features of the sub-band distortion in HDR videos were extracted
with log-Gabor filters. In [37], a perceptually weighted MSE metric was developed by considering

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:14 Y. Zhang et al.

low-pass filter based contrast sensitivity, visual attention and spatiotemporal randomness based
masking effects. Li et al. [58] proposed a practical perceptual VQA for video streaming, named
Video Multi-method Assessment Fusion (VMAF). By learning a Support Vector Machine (SVM)
model, it fused the scores from four existing metrics, including Anti-noise SNR (AN-SNR), Detail
Loss Measure (DLM), Visual Information Fidelity (VIF), and Mean Co-Located Pixel Difference
(MCPD). However, VMAF used the average frame difference as the only temporal feature, which
highly related to video content rather than the temporal distortion. To handle this problem, Bampis
et al. [3] proposed a Spatio-Temporal VMAF (ST-VMAF) to measure the temporal distortion by
considering temporal masking. It modelled bandpass-filtered maps of frames differences and used
entropy differences to predict video quality. In addition, an Ensemble VMAF (E-VMAF) scheme
was proposed by aggregating the conventional VMAF [58] and the enhanced ST-VMAF. Kim et al.
[106] used a CNN to extract spatio-temporal visual features for each frame, which were temporally
pooled with a Convolutional Neural Aggregation Network (CNAN). Xu et al. [113] proposed a
3D-CNN based VQA model, called C3DVQA, which was composed of two-stream 2D convolutional
layers to learn spatial features from the distorted and residual frames, 3D convolutional layers to
learn spatiotemporal features and regression layers to predict the visual quality.

4.3 Discussions on Quality Assessment


Although the VQA is an open problem investigated for many years, there are still many challenging
issues and unsolved problems. Firstly, many IQA/VQA models were developed based on some visual
properties and characteristics in specific applications. Although SSIM and VMAF are often used
in measuring quality of compressed video, there is still no perceptual visual quality model that is
commonly recognized and widely used as the PSNR.
Secondly, visual quality databases shall be built based on subjective tests under a controllable
environment, and it is very laborious to score large amount of distorted visual contents with dozens
of human subjects. Thus, the available datasets are with small scale. As they cannot be fused to
form a large one, it causes different properties of the datasets and inconsistences in evaluating the
existing VQA models. A large dataset for compression distortions is highly demanded.
Thirdly, most of previous VQAs were motivated from modelling the perceptual factors, noted
as perception based VQA schemes, such as SSIM and MOVIE, as shown in Fig. 7(a). However,
effective feature extraction and feature fusion are the keys to model accuracy. Recently, many
learning based VQA schemes were developed as the learning schemes were applied to improve
the features extraction and fusion models, as shown in Fig.7(b). Since it is challenging to extract
effective features for IQA, learning based representations, such as sparse representation [128, 134],
Singular Vector Decomposition (SVD)[68] and tucker decomposition [141], were used to represent
quality features more effectively. In addition, end-to-end deep IQAs/VQAs [24, 48, 106, 113] were
capable of learning feature representation and fusion simultaneously. However, they require a large
scale dataset with labeled subjective quality in training, which is laborious and expensive to collect.
Generality and interpretability of the deep IQA/VQA models shall be improved.
Finally, perception V is actually affected by factors 𝐹𝐻𝑉 𝑆 , 𝐹𝐷 , 𝐹𝐸 , 𝐹𝑅 and P according to Eq. 1.
Thus, the VQA model is a conditional function of 𝐹𝐷 , 𝐹𝐸 , 𝐹𝑅 and P. It is important to improve the
adaptability of models and transfer the models from one condition to another.

5 PERCEPTUAL VIDEO CODING OPTIMIZATION


The optimization objective of video coding is to maximize the visual quality of compressed videos
V at a given target bit rate 𝑅𝑇 , which can be formulated as
max(V), 𝑠.𝑡 .𝑅 ≤ 𝑅𝑇 , (5)

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:15

Fig. 8. Flowchart of the perceptually optimized bit allocation and rate control.

where 𝑅 and 𝑅𝑇 are coding bit rates. Suppose the original visual signal I has the highest visual
quality. The coding optimization problem in Eq. 5 is equivalent to minimizing visual distortion at
the given bit rate, which can be presented as
min(D𝑉 ), 𝑠.𝑡 .𝑅 ≤ 𝑅𝑇 , (6)
where D𝑉 is the visual quality degradation defined in Eq. 1. However, in current video coding
standards, the distortion D𝑉 is mainly measured with MSE/PSNR, i.e., D𝑉 ≈ D, which cannot truly
reflect the perceptual quality.

5.1 Perceptually Optimized Bit Allocation and Rate Control


Video transmission is to transmit bitstream under the conditions of limited bandwidth and trans-
mission delay to ensure the playback quality of video services, which has two key objectives. One
is to accurately control the amount of encoded bit rate below the target bandwidth of networks to
avoid buffering. The other is to maximize the visual quality of video services at the target bit rate,
which requires a reasonable bit allocation. The optimization objective is formulated as
𝑛
∑︁ 𝑛
∑︁
{𝑄𝑃𝑖∗ } = arg min ^ V (𝑄𝑃𝑖 )), 𝑠.𝑡 .
(D 𝑅𝑖 ≤ 𝑅𝑇 , (7)
{𝑄𝑃𝑖 } 𝑖=1 𝑖=1
where 𝑄𝑃𝑖 and 𝑅𝑖 are Quantization Parameter (QP) and bit rate for unit 𝑖, 𝑛 is the number of coding
units, D
^ V (𝑄𝑃𝑖 ) is the visual quality degradation at 𝑄𝑃𝑖 .
Table 2 illustrates representative works on perceptually optimized bit allocation and rate control
algorithms, which includes key features, quality metric, Bjøntegaard Delta Bit Rate (BDBR) [7],
rate accuracy and complexity overhead (Δ𝑇 ). These works can be divided into two types: quality
metric-based and perceptual factor-based. In the quality metric-based category, Gao et al. [29]

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:16 Y. Zhang et al.

Table 2. Perceptually Optimized Bit Allocation and Rate Control Schemes.

BDBR(𝑄 )[%] Bit Error Δ𝑇


Author/Year Visual Features and Coding Optimizations VQA, 𝑄
AI LD RA [%] [%]
1
Gao’16[29] 𝐷 𝑀𝑆𝐸 was replaced with 𝑆𝑆𝐼 𝑀 in game theory based bit allocation. SSIM -2.74 / / 0.1 1.0
Image structural similarity from mean, variance, and variance covariance. SSIM / -14.0 /
Zhou’19[140] 𝐷𝑆𝑆𝐼 𝑀 was approximated as
𝐷 𝑀𝑆𝐸
for CTU-level rate control. PSNR / -3.1 / 0.08 2.7
𝑆2
MS-SSIM / -12.8 /
Structural similarity 𝐷𝑆𝑆𝐼 𝑀 was approximated as 𝛼Θ𝐷 𝑀𝑆𝐸 + 𝛽 for CTU SSIM -5.2 -12.3 -5.3
Li’21 [57] 1.9 2.2
level bit allocation and RDO with joint 𝑅 -𝐷𝑆𝑆𝐼 𝑀 -𝜆𝑆𝑆𝐼 𝑀 model PSNR 2.7 2.3 -2.2
^ V = 𝐾𝑝 × 𝐷 𝑀𝑆𝐸 , where weight 𝐾𝑝 considered the strengths of
D
Xu’16 [112] 𝑘 ×𝑘 VQM / -3.41 / / /
motion (𝑘𝑀𝑆 ), structure (𝑘𝑆𝑆 ) and texture (𝑘𝑇 𝑆 ) as 𝐾𝑝 = 𝑀𝑆
𝑘𝑇 𝑆
𝑆𝑆 .

D^ V = 1 + 𝐾𝑝 × 𝐷 𝑀𝑆𝐸 , where weight 𝐾𝑝 to MSE was calculated


Zeng’16 [121] SSIM / -6.94 -5.71 0.8 /
as spatial texture complexity multiplied by temporal motion activity.
PWMSE -6.27 / /
PWMSE exploiting spatial and pattern masking effects, PWMSE-V -4.04 / /
Liu’19 [64] polynomial PWMSE-𝑄 and 𝑅 -𝑄 models were used for bit SSIM -8.0 / / 0.1 /
allocation and rate control. VIF -1.57 / /
MS-SSIM -7.85 / /
Gao’22 [28] Consistent quality based rate control considering SSIM and MSE variations. SSIM / -8.12 / 0.5 /
PSNR -3.6 / /
Xiang’22[111] 𝐷 𝑀𝑆𝐸 was suppressed by a masking effect or JND threshold. 0.1 5.05
PSPNR -6.4 / /
Li’22 [54] ROI and JND for region-level Intra-period determination. PSNR / -5.63 / / /
Block level JND factor based on masking effect and brightness contrast. PSNR / -3.30 /
Zhou’20[139] 0.08 15.10
was used as perceptual weight in RD model for rate control. SSIM / -6.50 /
Lim’20[59] QP determination considering luminance adaptation effect based JND. MOS / / / 0.22 0.36
Yang’16[116] CTU-level bit allocation considering spatial GM and temporal GMSD. GMSD / -7.43 -10.4 0.3 /
Masking effects based rate control considering GMSD, PSNR / -1.29 /
Wang’18[97] motion information and texture complexity. SSIM / -3.81 / 0.19 1.2
MOS / -4.67 /
Chao 16’[15] CU level bit allocation based on local JND. DMOS -14.0 / /

proposed a SSIM-based game theory approach for Coding Tree Unit (CTU)-level bit allocation
in intra frames, where MSE was replaced with 𝑆𝑆𝐼1𝑀 in measuring the visual distortion D ^ V , i.e.,
DV = 𝐷𝑆𝑆𝐼 𝑀 = 𝑆𝑆𝐼 𝑀 . Then, SSIM-𝑄𝑃 relationship was studied for bit allocation. However, coding
^ 1

modules in the original video encoder were implemented with the target of optimizing MSE/Mean
Absolute Difference (MAD), it was complicated to implement all their optimization criteria to
be SSIM. To handle this problem, Zhou et al. [140] extended the divisive normalized SSIM-based
RD model [102] from Discrete Cosine Transform (DCT)-domain, and the relationship between
1-SSIM and MSE was built as 𝐷𝑆𝑆𝐼 𝑀 ≈ 𝐷𝑆𝑀𝑆𝐸 2 , where 𝑆
2
is a ratio of DCT energies between the
current block and entire frame. Such that optimizing SSIM can be approximately achieved by
using MSE as D ^ V = 𝐷𝑆𝑆𝐼 𝑀 ≈ 𝐷𝑀𝑆𝐸
𝑆 2 , which was then applied to CTU-level rate control and global
𝑄𝑃 determination. Li et al. [57] proposed a more accurate relationship between SSIM and MSE
as 𝐷𝑆𝑆𝐼 𝑀 = 𝛼Θ𝐷 𝑀𝑆𝐸 + 𝛽, where 𝛼 and 𝛽 were linear model parameters updated every block, Θ
related to the texture variance of image contents. Consequently, 𝐷𝑆𝑆𝐼 𝑀 − 𝜆 and 𝐷𝑆𝑆𝐼 𝑀 − 𝑄𝑃 models
were derived for bit allocation. Fig.9 shows relationship analyses between 𝐷𝑆𝑆𝐼 𝑀 and four 𝐷 𝑀𝑆𝐸
based approximation models, i.e., 𝐷 𝑀𝑆𝐸 ,[102, 140],[119], and [57]. The dot data was collected from
encoding four sequences with HEVC at four 𝑄𝑃 ∈ {22, 27, 32, 37} and solid line was linear fit of
the data. 𝑅 2 of the fitted 𝐷𝑆𝑆𝐼 𝑀 and 𝐷 𝑀𝑆𝐸 are 0.482 and 0.458 for All Intra (AI) and Low Delay
(LD) configurations, respectively, which are low. The models in [102, 119, 140] improved the 𝑅 2 to
(0.791,0.787) and (0.694, 0.686), respectively. Li et al [57] further improved the fitting accuracy (𝑅 2 )
to 0.987 and 0.953, respectively, which was highly accurate. A more accurate relation between SSIM
and MSE lead a better consistency between optimization objective and codec implementation.
In addition to SSIM, Xu et al. [112] proposed a novel Free-energy Principle inspired Video Quality
metric (FePVQ), in which strengths of motion, structure and texture were considered as a perceptual
weight 𝐾𝑝 for local MSE, i.e., D
^ V = 𝐾𝑝 × 𝐷 𝑀𝑆𝐸 . Then, this FePVQ was used bit allocation and RDO

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:17

(a) (b)

Fig. 9. Relationships between 𝐷𝑆𝑆𝐼 𝑀 and four 𝐷 𝑀𝑆𝐸 based approximation models, i.e., 𝐷 𝑀𝑆𝐸 , [102, 140],[119],
and [57], respectively.(a) Intra frames, 𝑅 2 of the four linearly fitted models are (0.482, 0.791, 0.787, 0.987). (b)
Inter frames, 𝑅 2 are (0.458, 0.694, 0.686, 0.953).

to allocate more bits to quality sensitive regions. Similarly, perceptual weight to MSE was modelled
with spatial texture complexity and temporal motion activity[121], which guided the frame and
CTU-level bit allocations. Motivated by the PWMSE metric [36], Liu et al. [64] applied PWMSE as
the visual objective (D ^ V = 𝐷 𝑃𝑊 𝑀𝑆𝐸 ) and proposed a perceptual CTU level bit allocation algorithm
for HEVC intra coding, thereby minimizing the perceptual distortion of each CTU under a given
bit rate constraint. It achieved 6.27% bit rate reduction and 0.38 dB Bjøntegaard Delta Perceptually
Weighted Peak Signal to Noise Ratio (BD-PWPSNR) gain on average. Gao et al. [28] proposed a
rate control scheme to maintain consistent image quality among adjacent frames, where PSNR
variations among frames was used as an additional key objective in D ^ V . Xiang et al.[111] modeled
the perceptual distortion D ^ V as signal distortion 𝐷 𝑀𝑆𝐸 subtracted by a masking effect or JND
threshold 𝑀 , i.e., D^ V = 𝐷 𝑀𝑆𝐸 − 𝑀, which was then applied to CTU-level 𝜆-domain rate control.
It achieved 6.4% BDBR gain when the quality was measured with a JND based Peak Signal to
Perceptual Noise Ratio (PSPNR). Meanwhile, the quality consistency measured by PSNR deviation
among frames was also improved. In these schemes, more visually plausible quality metrics, such
as SSIM, PWMSE, or PSNR variation, were directly used to replace the MSE in the bit allocation
objective. There are two key problems to be solved. One is the accuracy of the used visual quality
metric, i.e., D
^ V is approximately measured with IQA, and the other is approximation accuracy
between IQA and MSE for easy codec implementation. The two approximations lead to losses of
accuracy and generality. Meanwhile, these quality metrics used in coding are IQAs. It will be more
complicated to the VQAs considering temporal distortion.
In the other category, perceptual factors were exploited in optimizing video compression, which
were regarded as indirect approaches. Li et al.[54] proposed a ROI based perceptual video coding
scheme, where a region-level Intra-period was determined based on ROI and JND to reduce error
propagation. Also, a ROI based bit allocation was used to adjust bit budgets for Intra blocks. Since
HVS is more sensitive to textural and motion regions, Yang et al. [116] proposed to allocate more
bits to perceptually sensitive regions with larger spatial Gradient Magnitude (GM) and temporal
Gradient Magnitude Similarity Deviation (GMSD). In [97], a masking effect based perceptual model
was proposed by considering texture complexity and motion. This model was then used to allocate
fewer bits to CTUs in the insensitive masking regions. Lim et al.[59] proposed a perceptual rate

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:18 Y. Zhang et al.

control algorithm that assigned CTU-level bits by considering luminance adaptation effect based
JND. The subjective MOS of decoded sequence was improved about 0.19. Based on the pixel-level
JND of masking effect and brightness contrast, Zhou et al.[139] derived a block-level JND factor,
which was used as a perceptual weight in the RD model for frame and CTU-level rate control. Chao
et al. [15] proposed a local JND model by considering contrast sensitivity, transparent masking,
temporal masking, residual correlation adjustment and quantization distortion. Then, 𝑄𝑃 was
gradually increased to allocate fewer bits when the visual distortion was within the JND threshold.
In this category, these approaches were not specifically designed for an IQA/VQA metric. Instead,
perceptual factors, such as masking effect, visual sensitivity, JND or visual attention, were exploited
for better generality since they were fundamental features in visual perception. However, the
disadvantage is that only one or two key properties were exploited, which removed a partial of the
visual redundancies. Moreover, inaccurate computational perceptual model and indirect objective
may degrade the coding gain at a target IQA/VQA .
According to Table 2, average bit error of these perceptually optimized schemes was from 0.08%
to 1.9%, which was highly accurate in rate control. In addition, their complexity overhead was from
0.36% to 15.10%, which was mainly from calculating the perceptual models. Figure 8 shows a general
flowchart of perceptually optimized bit allocation and rate control, which includes visual perception
models and enhanced bit allocation. Different from the pixel-by-pixel MSE, visual perception models,
such as SSIM, ROI, JND and visual sensitivity, indicate the relative visual importance of Group-of-
Picture (GOP), frame or region in videos. So, the relationship between perceptual distortion and rate
(D𝑉 -𝑅) shall be established, which is used to guide GOP/frame/CTU-level bit allocation algorithms
by assigning more bits to perceptually sensitive regions and fewer bits to perceptually insensitive
regions. Then, based on the assigned bits 𝑅, 𝑄𝑃 is determined using 𝑅-𝑄 or 𝑅-𝜆-𝑄 models. Finally,
parameters of these relationship models will be updated in encoding each unit. Consequently,
the overall quality of compressed video becomes more visually plausible. The advantage of these
works is that the number of bits can be more reasonably allocated with the guidance of visual
perception. However, there are some aspects to be further improved: 1) HVS is a complicated non-
linear system that has multiple visual properties with compound effects. Existing computational
perceptual models usually modelled one or two key visual properties, which can hardly model the
HVS accurately. Thus, developing more complete and accurate perceptual models will be helpful
in exploiting visual redundancies further. 2) Although the perceptual factors and quality models
optimize the bit allocation, some other coding modules, such block matching in predictive coding,
mode decision, loop filtering and RDO still use MSE based distortion criteria, which does not
conform with the perceptual bit allocation. 3) Different quality assessment metrics may not be
consistent with each other. Also, they measure quality at image or video-level, which is different
from coding unit processed at block-level. 4) It is difficult to determine the optimal bits and QPs for
some perceptual models, such as learning based schemes that cannot be mathematically expressed
and piece-wise JND function that is non-derivative. In these cases, convex is not guaranteed and
Lagrange multiplier method is not applicable in solving their RD minimization problem. Further
adaptations or approximations are required.

5.2 Perceptual Rate Distortion Optimization (PRDO)


Mainstream video coding is a hybrid framework, which includes predictive coding, transform
coding, entropy coding and filtering/enhancement processing. Excluding the entropy coding, RDO
[89] is to select the optimal mode or parameter among a set of candidates by minimizing the
Rate-Distortion (RD) cost consists of distortion D and coding bit rate 𝑅. The RDO has been used in
many coding modules such as variable-size Coding Unit (CU)/Prediction Unit (PU) modes in the
predictive coding, angular intra modes in intra prediction, variable Transform Unit (TU) sizes and

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:19

Table 3. Representative PRDO Schemes.

BDBR(𝑄 )[%] Δ𝑇
Author/Year Visual Features Codec VQA
AI LD RA [%]
Wang’12 [100] 𝐷 𝑀𝑆𝐸 was replaced with 1-SSIM H.264/JM15.1 SSIM / -16.28 -7.74 6.18
𝐷 𝑀𝑆𝐸 SSIM -8.1 -9.8 -14.0
Yeo’13 [119] Approximated 𝐷𝑆𝑆𝐼 𝑀 with H.264/JM17.2 1.0
2𝜎 2 +𝑐 PSNR 3.9 6.8 6.0
Wang’13 [101] Masking effect, SSIM-based divisive normalization H.264/JM15.1 SSIM / -15.8 / /
𝐷 𝑀𝑆𝐸
Lee’18 [51] 𝐷𝑆𝑆𝐼 𝑀 ≈ , empirical Lagrange multiplier clip HEVC/HM16.0 SSIM / -8.1 -4.0 7.1
2𝜎 2 +𝑐
SSIM / -4.65 /
Wang’19 [99] Weighted combination of 𝐷 𝑀𝑆𝐸 and 𝐷𝑆𝑆𝐼 𝑀 HEVC/HM16.14 /
PSNR / 3.27 /
MSE was replaced with the PWMSE considering PWMSE / -14.92 * -27.12*
Wu’20 [109] HEVC/HM16.0 /
spatial and temporal masking effects MOS / -17.55
Built a linear relationship between VMAF difference VMAF / -3.61 -2.67
and MSE difference with pre-analysis, block-level HEVC/HM16.20 PSNR / 4.72 3.60
Luo’21 [67] /
Lagrange multiplier is adjusted based on the SSIM / 3.66 2.06
correlation VVC/VTM10.0 VMAF / -1.92 -1.20
Jung’15 [44] FEJND from spatial and temporal masking effects H.264/JM17.2 PSNR / / -8.46 /
Bae’16 [2] DCT domain JND directed suppression for SSE HEVC/HM11.0 MOS / -12.10 -9.90 /
GMR for spatial feature and GMSDR for SSIM / -6.18 -5.95
Yang’17 [115] HEVC/HM10.0 /
temporal feature GMSD / -19.80 -16.24
Temporal correction, spectral saliency PWMSE / -6.14 -4.41
Rouis’18 [80] HEVC/HM16.12 3 to 5
and visual stationary tuned the Lagrange Multiplier SSIM / -6.95 -9.86
Cui’21 [20] JND of UHD/HDR based on saliency, CSF, LM and GDE HEVC/HM16.20 DMOS / -35.93 -24.93 19.26
Liu’18 [65] binocular combination distortion from left and right view MV-HEVC Bino-PSNR / / -5.93 1.0
SVQM / / -15.27
Zhang’16 [133] Flickering distortion in 3D synthesized video 3DVC/3D-HTM /
SIAT-VQA / / -14.58
* BDBR(𝑄 ) subjects to quality degradation.

transform kernels (e.g., Discrete Sine Transform (DST) and DCT kernels) in transform coding, and
parameter determination in in-loop filter. Traditionally, the distortion D is measured by MSE or
MAD. Since the ultimate objective of video coding is to minimize visual distortion DV at a given bit
rate according to Eq. 6, in PRDO, the optimal parameter 𝛼 ∗ is selected by minimizing a perceptual
RD cost 𝐽 (B, B)
^ as
^
( ∗
𝛼 = min 𝐽 (B, B(𝛼))
𝛼 ∈A , (8)
^
𝐽 (B, B(𝛼)) =D^ V (B, B(𝛼))
^ + 𝜆𝑅
where B and B ^ are the reference and reconstructed blocks, 𝛼 is a parameter from set A; D ^ V is the
visual distortion between B and B, ^ 𝑅 is the total encoding bits of block B, which includes the bits
of residual, mode index and other parameters, 𝜆 is a Lagrange multiplier adjusting the relative
^V
importance between distortion and rate, which can be derived by 𝜕𝜕𝑅 D
with convex optimization.
A number of perceptual quality models [51, 57, 99–101, 109, 119] and perceptual factors [2,
44, 65, 80, 133] have been investigated to improve the PRDO in Eq.8. Table 3 shows their visual
features and coding gains of the featured PRDO schemes. To improve the RDO with perceptual
quality models, Wang et al. [100] proposed SSIM-motivated RDO for block mode decision in video
coding, where the MSE based distortion term in the cost function was replaced with 1-SSIM, i.e.,
^ V = 1 − 𝑆𝑆𝐼 𝑀. Then, frame-level Lagrange multiplier adaptation was developed based on a
D
reduced-reference SSIM estimation since the whole distorted frame is not available during the
coding process. Meanwhile, marcoblock-level Lagrange multiplier adaptation was developed by
considering moving content and motion perception. Yeo et al. [119] approximated SSIM between
the reference and reconstructed blocks with their MSE 𝐷 𝑀𝑆𝐸 divided by the variance of the current
block (𝜎 2 ), i.e., 𝐷𝑆𝑆𝐼 𝑀 ≈ 2𝐷𝜎𝑀𝑆𝐸
2 +𝑐 , where 𝑐 is a constant. Then, MSE based RDO was modified to be

SSIM based RDO by scaling the Lagrange multiplier with a weighting factor, which was calculated
2
as local variance normalized by mean variance of the entire frame, i.e., 𝜎𝜎¯2 ). Meanwhile, QP adaption
was used for each marcoblock. In [101], based on a DCT domain SSIM index, DCT coefficients were

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:20 Y. Zhang et al.

divisively normalized to a perceptually uniform space, which determined the relative perceptual
importance of each marcoblock by exploiting the masking effect of HVS. Then, the distortion in
RDO was measured with and MSE between the normalized DCT coefficients of the reference and
distorted blocks. These works were proposed for H.264/AVC. Lee et al. [51] applied Yeo’s 𝐷𝑆𝑆𝐼 𝑀 -
𝐷 𝑀𝑆𝐸 model [119] to HEVC and an empirical clip was designed to constrain Lagrange multiplier
variations among frames. Similarly, approximated 𝐷𝑆𝑆𝐼 𝑀 -𝐷 𝑀𝑆𝐸 models in [29, 57, 140] can also
be applicable to the PRDO. In [99], input video was divided into two classes, i.e., simple/regular
textural regions with high sensitivity and complex textural regions with low sensitivity, based on
the free-energy principle. Then, a weighted combination of fidelity (𝐷𝑆𝑆𝐸 ) and perceptual distortion
(𝐷𝑆𝑆𝐼 𝑀 ), i.e., D
^ V = 𝜔𝐷𝑆𝑆𝐼 𝑀 + 𝐷 𝑀𝑆𝐸 , was used as the final distortion metric in the PRDO, where a
larger weight 𝜔 was given to the distortion in simple/regular texture regions. Similarly, in [109],
PWMSE-V [37] that considered spatial and temporal masking modulation was applied to the
distortion term in RDO, and then Lagrange multiplier adaptation was derived accordingly. Luo et
al. [67] built a linear relationship between block-level VMAF degradation and MSE difference from
multi-pass pre-coding, which derived a block-level Lagrange multiplier adaptation. However, when
measured with VMAF, only 3.61% and 2.67% BDBR gains were achieved for LD and Random Access
(RA) configurations, respectively. Meanwhile, BDBR lose if measured with PSNR and SSIM.
The second category is to improve the RDO by exploiting the perceptual factors, such as JND
and masking effects. Based on the spatiotemporal masking effects, Jung et al. [44] proposed a Free-
Energy based JND (FEJND) model to identify the disorderliness of video content, which was applied
to adjust the Lagrange multiplier and QP values in RDO process. Bae et al. [2] proposed a generalized
JND (GJND) model in DCT domain which bridged the gap between 8×8 DCT transform in JND
calculation and variable-size transform in video coding. The GJND adapted transformed coefficients
to different DCT kernel sizes. Then, GJND directed suppression was applied in calculating MSE
based distortion at different 𝑄𝑃s. It was reported that larger coding gains were mainly achieved
from small 𝑄𝑃 settings. Yang et al. [115] adopted Gradient Magnitude Ratio (GMR) for spatial
feature and GMSD Ratio (GMSDR) for temporal feature, which were divisively combined as a
perceptual weight for adjusting Lagrange multiplier in RDO. Rouis et al. [80] adjusted CTU-level
Lagrange multiplier in the RDO by jointly considering temporal correlation, visual stationary and
spectral saliency. Similarly, Cui et al. [20] built a JND model for UHD and HDR video based on
visual saliency, CSF, LM effect, and Gaussian Differential Entropy (GDE), which was incorporated
into a weighting factor to scale the Lagrange multiplier in RDO. Liu et al. [65] developed a binocular
combination-oriented distortion model by binocularly weighting the distortions from left and
right views based on stereoscopic visual perception of two eyes in HVS. Then, this binocular
distortion was applied to optimize symmetrical and asymmetrical stereoscopic video coding, in
which the Lagrange multiplier in encoding the right view was scaled. Zhang et al. [133] proposed a
full reference Synthesized Video Quality Metric (SVQM) to measure the perceptual quality of the
synthesized video. Then, the synthesized video distortion and depth distortion were combined as a
distortion term in 3D RDO for depth coding with the target of minimizing the perceptual distortion
of synthesized view at the given bit rate.
In addition to the RD performance, computational complexity is another important aspect of
the PRDO. As shown in Table 3, it was reported that the complexity overhead (Δ𝑇 ) of these
PRDOs increases from 1.00% to 19.26% on average [20, 51, 65, 80, 100, 119]. For most of these
PRDO works, they built mapping relations between D ^ V and 𝐷 𝑀𝑆𝐸 with perceptual weights. The
computational complexity is mainly from computing the perceptual weights of D ^ V or Lagrange
multiplier adjustment for each block. Instead of being repeatedly calculated like 𝐷 𝑀𝑆𝐸 in the coding

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:21

loop, the perceptual weights were usually calculated only once or reused in CTU for the PRDO of
each coding block, which reduced the complexity overhead.
The general workflow of the PRDO is adding perceptual quality metrics or perceptual factors,
such as JND [2, 44], masking effect [20] and saliency [80], to the MSE based distortion term of the
RD cost function. Then, Lagrange multiplier is adjusted correspondingly based on the perceptual
distortion analysis. For some methods, such as [80, 99], QP is adjusted accordingly to allocate
bits more reasonably. By using this PRDO criteria, smaller partitions, refined modes or motion
parameters were allocated to perceptually significant regions, and coarse modes will be used in
perceptually insensitive regions. However, there are two key problems to be addressed. Firstly,
since the HVS is complex and not fully understood, the current available models only exploited
partial properties of the HVS. It is challenging to build a perceptual model that can simulate HVS
accurately. Besides,the available quality assessment models were usually proposed and validated in
dataset with coarse-grain quality scores, e.g., five quality scales for all distortion levels. A fine-grain
distortion measurement specified for compression distortion is required in coding optimization.
Secondly, video encoder is a complex system consists of hundreds of coding algorithms, which
have been tuning to their optimal based on MSE. When the distortion metric in objective function
is changed, it requires unaffordable working load to modify coding modules one-by-one to their
optimal. For example, MAD based criteria was used in motion estimation, which has scarcely been
considered in the PRDO. In fact, one shortcut way is to establish a relationship between perceptual
models and the conventional MSE/MAD, such as [65, 67, 109, 133]. Then, Lagrange multiplier is
adjusted in practical implementation. However, perceptual models can hardly be presented by
MSE/MAD and inaccurate approximation may degrade the compression efficiency.

5.3 Perceptually Optimized Transform and Quantization


Let X be input signal, A and A𝑖𝑛𝑣 be kernels for forward and inverse transform. Forward transform
and quantization, and their inverse operations can be presented as


 Y =A×X

 Z = Y./(Q × 𝑞𝑠 )

(9)

^ = Z. × Q × 𝑞𝑠 ,
Y


 ^ ^
 = A𝑖𝑛𝑣 × Y
X

where ./ and .× are element-wise division and multiplication, respectively. X ^ and Y ^ are recon-
structed X and Y, Z is a matrix of transformed coefficients input to entropy coding, Q is a quan-
tization matrix, and 𝑞𝑠 is a scalar quantization step. The reconstructed coefficients after inverse
quantization are no longer the same as the input ones X ^ ≠ X, i.e., lossy coding. While using a
larger QP, i.e., 𝑞𝑠 , more coding bits can be saved, i.e., higher compression ratio, but larger quality
degradation will be caused at the meantime, as shown in Fig. 10. Perceptually optimized transform
and quantization are to minimize the perceptual loss at a given bit rate, which can be formulated as
∗ ∗ ∗ Í𝑛 ^ ^
 {A , A 𝑖𝑛𝑣 , Q } = arg min 𝑖=1 ( D V (X𝑖 , X𝑖 ))


(10)

{A,A𝑖𝑛𝑣 ,Q } ,
Í
 𝑛𝑖=1 𝑅(Z𝑖 ) ≤ 𝑅𝑇


where X𝑖 , X𝑖 and Z𝑖 are X, X
^ ^ and Z of coding unit 𝑖, respectively; D ^ V (.) is a visual quality
measurement, 𝑅(.) counts the coding bits. Framework of perceptually optimized transform and
quantization, and their related works are shown in Fig.10.
Transform coding is to transform the input residue from predictive coding to de-correlate or
compact energy of coefficients. It transforms the input residue from spatial to frequency domain,

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:22 Y. Zhang et al.

Fig. 10. Framework of perceptually optimized transform and quantization.

which can be regarded as a dimension reduction. Usually, the input and the reconstructed signals
from transform and inverse transform are lossless, such as wavelet transform, DCT, and KLT.
DCT or its variants, such as Integer Cosine Transform (ICT), DCT type II, and DST, have been
widely used in the video coding standards, such as H.264/AVC, HEVC and VVC. The ICT is the
integer version of DCT, which is presented as The ICT is composed of a orthogonal
Ë
Y = W E.
transform (WË = Cf XCT ) and a scaling ( where is a forward transform
Ëkernel, E is a scaling
Ë
f E), C f
matrix, and is an element-wise multiplication. The scaling operation ( E) is incorporated
into quantization Q, where multiplication is replaced with additions and shifts for low complexity.
Based on the directional pattern of image blocks, Zhao et al. [137] proposed an Enhanced Multiple
Transform (EMT) by selecting the optimal from a number of sinusoidal transform kernels. To
further improve the energy compaction, a Non-Separable Secondary Transform (NSST) [136] was
proposed to transform the coefficients output from the primary EMT, which has been adopted in
the latest VVC. Moreover, data-driven transform was further investigated to exploit data statistics.
Saab transform [56] learned multistage transform kernels by exploiting the directional properties of
intra predicted residual data, while deep learning-based transform [117] took advantage of the non-
linear representation ability of CNN to improve the energy compaction. In summary, the current
researches mainly focused on developing transform kernels or multistage transform to improve
the capability of energy compaction or de-correlation. Since transform and inverse transform are
lossless, perceptual distortion is scarcely concerned. In fact, since transform is followed by lossy
quantization, lossy transform will be acceptable.
In terms of perceptual quantization, in JPEG and MPEG-1/2, non-uniform quantization was
developed to give high frequency AC coefficients with larger quantization intervals and give DC
and low frequency AC coefficients with smaller quantization intervals, because HVS has a low pass
masking effect that is more sensitive to low frequency distortions than the high frequency ones.
In H.264/AVC, HEVC and VVC, uniform quantization were adopted and meanwhile perceptual
quantization matrix/table [10, 89] was developed to give smaller quantization interval to sensitive
coefficients. In [76], Papadopoulos et al. adjusted quantization step in assigning macroblock bits
so as to reflect visual importance of each macroblock. Similarly, Zhang et al. [122] established a
relationship between masking features in spatiotemporal domain and QPs, then selected a local QP
according to the visual characteristics of video contents. In PROVISION project [22], areas prone
to contouring were identified, where smaller QPs were assigned to prevent contouring artifacts.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:23

Table 4. Visual Perception Guided Transform and Quantization.

BDBR(𝑄 )[%] Δ𝑇 [%]


Works Visual Features Codec VQA,𝑄
LD RA LD RA
Papa.’17 [76] Textural masking and fovea to adjust 𝑄𝑃 HEVC/HM16.2 MOS / -9.00 / /
Xiang’14 [110] Adding a 𝑄𝑃 offset based on block-level JND AVS baseline MS-SSIM / -24.50* / /
Adding a CTU level 𝑄𝑃 offset based on the
Yan’20 [114] AVS-P2/RD17.0 SSIM -8.60 / / /
spatial and temporal masking effects
Kim’15 [47] Transform domain JND considering masking effects HEVC/HM11.0 MS-SSIM -16.10 -11.11 11.25 22.78
Cui’19 [19] Transform coefficients are suppressed with the JND HEVC/HM 16.9 MOS -32.98 -28.61 12.94 22.45
Ki’17 [46] Learning-based JNQD models for preprocessing HEVC/HM16.17 DMOS -10.38* -24.91* 34.65 /
Nami’22 [72] CNN-based JND prediction and saliency for 𝑄𝑃 determination HEVC/HM16.2 DMOS -25.44* / / /
Zhang’17 [124] Exploit spatial contrast sensitivity in quantization matrix H.264/Wyner-Ziv MOS -11.96 / -9.92# / /
Luo’13 [66] Adjust quantization matrices by reflecting JND values H.264/JM14.2 MOS / -28.32 / /
Prang.’16 [78] Quantization matrix considering CSF of HD/UHD display SHVC/SHM9.0 PSNR -2.50 -3.00 -0.75
Develop variable-size quantization matrices by SSIMPlus -11.30 / / /
Grois’20 [31] X265
considering the CSF of 4K UHD HDR videos PSNR𝑌𝑈 𝑉 -2.40 / / /
Shang’19 [84] Quantization matrices based on CSF of RGB videos HEVC/HM16.0 PSNR𝐺𝐵𝑅 -20.51 / /
Valin’15 [93] Luminance contrast masking used in vector quantization Daala Codec SSIM -13.70 / /
* BDBR(𝑄 ) subjects to quality degradation.
# benchmark is a distributed coding scheme.

Table 4 depicts representative schemes and performances of the perceptually optimized quantiza-
tion. Since not every distortion can be perceived by HVS, visibility threshold between perceivable
and non-perceivable distortion is regarded as JND, which could be exploited in quantization. Based
on the JND suppression, 𝑄𝑃 was enlarged in encoding the prediction residue while the distortion
was within the JND range [110]. It improved compression ratio while maintaining the same visual
quality. Meanwhile, spatial and temporal masking effects [110, 114] were jointly considered in
JND modelling and 𝑄𝑃 adaptation. Kim et al. [47] proposed a transform domain JND model to
improve the transform, quantization and RDO modules. Firstly, different JND scaling factors were
developed to scale JND for variable-size transform and quantization. Then, transform coefficients
were suppressed with scaled JND by subtracting the scaled JND value from the residual value.
Finally, the distortion in RDO cost function for PU and TU mode decision was compensated by the
JND-suppression, which was done by subtraction operation between transform coefficients and JND
values. JND driven Rate-Distortion Optimized Quantization (RDOQ) accounting for the noticeable
distortion was also investigated for mode decision in [19, 22]. Furthermore, Cui et al. [19] exploited
spatial JND characteristics of the HDR video considering contrast sensitivity, luminance adaptation
and saliency, which then suppressed the transformed coefficients if the distortion was smaller than
the JND. Ki et al. [46] proposed learning-based Just-Noticeable-Quantization-Distortion (JNQD)
models, LR-JNQD and CNN-JNQD, for perceptual video coding, which were able to adjust JND
levels according to quantization steps for preprocessing the input to video encoders. JNQD models
caused about 34.65% complexity overhead, which can be reduced with code optimization and GPU
acceleration. Nami et al.[72] proposed a JND-based perceptual coding scheme, named BL-JUNIPER,
where block-Level CNN based JND prediction and visual importance from visual attention models
were used to adjust QP for each block. In these JND based perceptual coding schemes, more coding
gains (up to 50%) were achieved at high bit rate (low QP) since compression distortion is less
perceivable when using small 𝑄𝑃 and more visual redundancies can be exploited as compared with
that of low bit rate. In these works, the 𝑄𝑃 value was determined at block or frame level.
To exploit the visual redundancies inner a block, a number of perceptually optimized quantization
matrices were proposed. Based on the spatial contrast sensitivity and transform domain correlation
noise, Zhang et al. [124] proposed a perceptually optimized Adaptive Quantization Matrix (AQM),
which was learned online from key frames and applied to Wyner-Ziv frame coding. In [66], the
element values of quantization matrices for 4 × 4 and 8 × 8 transforms were enlarged by considering

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:24 Y. Zhang et al.

the JND. Consequently, transformed coefficients were suppressed by the JND values after quanti-
zation. Prangnell et al. [78] improved CSF model by considering the enlarged resolutions of HD
and UHD displays and developed an AQM for SHVC, where weighting values for high frequency
coefficients were reduced. Grois et al. [31] developed luma and chroma CSF tuned Frequency
Weighting Matrices (FWMs) for 8×8 TU by extending Barten’s CSF model to HDR and UHD video.
Then, 4×4, 16×16 and 32×32 FWMs were derived with up and down-samplings from the 8×8 FWMs.
Shang et al. [84] proposed Weighting Quantization Matrices (WQMs) by considering the CSF of
DCT subbands for RGB videos, where G channel was given higher priority and PSNRs of G, B, and
R channels were combined with a ratio of 4:1:1. In addition to the quantization matrix for luma
coding[78, 124], quantization matrix for chroma coding was investigated in [84]. However, since
VQA for chroma has been scarcely addressed, it is difficult to measure the coding gains achieved by
the chroma quantization matrices. These schemes are perception based scalar quantization, where
each transform coefficient is separately quantized to an integer index causing blocking and ringing
artifacts. Perceptual vector quantization [93] was developed by considering the CSF, where a visual
codebook or perceptual vector was indexed in the quantization. However, it is challenging to build
a widely applicable visual codebooks.
Transform is to exploit frequency or pattern similarity in video representation by mapping spatial
coefficients into a more compact manner. Since the transform and the inverse transform are lossless,
the perceptual redundancies are exploited when combining the transform with quantization. Most
existing works are model based schemes, which optimize the QPs and quantization matrix based on
visual models, such as the visual sensitivity and JND. Due to the strong non-linear representation
and learning abilities of learning algorithms, it will be a promising direction to investigate learning
based transform [56, 117] and quantization [46] by considering perceptual loss.

5.4 Perception based Filtering and Visual Enhancement


Due to the block based prediction and quantization, compression artifacts, including blurring,
blocking and ringing, are introduced in reconstructed frames, which significantly reduce the
perceptual quality. To solve this problem, filtering and visual enhancement are proposed, whose
optimization objective can be formulated as
∗ Í𝑛 ^ ^
 H = arg min 𝑖=1 D V ( X𝑖 , X𝑖 )


(11)

{H } ,
D
 ^ V (X
^ 𝑖 , X𝑖 ) = 𝐹𝐻𝑉 𝑆 (𝐹𝐷 (H( X ^ 𝑖 ))) − 𝐹𝐻𝑉 𝑆 (𝐹𝐷 (X𝑖 ))

where H() is a visual enhancement operator, X𝑖 and X ^ 𝑖 are the 𝑖-th reference and distorted visual
signals, 𝑛 is the number of visual blocks. The related works can be divided into two categories, i.e.,
in-loop filtering and out-loop post/pre-processing.
The first category is block-level visual enhancement in the coding loop. Because of independent
quantization of DCT coefficients in each block, block based DCT usually gives rise to visually annoy-
ing blocking, ringing and blurring compression artifacts, especially at low bit rates. Deblocking filter
and Sample Adaptive Offset filter (SAO) were proposed to selectively smooth the discontinuities at
the block boundaries, which have been adopted in HEVC. To further reduce the blocking artifacts,
Adaptive Loop Filter (ALF) and cross-component ALF [10], which were improved spatial domain
Wiener filters, were adopted after the deblocking filter in VVC. These schemes considered the
discontinuities at the block boundaries as artifacts and selectively smoothed. However, blur was
introduced for textural boundaries. To enhance the visual quality, Zhao et al. [135] proposed an
image deblocking algorithm by using Structural Sparse Representation (SSR) prior and Quantization
Constraint (QC) prior. Owning to the superior performance of deep learning in image enhancement,
Zhang et al. [132] proposed a Residual Highway Convolutional Neural Network (RHCNN) consists

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:25

of residual highway units and convolutional layers for in-loop filtering in HEVC, where a skip
shortcut from the beginning to the end was introduced to reduce the complexity. Pan et al. [75]
proposed an Enhanced Deep CNNs (EDCNN) for in-loop filtering in HEVC, which included a
weighted normalization replacing batch normalization, a feature fusion sub-network and a joint
loss function combining MSE and MAD. It improved PSNR, PSNR smoothness and subjective visual
quality. The CNN based in-loop filtering further achieved about 7% BDBR gain and improved the
visual quality. However, the visual quality of these schemes was mainly measured with PSNR. It
costs extremely high computational complexities at both encoder and decoder, which are about 247
[75] and 267 [132] times respectively.
The second category is frame-level visual enhancement out of the coding loop. Jin et al. [43]
proposed a dual stream Multi-Path Recursive Residual Network (MPRRN) to reduce the compression
artifacts, where the MPRRN model was applied three times to enhance low frequency image content,
high frequency residual map and aggregated images, respectively. Li et al.[55] proposed a single
CNN model for handling a wide range of qualities, where quantization tables were used as a partial
of input data in training network. In addition, the network had two parallel branches, where the
restoration branch dealt with local ringing artifacts and the global branch dealt with global blocking
and color shifting. Yang et al. [118] proposed Quality Enhancement CNN (QE-CNN) networks for
coded I and B/P frames, respectively, which were extended from a four-layer Artifacts Reduction
CNN (AR-CNN) [23] by considering frame types and QPs. An average of 8.31% BDBR gain was
reported. Meanwhile, time-constrained optimization was proposed to have a good trade-off between
complexity and quality. However, these schemes still adopted L1 or L2 norm as the loss function, i.e.,
D^ V (X
^ 𝑖 , X𝑖 ) ≈ D( X ^ 𝑖 ) − X𝑖 ||𝑘 , 𝑘 = {1, 2}, which aim to improve the PSNR between
^ 𝑖 , X𝑖 ) = ||H( X
the reconstructed and the reference videos.
To tackle this problem, Guo et al. [33] proposed a one-to-many neural network training scheme
by jointly considering perceptual loss from high-level feature differences, natureness loss from
a discriminative network and JPEG quantization loss. Jin et al. [42] proposed a post-processing
for intra frame coding based on Deep Convolutional Generative Adversarial Network (DCGAN),
where a progressive refine strategy was to refine residue prediction and a perceptual discriminative
network was used to differentiate the difference between refined image and ground truth. In
addition, feature map differences in the discriminator were used as perceptual loss. However,
the perceptual factors in HVS were actually not well considered. These schemes were proposed
for out-loop coding, which can also be used as post-processing to enhance the visual quality of
reconstructed images. Guan et al. [32] proposed a video quality enhancement method for compressed
videos. This method included a high quality frame determination subnetwork and a multi-frame
enhancement CNN, which referred multiple temporal neighboring frames with higher quality.
Although video quality was improved, it was still measured with PSNR that cannot truly reflect
human perception. Vidal et al. [95] proposed a perceptual filter (called BilAWA and TBil) based
on bilateral and Adaptive Weighting Average (AWA) filters, where JND model was introduced to
control filtering strength. Similarly, Bhat et al. [6] proposed a perceptual pre-processing scheme for
luma and chroma components based on a multi-scale CSF [4] and masking effects. Chadha et al.
[14] proposed a CNN based Deep Perceptual Preprocessing (DPP) scheme to enhance the visual
quality of the frames input to the video encoder, where the perceptual loss included a no-reference
IQA, a full-reference L1 and structural similarity. DPP plus encoder were able to further reduce bit
rate up to 11% for AVC, AV1 and VVC encoders. These filters were used as pre-processing prior to
smooth the imperceptible visual details and thus reduced the bit rate.
The video quality enhancement in Eq. 11 is an ill-posed problem that requires prior information
to reconstruct high quality images, such as spatial, temporal correlation and visual pattern. Deep

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:26 Y. Zhang et al.

learning based schemes are more effective solutions to learn the priors as compared with the
handcrafted and empirical filters. In training a deep learning based visual quality enhancement
model, the source video can be regarded as the ground truth and there will be sufficient labels
for learning the priors. In addition, applying a visual quality metric or perceptual model that is
more consistent with HVS to form a perceptual loss and learning deep model for temporal features
become two important issues for future video quality enhancement. Using no-reference quality
metric will be an interesting exploration. However, in deep learning based visual enhancement, the
computational complexity increases significantly and may not be affordable as they were applied
in the coding loop.

5.5 Discussions
There are two main ways to exploit the visual redundancies in video coding. One is based on the
computational perceptual models, such as the CSF, JND, ROI and so on, which are regarded as
common visual features to guide the perceptual coding, such as bit allocation and mode decision.
Then, the quality of compressed videos are evaluated with the VQA metric. The other way is
applying VQA model or its approximation to the distortion term of the RD cost function, which
was used as the criteria in RDO while selecting the optimal coding modes or parameters. While
applying the VQA metrics to optimize the coding process, there are several major difficulties must
be aware. 1) The quality assessment models are generated from coarse-grain MOS which is 1-5 with
five scales. It is not accurate to distinguish the quality difference when the differences of coding bit
rate or matching modes are small, e.g., bit rate difference is less than 1% [127]. 2) The image and
video are usually assessed at image or video level. However, when the objective metric is applied
to the video coding, the basic unit is processed with block level [91]. So adaptation is required.
3) When the quality term is used in the RDO in video coding, convex may not be guaranteed for
some perceptual or VQA models, such as non-monotonic learning based models or piecewise JND
models. Thus, convex proof and further adaptation are required to achieve the optimality.
Generally, there are three challenging issues in perceptually optimized video coding. Firstly, since
the HVS is complicated, it is challenging to build accurate perceptual models and a VQA that adapts
to various video applications. Although many VQAs are superior in some aspects, it is challenging
to improve the generality of the VQAs. Secondly, there are various coding modules in video coding,
which are developed based on PSNR/MSE. Adaptations to the video coding modules are challenging.
VQA approximations based on MSE may facilitate the adaptation, but reduce the accuracy. Thirdly,
video coding requires a large amount of optimal parameter determination processes using RD cost
comparison. The perceptual models and VQA models are much more complex than MSE/MAD,
especially when using deep learning based VQAs. Computational complexity will be an important
issue to be addressed in perceptual coding.

6 EXPERIMENTAL VALIDATIONS FOR CODING STANDARDS AND TOOLS


6.1 Performances for the Video Coding Standards
Coding experiments were performed to validate the compression efficiency of the latest coding
standards, including VVC [10], HEVC [89], AVS3 [123], AVS2, X265 and AV1, and their reference
models were VTM-13.0, HM-16.20, HPM-12.0, RD17.0, X265 and AV1 v3.1.2, respectively. Encoders
were configured with RA. Fourteen sequences from JVET were encoded, which were Basketball-
Pass, BlowingBubbles, BQSqaure, RaceHorces, BasketballDrill, BQMall, PartyScene, RaceHorcesC,
FourPeople, Johnny, KristenAndSara, BasketballDrive, BQTerrace and Cactus. The visual quality of
the reconstructed videos was measured with PSNR, MS-SSIM, VQM [77], VMAF [58], MOVIE [82]
and C3DVQA [113].

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:27

(a) (b) (c)

(d) (e) (f)

Fig. 11. Perceptual coding performance evaluations for the latest coding standards. (a) PSNR (b) MS-SSIM (c)
VQM (d) VMAF (e) MOVIE (f) C3DVQA.

Table 5. Perceptual Coding Performance of Coding Tools in VVC [9, 10]

BDBR gain with Different Quality Metrics[%]


Abbr. VVC Tool Descriptions
PSNR MS-SSIM VMAF VQM MOVIE C3DVQA
isp Intra sub-partition mode 0.57 0.60 0.59 0.64 0.69 1.47
mrl Multiple reference lines 0.17 0.27 0.07 0.50 -0.72 -0.42
mip Matrix-based intra-picture prediction 0.33 0.28 0.95 0.37 -0.23 0.94
lmchroma Cross component linear model 0.50 1.17 0.79 0.74 0.31 0.95
mts Multiple transform selection 0.93 0.81 1.26 1.12 0.50 2.17
sbt Subblock transform mode 0.28 0.19 0.66 0.14 0.18 0.11
lfnst Low frequency non-separable transform 1.17 1.56 1.69 1.60 0.74 1.59
mmvd Merge with MVD 0.29 0.23 0.55 0.38 -0.06 0.26
affine Affine motion prediction 1.64 1.56 1.66 1.12 0.32 1.54
sbtmvp Subblock-based temporal MV prediction 0.35 1.02 0.55 0.08 0.51 0.13
depquant Dependent quantization 1.71 1.91 1.69 2.17 1.44 1.69
alf+ccalf ALF and Cross-component ALF 2.23 -0.37 3.67 3.27 -1.17 0.93
bio Bi-directional optical flow 0.77 0.92 1.42 1.09 1.56 0.58
ciip Combined intra/inter-picture prediction 0.21 0.29 0.72 0.08 -0.11 -0.08
geo Geometric partitioning mode 0.85 0.95 1.25 0.75 1.28 0.66
dmvr Decoder MV refinement 0.48 0.66 0.41 0.35 0.09 0.28
smvd Symmetric MVD 0.13 0.12 0.08 -0.15 -0.39 0.13
jointcbcr Joint coding of chroma residuals 0.88 0.90 0.95 4.77 -0.04 1.37
prof Prediction refinement with optical flow 0.23 0.19 0.28 3.26 0.16 0.35
All new tools 19.54 16.92 23.18 15.37 16.88 19.86

Fig.11 shows the average RD performances of the six tested codecs. We have the following three
key observations: 1) AVS-3 and VVC are the state-of-the-art standards. They achieve comparable

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:28 Y. Zhang et al.

Table 6. Perceptual Coding Performance of Coding Tools in AVS3 [123]

BDBR gain with Different Quality Metrics[%]


Abbr. AVS3 Tool Descriptions
PSNR MS-SSIM VMAF VQM MOVIE C3DVQA
ipf Intra prediction filter 0.64 0.88 0.92 1.01 0.96 0.43
iip Improved intra prediction 0.08 -0.01 0.09 0.34 0.03 -0.51
sawp Spatial angle weighted prediction 0.39 0.42 0.53 0.40 0.60 -0.03
ipc Inter prediction correction 0.71 0.72 0.57 1.12 1.16 1.01
affine Affine motion prediction 2.64 2.64 1.99 2.31 0.96 2.16
smvd Symmetric MV Difference 0.03 0.05 0.12 0.05 0.06 0.25
dmvr Decoder side MV Refinement 0.92 1.26 1.08 1.29 1.04 0.69
bio Bi-directional optical flow -0.19 -0.10 0.08 0.01 -0.13 -0.26
interpf Inter prediction filtering 0.38 0.29 0.57 0.04 0.71 0.22
pmc Prediction from multiple cross-components 0.01 0.05 0.04 0.25 -0.15 -0.52
esao Enhanced sample adaptive offset 0.75 0.32 1.25 0.17 0.20 0.15
dbr Deblocking refinement 0.06 0.06 0.17 0.38 -0.26 -0.23
alf Adaptive loop filter 4.34 0.99 4.63 0.97 3.98 5.12
tscpm Two-step cross component prediction mode 0.64 0.88 0.92 1.01 -0.12 -0.46
amvr Adaptive motion vector resolution 0.83 0.76 0.84 0.97 0.48 0.93
emvr Extended AMVR 0.12 0.11 0.19 -0.05 -0.16 -0.12
umve Ultra MV expression 1.00 0.80 1.13 0.65 0.18 0.95
All new tools 14.82 11.25 17.01 6.21 11.71 14.81

coding efficiency and are the top two in terms of the six metrics. 2) AV1 has comparable coding
efficiency at high rate with the AVS-3 and VVC for all the metrics, and a little inferior to them if
evaluated with the C3DVQA. 3) AVS-2 is inferior to the HEVC in this coding experiment, since the
video resolution is small and test sequences are from JVET. X265 is a fast version of HEVC at the
cost of a large efficiency loss. 4) Overall, the coding efficiency ranks of the tested standards are
similar as measured with PSNR, MS-SSIM, VQM, VMAF, MOVIE and C3DVQA.

6.2 Performances for Coding Tools in VVC and AVS-3


Coding experiments were performed to validate the effectiveness of the coding tools in the latest
VVC and AVS-3, where different coding tools were disabled one-by-one. The reference softwares
of the VVC and AVS-3 are VTM-13.0 and HPM-12.0, which were configured with the default RA,
respectively. The coding QP values were set as {22, 27, 32, 37} for VTM-13.0 and {27, 32, 38, 45}
for HPM-12.0. According to the common test conditions, the sequences for VVC were Basketball-
Pass, BlowingBubbles, BQSquare, RaceHorces, BasketballDrill, BQMall, PartyScene, RaceHorsesC,
FourPeople, Johnny, KristenAndSara, BasketballDrive, BQTerrace, and Cactus. The test sequences
for AVS3 were Crew, City, Vidyo1, Vidyo3, BasketballDirve, and Cactus. Besides the PSNR, another
five image/video metrics were used to measure the visual quality of reconstructed videos, which
included MS-SSIM, VMAF, VQM, MOVIE and C3DVQA. The compression efficiency was measured
by BDBR [7] under different quality metrics and reference softwares (VTM-13.0 and HPM-12.0)
were anchors for comparison.
Tables 5 and 6 show the average perceptual coding gains of coding tools in VVC and AVS-3 for
the test sequences, where the positive value indicates the BDBR gain. We can have the following
three key observations: 1) the VVC coding tools are able to obtain BDBR gains from 0.13% to 2.23%
and AVS-3 tools are able to achieve gains from -0.19% to 4.34%. The negative gain is achieved for
the bio in AVS-3 mainly because the test sequences are a subset of the sequences in the Common

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:29

P S N R P S N R
5 is p 6
m rl ip f
4 5
m ip iip
3 4
lm c h r o m a s a w p
C 3 D V Q A 2 M S -S S IM m ts C 3 D V Q A M S -S S IM ip c
3
1 s b t 2 a f fin e
0 lfn s t s m v d
1
m m v d d m v r
-1 0
a f f in e b io
-2 s b tm v p -1 in te r p f
d e p q u a n t p m c
a lf + c c a lf e s a o
b io d b r
c iip a lf
M O V IE V M A F g e o M O V IE V M A F ts c p m
d m v r a m v r
s m v d e m v r
jo in tc b c r u m v e
p ro f
V Q M V Q M

(a) (b)

Fig. 12. Perceptual coding performance comparison among the coding tools in VVC and AVS-3. (a) VVC [10]
(b) AVS-3 [123].

Test Conditions (CTC), which means the bio is more content-aware. 2) alf+ccalf and alf achieve
2.23% and 4.34% BDBR gain, which are the best among the tested tools in the VVC and AVS-3. 3) In
terms of different VQA metrics, BDBR gains vary significantly for some coding tools. For example,
the alf+ccalf in VVC is able to achieve 2.23%, 3.67% and 3.27% in terms of PSNR, VMAF and VQM.
However, negative BDBR gain is achieved while reconstructed videos are measured with MS-SSIM.
Similar situation can be found for AVS-3 tools, such as alf and esao. Fig.12 shows radar chart of
the BDBR gain of the coding tools in VVC and AVS-3. We can observe that mrl and alf+ccalf in
VVC and alf and affine in AVS3 cover larger areas than others. Since the current coding tools are
developed with the target of one specific metric, e.g., MSE/MAD, the results of the visual quality
metrics may contradict with each other. Overall, these coding tools in VVC and AVS-3 are mainly
developed for PSNR, there are still large room to remove the visual redundancies by considering
the human perceptual properties.

7 CONCLUSIONS AND FUTURE DIRECTIONS


Perceptually optimized video coding is to further improve coding efficiency by exploiting visual
redundancies. In this paper, we do a systematic survey on the recent advances and challenges
associated with the perceptually optimized video coding, including visual perception modelling,
visual quality assessment and visual perception guided video coding optimization. In each part,
problem formulation, workflows, recent advances, advantages and challenging issues are presented.
As more psychological and physiological visual factors from HVS are revealed and understood, more
accurate perceptual models are built. Meanwhile, visual signal representation, coding distortion and
display shall be jointly considered to build an accurate perceptual model. Since the conventional
PSNR cannot truly reflect the visual quality of HVS, developing VQA model to evaluate the visual
quality of video becomes one of the most crucial parts. However, the perceptual factors of HVS are
not fully revealed and the visual response correlates with multiple factors, including input P, display
𝐹𝐷 , representation 𝐹𝑅 , and video processing 𝐹𝐸 . Learning based VQA is a possible solution that
transfers the VQA prediction to data fitting problem. However, to learn a stable VQA model, building
a large scale video quality dataset is required, which is very laborious and time-consuming. Finally,
due to different mechanisms of the video coding models, problems and solutions are identified
while applying the perceptual models or VQA models to exploit the visual redundancies. Further
investigations and adaptations are required to maximize their performance. In summary, perceptual

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:30 Y. Zhang et al.

video coding does have many advantages and potentials in improving the compression efficiency,
which will be promising directions for visual signal communication.
Based on the review, there rise a number of promising research directions for future work:

1) Deep Learning Optimized Perceptual Coding. Deep learning is a powerful tool in solving the
prediction, classification, and regression problems in promoting video coding [62] since it is
able to discover the hidden statistics and knowledge in massive data. More investigations
on deep learning based perceptual modelling, visual quality assessment, coding module
optimizations and end-to-end video compression are highly desired. However, there are two
critical issues to be solved. One is the sufficient data labelling for deep model training which
will be critical in collecting sufficient number of visual labels, especially in visual modelling
and visual quality assessment. The other is the high complexity of using the deep model,
especially when it is applied to in-loop coding modules and temporal successive frames.
2) QoE Modelling. In recent works, the perceptual quality of video usually refers to the clarity.
However, in addition to the clarity, some other factors, such as depth quality, visual comfort,
interactivity, immersion, naturalness and low delay, are highly correlated the perceptual
quality. Therefore, QoE related visual factors and their perceptual models shall be investigated,
which will be helpful for perceptually optimized visual signal processing. On the other hand,
videos are developing in the trends of representing more realistic visual contents, which
increases the spatial, temporal resolutions, luma and chroma dynamics. QoE models for
higher resolution, fidelity and bit depth shall be investigated. Learning based approaches
could be an important direction to solve the feature representation and fusion problems in
the QoE modelling. Moreover, QoE factors have mutual affects and may contradict with each
other, how to apply the QoE model to the video coding and tackle multi-objective problems
will be challenging.
3) Video Coding for Machine (VCM) [30]. Conventional digital video was mainly developed
for human vision. However, with the increasing demand of intelligent video applications,
such as smart video survivance, visual search, autonomous vehicle and robotics, videos are
used not only for human vision but also for machine vision and intelligent visual analysis.
Therefore, visual features and quality of machine vision or related learning algorithm shall
be exploited, where 𝐹𝐷 () and 𝐹𝐻𝑉 𝑆 () relate to machine vision. Conventional video coding
is based on the hybrid coding framework exploiting the statistical and visual redundancies.
With the development of deep learning and different coding objectives in VCM, deep feature
analysis, quality evaluation methods, coding framework, and feature compression algorithms
for VCM are of high interests in academic and industrial communities.
4) Perceptual Coding for Hyper-realistic and High Dimensional Videos. Hyper-realistic videos
extend the existing visual media from resolution, frame rate, color gammut and dynamic
range and so on, which provide users with more realistic visual experience. The perceptual
factors and response of HVS 𝐹𝐻𝑉 𝑆 may vary with the visual representation 𝐹𝑅 , display 𝐹𝐷 , and
processing distortion 𝐹𝐸 . The current perceptual models, such as masking, JND, sensitivity
and IQA models, and coding algorithms, including RDO, bit allocation, transform, and visual
enhancements, deserve further adaptations for higher bit depth. High dimensional videos,
such as mutliview video, 3D video, point cloud and light field, extend the video from 2D to 3D
space by representing 3D scene with more viewpoints and depth information. There are more
visual redundancies and inter-correlations to be explored. The visual properties of the high
dimensional videos are more complicated and significantly different from those of the 2D
video, as they are able to provide users with depth, view and angle interactions, immersion

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:31

and so on. Further investigations on 3D visual perception, 3D VQA and 3D perceptual coding
for high dimensional videos will be interesting.
5) Perceptually Optimized Deep Visual Model. The deep learning demonstrates powerful capa-
bilities on visual feature representation and data fitting. However, it requires big visual data
with labels and causes extremely high computational complexity as the feature dimension
increases and the model goes deeper. With the development of visual psychology and neuro-
science, the psycho-visual findings in HVS would be possible to improve the design of deep
learning based visual models, including computational perceptual model, visual quality as-
sessment, perceptually optimized deep architecture for visual representation, reconstruction
and recognition.

REFERENCES
[1] Shahrukh Athar and Zhou Wang. 2019. A comprehensive performance evaluation of image quality assessment
algorithms. IEEE Access 7 (2019), 140030–140070.
[2] Sung-Ho Bae, Jaeil Kim, and Munchurl Kim. 2016. HEVC-based perceptually adaptive video coding using a DCT-based
local distortion detection probability model. IEEE Transactions on Image Processing 25, 7 (2016), 3343–3357.
[3] Christos G. Bampis, Zhi Li, and Alan C. Bovik. 2019. Spatiotemporal feature integration and model fusion for full
reference video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 29, 8 (2019),
2256–2270.
[4] Peter G. J. Barten. 2003. Formula for the contrast sensitivity of the human eye. In Image Quality and System Performance,
Vol. 5294. SPIE, 231 – 238.
[5] Marcelo Bertalmio. 2020. Chapter 5 - Brightness perception and encoding curves. In Vision Models for High Dynamic
Range and Wide Colour Gamut Imaging, Marcelo Bertalmio (Ed.). Academic Press, 95–129.
[6] Madhukar Bhat, Jean-Marc Thiesse, and Patrick Le Callet. 2019. HVS based perceptual pre-processing for video
coding. In 2019 27th European Signal Processing Conference (EUSIPCO). 1–5.
[7] Gisle Bjøntegaard. 2001. Calculation of average PSNR differences between RD-curves. In ITU-T Video Coding Experts
Group, VCEG-M33.
[8] Sebastian Bosse, Sören Becker, Klaus-Robert Müller, Wojciech Samek, and Thomas Wiegand. 2019. Estimation of
distortion sensitivity for visual quality prediction using a convolutional neural network. Digital Signal Processing 91
(2019), 54–65.
[9] Benjamin Bross, Jianle Chen, Jens-Rainer Ohm, Gary J. Sullivan, and Ye-Kui Wang. 2021. Developments in international
video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 109, 9 (2021),
1463–1493.
[10] Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm. 2021. Overview
of the versatile video coding (VVC) standard and its applications. IEEE Transactions on Circuits and Systems for Video
Technology 31, 10 (2021), 3736–3764.
[11] BT.2100-2. 2018. Image parameter values for high dynamic range television for use in production and international
programme exchange. ITU-R Recommendations (2018).
[12] BT.500-14. 2015. Methodologies for the subjective assessment of the quality of television images. ITU-R Recommenda-
tions (2015).
[13] BT.709-6. 2015. Parameter values for the HDTV standards for production and international programme exchange.
ITU-R Recommendations (2015).
[14] Aaron Chadha and Yiannis Andreopoulos. 2021. Deep perceptual preprocessing for video coding. In 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). 14847–14856.
[15] Wen-Wei Chao, Yen-Yu Chen, and Shao-Yi Chien. 2016. Perceptual HEVC/H.265 system with local just-noticeable-
difference model. In 2016 IEEE International Symposium on Circuits and Systems (ISCAS). 2679–2682.
[16] Zhenzhong Chen, Weisi Lin, and King Ngi Ngan. 2010. Perceptual video coding: challenges and approaches. In 2010
IEEE International Conference on Multimedia and Expo. 784–789.
[17] Zhenzhong Chen and Wei Wu. 2020. Asymmetric foveated just-noticeable-difference model for images with visual
field inhomogeneities. IEEE Transactions on Circuits and Systems for Video Technology 30, 11 (2020), 4064–4074.
[18] Runmin Cong, Jianjun Lei, Huazhu Fu, Ming-Ming Cheng, Weisi Lin, and Qingming Huang. 2019. Review of visual
saliency detection with comprehensive information. IEEE Transactions on Circuits and Systems for Video Technology
29, 10 (2019), 2941–2959.
[19] Xin Cui, Zongju Peng, Gangyi Jiang, Fen Chen, and Mei Yu. 2019. Perceptual video coding scheme using just noticeable
distortion model based on entropy filter. Entropy 21 (11 2019), 1095.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:32 Y. Zhang et al.

[20] Xin Cui, Zongju Peng, Gangyi Jiang, Fen Chen, Mei Yu, and Dongrong Jiang. 2021. Perceptual coding scheme for
ultra-high definition video based on perceptual noise channel model. Digital Signal Processing 108 (01 2021), 102903.
[21] Qionghai Dai, Jiamin Wu, Jingtao Fan, Feng Xu, and Xun Cao. 2019. Recent advances in computational photography.
IEEE Journal of Selected Topics in Signal Processing 28, 1 (2019), 1–5.
[22] Andre Seixas Dias, Sebastian Schwarz, Mischa Siekmann, Sebastian Bosse, Heiko Schwarz, Detlev Marpe, John
Zubrzycki, and Marta Mrak. 2015. Perceptually optimised video compression. In 2015 IEEE International Conference
on Multimedia Expo Workshops (ICMEW). 1–4.
[23] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2015. Compression Artifacts Reduction by a Deep
Convolutional Network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 576–584.
[24] Chunling Fan, Yun Zhang, Liangbing Feng, and Qingshan Jiang. 2018. No reference image quality assessment based
on multi-expert convolutional neural networks. IEEE Access 6 (2018), 8934–8943.
[25] Chunling Fan, Yun Zhang, Raouf Hamzaoui, Djemel Ziou, and Qingshan Jiang. 2021. Learning-based satisfied user
ratio prediction for symmetrically and asymmetrically compressed stereoscopic images. IEEE MultiMedia 28, 3 (2021),
8–20.
[26] Yuming Fang, Chi Zhang, Jing Li, Jianjun Lei, Matthieu Perreira Da Silva, and Patrick Le Callet. 2017. Visual attention
modeling for stereoscopic video: a benchmark and computational model. IEEE Transactions on Image Processing 26,
10 (2017), 4684–4696.
[27] Edouard Francois, C. Andrew Segall, Alexis M. Tourapis, P. Yin, and D. Rusanovskyy. 2020. High dynamic range
video coding technology in responses to the joint call for proposals on video compression with capability beyond
HEVC. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2020), 1253–1266.
[28] Wei Gao, Qiuping Jiang, Ronggang Wang, Siwei Ma, Ge Li, and Sam Kwong. 2022. Consistent quality oriented rate
control in HEVC via balancing Intra and Inter frame coding. IEEE Transactions on Industrial Informatics 18, 3 (2022),
1594–1604.
[29] Wei Gao, Sam Kwong, Yu Zhou, and Hui Yuan. 2016. SSIM-based game theory approach for rate-distortion optimized
intra frame CTU-level bit allocation. IEEE Transactions on Multimedia 18, 6 (2016), 988–999.
[30] Wen Gao, Siwei Ma, Lingyu Duan, Yonghong Tian, Peiyin Xing, Yaowei Wang, Shanshe Wang, Huizhu Jia, and Tiejun
Huang. 2021. Digital retina: A way to make the city brain more efficient by visual coding. IEEE Transactions on
Circuits and Systems for Video Technology 31, 11 (2021), 4147–4161.
[31] Dan Grois and Alex Giladi. 2020. Perceptual quantization matrices for high dynamic range H.265/MPEG-HEVC video
coding . In Applications of Digital Image Processing XLII, Vol. 11137. SPIE, 164 – 177.
[32] Zhenyu Guan, Qunliang Xing, Mai Xu, Ren Yang, Tie Liu, and Zulin Wang. 2021. MFQE 2.0: A new approach for
multi-frame quality enhancement on compressed video. IEEE Transactions on Pattern Analysis and Machine Intelligence
43, 3 (2021), 949–963.
[33] Jun Guo and Hongyang Chao. 2017. One-to-many network for visually pleasing compression artifacts reduction. In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4867–4876.
[34] Prateek Gupta, Priyanka Srivastava, Satyam Bhardwaj, and Vikrant Bhateja. 2011. A modified PSNR metric based
on HVS for quality assessment of color images. In 2011 International Conference on Communication and Industrial
Application. 1–4.
[35] Mahdi S. Hosseini, Yueyang Zhang, and Konstantinos N. Plataniotis. 2019. Encoding visual sensitivity by maxpol
convolution filters for image sharpness assessment. IEEE Transactions on Image Processing 28, 9 (2019), 4510–4525.
[36] Sudeng Hu, Lina Jin, Hanli Wang, Yun Zhang, Sam Kwong, and C.-C. Jay Kuo. 2015. Compressed image quality metric
based on perceptually weighted distortion. IEEE Transactions on Image Processing 24, 12 (2015), 5594–5608.
[37] Sudeng Hu, Lina Jin, Hanli Wang, Yun Zhang, Sam Kwong, and C.-C. Jay Kuo. 2017. Objective video quality assessment
based on perceptually weighted mean squared error. IEEE Transactions on Circuits and Systems for Video Technology
27, 9 (2017), 1844–1855.
[38] L. Itti, C. Koch, and E. Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259.
[39] Sami Jaballah, Mohamed-Chaker Larabi, and Jamel Belhadj Tahar. 2018. Asymmetric DCT-JND for luminance
adaptation effects: an application to perceptual video coding in MV-HEVC. In 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). 1797–1801.
[40] Qiuping Jiang, Zhentao Liu, Shiqi Wang, Feng Shao, and Weisi Lin. 2022. Toward Top-Down Just Noticeable Difference
Estimation of Natural Images. IEEE Transactions on Image Processing 31 (2022), 3697–3712.
[41] Lina Jin, Joe Yu-chieh Lin, Sudeng Hu, Haiqiang Wang, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C.-C. Jay
Kuo. 2016. Statistical study on perceived JPEG image quality via MCL-JCI dataset construction and analysis. Electronic
Imaging 2016 (02 2016), 1–9.
[42] Zhipeng Jin, Ping An, Chao Yang, and Liquan Shen. 2020. Post-processing for intra coding through perceptual
adversarial learning and progressive refinement. Neurocomputing 394 (2020), 158–167.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:33

[43] Zhi Jin, Muhammad Zafar Iqbal, Wenbin Zou, Xia Li, and Eckehard Steinbach. 2021. Dual-stream multi-path recursive
residual network for JPEG image compression artifacts reduction. IEEE Transactions on Circuits and Systems for Video
Technology 31, 2 (2021), 467–479.
[44] Cheolkon Jung and Yao Chen. 2015. Perceptual rate distortion optimisation for video coding using free-energy
principle. Electronics Letters 51, 21 (10 2015), 1656–1658.
[45] D. H. Kelly. 1961. Visual responses to time-dependent stimuli.∗ I. amplitude sensitivity measurements. Journal of the
Optical Society of America 51, 4 (Apr 1961), 422–429.
[46] Sehwan Ki, Sung-Ho Bae, Munchurl Kim, and Hyunsuk Ko. 2018. Learning-based just-noticeable-quantization-
distortion modeling for perceptual video coding. IEEE Transactions on Image Processing 27, 7 (2018), 3178–3193.
[47] Jaeil Kim, Sung-Ho Bae, and Munchurl Kim. 2015. An HEVC-compliant perceptual video coding scheme based on
JND models for variable block-sized transform kernels. IEEE Transactions on Circuits and Systems for Video Technology
25, 11 (2015), 1786–1800.
[48] Jongyoo Kim and Sanghoon Lee. 2017. Deep learning of human visual sensitivity in image quality assessment
framework. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1969–1977.
[49] Minjung Kim, Maliha Ashraf, María Pérez-Ortiz, Jasna Martinovic, Sophie Wuerger, and Rafal Mantiuk. 2020. Contrast
sensitivity functions for HDR displays. London Imaging Meeting 2020 (09 2020), 44–48.
[50] Christian J. Van Den Branden Lambrecht and Murat Kunt. 1998. Characterization of human visual sensitivity for
video imaging applications. Signal Processing 67 (1998), 255–269.
[51] Bumshik Lee and Jae Young Choi. 2018. A rate perceptual-distortion optimized video coding HEVC. IEICE Transactions
on Information and Systems 101, 12 (Dec 2018), 3158–3169.
[52] Jong-Seok Lee and Touradj Ebrahimi. 2012. Perceptual video compression: a survey. IEEE Journal of Selected Topics in
Signal Processing 6, 6 (2012), 684–697.
[53] Royson Lee, Stylianos I. Venieris, and Nicholas D. Lane. 2021. Deep neural network-based enhancement for image
and video streaming systems: a survey and future directions. ACM Computing Survery 54, 8, Article 169 (oct 2021),
30 pages.
[54] Hao Li, Weimin Lei, and Wei Zhang. 2022. Perceptual video coding based on adaptive region-level intra-period. In
7th International Conference on Computer and Communication Systems (ICCCS). 387–392.
[55] Jianwei Li, Yongtao Wang, Haihua Xie, and Kai-Kuang Ma. 2020. Learning a single model with a wide range of quality
factors for JPEG image artifacts removal. IEEE Transactions on Image Processing 29 (2020), 8842–8854.
[56] Na Li, Yun Zhang, and C.-C. Jay Kuo. 2022. High efficiency intra video coding based on data-driven transform. IEEE
Transactions on Broadcasting 68, 2 (2022), 383–396.
[57] Yang Li and Xuanqin Mou. 2021. Joint optimization for SSIM-based CTU-level bit allocation and rate distortion
optimization. IEEE Transactions on Broadcasting 67, 2 (2021), 500–511.
[58] Zhi Li, Anne Aaron, Katsavounidis, Ioannis Moorthy A, and Megha Manohara. 2016. Toward a practical perceptual
video quality metric. In Netflix TechBlog.
[59] Woong Lim and Donggyu Sim. 2020. A perceptual rate control algorithm based on luminance adaptation for HEVC
encoders. Signal, Image and Video Processing 14 (2020), 887–895.
[60] Weisi Lin and Gheorghita Ghinea. 2022. Progress and opportunities in modelling just-noticeable difference (JND) for
multimedia. IEEE Transactions on Multimedia 24 (2022), 3706–3721.
[61] Weisi Lin and C.-C. Jay Kuo. 2011. Perceptual visual quality metrics: A survey. Journal of Visual Communication and
Image Representation 22, 4 (2011), 297–312.
[62] Dong Liu, Yue Li, Jianping Lin, Houqiang Li, and Feng Wu. 2020. Deep learning-based video coding: a review and a
case study. ACM Computing Survery 53, 1, Article 11 (Feb 2020), 35 pages.
[63] Huanhua Liu, Yun Zhang, Huan Zhang, Chunling Fan, Sam Kwong, C.-C. Jay Kuo, and Xiaoping Fan. 2020. Deep
learning-based picture-wise just noticeable distortion prediction model for image compression. IEEE Transactions on
Image Processing 29 (2020), 641–656.
[64] Xiaoyan Liu, Yun Zhang, Linwei Zhu, and Huanhua Liu. 2019. Perception-based CTU level bit allocation for Intra
high efficiency video coding. IEEE Access 7 (2019), 154959–154970.
[65] Yanwei Liu, Jinxia Liu, Antonios Argyriou, and Song Ci. 2018. Binocular-combination-oriented perceptual rate-
distortion optimization for stereoscopic video coding. IEEE Transactions on Circuits and Systems for Video Technology
28, 8 (2018), 1949–1959.
[66] Zhengyi Luo, Li Song, Shibao Zheng, and Nam Ling. 2013. H.264/Advanced video control perceptual optimization
coding based on JND-directed coefficient suppression. IEEE Transactions on Circuits and Systems for Video Technology
23, 6 (2013), 935–948.
[67] Zhengyi Luo, Chen Zhu, Yan Huang, Rong Xie, Li Song, and C.-C. Jay Kuo. 2021. VMAF oriented perceptual coding
based on piecewise metric coupling. IEEE Transactions on Image Processing 30 (2021), 5109–5121.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:34 Y. Zhang et al.

[68] Azadeh Mansouri and Ahmad Mahmoudi-Aznaveh. 2019. SSVD: Structural SVD-based image quality assessment.
Signal Processing: Image Communication 74 (2019), 54–63.
[69] Rafał Mantiuk, Kil Joong Kim, Allan G. Rempel, and Wolfgang Heidrich. 2011. HDR-VDP-2: A calibrated visual metric
for visibility and quality predictions in all luminance conditions. ACM Transactions on Graphics 30, 4, Article 40
(2011), 14 pages.
[70] Xiongkuo Min, Ke Gu, Guangtao Zhai, Xiaokang Yang, Wenjun Zhang, Patrick Le Callet, and Chang Wen Chen. 2021.
Screen content quality assessment: overview, benchmark, and beyond. ACM Computing Survery 54, 9, Article 187
(oct 2021), 36 pages.
[71] Kathy Mullen. 1985. The contrast sensitivity of human color vision to red-green and blue-yellow chromatic gratings.
The Journal of Physiology 359 (03 1985), 381–400.
[72] Sanaz Nami, Farhad Pakdaman, Mahmoud Reza Hashemi, and Shervin Shirmohammadi. 2022. BL-JUNIPER: A
CNN-Assisted Framework for Perceptual Video Coding Leveraging Block-Level JND. IEEE Transactions on Multimedia
(2022), 1–16.
[73] Manish Narwaria, Matthieu Perreira Da Silva, and Patrick Le Callet. 2015. HDR-VQM: An objective quality measure
for high dynamic range video. Signal Processing: Image Communication 35 (05 2015).
[74] P.910. 2022. Subjective video quality assessment methods for multimedia applications. ITU-T Recommendations
(2022).
[75] Zhaoqing Pan, Xiaokai Yi, Yun Zhang, Byeungwoo Jeon, and Sam Kwong. 2020. Efficient in-loop filtering based on
enhanced deep convolutional neural networks for HEVC. IEEE Transactions on Image Processing 29 (2020), 5352–5366.
[76] M. A. Papadopoulos, Y. Rai, A. V. Katsenou, D. Agrafiotis, P. Le Callet, and D. R. Bull. 2017. Video quality enhancement
via QP adaptation based on perceptual coding maps. In 2017 IEEE International Conference on Image Processing (ICIP).
2741–2745.
[77] M.H. Pinson and S. Wolf. 2004. A new standardized method for objectively measuring video quality. IEEE Transactions
on Broadcasting 50, 3 (2004), 312–322.
[78] Lee Prangnell and Victor Sanchez. 2016. Adaptive quantization matrices for HD and UHD resolutions in scalable
HEVC. In 2016 Data Compression Conference (DCC). 626–626.
[79] J. G. Robson. 1966. Spatial and temporal contrast-sensitivity functions of the visual system. Journal of the Optical
Society of America 56, 8 (Aug 1966), 1141–1142.
[80] Kais Rouis, Mohamed-Chaker Larabi, and Jamel Belhadj Tahar. 2018. Perceptually adaptive lagrangian multiplier for
HEVC guided rate-distortion optimization. IEEE Access 6 (2018), 33589–33603.
[81] Jyrki Rovamo, Veijo Virsu, and Risto Näsänen. 1978. Cortical magnification factor predicts the photopic contrast
sensitivity of peripheral vision. Natrue 271 (1978), 54–56.
[82] Kalpana Seshadrinathan and Alan Conrad Bovik. 2010. Motion tuned spatio-temporal quality assessment of natural
videos. IEEE Transactions on Image Processing 19, 2 (2010), 335–350.
[83] Xiwu Shang, Jie Liang, Guozhong Wang, Haiwu Zhao, Chengjia Wu, and Chang Lin. 2019. Color-sensitivity-based
combined PSNR for objective video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology
29, 5 (2019), 1239–1250.
[84] Xiwu Shang, Guozhong Wang, Xiaoli Zhao, Yifan Zuo, Jie Liang, and Ivan V. Bajic. 2019. Weighting quantization
matrices for HEVC/H.265-coded RGB videos. IEEE Access 7 (2019), 36019–36032.
[85] Xuelin Shen, Zhangkai Ni, Wenhan Yang, Xinfeng Zhang, Shiqi Wang, and Sam Kwong. 2021. Just noticeable distortion
profile inference: a patch-level structural visibility learning approach. IEEE Transactions on Image Processing 30 (2021),
26–38.
[86] Andrew Stockman and Lindsay T. Sharpe. 1998. Human cone spectral sensitivities: a progress report. Vision Research
38, 21 (1998), 3193–3206.
[87] Anstis Stuart. 1998. Picturing peripheral acuity. Perception 27 (1998), 817–825.
[88] Gary Sullivan and Koohyar Minoo. 2012. JCT-VC AHG report: objective quality metric and alternative methods for
measuring coding efficiency (AHG12). In doc. JCT-VC-H0012, ITU-T/ISO/IEC JCT-VC.
[89] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video
coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 1649–1668.
[90] Gerhard Tech, Ying Chen, Karsten Müller, Jens-Rainer Ohm, Anthony Vetro, and Ye-Kui Wang. 2016. Overview of
the multiview and 3D extensions of high efficiency video coding. IEEE Transactions on Circuits and Systems for Video
Technology 26, 1 (2016), 35–49.
[91] Tao Tian, Hanli Wang, Sam Kwong, and C.-C. Jay Kuo. 2021. Perceptual image compression with block-Level just
noticeable difference prediction. ACM Transactions on Multimedia Computing, Communications, and Applications 16,
4, Article 126 (2021), 15 pages.
[92] J. Anthony Tolhurst, David Jand Movshon. 1975. Spatial and temporal contrast sensitivity of striate cortical neurones.
Nature 257 (11 1975), 674–5.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
A Survey on Perceptually Optimized Video Coding 11:35

[93] Jean-Marc Valin and Timothy B. Terriberry. 2015. Perceptual vector quantization for video coding. In Visual Information
Processing and Communication VI, Vol. 9410. SPIE, 65 – 75.
[94] Floris L. van Nes and Maarten A. Bouman. 1967. Spatial modulation transfer in the human eye. Journal of the Optical
Society of America 57 (1967), 401–406.
[95] Eloïse Vidal, Nicolas Sturmel, Christine Guillemot, Patrick Corlay, and Francois-Xavier Coudoux. 2017. New adaptive
filters as perceptual preprocessing for rate-quality performance optimization of video coding. Signal Processing: Image
Communication 52 (2017), 124–137.
[96] Haiqiang Wang, Ioannis Katsavounidis, Jiantong Zhou, Jeonghoon Park, Shawmin Lei, Xin Zhou, Man-On Pun,
Xin Jin, Ronggang Wang, Xu Wang, Yun Zhang, Jiwu Huang, Sam Kwong, and C.-C. Jay Kuo. 2017. VideoSet: a
large-scale compressed video quality dataset based on JND measurement. Journal of Visual Communication and Image
Representation 46 (01 2017).
[97] Hao Wang, Li Song, Rong Xie, Zhengyi Luo, and Xiangwen Wang. 2018. Masking effects based rate control scheme
for high efficiency video coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS). 1–5.
[98] Hongkui Wang, Li Yu, Junhui Liang, Haibing Yin, Tiansong Li, and Shengwei Wang. 2021. Hierarchical Predictive
Coding-Based JND Estimation for Image Compression. IEEE Transactions on Image Processing 30 (2021), 487–500.
[99] Qun Wang, Hui Yuan, Junyan Huo, and Peng Li. 2019. A fidelity-assured rate distortion optimization method for
perceptual-based video coding. In 2019 IEEE International Conference on Image Processing (ICIP). 4135–4139.
[100] Shiqi Wang, Abdul Rehman, Zhou Wang, Siwei Ma, and Wen Gao. 2012. SSIM-motivated rate-distortion optimization
for video coding. IEEE Transactions on Circuits and Systems for Video Technology 22, 4 (2012), 516–529.
[101] Shiqi Wang, Abdul Rehman, Zhou Wang, Siwei Ma, and Wen Gao. 2013. Perceptual video coding based on SSIM-
inspired divisive normalization. IEEE Transactions on Image Processing 22, 4 (2013), 1418–1429.
[102] Shiqi Wang, Abdul Rehman, Kai Zeng, Jiheng Wang, and Zhou Wang. 2017. SSIM-motivated two-pass VBR coding
for HEVC. IEEE Transactions on Circuits and Systems for Video Technology 27, 10 (2017), 2189–2203.
[103] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to
structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
[104] Zhou Wang, E.P. Simoncelli, and A.C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In
The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, Vol. 2. 1398–1402 Vol.2.
[105] Mathias Wien, Jill M. Boyce, Thomas Stockhammer, and Wen-Hsiao Peng. 2019. Standardization status of immersive
video coding. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 1 (2019), 5–17.
[106] Kim Woojae, Jongyoo Kim, Sewoong Ahn, Jinwoo Kim, and Sanghoon Lee. 2018. Deep video quality assessor: From
spatio-temporal visual sensitivity to a convolutional neural aggregation network. In 15th European Conference on
Computer Vision. 224–241.
[107] Jinjian Wu, Leida Li, Weisheng Dong, Guangming Shi, Weisi Lin, and C.-C. Jay Kuo. 2017. Enhanced just noticeable
difference model for images with pattern complexity. IEEE Transactions on Image Processing 26, 6 (2017), 2682–2693.
[108] Jinjian Wu, Guangming Shi, and Weisi Lin. 2019. Survey of visual just noticeable difference estimation. Frontiers of
Computer Science 13, 1 (2019), 4–15.
[109] Xiuzhe Wu, Hanli Wang, Sudeng Hu, Sam Kwong, and C.-C. Jay Kuo. 2020. Perceptually weighted mean squared
error based rate-distortion optimization for HEVC. IEEE Transactions on Broadcasting 66, 4 (2020), 824–834.
[110] Guoqing Xiang, Xiaodong Xie, Huizhu Jia, Xiaofeng Huang, Janny Liu, Wei Kaijin, Yuanchao Bai, Pei Liao, and Wen
Gao. 2014. An adaptive perceptual quantization algorithm based on block-level JND for video coding. In Pacific Rim
Conference on Multimedia (PCM): Advanceds in Multimedia Information Processing. 54–63.
[111] Guoqing Xiang, Xinfeng Zhang, Xiaofeng Huang, Fan Yang, Chuang Zhu, Huizhu Jia, and Xiaodong Xie. 2022. Per-
ceptual quality consistency oriented CTU level rate control for HEVC Intra coding. IEEE Transactions on Broadcasting
68, 1 (2022), 69–82.
[112] Long Xu, Weisi Lin, Lin Ma, Yongbing Zhang, Yuming Fang, King Ngi Ngan, Songnan Li, and Yihua Yan. 2016.
Free-energy principle inspired video quality metric and its use in video coding. IEEE Transactions on Multimedia 18, 4
(2016), 590 – 602.
[113] Munan Xu, Junming Chen, Haiqiang Wang, Shan Liu, Ge Li, and Zhiqiang Bai. 2020. C3DVQA: Full-reference video
quality assessment with 3D convolutional neural network. In IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). 4447–4451.
[114] Yunyao Yan, Guoqing Xiang, Yuan Li, Xiaodong Xie, Wei Yan, and Yungang Bao. 2020. Spatiotemporal perception
aware quantization algorithm for video coding. In IEEE International Conference on Multimedia and Expo (ICME). 1–6.
[115] Aisheng Yang, Huanqiang Zeng, Jing Chen, Jianqing Zhu, and Cai Canhui. 2017. Perceptual feature guided rate
distortion optimization for high efficiency video coding. Multidimensional Systems and Signal Processing 28, 4 (2017),
1249–1266.
[116] Aisheng Yang, Huanqiang Zeng, Lin Ma, Jing Chen, Canhui Cai, and Kai-Kuang Ma. 2016. A perceptual-based rate
control for HEVC. In 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA).

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.
11:36 Y. Zhang et al.

1–5.
[117] Kun Yang, Dong Liu, and Feng Wu. 2020. Deep learning-based nonlinear transform for HEVC intra coding. In 2020
IEEE International Conference on Visual Communications and Image Processing (VCIP). 387–390.
[118] Ren Yang, Mai Xu, Tie Liu, Zulin Wang, and Zhenyu Guan. 2019. Enhancing quality for HEVC compressed videos.
IEEE Transactions on Circuits and Systems for Video Technology 29, 7 (2019), 2039–2054.
[119] Chuohao Yeo, Hui Li Tan, and Yih Han Tan. 2013. On rate distortion optimization using SSIM. IEEE Transactions on
Circuits and Systems for Video Technology 23, 7 (2013), 1170–1181.
[120] Di Yuan, Tiesong Zhao, Yiwen Xu, Hong Xue, and Liqun Lin. 2019. Visual JND: a perceptual measurement in video
coding. IEEE Access 7 (2019), 29014–29022.
[121] Huanqiang Zeng, Aisheng Yang, King Ngi Ngan, and Wang Miaohui. 2016. Perceptual sensitivity-based rate control
method for high efficiency video coding. Multimedia Tools and Applications 75, 17 (2016), 10383 – 10396.
[122] Fan Zhang and David R. Bull. 2016. HEVC enhancement using content-based local QP selection. In 2016 IEEE
International Conference on Image Processing (ICIP). 4215–4219.
[123] Jiaqi Zhang, Chuanmin Jia, Meng Lei, Shanshe Wang, Siwei Ma, and Wen Gao. 2019. Recent development of AVS
video coding standard: AVS3. In 2019 Picture Coding Symposium (PCS). 1–5.
[124] Lei Zhang, Qiang Peng, and Xiao Wu. 2017. Perception-based adaptive quantization for transform-domain Wyner-Ziv
video coding. Multimedia Tools and Applications 76 (08 2017), 16699–16725.
[125] Lin Zhang, Ying Shen, and Hongyu Li. 2014. VSI: A visual saliency-induced index for perceptual image quality
assessment. IEEE Transactions on Image Processing 23, 10 (2014), 4270–4281.
[126] Qiudan Zhang, Xu Wang, Shiqi Wang, Shikai Li, Sam Kwong, and Jianmin Jiang. 2019. Learning to explore intrinsic
saliency for stereoscopic video. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
9741–9750.
[127] Xinfeng Zhang, Weisi Lin, Shiqi Wang, Jiaying Liu, Siwei Ma, and Wen Gao. 2019. Fine-grained quality assessment
for compressed images. IEEE Transactions on Image Processing 28, 3 (2019), 1163–1175.
[128] Xiang Zhang, Siwei Ma, Shiqi Wang, Jian Zhang, Huifang Sun, and Wen Gao. 2021. Divisively normalized sparse
coding: toward perceptual visual signal representation. IEEE Transactions on Cybernetics 51, 8 (2021), 4237–4250.
[129] Xinfeng Zhang, Chao Yang, Haiqiang Wang, Wei Xu, and C.-C. Jay Kuo. 2020. Satisfied-user-ratio modeling for
compressed video. IEEE Transactions on Image Processing 29 (2020), 3777–3789.
[130] Yun Zhang, Sam Kwong, and Shiqi Wang. 2020. Machine learning based video coding optimizations: a survey.
Information Sciences 506 (2020), 395–423.
[131] Yun Zhang, Huanhua Liu, You Yang, Xiaoping Fan, Sam Kwong, and C. C. Jay Kuo. 2021. Deep learning based just
noticeable difference and perceptual quality prediction models for compressed video. IEEE Transactions on Circuits
and Systems for Video Technology (2021).
[132] Yongbing Zhang, Tao Shen, Xiangyang Ji, Yun Zhang, Ruiqin Xiong, and Qionghai Dai. 2018. Residual highway
convolutional neural networks for in-loop filtering in HEVC. IEEE Transactions on Image Processing 27, 8 (2018),
3827–3841.
[133] Yun Zhang, Xiaoxiang Yang, Xiangkai Liu, Yongbing Zhang, Gangyi Jiang, and Sam Kwong. 2016. High-efficiency 3D
depth coding based on perceptual quality of synthesized video. IEEE Transactions on Image Processing 25, 12 (2016),
5877–5891.
[134] Yun Zhang, Huan Zhang, Mei Yu, Sam Kwong, and Yo-Sung Ho. 2020. Sparse representation-based video quality
assessment for synthesized 3D videos. IEEE Transactions on Image Processing 29 (2020), 509–524.
[135] Chen Zhao, Jian Zhang, Siwei Ma, Xiaopeng Fan, Yongbing Zhang, and Wen Gao. 2017. Reducing image compression
artifacts by structural sparse representation and quantization constraint prior. IEEE Transactions on Circuits and
Systems for Video Technology 27, 10 (2017), 2057–2071.
[136] Xin Zhao, Jianle Chen, Marta Karczewicz, Amir Said, and Vadim Seregin. 2018. Joint Separable and Non-Separable
Transforms for Next-Generation Video Coding. IEEE Transactions on Image Processing 27, 5 (2018), 2514–2525.
[137] Xin Zhao, Jianle Chen, Marta Karczewicz, Li Zhang, Xiang Li, and Wei-Jung Chien. 2016. Enhanced Multiple Transform
for Video Coding. In 2016 Data Compression Conference (DCC). 73–82.
[138] Yin Zhao, Zhenzhong Chen, Ce Zhu, Yap-Peng Tan, and Lu Yu. 2011. Binocular just-noticeable-difference model for
stereoscopic images. IEEE Signal Processing Letters 18, 1 (2011), 19–22.
[139] Mingliang Zhou, Xuekai Wei, Sam Kwong, Weijia Jia, and Bin Fang. 2020. Just noticeable distortion-based perceptual
rate control in HEVC. IEEE Transactions on Image Processing 29 (2020), 7603–7614.
[140] Mingliang Zhou, Xuekai Wei, Shiqi Wang, Sam Kwong, Chi-Keung Fong, Peter H. W. Wong, Wilson Y. F. Yuen, and
Wei Gao. 2019. SSIM-based global optimization for CTU-level rate control in HEVC. IEEE Transactions on Multimedia
21, 8 (2019), 1921–1933.
[141] Wei Zhou, Likun Shi, Zhibo Chen, and Jinglin Zhang. 2020. Tensor Oriented No-Reference Light Field Image Quality
Assessment. IEEE Transactions on Image Processing 29 (2020), 4070–4084.

ACM Comput. Surv., Vol. 00, No. 0, Article 11. Publication date: August 2022.

You might also like