0% found this document useful (0 votes)
43 views23 pages

Amina 1

This document summarizes a research article about developing reduced-reference objective metrics for assessing perceptual image quality in wireless imaging systems. Specifically, it describes two reduced-reference metrics proposed in the paper: the Normalized Hybrid Image Quality Metric (NHIQM) and a perceptual relevance weighted Lp-norm. Both metrics account for different structural artifacts observed on wireless links by measuring related image features and combining them using perceptual relevance weights from subjective experiments. The paper evaluates the prediction performance of the two metrics and finds they have excellent correlation with human perception, outperforming other prominent objective quality metrics.

Uploaded by

Azmi Abdulbaqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views23 pages

Amina 1

This document summarizes a research article about developing reduced-reference objective metrics for assessing perceptual image quality in wireless imaging systems. Specifically, it describes two reduced-reference metrics proposed in the paper: the Normalized Hybrid Image Quality Metric (NHIQM) and a perceptual relevance weighted Lp-norm. Both metrics account for different structural artifacts observed on wireless links by measuring related image features and combining them using perceptual relevance weights from subjective experiments. The paper evaluates the prediction performance of the two metrics and finds they have excellent correlation with human perception, outperforming other prominent objective quality metrics.

Uploaded by

Azmi Abdulbaqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ARTICLE IN PRESS

Signal Processing: Image Communication 24 (2009) 525–547

Contents lists available at ScienceDirect

Signal Processing: Image Communication


journal homepage: www.elsevier.com/locate/image

Reduced-reference metric design for objective perceptual quality


assessment in wireless imaging
Ulrich Engelke a,, Maulana Kusuma b, Hans-Jürgen Zepernick a, Manora Caldera c
a
Blekinge Institute of Technology, P.O. Box 520, SE-372 25 Ronneby, Sweden
b
Universitas Gunadarma, Jakarta 12540, Indonesia
c
Gibson Quai-AAS, 30 Richardson Street, Perth, WA 6005, Australia

a r t i c l e in fo abstract

Article history: The rapid growth of third and development of future generation mobile systems has led
Received 2 June 2008 to an increase in the demand for image and video services. However, the hostile nature
Received in revised form of the wireless channel makes the deployment of such services much more challenging,
12 June 2009
as in the case of a wireline system. In this context, the importance of taking care of user
Accepted 16 June 2009
satisfaction with service provisioning as a whole has been recognized. The related user-
oriented quality concepts cover end-to-end quality of service and subjective factors such
Keywords: as experiences with the service. To monitor quality and adapt system resources,
Objective perceptual image quality performance indicators that represent service integrity have to be selected and related
Normalized hybrid image quality metric
to objective measures that correlate well with the quality as perceived by humans. Such
Perceptual relevance weighted Lp -norm
objective perceptual quality metrics can then be utilized to optimize quality perception
Reduced-reference
Wireless imaging associated with applications in technical systems.
In this paper, we focus on the design of reduced-reference objective perceptual
image quality metrics for use in wireless imaging. Specifically, the normalized hybrid
image quality metric (NHIQM) and a perceptual relevance weighted Lp -norm are
designed. The main idea behind both feature-based metrics relates to the fact that the
human visual system (HVS) is trained to extract structural information from the viewing
area. Accordingly, NHIQM and Lp -norm are designed to account for different structural
artifacts that have been observed in our distortion model of a wireless link. The extent
by which individual artifacts are present in a given image is obtained by measuring
related image features. The overall quality measure is then computed as a weighting
sum of the features with the respective perceptual relevance weight obtained from
subjective experiments. The proposed metrics differ mainly in the pooling of the
features and amount of reduced-reference produced. While NHIQM performs the
pooling at the transmitter of the system to produce a single value as reduced-reference,
the Lp -norm requires all involved feature values from the transmitted and received
image to perform the pooling on the feature differences at the receiver. In addition, non-
linear mapping functions are developed that relate the metric values to predicted mean
opinion scores (MOS) and account for saturations in the HVS. The evaluation of
prediction performance of NHIQM and the Lp -norm reveals their excellent correlation
with human perception in terms of accuracy, monotonicity, and consistency. This holds
not only for the prediction performance on images taken for the training of the metrics
but also for the generalization to unknown images. In addition, it is shown that the

 Corresponding author.
E-mail addresses: [email protected] (U. Engelke), [email protected] (M. Kusuma), [email protected] (H.-J. Zepernick),
[email protected] (M. Caldera).

0923-5965/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.image.2009.06.005
ARTICLE IN PRESS
526 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

NHIQM approach and the perceptual relevance weighted Lp -norm outperform other
prominent objective quality metrics in prediction performance.
& 2009 Elsevier B.V. All rights reserved.

1. Introduction usable over a wide range of applications. However, these


benefits often come at the expense of high computational
The development of advanced transmission techniques complexity. In contrast, methods following an engineering
for third generation mobile communication systems and inspired approach are mainly based on image or video
their long-term evolution has paved the way for the analysis and feature extraction, which does not exclude
delivery of mobile multimedia services. Wireless imaging that certain aspects of the HVS are considered in the
applications are among those services that are offered on metric design.
modern mobile devices to support communication op- Most of the proposed HVS-based metrics are following
tions beyond the traditional voice services. As the the full-reference (FR) approach [6,19,33,42], meaning,
bandwidth resources allocated to mobile communication that they rely on the reference image being available for
systems are scarce and expensive, digital images and the quality assessment. Clearly, this limits their applic-
videos are compressed prior to their transmission. ability to wireless imaging as a reference image would
In addition, the time-varying nature of the wireless generally not available at the receiver where quality
channel caused by multipath propagation, the changing assessment takes place. Thus, a no-reference (NR) metric
interference conditions within the system, and other may be more appropriate since it measures the quality
factors cause the channel to be relatively unreliable. As a solely based on the received image. Although it is easy for
consequence, the quality of wireless imaging services are humans to judge the quality of an image without any
impaired not only by the lossy compression technique reference, it is extremely difficult for an automated
adapted but also by the burst error mechanisms induced algorithm to execute.
by the wireless channel. As a consequence, metrics following the NR approach
The performance evaluation of mobile multimedia such as [11,23,35] usually provide inferior quality predic-
systems has conventionally been based on link layer tion performance as compared to metrics that take into
metrics such as the signal-to-noise ratio (SNR) and the account some amount of reference information from the
bit error rate (BER) [24]. Similarly, performance of image transmitted image, or process the whole original image
compression techniques is often quantified by fidelity itself as in case of FR metrics. Furthermore, as NR metrics
metrics such as the mean squared error (MSE) and the provide an absolute measure about the quality of a
peak signal-to-noise ratio (PSNR) [44]. In the case of received image, it may be difficult to distinguish quality
communicating visual content, however, it has been degradations that have been induced during image
shown that these metrics do not necessarily correlate well transmission from those that have already been present
with the quality as perceived by the human observer in the image prior to transmission. Hence, there would be
[12,43]. As a result, user-oriented assessment methods strong limitations to execute link adaptation and resource
that can measure the overall perceived quality have gained management procedures based upon this type of metrics.
increased interest in recent years. It is expected that these In this respect, a good compromise between the FR and
methods will facilitate more efficient designs of mobile NR methods are the reduced-reference (RR) metrics. These
multimedia systems by establishing trade-offs between metrics rely only on a set of image features, the reduced-
the allocation of system resources and quality of service reference, instead of the entire reference image. These
(QoS) [26,32]. In other words, not only metrics associated features are simply extracted from an image prior to its
with the underlying technical system are considered but transmission and used at the receiver for detecting quality
also quality indicators that can accurately predict the degradations. The reduced-reference may then be trans-
visual quality as perceived by human observers. mitted over an ancillary channel, piggy backed with the
image, or embedded into the image using data hiding
techniques [41].
1.1. Visual quality assessment Wang et al. [40] have proposed an RR metric based on a
natural image statistic model in the wavelet domain and
A wide range of approaches has been followed in the Carnec et al. [2] define the C4 criterion which is an RR
design of such visual quality metrics ranging from simple metric based on an elaborate model of the HVS. Both
numerical measures [8] to highly complex models metrics have been shown to correlate well with human
incorporating those characteristics of the human visual perception, which comes at the cost of a high computa-
system (HVS) that are considered as being crucial for tional complexity. This may restrict their application in
visual quality perception [21,29,36]. Specifically, the the context of wireless imaging where computational
phenomenon that the HVS is adapted to extraction of resources are very limited, in particular in the mobile
structural information has received strong attention for device. Yamada et al. [45] and Chono et al. [5] propose RR
metric design [1,3,39]. These psychophysical approaches, metrics that can accurately predict PSNR. The former
which are based on modeling various aspects of the HVS, metric is based on a selection of representative luminance
correlate well with human visual perception and are values, whereas the latter metric utilizes distributed
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 527

source coding to communicate the RR signal. These Firstly, the normalized hybrid image quality metric
metrics may be applicable for usage in an image (NHIQM) is designed. It operates on extreme value
communication context due to their low computational normalized image features from which it produces a
complexity. However, the ability of these metrics to weighted sum with respect to the relevance of the
accurately predict perceived visual quality is doubtful involved features. The result is a single value that can be
due to the poor quality prediction performance of PSNR. communicated from transmitter to receiver where it is
utilized as reduced-reference information. The same
processing is performed on the received image resulting
1.2. Overview of the proposed metric design in the related NHIQM value. The absolute difference
between the NHIQM values of the transmitted and
In view of the above, this paper focuses on the received image constitutes the objective perceptual
development of RR objective perceptual image quality quality metric and is used to detect distortions.
metrics that are applicable in a wireless imaging context. Secondly, we consider a perceptual relevance
As such, image impairments representative for a wireless weighted Lp -norm as a means of pooling the image
imaging system are produced to constitute the basis of the features. Specifically, the Lp -norm is applied here to
design framework. In addition, particular care has been detect differences between features [7,10]. In this case,
taken to limit the overhead needed for communicating the pooling at the transmitter is omitted but requires
reduced-reference information and hence conserve the the features being transmitted over the channel to the
scarce bandwidth resources allocated to wireless systems. receiver. At the receiver, the difference between the
Furthermore, feature extraction algorithms are selected to transmitted and received features are combined to an
have small computation complexity in order not to drain overall quality metric. This approach allows to track
battery power at the wireless handheld device and in turn degradations for each of the involved features. On the
support longer service time. other hand, the amount of reduced-reference overhead is
Specifically, images in the widely adopted Joint Photo- increased compared to the NHIQM-based approach.
graphic Experts Group (JPEG) format are examined with The design of both feature-based RR metrics, NHIQM
typical impacts of a mobile communication system and Lp -norm, follows the same methodology. It comprises
included through a simulated wireless link. This system the selection of suitable feature extraction algorithms, the
under test enabled us to produce artifacts beyond those feature extraction for image samples of a training set,
inflicted purely by lossy source encoding but to account normalization of the calculated feature values, and the
also for end-to-end degradations caused by a transmission acquisition of the perceptual relevance weights from the
system. In particular, the artifacts of blocking, blur, subjective experiments. A non-linear mapping function is
ringing, masking, and lost blocks have been observed derived in a final step that relates the objective perceptual
ranging from extreme to almost invisible presence. quality metric to predicted MOS. In this way, non-
The information about the individual artifacts in an linearities in the HVS with respect to the processing of
image can be deduced from related image features such as quality degradations can be accounted for. The non-linear
edges, image activity (IA) and histogram statistics. The mapping function is derived using curve fitting methods
extent by which the considered artifacts exist in a given where, again, the MOS from the subjective experiments
image can therefore be quantified by using selected image are essential in deriving the parameters of the mapping
feature extraction algorithms. As some artifacts influence functions.
the perceived quality stronger than others, perceptual A comprehensive evaluation of the prediction perfor-
relevance weights are given to the associated image mance of NHIQM and the Lp -norm is provided in terms of
features. Clearly, subjective experiments and their analy- accuracy, monotonicity, and consistency [34]. These
sis are not only instrumental but critical in the process performance measures are given for the metric design
of revealing the specific values of perceptual relevance on a training set of images and the generalization to
weights. For this reason, we conducted subjective image unknown images. It turns out that the proposed feature-
quality experiments in two independent laboratories. based metrics outperform other considered RR and FR
The particular values of the weights were deduced as metrics in the context of wireless imaging distortions and
Pearson linear correlation coefficients between the related with respect to the above prediction performance mea-
features and the mean opinion scores (MOS) from the sures.
subjective experiments. In this respect, the perceptual
relevance weights obtained from analyzing the subjective
data constitute a key component in the transition from 1.3. Contributions of this work
subjective quality prediction methods to an automated
quality assessment that would be suitable for real-time Considering the above, this paper contributes a frame-
applications. Given these perceptual relevance weights, an work for image quality metric design in a wireless
objective perceptual image quality metric may then be communication system. As such, the metrics proposed in
designed to exploit image feature values and their weights this paper have been designed to be able to measure
within a suitable pooling process. In this paper, we quality degradation during image transmission using an
consider two feature-based objective perceptual quality RR approach. Unlike other RR metrics from the literature,
metrics that mainly differ in the pooling process and the the metrics in this paper are designed based on a set of
amount of reduced-reference as follows. test images that take into account the complex nature of a
ARTICLE IN PRESS
528 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

wireless communication system, rather than just account- shaded boxes relate to the components that would need to
ing for source coding artifacts or additional noise. be included for performing the operations related to RR
Furthermore, low computational complexity and low objective perceptual quality assessment. As such, the
overhead in terms of reduced-reference have been major system is able to monitor quality degradations that are
design issues in order to put low burdens on the incurred during transmission unlike in the case of
communication system. deploying an NR quality assessment method, where an
A statistical analysis of experiments that we conducted absolute quality of the received image would be obtained.
in two independent laboratories reveals insight into the Given the strict limitations on system resources such as
subjectively perceived quality of wireless imaging distor- bandwidth, the overhead induced by the reduced-refer-
tions. In addition, a statistical and correlation analysis of ence becomes a critical metric design issue. It is therefore
objective feature metrics provides further insight into the beneficial to extract and pool representative features of an
artifacts observed in wireless imaging and the perfor- image It at the transmitter (t) in order to condense the
mance of the feature metrics that were used to quantify image content and structure to a few numerical values.
the related artifacts. Comparison of the proposed RR The transmission of the source encoded image may then
quality metrics to other contemporary quality metrics be accompanied by the reduced-reference, which could be
reveals the ability of the proposed metrics to predict communicated either in-band as an additional header or
perceived quality in the context of wireless imaging. in a dedicated control channel. Subsequently, channel
This paper is organized as follows. Section 2 provides encoding, modulation and other wireless transmission
an overview of RR objective quality assessment in wireless functions are performed on the source encoded image and
imaging and the particular system under test as consid- the reduced-reference. At the receiving side, the inverse
ered in this paper. A detailed description of the conducted functions are performed including demodulation,
subjective quality experiments is contained in Section 3 channel decoding, and source decoding. The reduced-
along with a statistical analysis of the experiment out- reference features are recovered from the received data
comes. The objective feature extraction metrics, which and the related features of the reconstructed image Ir at
build the very basis of the metric design, are discussed the receiver (r) are extracted and pooled to produce the
in Section 4. An additional analysis of the feature related metric value. The difference between metric
metrics provides insight into their performance to values for the images It and Ir can then be explored for
measure artifacts in the images. On the basis of both the end-to-end image quality assessment. The outcome of the
subjective and objective data, the RR metric design for RR quality assessment may drive, for instance, link
objective perceptual quality assessment is then described adaption techniques such as adaptive coding and modula-
in detail in Section 5. In Section 6, the prediction tion, power control, or automatic repeat request strategies
performance of NHIQM and Lp -norm is evaluated and provided a feedback link would be available.
compared to other prominent objective quality metrics.
Finally, conclusions are drawn in Section 7. 2.1. System under test

2. Reduced-reference objective perceptual quality In the scope of this paper we consider a particular
assessment in wireless imaging setup of the wireless link model as shown in Fig. 1 which
turned out to results in a set of test images covering a
A typical link layer of a wireless communication broad range of artifact types and severities. In particular,
system is shown in Fig. 1. Here, the functional blocks in the JPEG format has been chosen to source encode the

Feature Pooling
Extraction
RR Channel
It Modulation
Embedding Encoding
Source
Encoding
Transmitter
Wireless Channel
Receiver

Source RR Channel De-


Ir
Decoding Recovery Decoding modulation
Feature
Extraction
RR
Quality Quality Pooling
metric Assess-
ment

Fig. 1. Overview of reduced-reference objective perceptual quality assessment deployed in a wireless imaging system.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 529

images prior to transmission. It is noted that JPEG is a


lossy image coding technique using a block discrete cosine
transform (DCT)-based algorithm, thus, facilitating an
easy transition to state-of-the-art DCT-based video co-
decs, such as H.264. Due to the quantization of DCT
coefficients, artifacts may already be introduced during
source encoding. A (31,21) Bose–Chaudhuri–Hocquen-
ghem (BCH) code was then used for error protection
purposes and binary phase shift keying (BPSK) for
modulation. An uncorrelated Rayleigh flat fading channel
in the presence of additive white Gaussian noise (AWGN)
was implemented as a simple model of the wireless
channel. Severe fading conditions may cause bit errors or
burst errors in the transmitted signal which are beyond
the correction capabilities of the channel decoder and as a
result, artifacts may be induced in the decoded image in
addition to the ones purely caused by the source encoding.
To produce severe transmission conditions, the average bit
energy to noise power spectral density ratio Eb =N 0 was
chosen as 5 dB.
It should be noted that the RR objective quality metric
design is based upon this particular setup. However, the
Fig. 2. Distorted image samples showing different artifacts: ‘‘Lena’’ with
proposed metric design framework can be easily adopted blocking, ‘‘Goldhill’’ with blur in 8  8 blocks (top); ‘‘Pepper’’ with
to other specific system components, given that the ringing and intensity masking, ‘‘Barbara’’ with extreme artifacts
objective data (test images) and subjective data (MOS) (bottom).
sets are available that are crucial for the metric design.
This may for instance include an extension from JPEG to the coarse quantization of frequency components and the
JPEG2000 or to measuring spatial artifacts in video, such associated suppression of high-frequency coefficients.
as H.264. In case of JPEG compression blur is usually observed
within the 8  8 blocks rather than on a global scale.
2.2. Artifacts in wireless imaging
2.2.3. Ringing
The system under test as outlined in Section 2.1 turned The artifact of ringing appears to the human observer
out to be beneficial with respect to generating impaired as periodic pseudo-edges around the original edges of the
images ranging from extreme artifacts to images with objects in an image. Ringing is caused by improper
almost invisible artifacts. Specifically, the range of arti- truncation of high-frequency components, which in turn
facts spanned beyond those typically induced by source can be noticed as high-frequency irregularities in the
encoding such as blocking and blur but also composed of reconstruction. Ringing is usually more evident along high
ringing, intensity masking, lost blocks, and combinations contrast edges, especially if these edges are located in
thereof. These artifacts will be briefly discussed in the areas of smooth textures.
following sections. In addition, some example images are
shown in Fig. 2 to illustrate the observed artifacts. 2.2.4. Intensity masking and lost blocks
In general, masking occurs when the visibility of a
2.2.1. Blocking stimulus is reduced due to the presence of another
Blocking artifacts are inherent with block-based image stimulus [43]. In this context, intensity shifts in parts of
compression techniques such as JPEG or H.264. Blocking an image, or the whole image, may result in either a
or blockiness can be observed as surface discontinuity at darker or brighter appearance of the area as compared to
block boundaries and is a direct consequence of the the original image and thus cause such a reduction in
independent quantization of the individual blocks of visibility. This phenomenon, which we refer to as intensity
pixels. In particular, in JPEG compressed images blocking masking, is a typical artifact in wireless image commu-
is present on the 8  8 block borders due to independent nication appearing in the presence of strong multipath
quantization of the DCT coefficients. fading. In the worst case, entire image blocks are lost
resulting in parts of the image being black.
2.2.2. Blur
Blur relates to the loss of spatial detail and is observed 3. Subjective image quality experiments
as texture blur. In addition, blur may be observed due to a
loss of semantic information that is carried by the shapes The methodology used for the subjective assessment of
of objects in an image. In this case, edge smoothness image quality is described hereafter. In particular, the
relates to a reduction of edge sharpness and contributes to laboratory environment, the test material, the panels of
blur. In relation to compression, blur is a consequence of viewers, and the test procedure adapted in the subjective
ARTICLE IN PRESS
530 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

experiments are given in detail. According to the guide- screen given it displays only black level in a dark room to
lines outlined in Recommendation BT.500-11 [16] of the luminance when displaying peak white was approxi-
the radio communication sector of the International mately 0.01. The display brightness and contrast was set
Telecommunication Union (ITU-R), subjective experi- up with picture line-up generation equipment (PLUGE)
ments were conducted in two independent laboratories. according to Recommendations ITU-R BT.814 [13] and
The first subjective experiment (SE 1) took place at the ITU-R BT.815 [14]. The calibration of the screens was
Western Australian Telecommunications Research Insti- performed with the calibration equipment ColorCAL from
tute (WATRI) in Perth, Australia and the second subjective Cambridge Research System Ltd., England, while the
experiment (SE 2) was conducted at the Blekinge Institute DisplayMate software was used as pattern generator.
of Technology (BIT) in Ronneby, Sweden. Due to its large impact on the artifact perceivability,
the viewing distance must be taken into consideration
when conducting a subjective experiment. The viewing
3.1. Laboratory environment distance is in the range of four times (4H) to six times (6H)
the height H of the CRT monitors, as stated in Recom-
The general viewing conditions were arranged as mendation ITU-R BT.1129-2 [15]. The distance of 4H was
specified in the ITU-R Recommendation BT.500-11 [16] selected here in order to provide better image details to
for a laboratory environment. the viewers.
The subjective experiments were conducted in a room
equipped with two 17 in cathode ray tube (CRT) monitors
3.2. Test material
of type Sony CPD-E200 (SE 1) and a pair of 17 in CRT
monitors of type DELL and Samtron 75E (SE 2). The ratio of
Seven reference images of dimension 512  512 pixels
luminance of inactive screen to peak luminance was kept
and represented in grey scale have been chosen to cover a
below a value of 0.02. The ratio of the luminance of the
variety of textures, complexities, and arrangements.
The images are shown in Figs. 3 and 4 where the images
in Fig. 3 represent humans and human faces and the
images in Fig. 4 represent more complex structures and
natural scenes. The wireless link simulation model as
explained in Section 2.1 has then been utilized to create
test images that exhibit the wide variety of distortions as
discussed in Section 2.2. In particular, two sets of
40 images each, I1 and I2 , were created to be used in
the two subjective experiments SE 1 and SE 2, respec-
tively. The images were chosen such as to cover a wide
variety of artifacts and also a broad range of severities for
each of the artifacts, from almost invisible to highly
distorted. Thus, the metric design is based on a set of test
images that incorporates distortions near the just notice-
able differences regime to artifacts widely covering the
suprathreshold regime.

3.3. Viewers

The viewers are the respondents in the experiment.


Experienced viewers, i.e. individuals that are profession-
ally involved in image quality evaluation/assessment at
Fig. 3. Reference images showing low texture human faces: ‘‘Lena’’, their work, are not eligible to participate in the subjective
‘‘Elaine’’ (top); ‘‘Tiffany’’, ‘‘Barbara’’ (bottom). experiments. As such, only inexperienced (or non-expert)

Fig. 4. Reference images showing complex textures: ‘‘Goldhill’’, ‘‘Pepper’’, and ‘‘Mandrill’’ (left to right).
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 531

viewers were allowed to take part in the conducted ment, with one image being the original, undistorted
subjective experiments. In order to support generalization image and the other being the distorted test image. As the
of results and statistical significance of the collected DSCQS method is quite sensitive to small quality differ-
subjective data, the experiments were conducted in two ences, it is well suited to not just cope with highly
different laboratories involving 30 non-expert viewers in distorted test images but also with cases where the
each experiment. Thus, the minimum requirement of at quality of original and distorted image is very similar.
least 15 viewers, as recommended in [16], is well satisfied.
In order to support consistency and eliminate systematic
differences among results at the different testing labora- 3.4.3. Grading scale
tories, similar panels of test subjects in terms of occupa- The grading is performed with reference to a five-point
tional category, gender, and age were established. quality scale (excellent, good, fair, poor, bad), which is
In particular, 25 males and five females, participated in used to divide the continuous grading scale into five
SE 1. They were all university staff and students and their partitions of equal length. Given the pair of images A and
ages were distributed in the range of 21–39 years with the B, the viewer is requested to assess their quality by placing
average age being 27 years. In the second experiment, SE a mark on each quality scale. As the reference and
2, 24 males and six females participated. Again, they distorted image appear in pseudo-random order, A and B
were all university staff and students and their ages were may refer to either the reference image or the distorted
distributed in the range of 20–53 years with the average image, depending on the actual arrangement of images in
age being 27 years. an assessment pair.

3.4. Test procedure 3.5. Subjective data analysis

3.4.1. Selection of test method The outcomes of the subjective experiments are
Different test methodologies are provided in detail in discussed in the following by means of a statistical
[16] to best match the objectives and circumstances of the analysis. In this respect, a concise representation of the
assessment problem. The methodologies are mainly subjective data can be achieved by calculating conven-
classified into two categories, as double-stimulus and tional statistics such as the mean, variance, skewness, and
single-stimulus. In double-stimulus, the reference image kurtosis of the related distribution of opinion scores.
is presented to the viewer along with the test image. The statistical analysis of these data reflects the fact that
On the other hand, in single-stimulus, the reference image perceived quality is a subjective measure and hence may
is not explicitly presented and may be shown transpar- be described statistically.
ently to the subject for judgement consistency observa-
tion purpose. As we consider RR metric design in this
paper, where partial information related to the reference 3.5.1. Statistical measures
image is available, we chose to deploy a double-stimulus Let the MOS value for the kth image in a set K of size K
method, the double-stimulus continuous quality scale be denoted here as mk . Then, we have
(DSCQS). Moreover, DSCQS has been shown to have low
1X N
sensitivity to contextual effects [16,34]. Contextual effects mk ¼ u (1)
N j¼1 j;k
occur when the subjective rating of an image is influenced
by presentation order and severity of impairments. This
where uj;k denotes the opinion score given by the jth
relates to the phenomenon that test subjects may tend to
viewer to the kth image and N is the number of viewers.
give an image a lower score than it might have normally
The confidence interval associated with the MOS of each
been given if its presentation was scheduled after a less
examined image is given by
distorted image.
½mk  k ; mk þ k  (2)
3.4.2. Presentation of test material The deviation term k can be derived from the standard
The test sessions were divided into two sections. Each deviation sk and the number N of viewers and is given for
section lasted up to 30 min and consisted of a stabilization a 95% confidence interval according to [16] by
and a test trial. The stabilization trials were used as
a warm-up to the actual test trial in each section.
sk
k ¼ 1:96 pffiffiffiffi (3)
In addition, one training trial was conducted at the very N
beginning of the test session to demonstrate the test where the standard deviation sk for the kth image is
procedure to the viewer and allow them to familiarize defined as the square root of the variance
themselves with the test mechanism. Clearly, the scores
obtained during the training and stabilization trials are 1 X N
s2k ¼ ðu  mk Þ2 (4)
not processed but only the scores given during the test N  1 j¼1 j;k
trials are analyzed. In order to reduce the viewer’s fatigue,
a 15 min break was given between sections. The skewness measures the degree of asymmetry of
Given the DSCQS method, pairs of images A and B are data around the mean value of a distribution of samples
presented in alternating order to the viewers for assess- and is defined by the second and third central moments
ARTICLE IN PRESS
532 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

m2 and m3 , respectively, as
m3 100
b¼ 3=2
(5)
m2 90

where the lth central moment ml is defined as 80


70
1X N
ml ¼ ðu  mÞl (6)
N j¼1 j 60

MOS
50
The peakedness of a distribution can be quantified by
the kurtosis, which measures how outlier-prone a dis- 40
tribution is. The kurtosis is defined by the second and 30
fourth central moments m2 and m4 , respectively, as
20
m
g ¼ 42 (7) 10
m2
0
It should be mentioned that the kurtosis of the normal 0 5 10 15 20 25 30 35 40
distribution is obtained as 3. If the considered distribution
Image number
is more outlier-prone than the normal distribution, it
results in a kurtosis greater than 3. On the other hand,
if it is less outlier-prone than the normal distribution, it 100
gives a kurtosis less than 3. A distribution of scores
90
is usually considered as normal if the kurtosis is between
2 and 4. 80
70
3.5.2. Statistical analysis 60
Figs. 5(a) and (b) show the scatter plots of MOS for SE 1
MOS

and SE 2, respectively. The 40 images in each experiment 50


are ordered with respect to decreasing subjective ratings 40
in MOS. It can be seen from the figures that the material
30
presented to the viewers resulted in a wide range of
perceptual quality ratings indeed for both subjective 20
experiments. As such, both experiments contained 10
the extreme cases of excellent and bad image quality
while the intermediate quality decreases approximately 0
0 5 10 15 20 25 30 35 40
linearly in between. It is also observed that the spread of
ratings around the MOS in terms of the 95% confidence Image number
interval is generally narrower for the images at the upper Fig. 5. Perceived quality ordered according to decreasing MOS with error
and lower end of the perceptual quality scale. Thus, the bars indicating the 95% confidence intervals: (a) SE 1, (b) SE 2.
viewers seemed to be more confident with giving their
quality ratings in case that the quality of the presented
images was either of very high or very low quality. On the through these data. It can be seen from the figure that
other hand, in the middle ranges of quality the confidence the linear fit for both experiments are very close
of viewers on the quality of an image was significantly indicating that the set of image samples used in the two
lower. independent experiments at WATRI and BIT composed of a
Figs. 6(a)–(d) show the MOS, variance, skewness, and similar range of quality impairments.
kurtosis, respectively, for each image sample that was Fig. 6(b) shows the variance of all opinion scores for
rated in the two subjective experiments. The image each image sample. The variance can be regarded as a
samples in all four figures are, as in Fig. 5, ordered measure of how much the viewers agree on the perceived
with respect to decreasing MOS. In addition to the image quality of a certain image sample. In other words, the
samples the figures depict the related fits to these smaller the variance, the more pronounced the agreement
statistics, which reveal good agreement among the data between all viewers. It can clearly be seen from the figure
for the two subjective experiments as the fits progress that the variance is relatively small for images that have
closely in the same manner over the ordered image obtained either excellent or bad subjective quality ratings.
samples. This indicates that the two experiments have In contrast, in the region where perceptual quality of
been very well aligned with each other and also that the the impaired images ranges between good and poor, the
two viewer panels, even though originating from different variance tends to be larger with the peak at about the
countries, seem to have given similar quality scores for the middle of the quality range. This is an interesting result
test images they have been shown. since it indicates that the viewers appear to be rather sure
Fig. 6(a) depicts the impaired image samples with whether an image sample is of excellent or bad quality
respect to decreasing MOS along with the linear fit while opinions about images of average quality differ to a
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 533

100 500
SE1 – Image sample
90 SE1 – Linear fit 450
SE2 – Image sample
80 SE2 – Linear fit 400
70 350
60 300

Variance
MOS

50 250
40 200
30 150
20 100 SE1 – Image sample
SE1 – Quadratic fit
10 50 SE2 – Image sample
SE2 – Quadratic fit
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Image number Image number

2 30
SE1 – Image sample
SE1 – Power fit
1 25 SE2 – Image sample
SE2 – Power fit
0
20
Skewness

–1
Kurtosis

15
–2
10
–3
SE1 – Image sample

–4
SE1 – Cubic fit 5
SE2 – Image sample
SE2 – Cubic fit

–5 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Image number Image number

Fig. 6. Statistics of opinion scores for the impaired image samples: (a) MOS, (b) variance, (c) skewness, and (d) kurtosis.

wider extent. These conclusions are supported by the skewness at the high quality end, indicating that
confidence intervals shown in Figs. 5(a) and (b), which are the agreement of low quality was higher as compared to
narrower for images rated as being excellent and bad. the agreement about high quality. The asymmetry in
Fig. 6(c) shows the skewness of the opinion scores subjective scores for the extreme cases of excellent and
distribution for each image sample. In the context of bad quality is thought to be due to the rating scale being
the subjective ratings of image quality, a negative or limited to 100 and 0, respectively. As such, subjective
positive skewness translate to the subjective scores being scores have to approach the maximal and minimal
more spread towards lower or higher values than the possible rating from below or above, respectively. The
MOS, respectively. For the images that were perceived as skewness of around zero for the middle range of qualities
being of high quality, the negative skewness indicates that reveals that the subjective scores seem to be symmetri-
subjective scores tend to be asymmetrically spread cally distributed with respect to MOS, even though the
around the MOS towards lower opinion scores and thus, variance for images of average quality is larger.
that a number of viewers gave significantly lower quality Fig. 6(d) provides the kurtosis for each impaired image
scores as compared to the MOS. In the other extreme of sample. It can be seen from the figure that the distribution
image quality being perceived as bad, the positive of subjective scores for some of the images scoring high
skewness points to an asymmetrically spread around MOS values in both experiments give kurtosis values
the MOS towards higher opinion scores. However, the much greater than of a normal distribution. This is a
positive skewness is not as distinct as the negative strong indication for outliers, meaning, that a few of the
ARTICLE IN PRESS
534 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

viewers gave the image quality a low rating, whereas the computed in both horizontal and vertical direction and
majority of viewers agreed on a high image quality. With combined in a pooling stage as follows:
the progression of images towards decreasing MOS, the
f˜ 1 ¼ a þ bBg1 Ag2 Z g3 (8)
associated kurtosis fits quickly level out around the value
3, pointing to a normal distribution of the opinion scores where the parameters a, b, g1 , g2 , and g3 were estimated
around MOS. It is interesting to point out that the high in [38] using MOS from subjective experiments. Despite
kurtosis in the high quality end does not occur at the bad the two IAM incorporated in f˜ 1 , we found that this metric
quality end. This means that the entire viewer panel accounts particularly well for blocking artifacts in JPEG
agreed on the bad quality images with no outlier scores compressed images. This might be due to the magnitude
being present. This result is also evident in the skewness of g1 , being reported in [38] as relatively large compared
distribution where the decline towards lower values at the to g2 and g3 , giving the blocking measure a higher impact
high quality end is much more pronounced as compared on the metric f˜ 1 .
to the incline of the skewness at the low quality end.
4.1.2. Feature f˜ 2 : edge smoothness
4. Objective structural degradation metrics The extraction of feature metric f˜ 2 relates purely
to measuring blur artifacts and follows the work of
The design of the RR metrics proposed in this paper is Marziliano et al. [22]. It accounts for the smoothing effect
based on the extraction of structural information from of blur by measuring the distance between edges. It was
the images. In this section we will discuss the objective found that it is sufficient to measure the blur along
feature metrics that were deployed to measure the vertical edges, which allows for saving computational
artifacts as observed in the test images (see Section 2.2). complexity as compared to computation on all edges.
An analysis of the objective measures provides further Therefore, a Sobel filter is applied to detect vertical edges
insight into the feature metrics performance of quantify- in the image. The edge image is then horizontally scanned.
ing the artifacts. For pixels that correspond to an edge point, the local
extrema in the corresponding image are used to compute
4.1. Feature metrics the edge width. The edge width then defines a local
measure of blur. Finally, a global blur measure is obtained
Given the set of artifacts as observed in the test images, by averaging the local blur values over all edge locations.
algorithms for feature extraction can be deployed to This metric was chosen to complement the IAM in f˜ 1 since
capture the amount by which each of the artifacts is it does not just account for in-block blur but rather
present in the images. The selection of the algorithms to contributes a global blur measure.
be used is driven by three constraints, namely, a reason-
able accuracy in capturing the characteristics of the 4.1.3. Features f˜ 3 and f˜ 4 : image activity
associated artifact, a representation of the feature that Ringing artifacts are observed as periodic pseudo-
incurs low overhead in terms of reduced-reference edges around original edges, thus increasing the activity
(conserve bandwidth), and computational inexpensive- within an image. The feature metrics f˜ 3 and f˜ 4 provide an
ness (conserve battery power). The features and feature indirect means of measuring ringing artifacts and are
extraction algorithms deployed here to measure and based on two IAM by Saha and Vemuri [30].
quantify the presence of the related artifacts are listed in Here, f˜ 3 quantifies image activity based on normalized
Table 1 and will be described in the following sections. magnitudes of edges in an edge image BðiÞ as follows:
!
1 MNX
4.1.1. Feature f˜ 1 : block boundary differences f˜ 3 ¼ BðiÞ  100 (9)
M  N i¼1
The first feature metric f˜ 1 is based on the algorithm by
Wang et al. [38] and comprises three measures. The first where M and N denote the image dimensions. Since f˜ 3
measure, B, estimates blocking as average differences does not depend on the direction of the edge, it also very
between block boundaries. Two image activity measures well complements the blocking measure in f˜ 1 , which is
(IAM), A and Z, are applied as indirect means purely designed to measure on the 8  8 block boundaries
of quantifying blur. The former IAM computes absolute in JPEG coded images.
differences between in-block image samples and the latter On the other hand, f˜ 4 measures IA in an image Iði; jÞ
IAM computes a zero-crossing rate. All three measures are based on local gradients in both vertical and horizontal

Table 1
Image features, feature extraction algorithms, and related artifacts.

Feature Algorithm Related artifact

f˜ 1 Block boundary differences Wang et al. [38] Blocking


f˜ 2 Edge smoothness Marziliano et al. [22] Blur
f˜ 3 Edge-based image activity Saha et al. [30] Ringing
f˜ 4 Gradient-based image activity Saha et al. [30] Ringing
f˜ 5 Image histogram statistics Kusuma et al. [18] Intensity masking, lost blocks
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 535

direction as follows: more, the normalization factor di in (13) is given by


0
1 @M X 1 X N
di ¼ max ðf˜ i;k Þ  min ðf˜ i;k Þ (14)
˜f 4 ¼ jIði; jÞ  Iði þ 1; jÞj k¼1;...;K k¼1;...;K
M  N i¼1 j¼1
1 As far as the extreme value normalized features defined
X X
M N 1
by (13) are concerned, it should be mentioned that the
þ jIði; jÞ  Iði; j þ 1ÞjA (10)
i¼1 j¼1 boundary conditions apply to those normalized feature
values f i;k which are associated with the feature values
In [30], the IAM were evaluated and in particular f˜ 4 has f˜ i;k 2 K of the images used in the experiments. In a
been found to quantify IA very accurately. We have further practical system, it may also be beneficial to clip the
identified that both f˜ 3 and f˜ 4 account well for measuring normalized feature values that are actually calculated in a
ringing artifacts and also other high-frequency changes real-time wireless imaging application to fall in the
within the image. interval ½0; 1 as well. For instance, severe signal fading
in a wireless channel can result in significant image
4.1.4. Feature f˜ 5 : image histogram statistics impairments at particular times causing the user-per-
Finally, feature metric f˜ 5 accounts for intensity mask- ceived quality to fall in a region where the HVS is
ing and lost blocks using an original algorithm [18]. Both saturated to notice further degradation.
these artifacts cause an intensity shift in parts of an image
or the whole image, which may result in either a darker or
brighter appearance of the area as compared to the 4.3. Feature metrics performance analysis
original image. As such we found that a simple computa-
tion of the standard deviation in the first-order image In order to gain deeper knowledge and understanding
histogram provides an adequate measure of both intensity about the feature extraction, it is of interest to examine
masking and lost blocks. We have thus adapted feature the extent to which different features are present in
metric f˜ 5 as follows: the stimuli and to quantify a relationship between the
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u L feature metrics and MOS. Given the context of RR metric
u X
˜f 5 ¼ t1 ðh  hÞ2 (11) design in wireless imaging, where we are interested in the
L i¼0 i difference between the quality of the received image as
compared to the quality of the transmitted image, let us in
where hi denotes the number of pixels at grey level i, h
the following consider the magnitude of normalized
denotes the mean grey level, and L is the maximum grey
feature differences
level of 255 when using 8 bits per pixel.
Df i ¼ jf t;i  f r;i j; i ¼ 1; . . . ; 5 (15)
4.2. Feature normalization
where f t;i and f r;i denote the ith feature value of the
The proposed NHIQM follows the design philosophy of transmitted and received image, respectively.
our previous work that resulted in the hybrid image
quality metric (HIQM) [17,18]. Although HIQM inherently
uses feature relevance weights, the actual feature values f˜ i 4.3.1. Feature magnitudes over MOS
have generally different meaning and different value Figs. 7(a) and (b) show the magnitudes of the normal-
ranges. As a consequence, it may be difficult to explore ized feature differences Df i for the image samples that
the resulting feature space for classification purposes and were presented in SE 1 and SE 2. For each experiment, the
quality assessment if only relevance weighting was used related 40 feature differences are ranked with respect to
as with HIQM. It is therefore suggested here to perform image samples of decreasing MOS. It can be seen from
also an extreme value normalization to the features. This these figures that the wireless link scenario indeed
allows for a more convenient and meaningful comparison inflicted all five features but with different degrees of
of the contribution of each normalized feature f i to the severity. While for the image samples of high perceptual
overall metric, as they are then taken from the same value quality ratings, feature differences are almost absent, the
range as feature differences tend to increase with decreasing MOS.
Especially, the level of Df 1 , relating to blocking, contained
0  fi  1 (12)
in the image samples shows the widest spread and
Specifically, let us distinguish among I different image becomes more pronounced when progressing from
features. The related feature values f˜ i , i ¼ 1; . . . ; I, shall be images of excellent to bad perceptual quality. A similar
normalized as follows [25]: behavior is observed for edge-based image activity Df 3
but appears not as pronounced as for Df 1. As far as the
f˜ i  mink¼1;...;K ðf˜ i;k Þ
fi ¼ ; i ¼ 1; . . . ; I (13) remaining three features are concerned, these become less
di prevalent for most of the images but large for some of the
where the feature values f˜ i;k , k ¼ 1; . . . ; K are taken from a stimuli. In particular, gradient-based image activity Df 4
set K of size K. In our case, these features were extracted and intensity masking Df 5 occur very distinctively with
from the images used in the subjective experiments, selected image samples while being almost absent from
including all reference images and test images. Further- the majority of image samples.
ARTICLE IN PRESS
536 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

Δ f1
0.5

0
1
Δ f2
0.5

0
1
Δ f3

0.5

0
1
Δ f4

0.5

0
1
Δ f5

0.5

0
0 5 10 15 20 25 30 35 40
Image sample

1
Δ f1

0.5

0
1
Δ f2

0.5

0
1
Δ f3

0.5

0
1
Δ f4

0.5

0
1
Δ f5

0.5

0
0 5 10 15 20 25 30 35 40
Image sample

Fig. 7. Magnitude of differences between normalized feature values for the considered image samples ranked according to decreasing MOS: (a) SE 1,
(b) SE 2.

4.3.2. Feature statistics instead of image dedicated statistics, shall be presented


As with the MOS gathered from the subjective experi- hereafter. Accordingly, for all five feature differences Df i the
ments, the statistical analysis may be extended to the mean, variance, skewness, and kurtosis have been com-
actual feature differences in order to obtain a better puted over all images that have been shown in experiments
understanding of the underlying objective quality degrada- SE 1 and SE 2 (see Fig. 7). The results of all statistics are
tions. However, overall statistics for the whole set of data, presented for both experiments in Tables 2 and 3.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 537

Table 2 Table 4
Statistics of magnitudes of feature differences Df i for SE 1 Correlations between feature differences for SE 1.

Df 1 Df 2 Df 3 Df 4 Df 5 Df 1 Df 2 Df 3 Df 4 Df 5

Mean 0.253 0.120 0.102 0.053 0.022 Df 1 1.000 0.625 0.821 0.016 0.027
Variance 0.043 0.017 0.014 0.015 0.009 Df 2 1.000 0.440 0.649 0.112
Skewness 0.627 1.425 1.124 3.518 6.015 Df 3 1.000 0.056 0.061
Kurtosis 2.082 4.120 3.241 15.010 37.466 Df 4 1.000 0.000
Df 5 1.000

Table 3 Table 5
Statistics of magnitudes of feature differences Df i for SE 2 Correlations between feature differences for SE 2.

Df 1 Df 2 Df 3 Df 4 Df 5 Df 1 Df 2 Df 3 Df 4 Df 5

Mean 0.263 0.094 0.115 0.049 0.061 Df 1 1.000 0.376 0.640 0.014 0.115
Variance 0.029 0.013 0.010 0.021 0.035 Df 2 1.000 0.486 0.753 0.316
Skewness 1.066 2.495 1.072 5.461 3.785 Df 3 1.000 0.323 0.272
Kurtosis 4.056 9.531 3.843 32.434 17.063 Df 4 1.000 0.170
Df 5 1.000

From comparison of the two tables one can observe 1 and SE 2. This is thought to be due to both metrics being
that for all four statistics and for all five feature based on measuring edges of an image. However, it should
differences, the magnitudes of the values are very much be noted again that feature metric Df 1 only considers the
in alignment between the two experiments SE 1 and SE 2. 8  8 block borders of the JPEG encoding, whereas feature
This indicates that the stimuli, in terms of the distorted metric Df 3 quantifies image activity based on edges in all
test images, had similar characteristics in both experi- spatial locations and directions. Furthermore, feature
ments. Thus, not only subjective data are in alignment but metrics Df 2 (edge smoothness) and Df 4 (gradient-based
also the composition of objective features among the test IA) exhibit pronounced cross-correlations in the test sets
material. In particular, it can be seen from both tables that of both experiments which may be a result of both metrics
the mean of the blocking differences dominates over the being designed to quantify smoothness in images based
other features. This is a direct result of the JPEG source on gradient information. As for feature metric Df 5 (image
encoding of which it is well known that blocking artifacts histogram statistics), it can be seen that this metric is only
are dominant over other artifacts such as blur. The mean negligibly correlated to any of the other feature metrics.
values of feature differences Df 4 and Df 5 are particularly This is a highly desired result since the feature metrics
small; however, these features exhibit instead a very high other than Df 5 should be widely unaffected by intensity
skewness and kurtosis as compared to the other features. shifts.
Clearly, this quantifies the progression of feature differ-
ences in the stimuli as shown in Figs. 7(a) and (b) Df 4 and
Df 5 being either negligibly small or distinctively devel- 5. Objective perceptual metric design
oped.
In this section we will in detail describe the RR
objective quality metric design. In this respect, the quality
4.3.3. Feature cross-correlations ratings obtained in the subjective experiments are instru-
Even though the feature metrics were selected to mental for the transition from subjective to objective
account for a particular artifact, one may expect some quality assessment.
overlap in quantifying the different artifacts. To further
understand the performance of the feature metrics in
comparison to each other, Tables 4 and 5 show the 5.1. Metric training and validation
Pearson linear correlation coefficient between each of
the feature metrics for both SE 1 and SE 2. In this context, As foundation of the metric design, the 80 images in
the cross-correlation measures the degree to which two I1 (SE 1) and I2 (SE 1) from the two experiments were
features are simultaneously affected by a certain type and organized into a training set IT containing 60 images and
severity of an artifact. As expected, the correlation of a a validation set IV containing 20 images. For this purpose,
feature with itself exhibits the maximum magnitude of 1. 30 images were taken from I1 and 30 images from I2 to
It can be seen from the tables that the cross- form IT while the remaining 10 images of each set
correlations between the features vary strongly in their compose IV . Accordingly, a training set and a validation
magnitudes. A particularly pronounced cross-correlation set were established with the corresponding MOS, here
can be observed between feature metrics Df 1 (block referred to as MOST and MOSV . The training sets, IT
boundary differences) and Df 3 (edge-based IA) for both SE and MOST , are then used for the actual metric design.
ARTICLE IN PRESS
538 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

MOST Weights
Aquisition
Δfi MOST Curve
wi Fitting
Difference
Features
Metric Computation MOSx
ft,i
It Mapping
Feature Feature ΔNHIQM LP-norm x
Ir Extraction Normalization
fr,i

Fig. 8. Framework for designing feature-based objective perceptual image quality metrics.

The validation sets, IV and MOSV , are used to evaluate the Table 6
Perceptual relevance weights of feature differences Df i for the images in
metrics ability to generalize to unknown images.
the training set.

Metric Weight Value


5.2. Metric design framework
Df 1 w1 0.819
A block diagram of the framework used in this paper Df 2 w2 0.413
Df 3 w3 0.751
to design RR objective perceptual image quality metrics is
Df 4 w4 0.182
shown in Fig. 8. A brief overview of the design process is Df 5 w5 0.385
given in the sequel with reference to this figure.
The first key operation in the transition from subjective
to objective perceptual image quality assessment is
executed within the process of feature weights acquisi- 5.3. Perceptual relevance of features
tion. As a prerequisite of weights acquisition, the different
features of the transmitted and received image are The Pearson linear correlation coefficient r P has been
extreme value normalized to allow for a meaningful chosen to reveal the extent by which the individual
weight association. As the RR design is focused on feature differences contribute to the overall perception of
detecting distortions between related features, the image quality. In this sense, it captures prediction
weights acquisition is performed with respect to feature accuracy referring here to the ability of a feature
differences Df i , i ¼ 1; . . . ; 5. Given the MOS values MOST difference to predict the subjective ratings with minimum
for the images in the training set IT and the related average error. Given a set of K data pairs ðuk ; vk Þ, this
feature differences Df i for each image, correlation coeffi- ability can be quantified by
cients between subjective ratings and feature differences PK
are computed as weights wi , i ¼ 1; . . . ; 5 to reveal the k¼1 ðuk  ūÞðvk  v̄Þ
r P ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK
ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK (16)
feature relevance to the subjectively perceived quality. 2 2
k¼1 ðuk  ūÞ k¼1 ðvk  v̄Þ
It is then straightforward to compute a feature-based
objective quality metric by applying a pooling function to where uk and vk are the feature difference and the
condense the information to a single value x. Here, two subjective rating related to the kth image, respectively,
metrics are proposed, namely DNHIQM and the relevance and ū and v̄ are the means of the respective data sets.
weighted Lp -norm. This choice is motivated by the fact that the correlation
The second essential component in moving from coefficient explicitly characterizes the association be-
subjective to objective quality assessment relates to the tween two variables, which are given here by pairs of
curve fitting block as shown in Fig. 8. Its inputs are the ratings and difference feature metrics. The sign of the
MOS values MOST for the images in the training set IT correlation value may be neglected as it only represents
and the values of the objective perceptual quality metric x the direction (increase/decrease) in which one variable
for each of these images. The relationship between changes with the change of the other variable. In view of
subjective quality given by MOST and objective quality the above, the absolute values of the Pearson linear
represented by x is then modeled by a suitable mapping correlation coefficients rP are computed as the perceptual
function. The parameters of potential mapping functions weights wi of the related features. A higher correlation
can be obtained by using standard curve fitting techni- coefficient then corresponds to a feature that more
ques. The selection of suitable mapping functions is significantly contributes to the overall quality as perceived
typically based on both goodness of fit measures and by the viewer, while a lower correlation coefficient means
visual inspection of the fitted curve. The obtained less perceptual significance. Also, if the correlation
mapping function f ðxÞ can then be used to calculate coefficient approaches to the zero value, the relationship
predicted MOS values, MOSx , for given values of the between the perceptual quality and the examined feature
quality metric value x. is not strongly developed.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 539

Table 6 shows the values of the Pearson linear imaging, as the reduced-reference is represented by only
correlation coefficients, or feature weights, that were one single value for a given image. Accordingly, NHIQM
obtained for each of the five feature differences Df i , i ¼ can be communicated from the transmitter to the receiver
1; . . . ; 5 for the training set when correlated to the whilst imposing very little stress on the bandwidth
associated MOST values. Accordingly, block boundary resources.
differences (Df 1 ) appear to be the most relevant feature Regarding applications in wireless imaging, NHIQM
followed by edge-based image activity (Df 3 ), edge can be calculated for the transmitted image It and
smoothness (Df 2 ), image histogram statistics (Df 5 ), and received image Ir , resulting in the corresponding values
gradient-based image activity (Df 4 ). This relates to NHIQM t and NHIQM r at the transmitter and receiver,
blocking being the most annoying artifact followed by respectively. Provided that the NHIQM t value is commu-
ringing due to edge-based image activity, blur, intensity nicated to the receiver, structural differences between the
masking, and ringing due to gradient-based image images at both ends may simply be represented by the
activity. Similar findings have also been made by Farias absolute difference
et al. [9] who observed that blocking is more annoying
than blur. The same group also found [10] that ringing is DNHIQM ¼ jNHIQMt  NHIQMr j (18)
the least annoying artifact. This agrees with our feature
metric Df 4 which also received the smallest weight. On 5.4.2. Perceptual relevance weighted Lp -norm
the other hand, the feature metric Df 3 deployed in our The Lp -norm, also referred to as Minkowski metric, is a
paper measures ringing as well but received a higher distance measure commonly used to quantify similarity
weight. We believe that this outcome can be related to Df 3 between two signals or vectors. In image processing it has
having a strong correlation with Df 1 (see Tables 4 and 5), been applied, for instance, with the percentage scaling
thus not only accounting for ringing but also for blocking method [28] and the combining of impairments in digital
artifacts. image coding [27].
It should be noted here that the relevance weights in In this paper, we incorporate the relevance weighting
Table 6 were obtained for the particular case of JPEG for the extreme value normalized features into the
source encoding where blocking artifacts are predominant calculation of the Lp -norm. This modification of the
over other artifacts such as blur. This may also contribute Lp -norm shall be defined as follows:
to the higher correlation weights for the edge-based " #1=p
features Df 1 and Df 3 as compared to the gradient-based X
I
Lp ¼ wpi jf t;i  f r;i jp (19)
features Df 2 and Df 4 . Hence, the relevance weights may i¼1
not be purely related to the perceptual relevance but also
to the particular artifacts that are observed in the visual where f t;i and f r;i denote the ith feature value of the
content. As such, one may obtain different relevance transmitted and the received image, respectively.
weights in case of other source encoders, such as The Minkowski exponent p may be determined
JPEG2000. experimentally [28]. Alternatively, the Minkowski expo-
nent p may be assigned a fixed value. In both cases, a
5.4. RR objective metric computation higher value of p increases the impact of the dominant
features on the overall metric. In the limit of p approach-
ing infinity, we obtain
In the following two sections, we will consider two
different pooling functions that are based on weighted
L1 ¼ max jf t;i  f r;i j (20)
combinations of the feature metrics. Firstly, we introduce i¼1;...;I
NHIQM, which linearly combines extreme value normal-
ized image features to a single quality value. Secondly, a meaning that the largest absolute feature value difference
perceptual relevance weighted version of the Lp -norm is solely dominates the norm. We have found [7] that values
proposed, which calculates a weighted sum of image beyond p ¼ 2 do not improve the quality prediction
feature differences between original and impaired image. performance of the modified Lp -norm given in (19). We
In both cases, the respective image features are extracted believe that this characteristic is because of the perceptual
with the metrics as summarized in Section 4.1, while the relevance weights obtained for each feature inherently
actual weights used for feature combination have been accounting for the dominance of the particular features. In
deduced as discussed in Section 5.3. the sequel, we therefore consider the modified Lp -norm
for Minkowski exponents of p ¼ 1 and 2 only.
Although the Lp -norm belongs to the class of RR
5.4.1. Normalized hybrid image quality metric
metrics, it requires more transmission resources com-
The proposed NHIQM is defined as a weighted sum of
pared to DNHIQM , as all feature values need to be
the extreme value normalized features as
communicated from the transmitter to the receiver.
X
I On the other hand, the information about each of the
NHIQM ¼ wi f i (17) feature degradations may provide further insights into
i¼1
the channel induced distortions. Hence, overhead may be
where wi denotes the relevance weight of the associated traded off at the expense of a reduction about structural
feature f i . Clearly, this RR metric is particularly beneficial degradation information by neglecting feature metrics
for objective perceptual quality assessment in wireless that received low perceptual relevance weights.
ARTICLE IN PRESS
540 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

5.5. Mapping functions Table 7


Mapping functions f ðxÞ ¼ MOSNHIQM , x ¼ DNHIQM and their goodness of fit.

Due to non-linear quality processing in the HVS,


Polynomial Parameters R2 RMSE SSE
artifacts and quality do not follow a linear relationship.
To account for this phenomenon, a mapping function is p1 x þ p0 p1 ¼ 97:8 0.71 12.78 9472
applied to the quality metrics. In general, an objective p0 ¼ 77:45
quality metric x may be mapped using a non-linear
mapping function f ðxÞ. The mapping function may then be p2 x2 þ p1 x þ p0 p2 ¼ 149:5 0.79 11.07 6982
used to determine the predicted mean opinion score MOSx p1 ¼ 199:4
p0 ¼ 87:88
for a given x as
MOSx ¼ f ðxÞ (21) p3 x 3 þ p2 x 2 þ p1 x þ p0 p3 ¼ 493:9 0.82 10.17 5792
p2 ¼ 672:2
Specifically, we will consider the following three classes
p1 ¼ 338:3
of mapping functions: p0 ¼ 94:87
8 m
> P
>
> pj x j Polynomial Exponential function
>
>
>
> j¼0
a1 eb1 x a1 ¼ 88:79 0.79 10.76 6714
> m
< P b1 ¼ 2:484
MOSx 9 aj ebj x Exponential (22)
>
> j¼0
>
>
>
> 100 a1 eb1 x þ a2 eb2 x a1 ¼ 69:76 0.83 10.01 5612
>
>
: Logistic b1 ¼ 1:719
1 þ el1 ðxl2 Þ
a2 ¼ 32:05
where the coefficients p0 ; . . . ; pm of the polynomial b2 ¼ 17:39
function of degree m, the initial value a1 . . . ; am and
growth/decay b1 . . . ; bm of the exponential function of a1 eb1 x þ a2 eb2 x þ a3 eb3 x a1 ¼ 63:18 0.80 11.12 6678
order m, and the parameters l1 ; l2 of the logistic function b1 ¼ 3:056
a2 ¼ 175
are to be determined through curve fitting based on the
b2 ¼ 0:1434
given experimental data from the training set. a3 ¼ 198:2
These three classes of mapping functions have been b3 ¼ 0:041
chosen as candidates for quality prediction due to the
following reasons: Logistic function
100=½1 þ el1 ðxl2 Þ  l1 ¼ 4:613 0.72 12.63 9263
l2 ¼ 0:262
 Polynomial functions provide sufficient flexibility to
support simple empirical prediction.
 Exponential functions are imposed to enable a good fit
to experimental data over the middle-to-upper range
 Sum of squares due to error (SSE) represents the total
of the quality impairment measure [20] and may be
deviation between predicted MOS and MOS from the
less prone to overfitting compared to functions with
experiments. The smaller the SSE value, the better
many parameters.
the fit.
 Logistic functions facilitate the mapping of quality
impairment measures into a finite interval. They
produce scale compressions at the high and low The Matlab Curve Fitting Toolbox was used to find the
extremes of quality while progressing approximately parameters of the considered mapping functions.
linear in the range between these extremes. The mapping functions have been derived for both
DNHIQM and the relevance weighted Lp -norm, however,
Standard curve fitting techniques have been used to only the results for DNHIQM will be presented in the
deduce the parameters of the mapping functions that following. The results are provided in Table 7 along with
mathematically describe best the relationship between the different goodness of fit measures. A visual examina-
subjective ratings and objective perceptual quality metric tion of the fitted mapping functions is supported by
with respect to a given goodness of fit measure. Figs. 9–11, which also show the 95% confidence interval
A mapping function obtained in this way translates an for each fit.
objective perceptual quality metric x into predicted MOS, As far as the polynomial functions are concerned, it
MOSx . The goodness of fit between MOS and predicted could be concluded at first sight from looking only at the
MOS, can be specified by either of the following statistics: goodness of fit statistics that the cubic polynomial would
perform similarly favorable in perceptual quality predic-
 R2 captures the degree by which variations in the MOS tion as the exponential functions. However, visual inspec-
values are accounted for by the fit. It can assume any tion of Fig. 9 suggests the opposite as the good fit applies
value in the interval [0,1] with a good fit being close to 1. only for the given data range but tends to diverge outside
 Root mean squared error (RMSE) is referred to as the this range. For example, an increase of the objective
standard error of the fit with a good fit indicated by an perceptual quality metric beyond the value of 0.8 would
RMSE value close to 0. actually predict ‘‘negative’’ MOS values (see Fig. 9(c)).
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 541

100 100
Image sample Image sample
90 Linear fit
90
Exponential fit
80 Confidence 80 Confidence
interval (95%) interval (95%)
70 70
60 60

MOS
MOS

50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM

100 100
Image sample Image sample
90 Quadratic fit 90 Exponential fit (2)
Confidence Confidence
80 80
interval (95%) interval (95%)
70 70
60 60
MOS

MOS

50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM

100 100
Image sample Image sample
90 Cubic fit 90
Exponential fit (3)
Confidence
80 80 Confidence
interval (95%) interval (95%)
70 70
60 60
MOS

MOS

50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM

Fig. 9. Polynomial mapping functions: (a) linear, (b) quadratic, (c) cubic. Fig. 10. Exponential mapping functions: (a) exponential, (b) double
exponential, (c) triple exponential.

As higher-degree polynomials may even result in more In contrast to the polynomial functions, favorable
severe overfitting, the class of polynomials has little to fitting has been obtained for all three considered
offer for use in objective perceptual quality assessment. exponential mapping functions, not only in terms of
ARTICLE IN PRESS
542 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

goodness of fit measures but also confirmed by visual needed to represent the reduced-reference is given as
inspection (see Fig. 10). However, it can be observed that 162 bits [40].
the triple exponential function performs similarly to the  SSIM: The SSIM index [39] is based on the assumption
exponential function but at the price of a larger computa- that the HVS is highly adapted to the extraction of
tional complexity due to its more involved analytical structural information from the visual scene. As such,
expression. As such, the triple exponential function may SSIM predicts structural degradations between two
not be considered further. images based on simple intensity and contrast mea-
As for the logistic mapping function, the goodness of fit sures. The final SSIM index is given by
measures indicate a rather poor fit to the data from
ð2mx my þ C 1 Þð2sxy þ C 2 Þ
the subjective experiments. Especially, the compression at SSIMðx; yÞ ¼ (24)
ðm2x þ m2y þ C 1 Þðs2x þ s2y þ C 2 Þ
the high end of the quality scale produces disagreement
with MOS (see Fig. 11). where mx , my and sx , sy denote the mean intensity and
In light of the above findings, both the exponential and contrast of image signals x and y, respectively. The
double exponential mapping are selected for further constants C 1 and C 2 are used to avoid instabilities in
consideration in the metric design. the structural similarity comparison that may occur for
certain mean intensity and contrast combinations
6. Evaluation of image quality metrics (m2x þ m2y ¼ 0, s2x þ s2y ¼ 0).
 VIF: The VIF criterion [31] approaches the image
With the exponential and double exponential mapping quality assessment problem from an information
being identified as suitable models for objectively theoretical point of view. In particular, the degradation
predicting perceptual image quality, an evaluation of the of visual quality due to a distortion process is
prediction performance of DNHIQM and Lp -norm on the measured by quantifying the information available in
training set (60 images in IT and related MOST ) and its a reference image and the amount of this reference
generalization to the validation set (20 images in IV and information that can be still extracted from the test
related MOSV ) is given in this section. image. As such, the VIF criterion measures the loss of
information between two images. For this purpose,
natural scene statistics and, in particular, Gaussian
6.1. Other objective quality metrics for comparison
scale mixtures (GSM) in the wavelet domain are
used to model the images. The proposed VIF metric is
We have selected contemporary quality metrics that
given by
have been proposed in recent years to allow for a
performance comparison with the proposed feature-based P !N;j !N;j N;j
j2subbands IðC ; F js Þ
DNHIQM and the Lp -norm. Specifically, the reduced-refer- VIF ¼ (25)
P !N;j !N;j N;j
ence image quality assessment (RRIQA) technique pro-
j2subbands Ið C ; E js Þ
posed in [40] is chosen as a prominent member of the
class of RR metrics. In addition, the structural similarity !
(SSIM) index [39], the visual information fidelity (VIF) where C denotes the GSM, N denotes the number of
! !
criterion [31], the visual signal-to-noise ratio (VSNR) [4], GSM used, and E and F denote the visual output of an
and the peak signal-to-noise ratio (PSNR) [25] are chosen HVS model, respectively, for the reference and test
as the FR metrics. It is noted that FR metrics would not be image.
suitable for the considered wireless imaging scenario but  VSNR: The VSNR [4] metric deploys a two-stage
rather serve to benchmark prediction performance, which approach based on near-threshold and suprathreshold
can be expected to be high due to the utilization of the properties of the HVS to quantify image fidelity. The
reference image. first stage determines whether distortions are visible
in an image. For this purpose, contrast thresholds for
distortion detection are determined using wavelet-
 RRIQA: This metric [40] is based on a natural image
based models of visual masking. If the distortions are
statistic model in the wavelet domain. The image
below the threshold, the quality of the image is
distortion measure is obtained from the estimation of
assumed to be perfect and the algorithm is terminated.
the Kullback–Leibler distance between the marginal
If the distortions are visible, a second stage imple-
probability densities of wavelet coefficients in the
ments perceived contrast and global precedence
subbands of the reference and distorted images as
properties of the HVS to determine the impact of
follows:
! the distortions on perceived quality. The final VSNR
1 XK
k metric is then given as
D ¼ log2 1 þ jd^ ðpk kqk Þj (23) 0 1
D0 k¼1
B CðIÞ C
where the constant D0 is used as a scaler of the VSNR ¼ 20 log10 B
@
C (26)
k dgp A
distortion measure, d^ ðpk kqk Þ denotes the estimation adpc þ ð1  aÞpffiffiffi
of the Kullback–Leibler distance between the prob- 2
ability density functions pk and qk of the kth where CðIÞ denotes the root-mean-squared contrast of
subband in the transmitted and the received the original image I, dpc and dgp are, respectively,
image, and K is the number of subbands. The overhead measures of perceived contrast and global precedence
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 543

100 Table 8
Image sample Computational complexity of the metrics and amount of reference
90 Logistic fit information needed.
80 Confidence
interval (95%) Metric Computation time/image (s) Reference information
70
Type Name
60
MOS

50 RR DNHIQM 1.55 17 bits


Lp -norm 1.55 85 bits
40 RRIQA 7.12 162 bits

30
FR SSIM 0.37 Full image
20 VIF 0.92 Full image
VSNR 0.33 Full image
10
PSNR 0.05 Full image
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM
computational complexity of all FR metrics is lower as
Fig. 11. Logistic mapping function. compared to the RR metrics. Amongst the FR metrics,
PSNR outperforms by far the other considered metrics in
terms of computational complexity. Regarding the RR
disruption, and a is a weight regulating the relative
metrics, it is observed that both DNHIQM and Lp -norm are
contributions of dpc and dgp .
significantly less complex than RRIQA.
 PSNR: Image fidelity is an indication about the
In the context of wireless imaging, the amount of
similarity between the reference and distorted images
reference information needed for quality assessment
and measures pixel-by-pixel closeness between those
determines the overhead of data that needs to be
pairs. The PSNR [25] is the most commonly used
transmitted over the channel along with the actual image.
fidelity metric. It measures the fidelity difference of
From Table 8 one can see that the reference information is
two image signals IR ðx; yÞ and ID ðx; yÞ on a pixel-by-
significantly lower for both DNHIQM and Lp -norm as
pixel basis as
compared to RRIQA. The particularly small reference
Z2 information for DNHIQM results from the fact that only a
PSNR ¼ 10 log (27)
MSE single value NHIQMt needs to be transmitted. On the other
where Z is the maximum pixel value, here 255. The hand, with the Lp -norm five features need to be trans-
mean square error is given as mitted resulting in a five times higher overhead. However,
as discussed in Section 5.4.2, the number of features used
1 XX X Y
may be traded off with the transmission overhead by
MSE ¼ ½IR ðx; yÞ  ID ðx; yÞ2 (28)
XY x¼1 y¼1 neglecting features of low perceptual relevance. As for the
FR metrics, the reference image is needed for the quality
where X and Y denote horizontal and vertical image assessment and as such, the size of the image determines
dimensions, respectively. Despite being an FR metric, the amount of reference information. Independent of the
PSNR usually does not correlate well with the visual image size, however, the amount of reference information
quality as perceived by a human observer [37]. would be magnitudes higher as compared to the RR
metrics.
6.2. Computational complexity and amount of reference
information 6.3. Prediction performance measures

In the following, we will discuss the computational The quality prediction performance of the considered
complexity of the considered metrics and the amount of objective metrics will be quantified in terms of accuracy,
reference information that is needed in order to assess the monotonicity, and consistency as recommended by the
quality of a test image. The details are summarized in Video Quality Experts Group (VQEG) [34].
Table 8. The prediction accuracy of each objective quality
The computational complexity is measured in terms of metric will be quantified using the Pearson linear
the time that each of the metrics needs to assess the correlation coefficient as defined in (16). The prediction
quality of a single image in our sets I1 and I2 . Here, we monotonicity will be measured by the non-parametric
have computed each metric over all 80 images and then Spearman rank order coefficient
determined the average time. The metrics were run on a
PK
laptop computer containing an Intel T2600 Dual Core k¼1 ðwk  w̄Þðgk  ḡÞ
rS ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK (29)
processor with 2.16 GHz and 4 GB of RAM. In order to 2 2
k¼1 ðwk  w̄Þ k¼1 ðgk  ḡÞ
allow for a fair comparison, the publicly available Matlab
implementation of each metric was used even though where wk and gk denote the ranks of the predicted scores
there may be other implementations available for some of and the subjective scores, respectively, and w̄ and ḡ are the
the metrics. It can be seen from Table 8 that the midranks of the respective data sets. This measure is used
ARTICLE IN PRESS
544 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

to quantify if changes (increase or decrease) in one 6.4. Linear regression


variable is followed by changes (increase or decrease) in
another variable, irrespective of the magnitude of the Prior to the evaluation of prediction performance for
changes. the considered objective image quality metrics, the
The prediction consistency is identified by the outlier favorable mapping functions will be used to relate
ratio. A data pair ðuk ; vk Þ may be declared as an the predicted MOS values to the actual MOS values from
outlier when the absolute difference between uk and vk the subjective experiments. The predicted scores, MOST
is greater than a certain threshold. As suggested in [34], and MOSV , respectively, are calculated for each image in
the threshold shall be chosen at least twice as much as the the training set IT and the validation set IV .
MOS standard deviation svk such that As an example, Fig. 12 shows the result for DNHIQM
using the exponential and double exponential mapping
juk  vk j42svk (30) functions. Here, the MOS values from the subjective
experiments are plotted versus the predicted MOS values,
Then, the outlier ratio can be calculated as MOSNHIQM , for the images in the training set IT . In
addition, a linear function has been fitted to the data set
r O ¼ RO =R (31) and is presented along with the 95% confidence interval.
It should be mentioned that the fitting curves for both
where RO denotes the total number of outliers and R is the exponential and double exponential mapping produce the
size of the data set. desired linear relationship between predicted MOS and
MOS. Specifically, the range between 0 and 100 is nicely
captured for predicted MOS and MOS. The prediction
performance measures will be calculated for these post-
mapped relationships in addition to the actual metric
values.
100
Image sample
90 Fitting curve
Confidence
6.5. Analysis of mapping parameters
80 interval (95%)
70 The evaluation of the prediction performance of
60 DNHIQM , L1 -norm, and L2 -norm will be presented here
and compared to RRIQA, SSIM, VIF, VSNR, and PSNR. For
MOS

50 this purpose, the parameters of the exponential and


40 double exponential mapping functions have been derived
for all of these metrics, following the methodology as
30
outlined in Section 5.5.
20 Table 9 presents the parameters of the prediction
10 functions deduced from curve fitting of the considered
quality metrics to the MOS values in the training set of
0
0 20 40 60 80 100 images using the exponential and double exponential
mapping. It can be seen from the numerical values of the
MOSNHIQM
parameters that distinct exponential mapping functions
are produced in terms of growth and decay. The negative
100
Image sample
90 Fitting curve
Confidence Table 9
80 interval (95%) Parameters of prediction functions for objective quality metrics using
exponential and double exponential mapping.
70
60 Metric Exponential Double exponential
MOS

50 Type Name a1 b1 a1 b1 a2 b2
40
RR DNHIQM 88.79 2.484 69.76 1.720 32.04 17.39
30 L1 - 87.63 1.840 68.15 1.251 34.31 13.46
norm
20 L2 - 90.20 2.820 69.83 1.950 32.28 16.24
10 norm
RRIQA 102.1 0.160 101.3 0.157 3:486  1014 3.701
0
0 20 40 60 80 100 FR SSIM 13.91 1.715 31.34 0.446 3:964  107 18.58
MOSNHIQM VIF 4.291 2.886 14.06 1.366 7:721  1014 33.98
VSNR 25.662 0.033 21.964 0.041 1:388  106 0.387
Fig. 12. MOS versus predicted MOS, MOSNHIQM : (a) exponential mapping, PSNR 18.33 0.036 22.68 0.029 0 0.029
(b) double exponential mapping.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 545

decay parameter for the feature-based objective percep- As can be seen from the numerical results in Table 10
tual quality metrics DNHIQM , L1 -norm, L2 -norm, as well as for the metric training, the prediction accuracy of the
RRIQA relate to the fact that these RR metrics represent feature-based metrics, DNHIQM , L1 -norm, and L2 -norm,
image degradation. Thus, larger values of these metrics outperform the other considered metrics, RRIQA, SSIM,
correspond to lower perceptual quality. In contrast, the FR VIF, VSNR, and PSNR. This applies for the training with
metrics SSIM, VIF, VSNR, and PSNR measure image respect to all three cases, i.e. the pure metric prior to
similarity of some sort, which is represented by the mapping and after mapping with exponential and
positive decay parameter in their exponential mapping double exponential functions. The comparison between
functions. In these cases, a larger metric value corre- the feature-based quality metrics indicate the comparable
sponds to higher perceptual quality. As for the double performance of DNHIQM and the Lp -norms.
exponential mapping functions, these are pronounced Similar observations about accuracy can be made for
only for the feature-based objective perceptual quality metric validation. In terms of metric generalization to
metrics of DNHIQM and the L1 - and L2 -norm. Specifically, these unknown images from the validation set, the
the growth and decay parameters for both involved feature-based quality metrics significantly outperform
exponential functions are substantially different to zero. the other considered metrics in accuracy. While DNHIQM
This is not the case for the other considered quality and the Lp -norms provide an accuracy over 80% and in
metrics, RRIQA SSIM, VIF, and VSNR, with the double some cases close to 90%, all other considered metrics fall
exponential mapping functions degenerating to an ex- below the 80% threshold of generalization accuracy. It is
ponential function for small metric values. Due to the also observed that the largest accuracy being r p ¼ 0:91 for
initial value a2 being close to zero, the second exponential DNHIQM on the training set using double exponential
function can contribute to the prediction only for mapping does not generalize as well as for the pure
extremely large metric values although this may still be metric or exponential mapping. This indicates that fitting
insignificant to the first exponential function involved. DNHIQM to a double exponential mapping may already
In the case of PSNR, the initial value of the second produce some degree of overfitting. Similar trends to
exponential function is actually obtained from the curve overfitting using double exponential mapping can be
fitting as being zero. Accordingly, the double exponential observed with the L2 -norm and VIF. In view of this and
mapping function in fact degenerates to an exponential the degeneration of double exponential mapping
mapping function. to exponential mapping with some metrics, the
exponential function may in fact constitute the most
preferred mapping in the considered context of wireless
6.6. Evaluation of quality prediction performance imaging.
Let us now compare the prediction monotonicity of the
Given the parameters of the prediction functions for proposed image quality metrics with the other state of the
the examined quality metrics, the prediction performance art image quality metrics. As all relationships follow
of these metrics is presented in Table 10. In particular, the strictly decreasing or increasing functions, differentiation
prediction accuracy is quantified by the Pearson linear between metric, exponential, and double exponential
correlation coefficient. It has been calculated on the basis mapping is not required as ranks are kept the same for
of the 60 images in the training set IT and the 20 images all three cases. The results shown in Table 10 reveal that
of the validation set IV . Moreover, prediction accuracy the feature-based DNHIQM approach and the Lp -norms
has been calculated for the relationship between MOS and perform favorably over the remaining four metrics with
the pure metric as well as for the relationship between prediction monotonicity well above 80% for both metric
MOS and predicted MOS using exponential mapping and training and validation. From the other metrics, only VIF
double exponential mapping. shows a satisfactory prediction monotonicity of rS ¼ 0:813

Table 10
Prediction performance of objective quality metrics, predicted MOS using exponential mapping, and predicted MOS using double exponential mapping.

Metric Accuracy Monotonicity Consistency

Type Name Metric Exponential 2-Exponential Metric, Mapping Exponential 2-exponential

r P;T rP;V r P;T r P;V rP;T rP;V r S;T rS;V r O;T r O;V r O;T r O;V

RR DNHIQM 0.843 0.840 0.892 0.888 0.910 0.860 0.867 0.892 0.017 0 0 0.050
L1 -norm 0.833 0.841 0.873 0.897 0.895 0.893 0.854 0.901 0.017 0 0.017 0
L2 -norm 0.845 0.846 0.888 0.884 0.903 0.878 0.875 0.890 0.017 0 0 0
RRIQA 0.821 0.772 0.829 0.749 0.831 0.752 0.786 0.758 0.050 0.050 0.050 0.050

FR SSIM 0.582 0.434 0.632 0.511 0.701 0.605 0.558 0.347 0.117 0.050 0.100 0.050
VIF 0.713 0.737 0.789 0.788 0.877 0.795 0.813 0.729 0.083 0 0.033 0
VSNR 0.766 0.696 0.758 0.686 0.783 0.686 0.686 0.510 0.083 0 0.050 0.050
PSNR 0.742 0.712 0.738 0.709 0.741 0.711 0.638 0.615 0.100 0 0.150 0
ARTICLE IN PRESS
546 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547

for the training but does not generalize well to the Acknowledgments
unknown images.
Finally, the prediction consistency for the training of The authors wish to thank staff and students of the
both feature-based metrics, DNHIQM and Lp -norms, is Western Australian Telecommunications Research Insti-
superior compared to the other four metrics. It is also tute, Perth, Australia and the School of Engineering at the
observed that the prediction consistency for the validation Blekinge Institute of Technology, Ronneby, Sweden for
of DNHIQM is better when using the exponential mapping participating in the subjective experiments. We would
compared to the double exponential mapping. also like to thank the anonymous reviewers for their
highly valuable comments which helped to significantly
improve the quality of this article.
7. Conclusions
References
In this paper, the design of RR objective perceptual
image quality metrics for wireless imaging has been [1] A.C. Brooks, T.N. Pappas, Structural similarity quality metrics in a
presented. Instead of focusing only on artifacts due to coding context: exploring the space of realistic distortions, in: IS&T/
source encoding, the design follows an end-to-end quality SPIE Human Vision and Electronic Imaging XI, vol. 6057, January
2006, pp. 60570U-1–12.
approach that accounts for the complex nature of artifacts [2] M. Carnec, P. Le Callet, D. Barba, Objective quality assessment of
that may be induced by a wireless communication system. color images based on a generic perceptual reduced reference,
As such, the proposed image quality metrics constitute Image Commun. 23 (4) (April 2008) 239–256.
[3] D.M. Chandler, K.H. Lim, S.S. Hemami, Effects of spatial correlations
alternatives to traditional link layer metrics and and global precedence on the visual fidelity of distorted images, in:
may readily be utilized for in-service quality monitoring IS&T/SPIE Human Vision and Electronic Imaging XI, vol. 6057,
and resource management purposes. Specifically, January 2006, pp. 131–145.
[4] D.M. Chandler, S.S. Hemami, VSNR: a wavelet-based visual signal-
both DNHIQM and the perceptual relevance weighted
to-noise ratio for natural images, IEEE Trans. Image Process. 16 (9)
Lp -norm are designed with respect to low computational (September 2007) 2284–2298.
complexity and low overhead, to measure quality [5] K. Chono, Y.-C. Lin, D. Varodayan, Y. Miyamoto, B. Girod, Reduced-
degradations in a wireless communication system, and reference image quality assessment using distributed source
coding, in: IEEE International Conference on Multimedia and Expo,
to account for different structural artifacts that have June 2008, pp. 609–612
been observed in our distortion model of a wireless link. [6] S. Daly, Visible differences predictor: an algorithm for the assess-
Here, structural artifacts are detected by related feature ment of image fidelity, in: IS&T/SPIE Human Vision, Visual
Processing, and Digital Display III, vol. 1666, August 1992,
metrics. pp. 2–15.
The general framework for the design of RR objective [7] U. Engelke, H.-J. Zepernick, Quality evaluation in wireless imaging
perceptual image quality metrics is outlined. It comprises using feature-based objective metrics, in: IEEE International
Symposium on Wireless Pervasive Computing, San Juan, Puerto
feature extraction, feature normalization, calculation of
Rico, February 2007, pp. 367–372.
difference features, relevance weight acquisition, and [8] A.M. Eskicioglu, P.S. Fisher, Image quality measures and their
feature pooling. In addition, curve fitting techniques are performance, IEEE Trans. Commun. 43 (12) (December 1995)
used to find the parameters of suitable mapping functions 2959–2965.
[9] M.C.Q. Farias, M.S. Moore, J.M. Foley, S.K. Mitra, Perceptual
that can translate objective quality metrics into predicted contributions of blocky, blurry, and fuzzy impairments to overall
MOS. The transition from subjective to objective percep- annoyance, in: Proceedings of the IS&T/SPIE Human Vision and
tual quality is executed in the process of relevance weight Electronic Imaging IX, vol. 5292, 2004, pp. 109–120.
[10] M.C.Q. Farias, J.M. Foley, S.K. Mitra, Perceptual analysis of video
acquisition and the derivation of the mapping functions. impairments that combine blocky, blurry, noisy, and ringing
In both these parts of the design framework, the results of synthetic artifacts, in: IS&T/SPIE Human Vision and Electronic
subjective experiments are engaged to train our feature- Imaging X, vol. 5666, 2005, pp. 107–118.
[11] R. Ferzli, L.J. Karam, J. Caviedes, A robust image sharpness metric
based quality metrics. Moreover, a detailed description based on kurtosis measurement of wavelet coefficients, in: Inter-
and statistical analysis of the subjective data gathered in national Workshop on Video Processing and Quality Metrics for
these experiments and related objective feature data is Consumer Electronics, January 2005.
[12] B. Girod, What’s wrong with mean-squared error, in: A.B. Watson
provided.
(Ed.), Digital Images and Human Vision, MIT Press, Cambridge, MA,
The evaluation of the quality prediction performance 1993, pp. 207–220.
reveals that DNHIQM and the perceptual relevance weighted [13] ITU-R Recommendation BT.814, Specifications and alignment
Lp -norm both correlate similarly well to human percep- procedures for setting of brightness and contrast of displays,
Geneva, Switzerland, 1994.
tion on images. This holds not only for the training of the [14] ITU-R Recommendation BT.815, Specification of a signal for
metrics but also for the generalization to unknown measurement of the contrast ratio of displays, Geneva, Switzerland,
images. Furthermore, the numerical results show that 1994.
[15] ITU-R Recommendation BT.1129-2, Subjective assessment of stan-
both feature-based RR metrics outperform even the dard definition digital television (SDTV) systems, Geneva, Switzer-
considered state of the art FR metrics in prediction land, 1998.
performance. As the reduced-reference overhead asso- [16] ITU-R Recommendation BT.500-10, Methodology for the subjective
assessment of the quality of television pictures, Geneva, Switzer-
ciated with the calculation of DNHIQM is condensed to only
land, 2002.
a single number, this approach may be the more favorable [17] M. Kusuma, H.-J. Zepernick, Objective hybrid image quality metric
choice for use in wireless imaging applications compared for in-service quality assessment, in: Signal Processing for Tele-
to the perceptual relevance weighted Lp -norm, which communications and Multimedia, Springer, Berlin, 2005, pp. 44–55.
[18] T.M. Kusuma, A perceptual-based objective quality metric for
requires all involved features to be communicated from wireless imaging, Ph.D. Thesis, Curtin University of Technology,
the transmitter to the receiver. Perth, Australia, 2005.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 547

[19] J. Lubin, A visual discrimination model for imaging system [33] C.J. van den Branden Lambrecht, O. Verscheure, Perceptual quality
design and evaluation, in: E. Peli (Ed.), Vision Models for Target measure using a spatio-temporal model of the human visual
Detection and Recognition, World Scientific, Singapore, 1995, system, in: SPIE Digital Video Compression: Algorithms and
pp. 245–283. Technologies, vol. 2668, January 1996, pp. 450–460.
[20] F.X.J. Lukas, Z.L. Budrikis, Picture quality prediction based on a [34] Video Quality Experts Group, Final report from the Video Quality
visual model, IEEE Trans. Commun. 30 (6) (July 1982) 1679–1692. Experts Group on the validation of objective models of video
[21] J.B. Martens, Multidimensional modeling of image quality, Proc. quality assessment phase II, August 2003, [Online]. Available:
IEEE 90 (1) (January 2002) 133–153. hhttp://www.vqeg.org.i
[22] P. Marziliano, F. Dufaux, S. Winkler, T. Ebrahimi, A no-reference [35] T. Vlachos, Detection of blocking artifacts in compressed video, IEE
perceptual blur metric, in: IEEE International Conference on Image Electron. Lett. 36 (13) (June 2000) 1106–1108.
Processing, September 2002, pp. 57–60. [36] B.A. Wandell, Foundations of Vision, Sinauer Associates, 1995.
[23] L. Meesters, J.B. Martens, A single ended blockiness measure for [37] Z. Wang, A.C. Bovik, L. Lu, Why is image quality assessment so
JPEG coded images, Signal Process. 82 (March 2002) 369–387. difficult? in: IEEE International Conference on Acoustics, Speech,
[24] A.F. Molisch, Wireless Communications, Wiley, IEEE Press, and Signal Processing, vol. 4, May 2000, pp. 3313–3316.
New York, 2005. [38] Z. Wang, H.R. Sheikh, A.C. Bovik, No-reference perceptual quality
[25] J.-R. Ohm, Multimedia Communication Technology: Representation, assessment of JPEG compressed images, in: IEEE International
Transmission and Identification of Multimedia Signals, Springer, Conference on Image Processing, September 2002, pp. 477–480.
Berlin, 2004. [39] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality
[26] F. Pereira, Sensations, perceptions and emotions towards quality of assessment: from error visibility to structural similarity, IEEE Trans.
experience evaluation for consumer electronics video adaptations, Image Process. 13 (4) (April 2004) 600–612.
in: International Workshop on Video Processing and Quality [40] Z. Wang, E.P. Simoncelli, Reduced-reference image quality assess-
Metrics for Consumer Electronics, January 2005. ment using a wavelet-domain natural image statistic model, in:
[27] H. de Ridder, Minkowski-metrics as a combination rule for digital- IS&T/SPIE Human Vision and Electronic Imaging X, vol. 5666, March
image-coding impairments, in: IS&T/SPIE Human Vision, Visual 2005, pp. 149–159.
Processing, and Digital Display III, vol. 1666, January 1992, [41] Z. Wang, G. Wu, H.R. Sheikh, E.P. Simoncelli, E.H. Yang, A.C. Bovik,
pp. 16–26. Quality aware images, IEEE Trans. Image Process. 15 (6) (June 2006)
[28] H. de Ridder, Percentage scaling: a new method for evaluating 1680–1689.
multiple impaired images, Proc. SPIE 3959 (2000) 68–77. [42] A.B. Watson, J. Hu, J.F. McGowan, Digital video quality metric based
[29] A.W. Rix, A. Bourret, M.P. Hollier, Models of human perception, J. BT on human vision, SPIE J. Electron. Imaging 10 (1) (January 2001)
Technol. 17 (1) (January 1999) 24–34. 20–29.
[30] S. Saha, R. Vemuri, An analysis on the effect of image features [43] S. Winkler, Digital Video Quality, Wiley, New York, 2005.
on lossy coding performance, IEEE Signal Process. Lett. 7 (5) [44] H.R. Wu, K.R. Rao (Eds.), Digital Video Image Quality and Perceptual
(May 2000) 104–107. Coding, CRC Press, Boca Raton, FL, 2006.
[31] H.R. Sheikh, A.C. Bovik, Image information and visual quality, IEEE [45] T. Yamada, Y. Miyamoto, M. Serizawa, H. Harasaki, Reduced-
Trans. Image Process. 15 (2) (February 2006) 430–444. reference based video quality metrics using representative lumi-
[32] D. Soldani, M. Li, R. Cuny (Ed.), QoS and QoE Management in UMTS nance values, in: International Workshop on Video Processing and
Cellular Systems, Wiley, New York, 2006. Quality Metrics for Consumer Electronics, January 2007.

You might also like