Amina 1
Amina 1
a r t i c l e in fo abstract
Article history: The rapid growth of third and development of future generation mobile systems has led
Received 2 June 2008 to an increase in the demand for image and video services. However, the hostile nature
Received in revised form of the wireless channel makes the deployment of such services much more challenging,
12 June 2009
as in the case of a wireline system. In this context, the importance of taking care of user
Accepted 16 June 2009
satisfaction with service provisioning as a whole has been recognized. The related user-
oriented quality concepts cover end-to-end quality of service and subjective factors such
Keywords: as experiences with the service. To monitor quality and adapt system resources,
Objective perceptual image quality performance indicators that represent service integrity have to be selected and related
Normalized hybrid image quality metric
to objective measures that correlate well with the quality as perceived by humans. Such
Perceptual relevance weighted Lp -norm
objective perceptual quality metrics can then be utilized to optimize quality perception
Reduced-reference
Wireless imaging associated with applications in technical systems.
In this paper, we focus on the design of reduced-reference objective perceptual
image quality metrics for use in wireless imaging. Specifically, the normalized hybrid
image quality metric (NHIQM) and a perceptual relevance weighted Lp -norm are
designed. The main idea behind both feature-based metrics relates to the fact that the
human visual system (HVS) is trained to extract structural information from the viewing
area. Accordingly, NHIQM and Lp -norm are designed to account for different structural
artifacts that have been observed in our distortion model of a wireless link. The extent
by which individual artifacts are present in a given image is obtained by measuring
related image features. The overall quality measure is then computed as a weighting
sum of the features with the respective perceptual relevance weight obtained from
subjective experiments. The proposed metrics differ mainly in the pooling of the
features and amount of reduced-reference produced. While NHIQM performs the
pooling at the transmitter of the system to produce a single value as reduced-reference,
the Lp -norm requires all involved feature values from the transmitted and received
image to perform the pooling on the feature differences at the receiver. In addition, non-
linear mapping functions are developed that relate the metric values to predicted mean
opinion scores (MOS) and account for saturations in the HVS. The evaluation of
prediction performance of NHIQM and the Lp -norm reveals their excellent correlation
with human perception in terms of accuracy, monotonicity, and consistency. This holds
not only for the prediction performance on images taken for the training of the metrics
but also for the generalization to unknown images. In addition, it is shown that the
Corresponding author.
E-mail addresses: [email protected] (U. Engelke), [email protected] (M. Kusuma), [email protected] (H.-J. Zepernick),
[email protected] (M. Caldera).
0923-5965/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.image.2009.06.005
ARTICLE IN PRESS
526 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
NHIQM approach and the perceptual relevance weighted Lp -norm outperform other
prominent objective quality metrics in prediction performance.
& 2009 Elsevier B.V. All rights reserved.
source coding to communicate the RR signal. These Firstly, the normalized hybrid image quality metric
metrics may be applicable for usage in an image (NHIQM) is designed. It operates on extreme value
communication context due to their low computational normalized image features from which it produces a
complexity. However, the ability of these metrics to weighted sum with respect to the relevance of the
accurately predict perceived visual quality is doubtful involved features. The result is a single value that can be
due to the poor quality prediction performance of PSNR. communicated from transmitter to receiver where it is
utilized as reduced-reference information. The same
processing is performed on the received image resulting
1.2. Overview of the proposed metric design in the related NHIQM value. The absolute difference
between the NHIQM values of the transmitted and
In view of the above, this paper focuses on the received image constitutes the objective perceptual
development of RR objective perceptual image quality quality metric and is used to detect distortions.
metrics that are applicable in a wireless imaging context. Secondly, we consider a perceptual relevance
As such, image impairments representative for a wireless weighted Lp -norm as a means of pooling the image
imaging system are produced to constitute the basis of the features. Specifically, the Lp -norm is applied here to
design framework. In addition, particular care has been detect differences between features [7,10]. In this case,
taken to limit the overhead needed for communicating the pooling at the transmitter is omitted but requires
reduced-reference information and hence conserve the the features being transmitted over the channel to the
scarce bandwidth resources allocated to wireless systems. receiver. At the receiver, the difference between the
Furthermore, feature extraction algorithms are selected to transmitted and received features are combined to an
have small computation complexity in order not to drain overall quality metric. This approach allows to track
battery power at the wireless handheld device and in turn degradations for each of the involved features. On the
support longer service time. other hand, the amount of reduced-reference overhead is
Specifically, images in the widely adopted Joint Photo- increased compared to the NHIQM-based approach.
graphic Experts Group (JPEG) format are examined with The design of both feature-based RR metrics, NHIQM
typical impacts of a mobile communication system and Lp -norm, follows the same methodology. It comprises
included through a simulated wireless link. This system the selection of suitable feature extraction algorithms, the
under test enabled us to produce artifacts beyond those feature extraction for image samples of a training set,
inflicted purely by lossy source encoding but to account normalization of the calculated feature values, and the
also for end-to-end degradations caused by a transmission acquisition of the perceptual relevance weights from the
system. In particular, the artifacts of blocking, blur, subjective experiments. A non-linear mapping function is
ringing, masking, and lost blocks have been observed derived in a final step that relates the objective perceptual
ranging from extreme to almost invisible presence. quality metric to predicted MOS. In this way, non-
The information about the individual artifacts in an linearities in the HVS with respect to the processing of
image can be deduced from related image features such as quality degradations can be accounted for. The non-linear
edges, image activity (IA) and histogram statistics. The mapping function is derived using curve fitting methods
extent by which the considered artifacts exist in a given where, again, the MOS from the subjective experiments
image can therefore be quantified by using selected image are essential in deriving the parameters of the mapping
feature extraction algorithms. As some artifacts influence functions.
the perceived quality stronger than others, perceptual A comprehensive evaluation of the prediction perfor-
relevance weights are given to the associated image mance of NHIQM and the Lp -norm is provided in terms of
features. Clearly, subjective experiments and their analy- accuracy, monotonicity, and consistency [34]. These
sis are not only instrumental but critical in the process performance measures are given for the metric design
of revealing the specific values of perceptual relevance on a training set of images and the generalization to
weights. For this reason, we conducted subjective image unknown images. It turns out that the proposed feature-
quality experiments in two independent laboratories. based metrics outperform other considered RR and FR
The particular values of the weights were deduced as metrics in the context of wireless imaging distortions and
Pearson linear correlation coefficients between the related with respect to the above prediction performance mea-
features and the mean opinion scores (MOS) from the sures.
subjective experiments. In this respect, the perceptual
relevance weights obtained from analyzing the subjective
data constitute a key component in the transition from 1.3. Contributions of this work
subjective quality prediction methods to an automated
quality assessment that would be suitable for real-time Considering the above, this paper contributes a frame-
applications. Given these perceptual relevance weights, an work for image quality metric design in a wireless
objective perceptual image quality metric may then be communication system. As such, the metrics proposed in
designed to exploit image feature values and their weights this paper have been designed to be able to measure
within a suitable pooling process. In this paper, we quality degradation during image transmission using an
consider two feature-based objective perceptual quality RR approach. Unlike other RR metrics from the literature,
metrics that mainly differ in the pooling process and the the metrics in this paper are designed based on a set of
amount of reduced-reference as follows. test images that take into account the complex nature of a
ARTICLE IN PRESS
528 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
wireless communication system, rather than just account- shaded boxes relate to the components that would need to
ing for source coding artifacts or additional noise. be included for performing the operations related to RR
Furthermore, low computational complexity and low objective perceptual quality assessment. As such, the
overhead in terms of reduced-reference have been major system is able to monitor quality degradations that are
design issues in order to put low burdens on the incurred during transmission unlike in the case of
communication system. deploying an NR quality assessment method, where an
A statistical analysis of experiments that we conducted absolute quality of the received image would be obtained.
in two independent laboratories reveals insight into the Given the strict limitations on system resources such as
subjectively perceived quality of wireless imaging distor- bandwidth, the overhead induced by the reduced-refer-
tions. In addition, a statistical and correlation analysis of ence becomes a critical metric design issue. It is therefore
objective feature metrics provides further insight into the beneficial to extract and pool representative features of an
artifacts observed in wireless imaging and the perfor- image It at the transmitter (t) in order to condense the
mance of the feature metrics that were used to quantify image content and structure to a few numerical values.
the related artifacts. Comparison of the proposed RR The transmission of the source encoded image may then
quality metrics to other contemporary quality metrics be accompanied by the reduced-reference, which could be
reveals the ability of the proposed metrics to predict communicated either in-band as an additional header or
perceived quality in the context of wireless imaging. in a dedicated control channel. Subsequently, channel
This paper is organized as follows. Section 2 provides encoding, modulation and other wireless transmission
an overview of RR objective quality assessment in wireless functions are performed on the source encoded image and
imaging and the particular system under test as consid- the reduced-reference. At the receiving side, the inverse
ered in this paper. A detailed description of the conducted functions are performed including demodulation,
subjective quality experiments is contained in Section 3 channel decoding, and source decoding. The reduced-
along with a statistical analysis of the experiment out- reference features are recovered from the received data
comes. The objective feature extraction metrics, which and the related features of the reconstructed image Ir at
build the very basis of the metric design, are discussed the receiver (r) are extracted and pooled to produce the
in Section 4. An additional analysis of the feature related metric value. The difference between metric
metrics provides insight into their performance to values for the images It and Ir can then be explored for
measure artifacts in the images. On the basis of both the end-to-end image quality assessment. The outcome of the
subjective and objective data, the RR metric design for RR quality assessment may drive, for instance, link
objective perceptual quality assessment is then described adaption techniques such as adaptive coding and modula-
in detail in Section 5. In Section 6, the prediction tion, power control, or automatic repeat request strategies
performance of NHIQM and Lp -norm is evaluated and provided a feedback link would be available.
compared to other prominent objective quality metrics.
Finally, conclusions are drawn in Section 7. 2.1. System under test
2. Reduced-reference objective perceptual quality In the scope of this paper we consider a particular
assessment in wireless imaging setup of the wireless link model as shown in Fig. 1 which
turned out to results in a set of test images covering a
A typical link layer of a wireless communication broad range of artifact types and severities. In particular,
system is shown in Fig. 1. Here, the functional blocks in the JPEG format has been chosen to source encode the
Feature Pooling
Extraction
RR Channel
It Modulation
Embedding Encoding
Source
Encoding
Transmitter
Wireless Channel
Receiver
Fig. 1. Overview of reduced-reference objective perceptual quality assessment deployed in a wireless imaging system.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 529
experiments are given in detail. According to the guide- screen given it displays only black level in a dark room to
lines outlined in Recommendation BT.500-11 [16] of the luminance when displaying peak white was approxi-
the radio communication sector of the International mately 0.01. The display brightness and contrast was set
Telecommunication Union (ITU-R), subjective experi- up with picture line-up generation equipment (PLUGE)
ments were conducted in two independent laboratories. according to Recommendations ITU-R BT.814 [13] and
The first subjective experiment (SE 1) took place at the ITU-R BT.815 [14]. The calibration of the screens was
Western Australian Telecommunications Research Insti- performed with the calibration equipment ColorCAL from
tute (WATRI) in Perth, Australia and the second subjective Cambridge Research System Ltd., England, while the
experiment (SE 2) was conducted at the Blekinge Institute DisplayMate software was used as pattern generator.
of Technology (BIT) in Ronneby, Sweden. Due to its large impact on the artifact perceivability,
the viewing distance must be taken into consideration
when conducting a subjective experiment. The viewing
3.1. Laboratory environment distance is in the range of four times (4H) to six times (6H)
the height H of the CRT monitors, as stated in Recom-
The general viewing conditions were arranged as mendation ITU-R BT.1129-2 [15]. The distance of 4H was
specified in the ITU-R Recommendation BT.500-11 [16] selected here in order to provide better image details to
for a laboratory environment. the viewers.
The subjective experiments were conducted in a room
equipped with two 17 in cathode ray tube (CRT) monitors
3.2. Test material
of type Sony CPD-E200 (SE 1) and a pair of 17 in CRT
monitors of type DELL and Samtron 75E (SE 2). The ratio of
Seven reference images of dimension 512 512 pixels
luminance of inactive screen to peak luminance was kept
and represented in grey scale have been chosen to cover a
below a value of 0.02. The ratio of the luminance of the
variety of textures, complexities, and arrangements.
The images are shown in Figs. 3 and 4 where the images
in Fig. 3 represent humans and human faces and the
images in Fig. 4 represent more complex structures and
natural scenes. The wireless link simulation model as
explained in Section 2.1 has then been utilized to create
test images that exhibit the wide variety of distortions as
discussed in Section 2.2. In particular, two sets of
40 images each, I1 and I2 , were created to be used in
the two subjective experiments SE 1 and SE 2, respec-
tively. The images were chosen such as to cover a wide
variety of artifacts and also a broad range of severities for
each of the artifacts, from almost invisible to highly
distorted. Thus, the metric design is based on a set of test
images that incorporates distortions near the just notice-
able differences regime to artifacts widely covering the
suprathreshold regime.
3.3. Viewers
Fig. 4. Reference images showing complex textures: ‘‘Goldhill’’, ‘‘Pepper’’, and ‘‘Mandrill’’ (left to right).
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 531
viewers were allowed to take part in the conducted ment, with one image being the original, undistorted
subjective experiments. In order to support generalization image and the other being the distorted test image. As the
of results and statistical significance of the collected DSCQS method is quite sensitive to small quality differ-
subjective data, the experiments were conducted in two ences, it is well suited to not just cope with highly
different laboratories involving 30 non-expert viewers in distorted test images but also with cases where the
each experiment. Thus, the minimum requirement of at quality of original and distorted image is very similar.
least 15 viewers, as recommended in [16], is well satisfied.
In order to support consistency and eliminate systematic
differences among results at the different testing labora- 3.4.3. Grading scale
tories, similar panels of test subjects in terms of occupa- The grading is performed with reference to a five-point
tional category, gender, and age were established. quality scale (excellent, good, fair, poor, bad), which is
In particular, 25 males and five females, participated in used to divide the continuous grading scale into five
SE 1. They were all university staff and students and their partitions of equal length. Given the pair of images A and
ages were distributed in the range of 21–39 years with the B, the viewer is requested to assess their quality by placing
average age being 27 years. In the second experiment, SE a mark on each quality scale. As the reference and
2, 24 males and six females participated. Again, they distorted image appear in pseudo-random order, A and B
were all university staff and students and their ages were may refer to either the reference image or the distorted
distributed in the range of 20–53 years with the average image, depending on the actual arrangement of images in
age being 27 years. an assessment pair.
3.4.1. Selection of test method The outcomes of the subjective experiments are
Different test methodologies are provided in detail in discussed in the following by means of a statistical
[16] to best match the objectives and circumstances of the analysis. In this respect, a concise representation of the
assessment problem. The methodologies are mainly subjective data can be achieved by calculating conven-
classified into two categories, as double-stimulus and tional statistics such as the mean, variance, skewness, and
single-stimulus. In double-stimulus, the reference image kurtosis of the related distribution of opinion scores.
is presented to the viewer along with the test image. The statistical analysis of these data reflects the fact that
On the other hand, in single-stimulus, the reference image perceived quality is a subjective measure and hence may
is not explicitly presented and may be shown transpar- be described statistically.
ently to the subject for judgement consistency observa-
tion purpose. As we consider RR metric design in this
paper, where partial information related to the reference 3.5.1. Statistical measures
image is available, we chose to deploy a double-stimulus Let the MOS value for the kth image in a set K of size K
method, the double-stimulus continuous quality scale be denoted here as mk . Then, we have
(DSCQS). Moreover, DSCQS has been shown to have low
1X N
sensitivity to contextual effects [16,34]. Contextual effects mk ¼ u (1)
N j¼1 j;k
occur when the subjective rating of an image is influenced
by presentation order and severity of impairments. This
where uj;k denotes the opinion score given by the jth
relates to the phenomenon that test subjects may tend to
viewer to the kth image and N is the number of viewers.
give an image a lower score than it might have normally
The confidence interval associated with the MOS of each
been given if its presentation was scheduled after a less
examined image is given by
distorted image.
½mk k ; mk þ k (2)
3.4.2. Presentation of test material The deviation term k can be derived from the standard
The test sessions were divided into two sections. Each deviation sk and the number N of viewers and is given for
section lasted up to 30 min and consisted of a stabilization a 95% confidence interval according to [16] by
and a test trial. The stabilization trials were used as
a warm-up to the actual test trial in each section.
sk
k ¼ 1:96 pffiffiffiffi (3)
In addition, one training trial was conducted at the very N
beginning of the test session to demonstrate the test where the standard deviation sk for the kth image is
procedure to the viewer and allow them to familiarize defined as the square root of the variance
themselves with the test mechanism. Clearly, the scores
obtained during the training and stabilization trials are 1 X N
s2k ¼ ðu mk Þ2 (4)
not processed but only the scores given during the test N 1 j¼1 j;k
trials are analyzed. In order to reduce the viewer’s fatigue,
a 15 min break was given between sections. The skewness measures the degree of asymmetry of
Given the DSCQS method, pairs of images A and B are data around the mean value of a distribution of samples
presented in alternating order to the viewers for assess- and is defined by the second and third central moments
ARTICLE IN PRESS
532 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
m2 and m3 , respectively, as
m3 100
b¼ 3=2
(5)
m2 90
MOS
50
The peakedness of a distribution can be quantified by
the kurtosis, which measures how outlier-prone a dis- 40
tribution is. The kurtosis is defined by the second and 30
fourth central moments m2 and m4 , respectively, as
20
m
g ¼ 42 (7) 10
m2
0
It should be mentioned that the kurtosis of the normal 0 5 10 15 20 25 30 35 40
distribution is obtained as 3. If the considered distribution
Image number
is more outlier-prone than the normal distribution, it
results in a kurtosis greater than 3. On the other hand,
if it is less outlier-prone than the normal distribution, it 100
gives a kurtosis less than 3. A distribution of scores
90
is usually considered as normal if the kurtosis is between
2 and 4. 80
70
3.5.2. Statistical analysis 60
Figs. 5(a) and (b) show the scatter plots of MOS for SE 1
MOS
100 500
SE1 – Image sample
90 SE1 – Linear fit 450
SE2 – Image sample
80 SE2 – Linear fit 400
70 350
60 300
Variance
MOS
50 250
40 200
30 150
20 100 SE1 – Image sample
SE1 – Quadratic fit
10 50 SE2 – Image sample
SE2 – Quadratic fit
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Image number Image number
2 30
SE1 – Image sample
SE1 – Power fit
1 25 SE2 – Image sample
SE2 – Power fit
0
20
Skewness
–1
Kurtosis
15
–2
10
–3
SE1 – Image sample
–4
SE1 – Cubic fit 5
SE2 – Image sample
SE2 – Cubic fit
–5 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Image number Image number
Fig. 6. Statistics of opinion scores for the impaired image samples: (a) MOS, (b) variance, (c) skewness, and (d) kurtosis.
wider extent. These conclusions are supported by the skewness at the high quality end, indicating that
confidence intervals shown in Figs. 5(a) and (b), which are the agreement of low quality was higher as compared to
narrower for images rated as being excellent and bad. the agreement about high quality. The asymmetry in
Fig. 6(c) shows the skewness of the opinion scores subjective scores for the extreme cases of excellent and
distribution for each image sample. In the context of bad quality is thought to be due to the rating scale being
the subjective ratings of image quality, a negative or limited to 100 and 0, respectively. As such, subjective
positive skewness translate to the subjective scores being scores have to approach the maximal and minimal
more spread towards lower or higher values than the possible rating from below or above, respectively. The
MOS, respectively. For the images that were perceived as skewness of around zero for the middle range of qualities
being of high quality, the negative skewness indicates that reveals that the subjective scores seem to be symmetri-
subjective scores tend to be asymmetrically spread cally distributed with respect to MOS, even though the
around the MOS towards lower opinion scores and thus, variance for images of average quality is larger.
that a number of viewers gave significantly lower quality Fig. 6(d) provides the kurtosis for each impaired image
scores as compared to the MOS. In the other extreme of sample. It can be seen from the figure that the distribution
image quality being perceived as bad, the positive of subjective scores for some of the images scoring high
skewness points to an asymmetrically spread around MOS values in both experiments give kurtosis values
the MOS towards higher opinion scores. However, the much greater than of a normal distribution. This is a
positive skewness is not as distinct as the negative strong indication for outliers, meaning, that a few of the
ARTICLE IN PRESS
534 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
viewers gave the image quality a low rating, whereas the computed in both horizontal and vertical direction and
majority of viewers agreed on a high image quality. With combined in a pooling stage as follows:
the progression of images towards decreasing MOS, the
f˜ 1 ¼ a þ bBg1 Ag2 Z g3 (8)
associated kurtosis fits quickly level out around the value
3, pointing to a normal distribution of the opinion scores where the parameters a, b, g1 , g2 , and g3 were estimated
around MOS. It is interesting to point out that the high in [38] using MOS from subjective experiments. Despite
kurtosis in the high quality end does not occur at the bad the two IAM incorporated in f˜ 1 , we found that this metric
quality end. This means that the entire viewer panel accounts particularly well for blocking artifacts in JPEG
agreed on the bad quality images with no outlier scores compressed images. This might be due to the magnitude
being present. This result is also evident in the skewness of g1 , being reported in [38] as relatively large compared
distribution where the decline towards lower values at the to g2 and g3 , giving the blocking measure a higher impact
high quality end is much more pronounced as compared on the metric f˜ 1 .
to the incline of the skewness at the low quality end.
4.1.2. Feature f˜ 2 : edge smoothness
4. Objective structural degradation metrics The extraction of feature metric f˜ 2 relates purely
to measuring blur artifacts and follows the work of
The design of the RR metrics proposed in this paper is Marziliano et al. [22]. It accounts for the smoothing effect
based on the extraction of structural information from of blur by measuring the distance between edges. It was
the images. In this section we will discuss the objective found that it is sufficient to measure the blur along
feature metrics that were deployed to measure the vertical edges, which allows for saving computational
artifacts as observed in the test images (see Section 2.2). complexity as compared to computation on all edges.
An analysis of the objective measures provides further Therefore, a Sobel filter is applied to detect vertical edges
insight into the feature metrics performance of quantify- in the image. The edge image is then horizontally scanned.
ing the artifacts. For pixels that correspond to an edge point, the local
extrema in the corresponding image are used to compute
4.1. Feature metrics the edge width. The edge width then defines a local
measure of blur. Finally, a global blur measure is obtained
Given the set of artifacts as observed in the test images, by averaging the local blur values over all edge locations.
algorithms for feature extraction can be deployed to This metric was chosen to complement the IAM in f˜ 1 since
capture the amount by which each of the artifacts is it does not just account for in-block blur but rather
present in the images. The selection of the algorithms to contributes a global blur measure.
be used is driven by three constraints, namely, a reason-
able accuracy in capturing the characteristics of the 4.1.3. Features f˜ 3 and f˜ 4 : image activity
associated artifact, a representation of the feature that Ringing artifacts are observed as periodic pseudo-
incurs low overhead in terms of reduced-reference edges around original edges, thus increasing the activity
(conserve bandwidth), and computational inexpensive- within an image. The feature metrics f˜ 3 and f˜ 4 provide an
ness (conserve battery power). The features and feature indirect means of measuring ringing artifacts and are
extraction algorithms deployed here to measure and based on two IAM by Saha and Vemuri [30].
quantify the presence of the related artifacts are listed in Here, f˜ 3 quantifies image activity based on normalized
Table 1 and will be described in the following sections. magnitudes of edges in an edge image BðiÞ as follows:
!
1 MNX
4.1.1. Feature f˜ 1 : block boundary differences f˜ 3 ¼ BðiÞ 100 (9)
M N i¼1
The first feature metric f˜ 1 is based on the algorithm by
Wang et al. [38] and comprises three measures. The first where M and N denote the image dimensions. Since f˜ 3
measure, B, estimates blocking as average differences does not depend on the direction of the edge, it also very
between block boundaries. Two image activity measures well complements the blocking measure in f˜ 1 , which is
(IAM), A and Z, are applied as indirect means purely designed to measure on the 8 8 block boundaries
of quantifying blur. The former IAM computes absolute in JPEG coded images.
differences between in-block image samples and the latter On the other hand, f˜ 4 measures IA in an image Iði; jÞ
IAM computes a zero-crossing rate. All three measures are based on local gradients in both vertical and horizontal
Table 1
Image features, feature extraction algorithms, and related artifacts.
Δ f1
0.5
0
1
Δ f2
0.5
0
1
Δ f3
0.5
0
1
Δ f4
0.5
0
1
Δ f5
0.5
0
0 5 10 15 20 25 30 35 40
Image sample
1
Δ f1
0.5
0
1
Δ f2
0.5
0
1
Δ f3
0.5
0
1
Δ f4
0.5
0
1
Δ f5
0.5
0
0 5 10 15 20 25 30 35 40
Image sample
Fig. 7. Magnitude of differences between normalized feature values for the considered image samples ranked according to decreasing MOS: (a) SE 1,
(b) SE 2.
Table 2 Table 4
Statistics of magnitudes of feature differences Df i for SE 1 Correlations between feature differences for SE 1.
Df 1 Df 2 Df 3 Df 4 Df 5 Df 1 Df 2 Df 3 Df 4 Df 5
Mean 0.253 0.120 0.102 0.053 0.022 Df 1 1.000 0.625 0.821 0.016 0.027
Variance 0.043 0.017 0.014 0.015 0.009 Df 2 1.000 0.440 0.649 0.112
Skewness 0.627 1.425 1.124 3.518 6.015 Df 3 1.000 0.056 0.061
Kurtosis 2.082 4.120 3.241 15.010 37.466 Df 4 1.000 0.000
Df 5 1.000
Table 3 Table 5
Statistics of magnitudes of feature differences Df i for SE 2 Correlations between feature differences for SE 2.
Df 1 Df 2 Df 3 Df 4 Df 5 Df 1 Df 2 Df 3 Df 4 Df 5
Mean 0.263 0.094 0.115 0.049 0.061 Df 1 1.000 0.376 0.640 0.014 0.115
Variance 0.029 0.013 0.010 0.021 0.035 Df 2 1.000 0.486 0.753 0.316
Skewness 1.066 2.495 1.072 5.461 3.785 Df 3 1.000 0.323 0.272
Kurtosis 4.056 9.531 3.843 32.434 17.063 Df 4 1.000 0.170
Df 5 1.000
From comparison of the two tables one can observe 1 and SE 2. This is thought to be due to both metrics being
that for all four statistics and for all five feature based on measuring edges of an image. However, it should
differences, the magnitudes of the values are very much be noted again that feature metric Df 1 only considers the
in alignment between the two experiments SE 1 and SE 2. 8 8 block borders of the JPEG encoding, whereas feature
This indicates that the stimuli, in terms of the distorted metric Df 3 quantifies image activity based on edges in all
test images, had similar characteristics in both experi- spatial locations and directions. Furthermore, feature
ments. Thus, not only subjective data are in alignment but metrics Df 2 (edge smoothness) and Df 4 (gradient-based
also the composition of objective features among the test IA) exhibit pronounced cross-correlations in the test sets
material. In particular, it can be seen from both tables that of both experiments which may be a result of both metrics
the mean of the blocking differences dominates over the being designed to quantify smoothness in images based
other features. This is a direct result of the JPEG source on gradient information. As for feature metric Df 5 (image
encoding of which it is well known that blocking artifacts histogram statistics), it can be seen that this metric is only
are dominant over other artifacts such as blur. The mean negligibly correlated to any of the other feature metrics.
values of feature differences Df 4 and Df 5 are particularly This is a highly desired result since the feature metrics
small; however, these features exhibit instead a very high other than Df 5 should be widely unaffected by intensity
skewness and kurtosis as compared to the other features. shifts.
Clearly, this quantifies the progression of feature differ-
ences in the stimuli as shown in Figs. 7(a) and (b) Df 4 and
Df 5 being either negligibly small or distinctively devel- 5. Objective perceptual metric design
oped.
In this section we will in detail describe the RR
objective quality metric design. In this respect, the quality
4.3.3. Feature cross-correlations ratings obtained in the subjective experiments are instru-
Even though the feature metrics were selected to mental for the transition from subjective to objective
account for a particular artifact, one may expect some quality assessment.
overlap in quantifying the different artifacts. To further
understand the performance of the feature metrics in
comparison to each other, Tables 4 and 5 show the 5.1. Metric training and validation
Pearson linear correlation coefficient between each of
the feature metrics for both SE 1 and SE 2. In this context, As foundation of the metric design, the 80 images in
the cross-correlation measures the degree to which two I1 (SE 1) and I2 (SE 1) from the two experiments were
features are simultaneously affected by a certain type and organized into a training set IT containing 60 images and
severity of an artifact. As expected, the correlation of a a validation set IV containing 20 images. For this purpose,
feature with itself exhibits the maximum magnitude of 1. 30 images were taken from I1 and 30 images from I2 to
It can be seen from the tables that the cross- form IT while the remaining 10 images of each set
correlations between the features vary strongly in their compose IV . Accordingly, a training set and a validation
magnitudes. A particularly pronounced cross-correlation set were established with the corresponding MOS, here
can be observed between feature metrics Df 1 (block referred to as MOST and MOSV . The training sets, IT
boundary differences) and Df 3 (edge-based IA) for both SE and MOST , are then used for the actual metric design.
ARTICLE IN PRESS
538 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
MOST Weights
Aquisition
Δfi MOST Curve
wi Fitting
Difference
Features
Metric Computation MOSx
ft,i
It Mapping
Feature Feature ΔNHIQM LP-norm x
Ir Extraction Normalization
fr,i
Fig. 8. Framework for designing feature-based objective perceptual image quality metrics.
The validation sets, IV and MOSV , are used to evaluate the Table 6
Perceptual relevance weights of feature differences Df i for the images in
metrics ability to generalize to unknown images.
the training set.
Table 6 shows the values of the Pearson linear imaging, as the reduced-reference is represented by only
correlation coefficients, or feature weights, that were one single value for a given image. Accordingly, NHIQM
obtained for each of the five feature differences Df i , i ¼ can be communicated from the transmitter to the receiver
1; . . . ; 5 for the training set when correlated to the whilst imposing very little stress on the bandwidth
associated MOST values. Accordingly, block boundary resources.
differences (Df 1 ) appear to be the most relevant feature Regarding applications in wireless imaging, NHIQM
followed by edge-based image activity (Df 3 ), edge can be calculated for the transmitted image It and
smoothness (Df 2 ), image histogram statistics (Df 5 ), and received image Ir , resulting in the corresponding values
gradient-based image activity (Df 4 ). This relates to NHIQM t and NHIQM r at the transmitter and receiver,
blocking being the most annoying artifact followed by respectively. Provided that the NHIQM t value is commu-
ringing due to edge-based image activity, blur, intensity nicated to the receiver, structural differences between the
masking, and ringing due to gradient-based image images at both ends may simply be represented by the
activity. Similar findings have also been made by Farias absolute difference
et al. [9] who observed that blocking is more annoying
than blur. The same group also found [10] that ringing is DNHIQM ¼ jNHIQMt NHIQMr j (18)
the least annoying artifact. This agrees with our feature
metric Df 4 which also received the smallest weight. On 5.4.2. Perceptual relevance weighted Lp -norm
the other hand, the feature metric Df 3 deployed in our The Lp -norm, also referred to as Minkowski metric, is a
paper measures ringing as well but received a higher distance measure commonly used to quantify similarity
weight. We believe that this outcome can be related to Df 3 between two signals or vectors. In image processing it has
having a strong correlation with Df 1 (see Tables 4 and 5), been applied, for instance, with the percentage scaling
thus not only accounting for ringing but also for blocking method [28] and the combining of impairments in digital
artifacts. image coding [27].
It should be noted here that the relevance weights in In this paper, we incorporate the relevance weighting
Table 6 were obtained for the particular case of JPEG for the extreme value normalized features into the
source encoding where blocking artifacts are predominant calculation of the Lp -norm. This modification of the
over other artifacts such as blur. This may also contribute Lp -norm shall be defined as follows:
to the higher correlation weights for the edge-based " #1=p
features Df 1 and Df 3 as compared to the gradient-based X
I
Lp ¼ wpi jf t;i f r;i jp (19)
features Df 2 and Df 4 . Hence, the relevance weights may i¼1
not be purely related to the perceptual relevance but also
to the particular artifacts that are observed in the visual where f t;i and f r;i denote the ith feature value of the
content. As such, one may obtain different relevance transmitted and the received image, respectively.
weights in case of other source encoders, such as The Minkowski exponent p may be determined
JPEG2000. experimentally [28]. Alternatively, the Minkowski expo-
nent p may be assigned a fixed value. In both cases, a
5.4. RR objective metric computation higher value of p increases the impact of the dominant
features on the overall metric. In the limit of p approach-
ing infinity, we obtain
In the following two sections, we will consider two
different pooling functions that are based on weighted
L1 ¼ max jf t;i f r;i j (20)
combinations of the feature metrics. Firstly, we introduce i¼1;...;I
NHIQM, which linearly combines extreme value normal-
ized image features to a single quality value. Secondly, a meaning that the largest absolute feature value difference
perceptual relevance weighted version of the Lp -norm is solely dominates the norm. We have found [7] that values
proposed, which calculates a weighted sum of image beyond p ¼ 2 do not improve the quality prediction
feature differences between original and impaired image. performance of the modified Lp -norm given in (19). We
In both cases, the respective image features are extracted believe that this characteristic is because of the perceptual
with the metrics as summarized in Section 4.1, while the relevance weights obtained for each feature inherently
actual weights used for feature combination have been accounting for the dominance of the particular features. In
deduced as discussed in Section 5.3. the sequel, we therefore consider the modified Lp -norm
for Minkowski exponents of p ¼ 1 and 2 only.
Although the Lp -norm belongs to the class of RR
5.4.1. Normalized hybrid image quality metric
metrics, it requires more transmission resources com-
The proposed NHIQM is defined as a weighted sum of
pared to DNHIQM , as all feature values need to be
the extreme value normalized features as
communicated from the transmitter to the receiver.
X
I On the other hand, the information about each of the
NHIQM ¼ wi f i (17) feature degradations may provide further insights into
i¼1
the channel induced distortions. Hence, overhead may be
where wi denotes the relevance weight of the associated traded off at the expense of a reduction about structural
feature f i . Clearly, this RR metric is particularly beneficial degradation information by neglecting feature metrics
for objective perceptual quality assessment in wireless that received low perceptual relevance weights.
ARTICLE IN PRESS
540 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
100 100
Image sample Image sample
90 Linear fit
90
Exponential fit
80 Confidence 80 Confidence
interval (95%) interval (95%)
70 70
60 60
MOS
MOS
50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM
100 100
Image sample Image sample
90 Quadratic fit 90 Exponential fit (2)
Confidence Confidence
80 80
interval (95%) interval (95%)
70 70
60 60
MOS
MOS
50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM
100 100
Image sample Image sample
90 Cubic fit 90
Exponential fit (3)
Confidence
80 80 Confidence
interval (95%) interval (95%)
70 70
60 60
MOS
MOS
50 50
40 40
30 30
20 20
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM ΔNHIQM
Fig. 9. Polynomial mapping functions: (a) linear, (b) quadratic, (c) cubic. Fig. 10. Exponential mapping functions: (a) exponential, (b) double
exponential, (c) triple exponential.
As higher-degree polynomials may even result in more In contrast to the polynomial functions, favorable
severe overfitting, the class of polynomials has little to fitting has been obtained for all three considered
offer for use in objective perceptual quality assessment. exponential mapping functions, not only in terms of
ARTICLE IN PRESS
542 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
goodness of fit measures but also confirmed by visual needed to represent the reduced-reference is given as
inspection (see Fig. 10). However, it can be observed that 162 bits [40].
the triple exponential function performs similarly to the SSIM: The SSIM index [39] is based on the assumption
exponential function but at the price of a larger computa- that the HVS is highly adapted to the extraction of
tional complexity due to its more involved analytical structural information from the visual scene. As such,
expression. As such, the triple exponential function may SSIM predicts structural degradations between two
not be considered further. images based on simple intensity and contrast mea-
As for the logistic mapping function, the goodness of fit sures. The final SSIM index is given by
measures indicate a rather poor fit to the data from
ð2mx my þ C 1 Þð2sxy þ C 2 Þ
the subjective experiments. Especially, the compression at SSIMðx; yÞ ¼ (24)
ðm2x þ m2y þ C 1 Þðs2x þ s2y þ C 2 Þ
the high end of the quality scale produces disagreement
with MOS (see Fig. 11). where mx , my and sx , sy denote the mean intensity and
In light of the above findings, both the exponential and contrast of image signals x and y, respectively. The
double exponential mapping are selected for further constants C 1 and C 2 are used to avoid instabilities in
consideration in the metric design. the structural similarity comparison that may occur for
certain mean intensity and contrast combinations
6. Evaluation of image quality metrics (m2x þ m2y ¼ 0, s2x þ s2y ¼ 0).
VIF: The VIF criterion [31] approaches the image
With the exponential and double exponential mapping quality assessment problem from an information
being identified as suitable models for objectively theoretical point of view. In particular, the degradation
predicting perceptual image quality, an evaluation of the of visual quality due to a distortion process is
prediction performance of DNHIQM and Lp -norm on the measured by quantifying the information available in
training set (60 images in IT and related MOST ) and its a reference image and the amount of this reference
generalization to the validation set (20 images in IV and information that can be still extracted from the test
related MOSV ) is given in this section. image. As such, the VIF criterion measures the loss of
information between two images. For this purpose,
natural scene statistics and, in particular, Gaussian
6.1. Other objective quality metrics for comparison
scale mixtures (GSM) in the wavelet domain are
used to model the images. The proposed VIF metric is
We have selected contemporary quality metrics that
given by
have been proposed in recent years to allow for a
performance comparison with the proposed feature-based P !N;j !N;j N;j
j2subbands IðC ; F js Þ
DNHIQM and the Lp -norm. Specifically, the reduced-refer- VIF ¼ (25)
P !N;j !N;j N;j
ence image quality assessment (RRIQA) technique pro-
j2subbands Ið C ; E js Þ
posed in [40] is chosen as a prominent member of the
class of RR metrics. In addition, the structural similarity !
(SSIM) index [39], the visual information fidelity (VIF) where C denotes the GSM, N denotes the number of
! !
criterion [31], the visual signal-to-noise ratio (VSNR) [4], GSM used, and E and F denote the visual output of an
and the peak signal-to-noise ratio (PSNR) [25] are chosen HVS model, respectively, for the reference and test
as the FR metrics. It is noted that FR metrics would not be image.
suitable for the considered wireless imaging scenario but VSNR: The VSNR [4] metric deploys a two-stage
rather serve to benchmark prediction performance, which approach based on near-threshold and suprathreshold
can be expected to be high due to the utilization of the properties of the HVS to quantify image fidelity. The
reference image. first stage determines whether distortions are visible
in an image. For this purpose, contrast thresholds for
distortion detection are determined using wavelet-
RRIQA: This metric [40] is based on a natural image
based models of visual masking. If the distortions are
statistic model in the wavelet domain. The image
below the threshold, the quality of the image is
distortion measure is obtained from the estimation of
assumed to be perfect and the algorithm is terminated.
the Kullback–Leibler distance between the marginal
If the distortions are visible, a second stage imple-
probability densities of wavelet coefficients in the
ments perceived contrast and global precedence
subbands of the reference and distorted images as
properties of the HVS to determine the impact of
follows:
! the distortions on perceived quality. The final VSNR
1 XK
k metric is then given as
D ¼ log2 1 þ jd^ ðpk kqk Þj (23) 0 1
D0 k¼1
B CðIÞ C
where the constant D0 is used as a scaler of the VSNR ¼ 20 log10 B
@
C (26)
k dgp A
distortion measure, d^ ðpk kqk Þ denotes the estimation adpc þ ð1 aÞpffiffiffi
of the Kullback–Leibler distance between the prob- 2
ability density functions pk and qk of the kth where CðIÞ denotes the root-mean-squared contrast of
subband in the transmitted and the received the original image I, dpc and dgp are, respectively,
image, and K is the number of subbands. The overhead measures of perceived contrast and global precedence
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 543
100 Table 8
Image sample Computational complexity of the metrics and amount of reference
90 Logistic fit information needed.
80 Confidence
interval (95%) Metric Computation time/image (s) Reference information
70
Type Name
60
MOS
30
FR SSIM 0.37 Full image
20 VIF 0.92 Full image
VSNR 0.33 Full image
10
PSNR 0.05 Full image
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ΔNHIQM
computational complexity of all FR metrics is lower as
Fig. 11. Logistic mapping function. compared to the RR metrics. Amongst the FR metrics,
PSNR outperforms by far the other considered metrics in
terms of computational complexity. Regarding the RR
disruption, and a is a weight regulating the relative
metrics, it is observed that both DNHIQM and Lp -norm are
contributions of dpc and dgp .
significantly less complex than RRIQA.
PSNR: Image fidelity is an indication about the
In the context of wireless imaging, the amount of
similarity between the reference and distorted images
reference information needed for quality assessment
and measures pixel-by-pixel closeness between those
determines the overhead of data that needs to be
pairs. The PSNR [25] is the most commonly used
transmitted over the channel along with the actual image.
fidelity metric. It measures the fidelity difference of
From Table 8 one can see that the reference information is
two image signals IR ðx; yÞ and ID ðx; yÞ on a pixel-by-
significantly lower for both DNHIQM and Lp -norm as
pixel basis as
compared to RRIQA. The particularly small reference
Z2 information for DNHIQM results from the fact that only a
PSNR ¼ 10 log (27)
MSE single value NHIQMt needs to be transmitted. On the other
where Z is the maximum pixel value, here 255. The hand, with the Lp -norm five features need to be trans-
mean square error is given as mitted resulting in a five times higher overhead. However,
as discussed in Section 5.4.2, the number of features used
1 XX X Y
may be traded off with the transmission overhead by
MSE ¼ ½IR ðx; yÞ ID ðx; yÞ2 (28)
XY x¼1 y¼1 neglecting features of low perceptual relevance. As for the
FR metrics, the reference image is needed for the quality
where X and Y denote horizontal and vertical image assessment and as such, the size of the image determines
dimensions, respectively. Despite being an FR metric, the amount of reference information. Independent of the
PSNR usually does not correlate well with the visual image size, however, the amount of reference information
quality as perceived by a human observer [37]. would be magnitudes higher as compared to the RR
metrics.
6.2. Computational complexity and amount of reference
information 6.3. Prediction performance measures
In the following, we will discuss the computational The quality prediction performance of the considered
complexity of the considered metrics and the amount of objective metrics will be quantified in terms of accuracy,
reference information that is needed in order to assess the monotonicity, and consistency as recommended by the
quality of a test image. The details are summarized in Video Quality Experts Group (VQEG) [34].
Table 8. The prediction accuracy of each objective quality
The computational complexity is measured in terms of metric will be quantified using the Pearson linear
the time that each of the metrics needs to assess the correlation coefficient as defined in (16). The prediction
quality of a single image in our sets I1 and I2 . Here, we monotonicity will be measured by the non-parametric
have computed each metric over all 80 images and then Spearman rank order coefficient
determined the average time. The metrics were run on a
PK
laptop computer containing an Intel T2600 Dual Core k¼1 ðwk w̄Þðgk ḡÞ
rS ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK (29)
processor with 2.16 GHz and 4 GB of RAM. In order to 2 2
k¼1 ðwk w̄Þ k¼1 ðgk ḡÞ
allow for a fair comparison, the publicly available Matlab
implementation of each metric was used even though where wk and gk denote the ranks of the predicted scores
there may be other implementations available for some of and the subjective scores, respectively, and w̄ and ḡ are the
the metrics. It can be seen from Table 8 that the midranks of the respective data sets. This measure is used
ARTICLE IN PRESS
544 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
50 Type Name a1 b1 a1 b1 a2 b2
40
RR DNHIQM 88.79 2.484 69.76 1.720 32.04 17.39
30 L1 - 87.63 1.840 68.15 1.251 34.31 13.46
norm
20 L2 - 90.20 2.820 69.83 1.950 32.28 16.24
10 norm
RRIQA 102.1 0.160 101.3 0.157 3:486 1014 3.701
0
0 20 40 60 80 100 FR SSIM 13.91 1.715 31.34 0.446 3:964 107 18.58
MOSNHIQM VIF 4.291 2.886 14.06 1.366 7:721 1014 33.98
VSNR 25.662 0.033 21.964 0.041 1:388 106 0.387
Fig. 12. MOS versus predicted MOS, MOSNHIQM : (a) exponential mapping, PSNR 18.33 0.036 22.68 0.029 0 0.029
(b) double exponential mapping.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 545
decay parameter for the feature-based objective percep- As can be seen from the numerical results in Table 10
tual quality metrics DNHIQM , L1 -norm, L2 -norm, as well as for the metric training, the prediction accuracy of the
RRIQA relate to the fact that these RR metrics represent feature-based metrics, DNHIQM , L1 -norm, and L2 -norm,
image degradation. Thus, larger values of these metrics outperform the other considered metrics, RRIQA, SSIM,
correspond to lower perceptual quality. In contrast, the FR VIF, VSNR, and PSNR. This applies for the training with
metrics SSIM, VIF, VSNR, and PSNR measure image respect to all three cases, i.e. the pure metric prior to
similarity of some sort, which is represented by the mapping and after mapping with exponential and
positive decay parameter in their exponential mapping double exponential functions. The comparison between
functions. In these cases, a larger metric value corre- the feature-based quality metrics indicate the comparable
sponds to higher perceptual quality. As for the double performance of DNHIQM and the Lp -norms.
exponential mapping functions, these are pronounced Similar observations about accuracy can be made for
only for the feature-based objective perceptual quality metric validation. In terms of metric generalization to
metrics of DNHIQM and the L1 - and L2 -norm. Specifically, these unknown images from the validation set, the
the growth and decay parameters for both involved feature-based quality metrics significantly outperform
exponential functions are substantially different to zero. the other considered metrics in accuracy. While DNHIQM
This is not the case for the other considered quality and the Lp -norms provide an accuracy over 80% and in
metrics, RRIQA SSIM, VIF, and VSNR, with the double some cases close to 90%, all other considered metrics fall
exponential mapping functions degenerating to an ex- below the 80% threshold of generalization accuracy. It is
ponential function for small metric values. Due to the also observed that the largest accuracy being r p ¼ 0:91 for
initial value a2 being close to zero, the second exponential DNHIQM on the training set using double exponential
function can contribute to the prediction only for mapping does not generalize as well as for the pure
extremely large metric values although this may still be metric or exponential mapping. This indicates that fitting
insignificant to the first exponential function involved. DNHIQM to a double exponential mapping may already
In the case of PSNR, the initial value of the second produce some degree of overfitting. Similar trends to
exponential function is actually obtained from the curve overfitting using double exponential mapping can be
fitting as being zero. Accordingly, the double exponential observed with the L2 -norm and VIF. In view of this and
mapping function in fact degenerates to an exponential the degeneration of double exponential mapping
mapping function. to exponential mapping with some metrics, the
exponential function may in fact constitute the most
preferred mapping in the considered context of wireless
6.6. Evaluation of quality prediction performance imaging.
Let us now compare the prediction monotonicity of the
Given the parameters of the prediction functions for proposed image quality metrics with the other state of the
the examined quality metrics, the prediction performance art image quality metrics. As all relationships follow
of these metrics is presented in Table 10. In particular, the strictly decreasing or increasing functions, differentiation
prediction accuracy is quantified by the Pearson linear between metric, exponential, and double exponential
correlation coefficient. It has been calculated on the basis mapping is not required as ranks are kept the same for
of the 60 images in the training set IT and the 20 images all three cases. The results shown in Table 10 reveal that
of the validation set IV . Moreover, prediction accuracy the feature-based DNHIQM approach and the Lp -norms
has been calculated for the relationship between MOS and perform favorably over the remaining four metrics with
the pure metric as well as for the relationship between prediction monotonicity well above 80% for both metric
MOS and predicted MOS using exponential mapping and training and validation. From the other metrics, only VIF
double exponential mapping. shows a satisfactory prediction monotonicity of rS ¼ 0:813
Table 10
Prediction performance of objective quality metrics, predicted MOS using exponential mapping, and predicted MOS using double exponential mapping.
r P;T rP;V r P;T r P;V rP;T rP;V r S;T rS;V r O;T r O;V r O;T r O;V
RR DNHIQM 0.843 0.840 0.892 0.888 0.910 0.860 0.867 0.892 0.017 0 0 0.050
L1 -norm 0.833 0.841 0.873 0.897 0.895 0.893 0.854 0.901 0.017 0 0.017 0
L2 -norm 0.845 0.846 0.888 0.884 0.903 0.878 0.875 0.890 0.017 0 0 0
RRIQA 0.821 0.772 0.829 0.749 0.831 0.752 0.786 0.758 0.050 0.050 0.050 0.050
FR SSIM 0.582 0.434 0.632 0.511 0.701 0.605 0.558 0.347 0.117 0.050 0.100 0.050
VIF 0.713 0.737 0.789 0.788 0.877 0.795 0.813 0.729 0.083 0 0.033 0
VSNR 0.766 0.696 0.758 0.686 0.783 0.686 0.686 0.510 0.083 0 0.050 0.050
PSNR 0.742 0.712 0.738 0.709 0.741 0.711 0.638 0.615 0.100 0 0.150 0
ARTICLE IN PRESS
546 U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547
for the training but does not generalize well to the Acknowledgments
unknown images.
Finally, the prediction consistency for the training of The authors wish to thank staff and students of the
both feature-based metrics, DNHIQM and Lp -norms, is Western Australian Telecommunications Research Insti-
superior compared to the other four metrics. It is also tute, Perth, Australia and the School of Engineering at the
observed that the prediction consistency for the validation Blekinge Institute of Technology, Ronneby, Sweden for
of DNHIQM is better when using the exponential mapping participating in the subjective experiments. We would
compared to the double exponential mapping. also like to thank the anonymous reviewers for their
highly valuable comments which helped to significantly
improve the quality of this article.
7. Conclusions
References
In this paper, the design of RR objective perceptual
image quality metrics for wireless imaging has been [1] A.C. Brooks, T.N. Pappas, Structural similarity quality metrics in a
presented. Instead of focusing only on artifacts due to coding context: exploring the space of realistic distortions, in: IS&T/
source encoding, the design follows an end-to-end quality SPIE Human Vision and Electronic Imaging XI, vol. 6057, January
2006, pp. 60570U-1–12.
approach that accounts for the complex nature of artifacts [2] M. Carnec, P. Le Callet, D. Barba, Objective quality assessment of
that may be induced by a wireless communication system. color images based on a generic perceptual reduced reference,
As such, the proposed image quality metrics constitute Image Commun. 23 (4) (April 2008) 239–256.
[3] D.M. Chandler, K.H. Lim, S.S. Hemami, Effects of spatial correlations
alternatives to traditional link layer metrics and and global precedence on the visual fidelity of distorted images, in:
may readily be utilized for in-service quality monitoring IS&T/SPIE Human Vision and Electronic Imaging XI, vol. 6057,
and resource management purposes. Specifically, January 2006, pp. 131–145.
[4] D.M. Chandler, S.S. Hemami, VSNR: a wavelet-based visual signal-
both DNHIQM and the perceptual relevance weighted
to-noise ratio for natural images, IEEE Trans. Image Process. 16 (9)
Lp -norm are designed with respect to low computational (September 2007) 2284–2298.
complexity and low overhead, to measure quality [5] K. Chono, Y.-C. Lin, D. Varodayan, Y. Miyamoto, B. Girod, Reduced-
degradations in a wireless communication system, and reference image quality assessment using distributed source
coding, in: IEEE International Conference on Multimedia and Expo,
to account for different structural artifacts that have June 2008, pp. 609–612
been observed in our distortion model of a wireless link. [6] S. Daly, Visible differences predictor: an algorithm for the assess-
Here, structural artifacts are detected by related feature ment of image fidelity, in: IS&T/SPIE Human Vision, Visual
Processing, and Digital Display III, vol. 1666, August 1992,
metrics. pp. 2–15.
The general framework for the design of RR objective [7] U. Engelke, H.-J. Zepernick, Quality evaluation in wireless imaging
perceptual image quality metrics is outlined. It comprises using feature-based objective metrics, in: IEEE International
Symposium on Wireless Pervasive Computing, San Juan, Puerto
feature extraction, feature normalization, calculation of
Rico, February 2007, pp. 367–372.
difference features, relevance weight acquisition, and [8] A.M. Eskicioglu, P.S. Fisher, Image quality measures and their
feature pooling. In addition, curve fitting techniques are performance, IEEE Trans. Commun. 43 (12) (December 1995)
used to find the parameters of suitable mapping functions 2959–2965.
[9] M.C.Q. Farias, M.S. Moore, J.M. Foley, S.K. Mitra, Perceptual
that can translate objective quality metrics into predicted contributions of blocky, blurry, and fuzzy impairments to overall
MOS. The transition from subjective to objective percep- annoyance, in: Proceedings of the IS&T/SPIE Human Vision and
tual quality is executed in the process of relevance weight Electronic Imaging IX, vol. 5292, 2004, pp. 109–120.
[10] M.C.Q. Farias, J.M. Foley, S.K. Mitra, Perceptual analysis of video
acquisition and the derivation of the mapping functions. impairments that combine blocky, blurry, noisy, and ringing
In both these parts of the design framework, the results of synthetic artifacts, in: IS&T/SPIE Human Vision and Electronic
subjective experiments are engaged to train our feature- Imaging X, vol. 5666, 2005, pp. 107–118.
[11] R. Ferzli, L.J. Karam, J. Caviedes, A robust image sharpness metric
based quality metrics. Moreover, a detailed description based on kurtosis measurement of wavelet coefficients, in: Inter-
and statistical analysis of the subjective data gathered in national Workshop on Video Processing and Quality Metrics for
these experiments and related objective feature data is Consumer Electronics, January 2005.
[12] B. Girod, What’s wrong with mean-squared error, in: A.B. Watson
provided.
(Ed.), Digital Images and Human Vision, MIT Press, Cambridge, MA,
The evaluation of the quality prediction performance 1993, pp. 207–220.
reveals that DNHIQM and the perceptual relevance weighted [13] ITU-R Recommendation BT.814, Specifications and alignment
Lp -norm both correlate similarly well to human percep- procedures for setting of brightness and contrast of displays,
Geneva, Switzerland, 1994.
tion on images. This holds not only for the training of the [14] ITU-R Recommendation BT.815, Specification of a signal for
metrics but also for the generalization to unknown measurement of the contrast ratio of displays, Geneva, Switzerland,
images. Furthermore, the numerical results show that 1994.
[15] ITU-R Recommendation BT.1129-2, Subjective assessment of stan-
both feature-based RR metrics outperform even the dard definition digital television (SDTV) systems, Geneva, Switzer-
considered state of the art FR metrics in prediction land, 1998.
performance. As the reduced-reference overhead asso- [16] ITU-R Recommendation BT.500-10, Methodology for the subjective
assessment of the quality of television pictures, Geneva, Switzer-
ciated with the calculation of DNHIQM is condensed to only
land, 2002.
a single number, this approach may be the more favorable [17] M. Kusuma, H.-J. Zepernick, Objective hybrid image quality metric
choice for use in wireless imaging applications compared for in-service quality assessment, in: Signal Processing for Tele-
to the perceptual relevance weighted Lp -norm, which communications and Multimedia, Springer, Berlin, 2005, pp. 44–55.
[18] T.M. Kusuma, A perceptual-based objective quality metric for
requires all involved features to be communicated from wireless imaging, Ph.D. Thesis, Curtin University of Technology,
the transmitter to the receiver. Perth, Australia, 2005.
ARTICLE IN PRESS
U. Engelke et al. / Signal Processing: Image Communication 24 (2009) 525–547 547
[19] J. Lubin, A visual discrimination model for imaging system [33] C.J. van den Branden Lambrecht, O. Verscheure, Perceptual quality
design and evaluation, in: E. Peli (Ed.), Vision Models for Target measure using a spatio-temporal model of the human visual
Detection and Recognition, World Scientific, Singapore, 1995, system, in: SPIE Digital Video Compression: Algorithms and
pp. 245–283. Technologies, vol. 2668, January 1996, pp. 450–460.
[20] F.X.J. Lukas, Z.L. Budrikis, Picture quality prediction based on a [34] Video Quality Experts Group, Final report from the Video Quality
visual model, IEEE Trans. Commun. 30 (6) (July 1982) 1679–1692. Experts Group on the validation of objective models of video
[21] J.B. Martens, Multidimensional modeling of image quality, Proc. quality assessment phase II, August 2003, [Online]. Available:
IEEE 90 (1) (January 2002) 133–153. hhttp://www.vqeg.org.i
[22] P. Marziliano, F. Dufaux, S. Winkler, T. Ebrahimi, A no-reference [35] T. Vlachos, Detection of blocking artifacts in compressed video, IEE
perceptual blur metric, in: IEEE International Conference on Image Electron. Lett. 36 (13) (June 2000) 1106–1108.
Processing, September 2002, pp. 57–60. [36] B.A. Wandell, Foundations of Vision, Sinauer Associates, 1995.
[23] L. Meesters, J.B. Martens, A single ended blockiness measure for [37] Z. Wang, A.C. Bovik, L. Lu, Why is image quality assessment so
JPEG coded images, Signal Process. 82 (March 2002) 369–387. difficult? in: IEEE International Conference on Acoustics, Speech,
[24] A.F. Molisch, Wireless Communications, Wiley, IEEE Press, and Signal Processing, vol. 4, May 2000, pp. 3313–3316.
New York, 2005. [38] Z. Wang, H.R. Sheikh, A.C. Bovik, No-reference perceptual quality
[25] J.-R. Ohm, Multimedia Communication Technology: Representation, assessment of JPEG compressed images, in: IEEE International
Transmission and Identification of Multimedia Signals, Springer, Conference on Image Processing, September 2002, pp. 477–480.
Berlin, 2004. [39] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality
[26] F. Pereira, Sensations, perceptions and emotions towards quality of assessment: from error visibility to structural similarity, IEEE Trans.
experience evaluation for consumer electronics video adaptations, Image Process. 13 (4) (April 2004) 600–612.
in: International Workshop on Video Processing and Quality [40] Z. Wang, E.P. Simoncelli, Reduced-reference image quality assess-
Metrics for Consumer Electronics, January 2005. ment using a wavelet-domain natural image statistic model, in:
[27] H. de Ridder, Minkowski-metrics as a combination rule for digital- IS&T/SPIE Human Vision and Electronic Imaging X, vol. 5666, March
image-coding impairments, in: IS&T/SPIE Human Vision, Visual 2005, pp. 149–159.
Processing, and Digital Display III, vol. 1666, January 1992, [41] Z. Wang, G. Wu, H.R. Sheikh, E.P. Simoncelli, E.H. Yang, A.C. Bovik,
pp. 16–26. Quality aware images, IEEE Trans. Image Process. 15 (6) (June 2006)
[28] H. de Ridder, Percentage scaling: a new method for evaluating 1680–1689.
multiple impaired images, Proc. SPIE 3959 (2000) 68–77. [42] A.B. Watson, J. Hu, J.F. McGowan, Digital video quality metric based
[29] A.W. Rix, A. Bourret, M.P. Hollier, Models of human perception, J. BT on human vision, SPIE J. Electron. Imaging 10 (1) (January 2001)
Technol. 17 (1) (January 1999) 24–34. 20–29.
[30] S. Saha, R. Vemuri, An analysis on the effect of image features [43] S. Winkler, Digital Video Quality, Wiley, New York, 2005.
on lossy coding performance, IEEE Signal Process. Lett. 7 (5) [44] H.R. Wu, K.R. Rao (Eds.), Digital Video Image Quality and Perceptual
(May 2000) 104–107. Coding, CRC Press, Boca Raton, FL, 2006.
[31] H.R. Sheikh, A.C. Bovik, Image information and visual quality, IEEE [45] T. Yamada, Y. Miyamoto, M. Serizawa, H. Harasaki, Reduced-
Trans. Image Process. 15 (2) (February 2006) 430–444. reference based video quality metrics using representative lumi-
[32] D. Soldani, M. Li, R. Cuny (Ed.), QoS and QoE Management in UMTS nance values, in: International Workshop on Video Processing and
Cellular Systems, Wiley, New York, 2006. Quality Metrics for Consumer Electronics, January 2007.