0% found this document useful (0 votes)
69 views28 pages

Mathematical Problems and Solutions On Information Theory

Information theory deals with the quantification, storage, and communication of information. It encompasses concepts such as entropy, which measures the uncertainty or randomness of a system, information content, and mutual information, which quantifies the amount of information obtained about one random variable through another. Problems in this field often involve optimizing data encoding and transmission to reduce loss and enhance efficiency, addressing challenges like signal degradation and

Uploaded by

Alvi Rownok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views28 pages

Mathematical Problems and Solutions On Information Theory

Information theory deals with the quantification, storage, and communication of information. It encompasses concepts such as entropy, which measures the uncertainty or randomness of a system, information content, and mutual information, which quantifies the amount of information obtained about one random variable through another. Problems in this field often involve optimizing data encoding and transmission to reduce loss and enhance efficiency, addressing challenges like signal degradation and

Uploaded by

Alvi Rownok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

University of Naples Federico II, Course: Information Theory

2023

Solutions to problems 1.1, 1.2, 1.3, 1.4,


1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12,
1.13, 1.14 and 1.15

Alvi Rownok, P37000166


January 3, 2024

1 Problem 1.1
A fair coin is flipped until the first head occurs. Let X denote the number of
flips required.
1. Find the entropy H(X) in bits.
2. A random variable X is drawn according to this distribution. Find an
”efficient” sequence of yes-no questions of the form, ”Is X contained in
the set S?”. Compare H(X) to the expected number of questions required
to determine X.
Solution:
The probability of getting a head on the first flip is 1/2, the probability of getting
a tail on the first flip and a head on the second flip is (1/2)2 , the probability of
getting a tail on the first two flips and a head on the third flip is (1/2)3 , and so
on. In general, the probability of getting a head on the n-th flip is (1/2)n .
The entropy of a random variable X is defined as:
X
H(X) = − p(x) log2 p(x)
where p(x) is the probability of X taking on the value x.
In this case, the entropy of X is:

1
X 1n 1n


H(X) = − log2
2 2
  n 
X 1 1
=− n· log2
2 2
  n
1X 1
= − log2 n·
2 2
The sum in the last line is a geometric series with a common ratio of 1/2. The
sum of a geometric series is given by:
X  1 n 1
n· = =2
2 1 − 1/2
Therefore, the entropy of X is:
1
H(X) = − log2 · 2 = 1 bit
2

Efficient Sequence of Yes-No Questions


One efficient sequence of yes-no questions to determine the value of X is:

1. Is X greater than or equal to 2?


2. Is X greater than or equal to 4?
3. Is X greater than or equal to 8?
4. And so on.

If the answer to the first question is yes, then X is greater than or equal to 2.
If the answer to the second question is yes, then X is greater than or equal to
4. If the answer to the third question is yes, then X is greater than or equal to
8. And so on.
The expected number of questions required to determine the value of X is:
X
E(N ) = n · p(n)
where n is the number of questions required to determine the value of X and
p(n) is the probability that n questions are required.
In this case, the probability that n questions are required is (1/2)n . Therefore,
the expected number of questions required to determine the value of X is:
X  1 n
E(N ) = n·
2
1
= =2
1 − 1/2

2
Therefore, the expected number of questions required to determine the value of
X is 2.

Comparison of H(X) and E(N )


The entropy of X is 1 bit, and the expected number of questions required to
determine the value of X is 2. This means that the entropy of X is less than
the expected number of questions required to determine the value of X. This
is because the entropy of X only takes into account the uncertainty of the
distribution, while the expected number of questions required to determine the
value of X also takes into account the cost of asking questions.

2 Problem 1.2
Let X be a random variable taking on a finite number of values. What is the
(general) inequality relationship of H(X) and H(Y ) if
(a) Y = 2X
(b) Y = cos X
Solution:
Shannon-McMillan-Breiman theorem or Data Processing Inequality
Let X be a random variable taking on a finite number of values, and Y = g(X)
be a function of X. Then, the entropy of Y is related to the entropy of X by
the following inequality:

H(Y ) ≤ H(X)
where H(X) and H(Y ) are the entropies of X and Y , respectively.

Proof
We can write the entropy of Y as follows:
X
H(Y ) = − p(y) log p(y)
y∈Y

where Y is the set of all possible values of Y , and p(y) is the probability of Y
taking on the value y.
We can rewrite the probability of Y taking on the value y as follows:
X
p(y) = p(x)p(y|x)
x∈X

where X is the set of all possible values of X, and p(x) is the probability of X
taking on the value x.
Substituting this expression for p(y) into the expression for H(Y ), we get:

3
! !
X X X
H(Y ) = − p(x)p(y|x) log p(x)p(y|x)
y∈Y x∈X x∈X

p(y|x)p(x)
We can now use the fact that p(x|y) = p(y) to rewrite this expression as
follows:
! !
X X X
H(Y ) = − p(x)p(y|x) log p(y|x)p(x)
y∈Y x∈X x∈X

We can now interchange the order of summation to get:


 
X X X
H(Y ) = − p(x) p(y|x) log  p(y|x)p(x)
x∈X y∈Y y∈Y
P
We can now use the fact that p(y|x) = 1 to simplify this expression to:
y∈Y
 
X X
H(Y ) = − p(x) log  p(y|x)p(x)
x∈X y∈Y
P
We can now use the fact that H(X) = − x∈X p(x) log p(x) to rewrite this
expression as follows:

H(Y ) ≤ H(X)
This completes the proof.

Examples
(a) Y = 2X
In this case, the function g(x) = 2x. The entropy of Y is given by:
X
H(Y ) = − p(y) log p(y)
y∈Y

where Y is the set of all possible values of Y .


We can write the probability of Y taking on the value y as follows:

p(y) = p(2x = y) = p(x = y/2)


Substituting this expression for p(y) into the expression for H(Y ), we get:
X
H(Y ) = − p(x = y/2) log p(x = y/2)
y∈Y

We can now use the fact that p(x = y/2) = p(x)2 to rewrite this expression as
follows:

4
X
H(Y ) = − p(x)2 log p(x)2
y∈Y

We can now simplify this expression to:


X
H(Y ) = −2 p(x) log p(x) = 2H(X)
x∈X

Therefore, the entropy of Y is twice the entropy of X.

(b) Y = cos X
In this case, the function g(x) = cos x. The entropy of Y is given by:
X
H(Y ) = − p(y) log p(y)
y∈Y

where Y is the set of all possible values of Y .


We can write the probability of Y taking on the value y as follows:

p(y) = p(cos x = y) = p(x = arccos y)


Substituting this expression for p(y) into the expression for H(Y ), we get:
X
H(Y ) = − p(x = arccos y) log p(x = arccos y)
y∈Y
1
We can now use the fact that p(x = arccos y) = √ to rewrite this expression
1−y 2
as follows:
X 1 1
H(Y ) = − p log p
y∈Y 1− y2 1 − y2
We can now simplify this expression to:
Z 1
1 1 π
H(Y ) = p log p dy =
−1 1− y2 1− y2 2
π
Therefore, the entropy of Y is 2.

3 Problem 1.3
What is the minimum value of H(p1 , . . . , pn ) = H(p) as p ranges over the set of
n-dimensional probability vectors? Find all p’s which achieve this minimum.

Solution: The minimum value of H(p1 , . . . , pn ) = H(p) as p ranges over the set
of n-dimensional probability vectors is 0. This minimum is achieved when all
the components of p are equal, i.e., p1 = p2 = · · · = pn = n1 .

5
To prove this, we can use the following inequality:
n
X
H(p) ≥ H(p1 , . . . , pn ) + pi log pi
i=1

where H(p) is the entropy of the probability vector p, and H(p1 , . . . , pn ) is the
entropy of the joint probability distribution of the random variables X1 , . . . , Xn ,
where Xi takes the value i with probability pi .
To see why this inequality holds, we can use the chain rule for entropy:
n
X
H(p) = H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 )
i=1

where H(Xi |X1 , . . . , Xi−1 ) is the conditional entropy of Xi given X1 , . . . , Xi−1 .


Since Xi is independent of X1 , . . . , Xi−1 , we have H(Xi |X1 , . . . , Xi−1 ) = H(Xi ).
Therefore,
n
X n
X
H(p) = H(Xi ) = H(pi )
i=1 i=1

Now, using the fact that H(pi ) ≤ log n for all i, we have
n
X
H(p) ≤ log n = n log n
i=1

Combining this with the inequality above, we get


n
X
H(p) ≥ H(p1 , . . . , pn ) + pi log pi ≥ H(p1 , . . . , pn )
i=1

Since the equality holds when p1 = p2 = · · · = pn = n1 , this proves that the


minimum value of H(p1 , . . . , pn ) is 0, and that this minimum is achieved when
all the components of p are equal.

4 Problem 1.4
An urn contains r red, w white, and b black balls. Which has higher entropy,
drawing k ≥ 2 balls from the urn with replacement or without replacement?
Set it up and show why. (There is both a hard way and a relatively simple way
to do this.)

Solution:

Without Replacement:
The number of ways to choose k balls from an urn containing r red, w white,
and b black balls without replacement is given by the multinomial coefficient:

6
(r + w + b)!
M (k; r, w, b) =
k!(r − k)!(w − k)!(b − k)!
The entropy of this distribution is given by:

(r + w + b)!
S = ln M (k; r, w, b) = ln
k!(r − k)!(w − k)!(b − k)!
With Replacement:
The number of ways to choose k balls from an urn containing r red, w white,
and b black balls with replacement is given by the multinomial distribution:
n!
P (k1 , k2 , k3 ) = pk1 pk2 pk3
k1 !k2 !k3 ! 1 2 3
r w b
where n = k1 + k2 + k3 and p1 = r+w+b , p2 = r+w+b , and p3 = r+w+b .
The entropy of this distribution is given by:
n X
X n X
n
S=− P (k1 , k2 , k3 ) ln P (k1 , k2 , k3 )
k1 =0 k2 =0 k3 =0
Comparison:
To compare the entropies of the two distributions, we can use the following
inequality:

ln x ≤ x − 1
for all x > 0.
Using this inequality, we can show that:

(r + w + b)k
ln M (k; r, w, b) ≤ ln − (r + w + b − k)
k!
and

X n
n X
n X
− P (k1 , k2 , k3 ) ln P (k1 , k2 , k3 )
k1 =0 k2 =0 k3 =0
n X n X n
(r + w + b)k
X  
≥− P (k1 , k2 , k3 ) − (r + w + b − k)
k!
k1 =0 k2 =0 k3 =0

Simplifying these expressions, we get:


r+w+b
Swithout ≤ Swith + (r + w + b − k) ln
k
Since r + w + b − k > 0 and ln r+w+b
k > 0, we have:

Swithout < Swith


Therefore, drawing k ≥ 2 balls from the urn with replacement has higher entropy
than drawing k ≥ 2 balls from the urn without replacement.

7
5 Problem 1.5
Let p(x, y) be given as in the table shown in Figure 1.

Figure 1: Probability distribution table for p(x, y).

• (a) Entropies: H(X), H(Y )


– Calculate the entropy H(X).
– Calculate the entropy H(Y ).

• (b) Conditional Entropies: H(X|Y ), H(Y |X)


– Calculate the conditional entropy H(X|Y ).
– Calculate the conditional entropy H(Y |X).
• (c) Joint Entropy: H(X, Y )

– Calculate the joint entropy H(X, Y ).


• (d) Conditional Difference: H(Y ) − H(X|Y )
– Calculate H(Y ) − H(X|Y ).

• (e) Mutual Information: I(X; Y )


– Calculate the mutual information I(X; Y ).
• Venn Diagram
– Draw a Venn diagram to represent the quantities in (a) through (e).

Solution:

a) H(X) and H(Y ) are given by:


X
H(X) = − p(x) log p(x)
x∈X
X
H(Y ) = − p(y) log p(y)
y∈Y

Where p(x) and p(y) are the marginal probabilities of X and Y , respectively,
which can be obtained by summing the probabilities in each column and row,
respectively.

8
X 1 1 2
p(0) = p(0, y) = p(0, 0) + p(0, 1) = + =
3 3 3
y∈Y
X 1 1
p(1) = p(1, y) = p(1, 0) + p(1, 1) = 0 + =
3 3
y∈Y
X 1 1
p(0) = p(x, 0) = p(0, 0) + p(1, 0) = +0=
3 3
x∈X
X 1 1 2
p(1) = p(x, 1) = p(0, 1) + p(1, 1) = + =
3 3 3
x∈X

Therefore,
2 2 1 1
H(X) = −p(0) log p(0) − p(1) log p(1) = − log − log ≈ 0.9183
3 3 3 3
2 2 1 1
H(Y ) = −p(0) log p(0) − p(1) log p(1) = − log − log ≈ 0.9183
3 3 3 3
b) H(X|Y ) and H(Y |X) are given by:
X X
H(X|Y ) = − p(y) p(x|y) log p(x|y)
y∈Y x∈X
X X
H(Y |X) = − p(x) p(y|x) log p(y|x)
x∈X y∈Y

Where p(x|y) and p(y|x) are the conditional probabilities of X given Y and Y
given X, respectively.

p(x, 0) p(x, 0)
p(x|y = 0) = = 1
p(0) 3

p(x, 1) p(x, 1)
p(x|y = 1) = = 2
p(1) 3

p(0, y) p(0, y)
p(y|x = 0) = = 1
p(0) 3

p(1, y) p(1, y)
p(y|x = 1) = = 2
p(1) 3
Therefore,
X X
H(X|Y ) = − p(y) p(x|y) log p(x|y)
y∈Y x∈X

X X
= −p(0) p(x|y = 0) log p(x|y = 0) − p(1) p(x|y = 1) log p(x|y = 1)
x∈X x∈X

9
 
2 1 1 1 1 1
=− − log − log − [0 log 0 − 1 log 1]
3 2 2 2 2 3
2
= log 2 ≈ 0.4620
3
X X
H(Y |X) = − p(x) p(y|x) log p(y|x)
x∈X y∈Y
X
=− p(x) [p(y = 0|x) log p(y = 0|x) + p(y = 1|x) log p(y = 1|x)]
x∈X
 
1 1 1 1 1 2
=− log + log − [0 log 0 + 1 log 1]
3 2 2 2 2 3
1
= log 2 ≈ 0.4620
3
c) H(X, Y ) is given by:
XX
H(X, Y ) = − p(x, y) log p(x, y)
x∈X y∈Y

1 1 1 1 1 1 1 1
= − log − log − 0 log 0 − log − log − 1 log 1
3 3 3 3 3 3 3 3

= log 3 ≈ 1.0986
d) H(Y ) − H(X|Y ) is given by:

H(Y ) − H(X|Y ) = 0.9183 − 0.4620 = 0.4563


e) I(X; Y ) is given by:

I(X; Y ) = H(X) + H(Y ) − H(X, Y )

= 0.9183 + 0.9183 − 1.0986 = 0.7379

Venn diagram for the quantities in (a) through (e)

10
A B

0.9183 0.9183
0.7379
0.4620 0.4620
1.0986

6 Problem 1.6
Let X and Y be random variables that take on values x1 , x2 , . . . , xr and y1 , y2 , . . . , ys ,
respectively. Let Z = X + Y .

a) Show that H(Z|X) = H(Y |X). Argue that if X and Y are independent,
then H(Y ) ≤ H(Z) and H(X) ≤ H(Z). Thus, the addition of indepen-
dent random variables adds uncertainty.
b) Give an example of (necessarily dependent) random variables in which
H(Y ) ≥ H(Z) and H(X) ≥ H(Z).
c) Under what conditions does H(Z) = H(X) + H(Y )?

Solution:

a) To show that H(Z|X) = H(Y |X), we use the definition of conditional


entropy:
XX
H(Z|X) = − P (x, z) log P (z|x)
x∈X z∈Z
XX
H(Y |X) = − P (x, y) log P (y|x)
x∈X y∈Y
P P
Since Z = X+Y , we have P (z|x) = y∈Y P (x, y) and P (y|x) = z∈Z P (x, z).
Therefore,
XX X
H(Z|X) = − P (x, z) log P (x, y)
x∈X z∈Z y∈Y

XX X
=− P (x, y) log P (x, z)
x∈X y∈Y z∈Z

11
= H(Y |X)

If X and Y are independent, then P (x, y) = P (x)P (y). Therefore,


X
H(Y ) = − P (y) log P (y)
y∈Y

XX
=− P (x)P (y) log P (y)
x∈X y∈Y

X X
=− P (x) P (y) log P (y)
x∈X y∈Y

X
=− P (x)H(Y |X)
x∈X

X
≤− P (x)H(Z|X)
x∈X

= H(Z)

Similarly, we can show that H(X) ≤ H(Z).


b) An example of (necessarily dependent) random variables in which H(Y ) ≥
H(Z) and H(X) ≥ H(Z) is given by X = Y = 0 with probability 1/2
and X = 1, Y = −1 with probability 1/2. Then, Z = X + Y = 0 with
probability 1. Therefore, H(Z) = 0. However, H(Y ) = H(X) = 1.
c) H(Z) = H(X) + H(Y ) if and only if X and Y are independent.
To prove this, we use the chain rule for entropy:

H(Z) = H(X + Y ) = H(X) + H(Y |X) = H(X) + H(Y ) − I(X; Y )

If X and Y are independent, then I(X; Y ) = 0. Therefore, H(Z) =


H(X) + H(Y ).
Conversely, if H(Z) = H(X) + H(Y ), then I(X; Y ) = 0. Therefore, X
and Y are independent.

12
7 Problem 1.7
Given the following joint distribution on (X, Y ) as shown in the image:

Let X̂(Y ) be an estimator for X based on Y , and let Pe = Pr (X̂(Y ) ̸= X).

• Find the minimum probability error estimator X̂(Y ) and the associated
Pe .
• Evaluate the Fano inequality for this problem and compare.

Solution:

Minimum Probability Error Estimator


The minimum probability error estimator is the function X̂(Y ) that minimizes
the probability of error, which is given by

Pe = P (X̂(Y ) ̸= X).
To find the minimum probability error estimator, we need to find the function
X̂(Y ) that minimizes the probability of error. For any function X̂(Y ), the
probability of error can be written as
X
Pe = P (X̂(Y ) ̸= X | Y = y)P (Y = y),
y∈Y

where Y is the set of all possible values of Y . Since the probability of error is
minimized by choosing X̂(Y ) to be the function that minimizes the probability
of error for each value of Y , we have

X̂(y) = arg min P (X̂(Y ) ̸= X | Y = y).


x∈X

For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
P (X̂(Y ) ̸= x | Y = y) = P (X̂(Y ) = x′ , X ̸= x | Y = y).
x′ ∈X

Since the probability of error is minimized by choosing X̂(Y ) to be the function


that minimizes the probability of error for each value of Y , we have
X
X̂(y) = arg min P (X̂(Y ) = x′ , X ̸= x | Y = y).
x∈X
x′ ∈X

13
For any function X̂(Y ), the probability of error for a given value of Y can be
written as

P (X̂(Y ) = x, X ̸= x | Y = y) = P (X ̸= x | Y = y) − P (X = x | Y = y).

Since the probability of error is minimized by choosing X̂(Y ) to be the function


that minimizes the probability of error for each value of Y , we have

X̂(y) = arg min[P (X ̸= x | Y = y) − P (X = x | Y = y)].


x∈X

For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
1 − P (X = x | Y = y) = P (X = x′ | Y = y).
x′ ∈X ,x′ ̸=x

Since the probability of error is minimized by choosing X̂(Y ) to be the function


that minimizes the probability of error for each value of Y , we have
X
X̂(y) = arg min P (X = x′ | Y = y).
x∈X
x′ ∈X

For any function X̂(Y ), the probability of error for a given value of Y can be
written as

X X
P (X = x′ | Y = y) = P (X = x′ | Y = y) − P (X = x | Y = y).
x′ ∈X ,x′ ̸=x x′ ∈X

Since the probability of error is minimized by choosing X̂(Y ) to be the function


that minimizes the probability of error for each value of Y , we have
X
X̂(y) = arg min P (X = x′ | Y = y) − P (X = x | Y = y).
x∈X
x′ ∈X

For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
P (X = x′ | Y = y) − P (X = x | Y = y) = E(X | Y = y) − x.
x′ ∈X

Since the probability of error is minimized by choosing X̂(Y ) to be the function


that minimizes the probability of error for each value of Y , we have

X̂(y) = E(X | Y = y).


Therefore, the minimum probability error estimator is the function X̂(Y ) that
is given by

14
X̂(Y ) = E(X | Y ).
Probability of Error
The probability of error of the minimum probability error estimator is given by
X
Pe = P (X̂(Y ) ̸= X) = P (X̂(Y ) ̸= X | Y = y)P (Y = y).
y∈Y

Since the minimum probability error estimator is the function X̂(Y ) that is
given by

X̂(Y ) = E(X | Y ),
the probability of error can be written as
X
Pe = P (X ̸= E(X | Y ) | Y = y)P (Y = y).
y∈Y

Since the conditional expectation is a function of the random variable Y , the


probability of error can be written as

X
Pe = P (X ̸= E(X | Y ) | Y = y)P (Y = y) = E[P (X ̸= E(X | Y ) | Y )].
y∈Y

Since the conditional expectation is a function of the random variable Y , the


probability of error can be written as

Pe = E[P (X ̸= E(X | Y ) | Y )] = E[1−P (X = E(X | Y ) | Y )] = E[1−E(X | Y )] = H(X)−H(X | Y ).

Therefore, the probability of error of the minimum probability error estimator


is given by

Pe = H(X) − H(X | Y ).
Fano Inequality
The Fano inequality states that the probability of error of any estimator X̂(Y )
is bounded below by the entropy of the random variable X conditioned on the
random variable Y , i.e.,

Pe ≥ H(X | Y ).
For the minimum probability error estimator, the probability of error is given
by

Pe = H(X) − H(X | Y ).
Therefore, the Fano inequality can be written as

15
H(X) − H(X | Y ) ≥ H(X | Y ).
This implies that

H(X) ≥ 2H(X | Y ).
Therefore, the entropy of the random variable X is bounded below by twice the
entropy of the random variable X conditioned on the random variable Y .

8 Problem 1.8
Let X1 , X2 , . . . be an i.i.d. sequence of discrete random variables with entropy
H(X). Define Cn (t) = {xn ∈ Xn : p(xn ) ≥ 2−nt }, which denotes the subset of
n-sequences with probabilities ≥ 2−nt .

1. Show |Cn (t)| ≤ 2−nt .


2. For what values of t does P (Xn ∈ Cn (t)) → 1?
3. Denoted by p(x) the probability mass of the random variable X, compute
the limit
1/n
lim (p(X1 , X2 , . . . , Xn )) .
n→∞

Solution:

1. Proof of |Cn (t)| ≤ 2−nt


Since p(xn ) ≥ 2−nt for all xn ∈ Cn (t), we have
X X
p(xn ) ≥ 2−nt = 2−nt |Cn (t)|.
xn ∈Cn (t) xn ∈Cn (t)

But since the sum of probabilities over all possible values of Xn is 1, we have
X X
p(xn ) + p(xn ) = 1.
xn ∈Cn (t) xn ∈C
/ n (t)

Therefore,

2−nt |Cn (t)| ≤ 1,


which implies that

|Cn (t)| ≤ 2nt .


2. Values of t for which P (Xn ∈ Cn (t)) → 1
Since
X
P (Xn ∈ Cn (t)) = p(xn ),
xn ∈Cn (t)

16
and
X
p(xn ) ≥ 2−nt |Cn (t)|,
xn ∈Cn (t)

we have

P (Xn ∈ Cn (t)) ≥ 2−nt |Cn (t)|.


But since |Cn (t)| ≤ 2−nt , we have

P (Xn ∈ Cn (t)) ≥ 2−nt 2−nt = 2−2nt .


Therefore,

P (Xn ∈ Cn (t)) → 1
as n → ∞ for any t > 0.
1/n
3. Limit of (p(X1 , X2 , . . . , Xn ))
Since X1 , X2 , . . . , Xn are i.i.d., we have
n
Y
p(X1 , X2 , . . . , Xn ) = p(Xi ).
i=1

Therefore,

n
!1/n  1/n
1/n
Y 1
(p(X1 , X2 , . . . , Xn )) = p(Xi ) = ,
i=1
2H
where H is the entropy of X. Therefore,
 1/n
1/n 1 1
lim (p(X1 , X2 , . . . , Xn )) = lim = .
n→∞ n→∞ 2H 2H

9 Problem 1.9
Huffman Codes for Random Variable X
Consider the random variable defined by the following table:

Binary Huffman Code


To find a binary Huffman code for X, you can follow the standard Huffman
coding algorithm.
Expected Codelength
To find the expected codelength for this encoding, you can use the formula:

17
X
Expected Codelength = pi · Codelength(xi )
i
where pi is the probability of symbol xi .
Ternary Huffman Code
Similarly, find a ternary Huffman code for X using the Huffman coding algo-
rithm.

Solution:
[Binary Huffman Code]
The Huffman algorithm works by recursively merging the two symbols with the
lowest probabilities until a single tree is formed. In this case, the symbols are
x1 , x2 , · · · , x7 . The probabilities are given in the table below.
Symbol Probability
x1 0.49
x2 0.26
x3 0.12
x4 0.04
x5 0.04
x6 0.03
x7 0.02
The first step is to merge the two symbols with the lowest probabilities, which
are x6 and x7 . The new symbol is x67 , and its probability is the sum of the
probabilities of x6 and x7 , which is 0.05.
The next step is to merge the two symbols with the lowest probabilities, which
are x4 and x5 . The new symbol is x45 , and its probability is the sum of the
probabilities of x4 and x5 , which is 0.08.
The next step is to merge the two symbols with the lowest probabilities, which
are x3 and x67 . The new symbol is x367 , and its probability is the sum of the
probabilities of x3 , x6 , and x7 , which is 0.17.
The next step is to merge the two symbols with the lowest probabilities, which
are x2 and x45 . The new symbol is x245 , and its probability is the sum of the
probabilities of x2 , x4 , and x5 , which is 0.34.
The final step is to merge the two symbols with the lowest probabilities, which
are x1 and x245 . The new symbol is x1245 , and its probability is the sum of the
probabilities of x1 , x2 , x4 , and x5 , which is 0.83.
The resulting Huffman tree is shown below.
x_{1245}
/ \
x_1 x_{245}
/ \ / \
x_{45} x_{367} x_2 x_3
| | | |
x_4 x_{67} x_6 x_7

18
The Huffman code for each symbol is the path from the root of the tree to the
leaf node representing the symbol. The codes are shown in the table below.

Symbol Code
x1 0
x2 10
x3 110
x4 1110
x5 1111
x6 101
x7 100

The expected codelength for this encoding is the sum of the probabilities of each
symbol multiplied by the length of its code. The expected codelength is shown
below.
7
X
E(L) = pi xi
i=1

= 0.49 × 1 + 0.26 × 2 + 0.12 × 3 + 0.04 × 4 + 0.04 × 4 + 0.03 × 3 + 0.02 × 3

= 2.69
Therefore, the expected codelength for this encoding is 2.69 bits.

[Ternary Huffman Code]


The Huffman algorithm can also be used to generate a ternary code. In this
case, the symbols are x1 , x2 , · · · , x7 . The probabilities are given in the table
below.
Symbol Probability
x1 0.49
x2 0.26
x3 0.12
x4 0.04
x5 0.04
x6 0.03
x7 0.02

The first step is to merge the two symbols with the lowest probabilities, which
are x6 and x7 . The new symbol is x67 , and its probability is the sum of the
probabilities of x6 and x7 , which is 0.05.
The next step is to merge the two symbols with the lowest probabilities, which
are x4 and x5 . The new symbol is x45 , and its probability is the sum of the
probabilities of x4 and x5 , which is 0.08.

19
The next step is to merge the two symbols with the lowest probabilities, which
are x3 and x67 . The new symbol is x367 , and its probability is the sum of the
probabilities of x3 , x6 , and x7 , which is 0.17.
The next step is to merge the two symbols with the lowest probabilities, which
are x2 and x45 . The new symbol is x245 , and its probability is the sum of the
probabilities of x2 , x4 , and x5 , which is 0.34.
The final step is to merge the two symbols with the lowest probabilities, which
are x1 and x245 . The new symbol is x1245 , and its probability is the sum of the
probabilities of x1 , x2 , x4 , and x5 , which is 0.83.
The resulting Huffman tree is shown below.

x_{1245}
/ \
x_1 x_{245}
/ \ / \
x_{45} x_{367} x_2 x_3
| | | |
x_4 x_{67} x_6 x_7

The Huffman code for each symbol is the path from the root of the tree to the
leaf node representing the symbol. The codes are shown in the table below.

Symbol Code
x1 0
x2 10
x3 110
x4 1110
x5 1111
x6 101
x7 102

The expected codelength for this encoding is the sum of the probabilities of each
symbol multiplied by the length of its code. The expected codelength is shown
below.
7
X
E(L) = pi xi
i=1

= 0.49 × 1 + 0.26 × 2 + 0.12 × 3 + 0.04 × 4 + 0.04 × 4 + 0.03 × 3 + 0.02 × 3

= 2.86
Therefore, the expected codelength for this encoding is 2.86 bits.

20
10 Problem 1.10
Consider a source with probabilities 13 , 15 , 15 , 15 2 2

, 15 . Find the binary Huffman
code for this source. Argue that this code is also optimal for the source with
probabilities 15 , 15 , 15 , 15 , 15 .


Solution:
1 1 1 2 2

Optimal Huffman Code for the Source with Probabilities 3 , 5 , 5 , 15 , 15

1. Construct the Huffman Tree:


• Sort the symbols in ascending order of their probabilities: 2 2 1 1 1

15 , 15 , 5 , 5 , 3 .
• Create two new nodes, each containing one of the two symbols with
2

the smallest probabilities 15 .
• Assign a probability to each  new node equal to the sum of the prob-
4
abilities of its children 15 .
• Repeat steps 2 and 3 until a single node remains.
The resulting Huffman tree is shown below:
1

2 3

4 5 6 7

a b c d e f

2. Assign Huffman Codes:


• Assign a 0 to the left branch of each internal node and a 1 to the
right branch.
• Starting from the root node, traverse the tree and assign the corre-
sponding binary code to each symbol.
The resulting Huffman codes are:

a : 000
b : 001
c : 010
d : 011
e : 10
f : 11

1 1 1 1 1

Proof of Optimality for the Source with Probabilities 5, 5, 5, 5, 5

21
1 1 1 2 2

The Huffman code constructed for the source with probabilities  3 , 5 , 5 , 15 , 15
1 1 1 1 1
is also optimal for the source with probabilities 5 , 5 , 5 , 5 , 5 because the
two sources have the same entropy.
The entropy of a source is given by:
X
H(X) = − p(x) log2 (p(x))

where p(x) is the probability of symbol x.


For the source with probabilities 31 , 15 , 15 , 15
2 2

, 15 , the entropy is:
          
1 1 1 1 1 1 2 2 2 2
H(X) = − log2 + log2 + log2 + log2 + log2 ≈ 1.991
3 3 5 5 5 5 15 15 15 15

For the source with probabilities 15 , 15 , 15 , 15 , 15 , the entropy is:




          
1 1 1 1 1 1 1 1 1 1
H(X) = − log2 + log2 + log2 + log2 + log2 =2
5 5 5 5 5 5 5 5 5 5

Since the two sources have the same entropy, the Huffman code con-
structed for the first source is also optimal for the second source.

11 Problem 1.11
• Huffman Code for Random Variable X:
• Optimal Code Length Assignments: Show that there exist two differ-
ent sets of optimal lengths for the codewords, namely, show that codeword
length assignments (1, 2, 3, 3) and (2, 2, 2, 2) are both optimal.

• Conclusion about Optimal Codes: Conclude that there are optimal


codes with codeword lengths for some symbols that exceed the Shannon
1
code length ⌈log p(x) ⌉.

Solution:

Constructing the Huffman Code

• Sort the probabilities in ascending order: 1 1 1 1



12 , 4 , 3 , 3 .
• Combine the two smallest probabilities to form a new symbol with prob-
7
ability 12 .
• Update the list of probabilities: 12
7 1 1

, 3, 3 .
• Repeat steps 2 and 3 until only one symbol remains.

22
1 (Root)

7 1
2 (Probability: 12 ) 3 (Probability: 3)

1 1
4 (Probability: 4) 5 (Probability: 3)

Where (1) represents the root node, (2) represents the node with probability
7 1
12 , (3) represents the node with probability 3 , (4) represents the node with
probability 4 , and (5) represents the node with probability 13 .
1

The corresponding Huffman code is:

Symbol 1: 0
Symbol 2: 10
Symbol 3: 110
Symbol 4: 111

Showing the Existence of Two Optimal Sets of Codeword Lengths


• Set 1: (1, 2, 3, 3)
• Set 2: (2, 2, 2, 2)
- Average codeword length for Set 1: (1 · 31 ) + (2 · 13 ) + (3 · 14 ) + (3 · 12
1
) = 2.25
bits
- Set 2: (2, 2, 2, 2)
- Average codeword length for Set 2: (2 · 13 ) + (2 · 13 ) + (2 · 14 ) + (2 · 12
1
) = 2.25
bits
Conclusion
The existence of two different sets of optimal lengths for the codewords in this
example demonstrates that there can be multiple optimal Huffman codes for a
given random variable. This is in contrast to the Shannon code, which always
produces a unique set of codeword lengths that minimizes the average codeword
length.
In this specific example, the optimal codeword lengths for some symbols exceed
the Shannon code length. The Shannon code length for symbol 1 is ⌈log 13 ⌉ = 2
bits, while the optimal codeword length for symbol 1 in both sets is 1 bit. This
shows that the Huffman code can sometimes achieve better compression than
the Shannon code.

12 Problem 1.12
Consider the following method for generating a code for a random variable X
which takes on m values {1, 2, . . . , m} with probabilities p1 , p2 , . . . , pm . Assume
that the probabilities are ordered so that p1 ≤ p2 ≤ . . . ≤ pm . Define:

23
i−1
X
Fi = pk
k=1

Then the codeword for i is the number Fi ∈ [0, 1] rounded off to li bits, where
li = log p1i .

• Show that the code constructed by this process is prefix-free, and the
average length satisfies H(X) ≤ L < H(X) + 1.

• Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125).

Solution:

For the prefix-free property, we need to show that no codeword is a prefix of


another codeword. Suppose, on the contrary, that there exist two codewords
Ci and Cj such that Ci is a prefix of Cj . Let Ci = 0li 1 and Cj = 0lj 1, where
li < lj . Then, by the construction of the code, we have

Fi + 2−li < Fj + 2−lj ,

which implies that


Fj − Fi > 2−lj − 2−li ≥ 2−lj .
But this contradicts the fact that Fj − Fi is a multiple of 2−lj .
Therefore, the code constructed by the process is prefix-free.
For the average length, we have
m
X
H(X) = − pi log pi
i=1
m i i−1
!
X X X
=− pi log pk − pk
i=1 k=1 k=1
m i
! m i−1
!
X X X X
=− pi log pk + pi log pk
i=1 k=1 i=1 k=1
m i
!
X X
≤− pi log pk
i=1 k=1
= H(X)

where the inequality follows from the fact that log x is a concave function.
Therefore, the average length of the code satisfies H(X) ≤ H(X).

24
13 Problem 1.13
Assume that a communication channel with transition probabilities p(y|x) and
channel capacity C = max p(x)I(X; Y ) is given. A helpful statistician prepro-
cesses the output by forming Y = g(Y ). He claims that this will strictly improve
the capacity.

1. Show that he is wrong.


2. Under what conditions does he not strictly decrease the capacity?

Solution:

1. Proof that the statistician is wrong:


Let X be the input random variable, Y be the original output random variable,
and Y ′ = g(Y ) be the preprocessed output random variable. Then, the mutual
information between X and Y ′ is given by:

I(X; Y ′ ) = H(Y ′ ) − H(Y ′ |X)


Since Y ′ is a function of Y , we have H(Y ′ ) ≤ H(Y ). Additionally, since X and
Y are jointly distributed, we have H(Y ′ |X) ≥ 0. Therefore,

I(X; Y ′ ) ≤ H(Y ) − 0 = I(X; Y )


with equality if and only if Y ′ is a deterministic function of Y .
Since C = maxp (x)I(X; Y ), it follows that C ≥ I(X; Y ′ ). Therefore, the statis-
tician’s claim that preprocessing the output will strictly improve the capacity
is false.
2. Conditions under which the statistician does not strictly decrease
the capacity:
The statistician will not strictly decrease the capacity if and only if I(X; Y ′ ) =
I(X; Y ). This will occur if and only if Y ′ is a deterministic function of Y .
In other words, the statistician will not strictly decrease the capacity if and only
if the preprocessing operation does not introduce any new information about
X.

14 Problem 1.14
Consider the Z-channel with binary input and output alphabets. The transition
probabilities p(y|x) are given by the following matrix:
 
p(0|0) p(1|0)
p(0|1) p(1|1)
Find the capacity of the Z-channel and determine the maximizing input proba-
bility distribution.

25
Solution:
The capacity of a binary channel is given by:

C = max I(X; Y )
p(x)

where p(x) is the input probability distribution and I(X; Y ) is the mutual in-
formation between X and Y .
The mutual information is given by:

I(X; Y ) = H(Y ) − H(Y |X)


where H(Y ) is the entropy of Y and H(Y |X) is the conditional entropy of Y
given X.
The entropy of Y is given by:
X
H(Y ) = − p(y) log p(y)
y∈Y

where Y is the output alphabet.


The conditional entropy of Y given X is given by:
X X
H(Y |X) = − p(x) p(y|x) log p(y|x)
x∈X y∈Y

where X is the input alphabet.


Substituting the given transition probabilities into the above equations, we get:

H(Y ) = −p(0) log p(0) − p(1) log p(1)

H(Y |X) = −p(0) (p(0|0) log p(0|0) + p(1|0) log p(1|0))−p(1) (p(0|1) log p(0|1) + p(1|1) log p(1|1))

Simplifying the above equations, we get:

H(Y ) = −p(0) log p(0) − p(1) log p(1)

H(Y |X) = 1
Substituting the above equations into the equation for the mutual information,
we get:

I(X; Y ) = −p(0) log p(0) − p(1) log p(1) − 1


To find the maximizing input probability distribution, we take the derivative of
the mutual information with respect to p(0) and set it equal to zero:

dI(X; Y )
= − log p(0) − 1 + log p(1) = 0
dp(0)

26
Solving for p(0), we get:
1
p(0) =
2
1
Substituting p(0) = 2 into the equation for the mutual information, we get:

I(X; Y ) = 1
Therefore, the capacity of the Z-channel is 1 bit per channel use, and the max-
imizing input probability distribution is p(0) = p(1) = 12 .

15 Problem 1.15
Consider a 26-key typewriter.
1. If pushing a key results in printing the associated letter, what is the ca-
pacity C in bits?

2. Now suppose that pushing a key results in printing that letter or the next
(with equal probability). Thus, A → (A or B), . . . , Z → (Z or A). What
is the capacity?
3. What is the highest rate code with block length one that you can find that
achieves zero probability of error for the channel in the item above?

Solution:
1. Capacity C in bits:
For a 26-key typewriter, each key can represent one of 26 possible letters. There-
fore, the number of possible messages is 26. The capacity C of a channel is
defined as the maximum number of bits that can be transmitted per second
without error. In this case, since each keystroke represents one of 26 possible
letters, the capacity C is:

C = log2 (26) = 4.7 bits

2. Capacity with equal probability of printing the associated letter


or the next:
In this case, each keystroke can result in two possible letters, with equal proba-
bility. Therefore, the number of possible messages is 26 × 2 = 52. The capacity
C of the channel is:
C = log2 (52) = 5.7 bits
3. Highest rate code with block length one that achieves zero proba-
bility of error:
To achieve zero probability of error, we need to use a code that ensures that
each possible message is represented by a unique codeword. In this case, we

27
can use a simple binary code, where each keystroke is represented by a 5-bit
codeword. The codewords can be assigned as follows:

A: 00000
B: 00001
C: 00010
..
.
Z: 11111

With this code, each possible message is represented by a unique codeword, and
the probability of error is zero. The rate of this code is 5 bits per keystroke,
which is the highest possible rate that can achieve zero probability of error for
the given channel.

28

You might also like