Mathematical Problems and Solutions On Information Theory
Mathematical Problems and Solutions On Information Theory
2023
1 Problem 1.1
A fair coin is flipped until the first head occurs. Let X denote the number of
flips required.
1. Find the entropy H(X) in bits.
2. A random variable X is drawn according to this distribution. Find an
”efficient” sequence of yes-no questions of the form, ”Is X contained in
the set S?”. Compare H(X) to the expected number of questions required
to determine X.
Solution:
The probability of getting a head on the first flip is 1/2, the probability of getting
a tail on the first flip and a head on the second flip is (1/2)2 , the probability of
getting a tail on the first two flips and a head on the third flip is (1/2)3 , and so
on. In general, the probability of getting a head on the n-th flip is (1/2)n .
The entropy of a random variable X is defined as:
X
H(X) = − p(x) log2 p(x)
where p(x) is the probability of X taking on the value x.
In this case, the entropy of X is:
1
X 1n 1n
H(X) = − log2
2 2
n
X 1 1
=− n· log2
2 2
n
1X 1
= − log2 n·
2 2
The sum in the last line is a geometric series with a common ratio of 1/2. The
sum of a geometric series is given by:
X 1 n 1
n· = =2
2 1 − 1/2
Therefore, the entropy of X is:
1
H(X) = − log2 · 2 = 1 bit
2
If the answer to the first question is yes, then X is greater than or equal to 2.
If the answer to the second question is yes, then X is greater than or equal to
4. If the answer to the third question is yes, then X is greater than or equal to
8. And so on.
The expected number of questions required to determine the value of X is:
X
E(N ) = n · p(n)
where n is the number of questions required to determine the value of X and
p(n) is the probability that n questions are required.
In this case, the probability that n questions are required is (1/2)n . Therefore,
the expected number of questions required to determine the value of X is:
X 1 n
E(N ) = n·
2
1
= =2
1 − 1/2
2
Therefore, the expected number of questions required to determine the value of
X is 2.
2 Problem 1.2
Let X be a random variable taking on a finite number of values. What is the
(general) inequality relationship of H(X) and H(Y ) if
(a) Y = 2X
(b) Y = cos X
Solution:
Shannon-McMillan-Breiman theorem or Data Processing Inequality
Let X be a random variable taking on a finite number of values, and Y = g(X)
be a function of X. Then, the entropy of Y is related to the entropy of X by
the following inequality:
H(Y ) ≤ H(X)
where H(X) and H(Y ) are the entropies of X and Y , respectively.
Proof
We can write the entropy of Y as follows:
X
H(Y ) = − p(y) log p(y)
y∈Y
where Y is the set of all possible values of Y , and p(y) is the probability of Y
taking on the value y.
We can rewrite the probability of Y taking on the value y as follows:
X
p(y) = p(x)p(y|x)
x∈X
where X is the set of all possible values of X, and p(x) is the probability of X
taking on the value x.
Substituting this expression for p(y) into the expression for H(Y ), we get:
3
! !
X X X
H(Y ) = − p(x)p(y|x) log p(x)p(y|x)
y∈Y x∈X x∈X
p(y|x)p(x)
We can now use the fact that p(x|y) = p(y) to rewrite this expression as
follows:
! !
X X X
H(Y ) = − p(x)p(y|x) log p(y|x)p(x)
y∈Y x∈X x∈X
H(Y ) ≤ H(X)
This completes the proof.
Examples
(a) Y = 2X
In this case, the function g(x) = 2x. The entropy of Y is given by:
X
H(Y ) = − p(y) log p(y)
y∈Y
We can now use the fact that p(x = y/2) = p(x)2 to rewrite this expression as
follows:
4
X
H(Y ) = − p(x)2 log p(x)2
y∈Y
(b) Y = cos X
In this case, the function g(x) = cos x. The entropy of Y is given by:
X
H(Y ) = − p(y) log p(y)
y∈Y
3 Problem 1.3
What is the minimum value of H(p1 , . . . , pn ) = H(p) as p ranges over the set of
n-dimensional probability vectors? Find all p’s which achieve this minimum.
Solution: The minimum value of H(p1 , . . . , pn ) = H(p) as p ranges over the set
of n-dimensional probability vectors is 0. This minimum is achieved when all
the components of p are equal, i.e., p1 = p2 = · · · = pn = n1 .
5
To prove this, we can use the following inequality:
n
X
H(p) ≥ H(p1 , . . . , pn ) + pi log pi
i=1
where H(p) is the entropy of the probability vector p, and H(p1 , . . . , pn ) is the
entropy of the joint probability distribution of the random variables X1 , . . . , Xn ,
where Xi takes the value i with probability pi .
To see why this inequality holds, we can use the chain rule for entropy:
n
X
H(p) = H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 )
i=1
Now, using the fact that H(pi ) ≤ log n for all i, we have
n
X
H(p) ≤ log n = n log n
i=1
4 Problem 1.4
An urn contains r red, w white, and b black balls. Which has higher entropy,
drawing k ≥ 2 balls from the urn with replacement or without replacement?
Set it up and show why. (There is both a hard way and a relatively simple way
to do this.)
Solution:
Without Replacement:
The number of ways to choose k balls from an urn containing r red, w white,
and b black balls without replacement is given by the multinomial coefficient:
6
(r + w + b)!
M (k; r, w, b) =
k!(r − k)!(w − k)!(b − k)!
The entropy of this distribution is given by:
(r + w + b)!
S = ln M (k; r, w, b) = ln
k!(r − k)!(w − k)!(b − k)!
With Replacement:
The number of ways to choose k balls from an urn containing r red, w white,
and b black balls with replacement is given by the multinomial distribution:
n!
P (k1 , k2 , k3 ) = pk1 pk2 pk3
k1 !k2 !k3 ! 1 2 3
r w b
where n = k1 + k2 + k3 and p1 = r+w+b , p2 = r+w+b , and p3 = r+w+b .
The entropy of this distribution is given by:
n X
X n X
n
S=− P (k1 , k2 , k3 ) ln P (k1 , k2 , k3 )
k1 =0 k2 =0 k3 =0
Comparison:
To compare the entropies of the two distributions, we can use the following
inequality:
ln x ≤ x − 1
for all x > 0.
Using this inequality, we can show that:
(r + w + b)k
ln M (k; r, w, b) ≤ ln − (r + w + b − k)
k!
and
X n
n X
n X
− P (k1 , k2 , k3 ) ln P (k1 , k2 , k3 )
k1 =0 k2 =0 k3 =0
n X n X n
(r + w + b)k
X
≥− P (k1 , k2 , k3 ) − (r + w + b − k)
k!
k1 =0 k2 =0 k3 =0
7
5 Problem 1.5
Let p(x, y) be given as in the table shown in Figure 1.
Solution:
Where p(x) and p(y) are the marginal probabilities of X and Y , respectively,
which can be obtained by summing the probabilities in each column and row,
respectively.
8
X 1 1 2
p(0) = p(0, y) = p(0, 0) + p(0, 1) = + =
3 3 3
y∈Y
X 1 1
p(1) = p(1, y) = p(1, 0) + p(1, 1) = 0 + =
3 3
y∈Y
X 1 1
p(0) = p(x, 0) = p(0, 0) + p(1, 0) = +0=
3 3
x∈X
X 1 1 2
p(1) = p(x, 1) = p(0, 1) + p(1, 1) = + =
3 3 3
x∈X
Therefore,
2 2 1 1
H(X) = −p(0) log p(0) − p(1) log p(1) = − log − log ≈ 0.9183
3 3 3 3
2 2 1 1
H(Y ) = −p(0) log p(0) − p(1) log p(1) = − log − log ≈ 0.9183
3 3 3 3
b) H(X|Y ) and H(Y |X) are given by:
X X
H(X|Y ) = − p(y) p(x|y) log p(x|y)
y∈Y x∈X
X X
H(Y |X) = − p(x) p(y|x) log p(y|x)
x∈X y∈Y
Where p(x|y) and p(y|x) are the conditional probabilities of X given Y and Y
given X, respectively.
p(x, 0) p(x, 0)
p(x|y = 0) = = 1
p(0) 3
p(x, 1) p(x, 1)
p(x|y = 1) = = 2
p(1) 3
p(0, y) p(0, y)
p(y|x = 0) = = 1
p(0) 3
p(1, y) p(1, y)
p(y|x = 1) = = 2
p(1) 3
Therefore,
X X
H(X|Y ) = − p(y) p(x|y) log p(x|y)
y∈Y x∈X
X X
= −p(0) p(x|y = 0) log p(x|y = 0) − p(1) p(x|y = 1) log p(x|y = 1)
x∈X x∈X
9
2 1 1 1 1 1
=− − log − log − [0 log 0 − 1 log 1]
3 2 2 2 2 3
2
= log 2 ≈ 0.4620
3
X X
H(Y |X) = − p(x) p(y|x) log p(y|x)
x∈X y∈Y
X
=− p(x) [p(y = 0|x) log p(y = 0|x) + p(y = 1|x) log p(y = 1|x)]
x∈X
1 1 1 1 1 2
=− log + log − [0 log 0 + 1 log 1]
3 2 2 2 2 3
1
= log 2 ≈ 0.4620
3
c) H(X, Y ) is given by:
XX
H(X, Y ) = − p(x, y) log p(x, y)
x∈X y∈Y
1 1 1 1 1 1 1 1
= − log − log − 0 log 0 − log − log − 1 log 1
3 3 3 3 3 3 3 3
= log 3 ≈ 1.0986
d) H(Y ) − H(X|Y ) is given by:
10
A B
0.9183 0.9183
0.7379
0.4620 0.4620
1.0986
6 Problem 1.6
Let X and Y be random variables that take on values x1 , x2 , . . . , xr and y1 , y2 , . . . , ys ,
respectively. Let Z = X + Y .
a) Show that H(Z|X) = H(Y |X). Argue that if X and Y are independent,
then H(Y ) ≤ H(Z) and H(X) ≤ H(Z). Thus, the addition of indepen-
dent random variables adds uncertainty.
b) Give an example of (necessarily dependent) random variables in which
H(Y ) ≥ H(Z) and H(X) ≥ H(Z).
c) Under what conditions does H(Z) = H(X) + H(Y )?
Solution:
XX X
=− P (x, y) log P (x, z)
x∈X y∈Y z∈Z
11
= H(Y |X)
XX
=− P (x)P (y) log P (y)
x∈X y∈Y
X X
=− P (x) P (y) log P (y)
x∈X y∈Y
X
=− P (x)H(Y |X)
x∈X
X
≤− P (x)H(Z|X)
x∈X
= H(Z)
12
7 Problem 1.7
Given the following joint distribution on (X, Y ) as shown in the image:
• Find the minimum probability error estimator X̂(Y ) and the associated
Pe .
• Evaluate the Fano inequality for this problem and compare.
Solution:
Pe = P (X̂(Y ) ̸= X).
To find the minimum probability error estimator, we need to find the function
X̂(Y ) that minimizes the probability of error. For any function X̂(Y ), the
probability of error can be written as
X
Pe = P (X̂(Y ) ̸= X | Y = y)P (Y = y),
y∈Y
where Y is the set of all possible values of Y . Since the probability of error is
minimized by choosing X̂(Y ) to be the function that minimizes the probability
of error for each value of Y , we have
For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
P (X̂(Y ) ̸= x | Y = y) = P (X̂(Y ) = x′ , X ̸= x | Y = y).
x′ ∈X
13
For any function X̂(Y ), the probability of error for a given value of Y can be
written as
P (X̂(Y ) = x, X ̸= x | Y = y) = P (X ̸= x | Y = y) − P (X = x | Y = y).
For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
1 − P (X = x | Y = y) = P (X = x′ | Y = y).
x′ ∈X ,x′ ̸=x
For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X X
P (X = x′ | Y = y) = P (X = x′ | Y = y) − P (X = x | Y = y).
x′ ∈X ,x′ ̸=x x′ ∈X
For any function X̂(Y ), the probability of error for a given value of Y can be
written as
X
P (X = x′ | Y = y) − P (X = x | Y = y) = E(X | Y = y) − x.
x′ ∈X
14
X̂(Y ) = E(X | Y ).
Probability of Error
The probability of error of the minimum probability error estimator is given by
X
Pe = P (X̂(Y ) ̸= X) = P (X̂(Y ) ̸= X | Y = y)P (Y = y).
y∈Y
Since the minimum probability error estimator is the function X̂(Y ) that is
given by
X̂(Y ) = E(X | Y ),
the probability of error can be written as
X
Pe = P (X ̸= E(X | Y ) | Y = y)P (Y = y).
y∈Y
X
Pe = P (X ̸= E(X | Y ) | Y = y)P (Y = y) = E[P (X ̸= E(X | Y ) | Y )].
y∈Y
Pe = H(X) − H(X | Y ).
Fano Inequality
The Fano inequality states that the probability of error of any estimator X̂(Y )
is bounded below by the entropy of the random variable X conditioned on the
random variable Y , i.e.,
Pe ≥ H(X | Y ).
For the minimum probability error estimator, the probability of error is given
by
Pe = H(X) − H(X | Y ).
Therefore, the Fano inequality can be written as
15
H(X) − H(X | Y ) ≥ H(X | Y ).
This implies that
H(X) ≥ 2H(X | Y ).
Therefore, the entropy of the random variable X is bounded below by twice the
entropy of the random variable X conditioned on the random variable Y .
8 Problem 1.8
Let X1 , X2 , . . . be an i.i.d. sequence of discrete random variables with entropy
H(X). Define Cn (t) = {xn ∈ Xn : p(xn ) ≥ 2−nt }, which denotes the subset of
n-sequences with probabilities ≥ 2−nt .
Solution:
But since the sum of probabilities over all possible values of Xn is 1, we have
X X
p(xn ) + p(xn ) = 1.
xn ∈Cn (t) xn ∈C
/ n (t)
Therefore,
16
and
X
p(xn ) ≥ 2−nt |Cn (t)|,
xn ∈Cn (t)
we have
P (Xn ∈ Cn (t)) → 1
as n → ∞ for any t > 0.
1/n
3. Limit of (p(X1 , X2 , . . . , Xn ))
Since X1 , X2 , . . . , Xn are i.i.d., we have
n
Y
p(X1 , X2 , . . . , Xn ) = p(Xi ).
i=1
Therefore,
n
!1/n 1/n
1/n
Y 1
(p(X1 , X2 , . . . , Xn )) = p(Xi ) = ,
i=1
2H
where H is the entropy of X. Therefore,
1/n
1/n 1 1
lim (p(X1 , X2 , . . . , Xn )) = lim = .
n→∞ n→∞ 2H 2H
9 Problem 1.9
Huffman Codes for Random Variable X
Consider the random variable defined by the following table:
17
X
Expected Codelength = pi · Codelength(xi )
i
where pi is the probability of symbol xi .
Ternary Huffman Code
Similarly, find a ternary Huffman code for X using the Huffman coding algo-
rithm.
Solution:
[Binary Huffman Code]
The Huffman algorithm works by recursively merging the two symbols with the
lowest probabilities until a single tree is formed. In this case, the symbols are
x1 , x2 , · · · , x7 . The probabilities are given in the table below.
Symbol Probability
x1 0.49
x2 0.26
x3 0.12
x4 0.04
x5 0.04
x6 0.03
x7 0.02
The first step is to merge the two symbols with the lowest probabilities, which
are x6 and x7 . The new symbol is x67 , and its probability is the sum of the
probabilities of x6 and x7 , which is 0.05.
The next step is to merge the two symbols with the lowest probabilities, which
are x4 and x5 . The new symbol is x45 , and its probability is the sum of the
probabilities of x4 and x5 , which is 0.08.
The next step is to merge the two symbols with the lowest probabilities, which
are x3 and x67 . The new symbol is x367 , and its probability is the sum of the
probabilities of x3 , x6 , and x7 , which is 0.17.
The next step is to merge the two symbols with the lowest probabilities, which
are x2 and x45 . The new symbol is x245 , and its probability is the sum of the
probabilities of x2 , x4 , and x5 , which is 0.34.
The final step is to merge the two symbols with the lowest probabilities, which
are x1 and x245 . The new symbol is x1245 , and its probability is the sum of the
probabilities of x1 , x2 , x4 , and x5 , which is 0.83.
The resulting Huffman tree is shown below.
x_{1245}
/ \
x_1 x_{245}
/ \ / \
x_{45} x_{367} x_2 x_3
| | | |
x_4 x_{67} x_6 x_7
18
The Huffman code for each symbol is the path from the root of the tree to the
leaf node representing the symbol. The codes are shown in the table below.
Symbol Code
x1 0
x2 10
x3 110
x4 1110
x5 1111
x6 101
x7 100
The expected codelength for this encoding is the sum of the probabilities of each
symbol multiplied by the length of its code. The expected codelength is shown
below.
7
X
E(L) = pi xi
i=1
= 2.69
Therefore, the expected codelength for this encoding is 2.69 bits.
The first step is to merge the two symbols with the lowest probabilities, which
are x6 and x7 . The new symbol is x67 , and its probability is the sum of the
probabilities of x6 and x7 , which is 0.05.
The next step is to merge the two symbols with the lowest probabilities, which
are x4 and x5 . The new symbol is x45 , and its probability is the sum of the
probabilities of x4 and x5 , which is 0.08.
19
The next step is to merge the two symbols with the lowest probabilities, which
are x3 and x67 . The new symbol is x367 , and its probability is the sum of the
probabilities of x3 , x6 , and x7 , which is 0.17.
The next step is to merge the two symbols with the lowest probabilities, which
are x2 and x45 . The new symbol is x245 , and its probability is the sum of the
probabilities of x2 , x4 , and x5 , which is 0.34.
The final step is to merge the two symbols with the lowest probabilities, which
are x1 and x245 . The new symbol is x1245 , and its probability is the sum of the
probabilities of x1 , x2 , x4 , and x5 , which is 0.83.
The resulting Huffman tree is shown below.
x_{1245}
/ \
x_1 x_{245}
/ \ / \
x_{45} x_{367} x_2 x_3
| | | |
x_4 x_{67} x_6 x_7
The Huffman code for each symbol is the path from the root of the tree to the
leaf node representing the symbol. The codes are shown in the table below.
Symbol Code
x1 0
x2 10
x3 110
x4 1110
x5 1111
x6 101
x7 102
The expected codelength for this encoding is the sum of the probabilities of each
symbol multiplied by the length of its code. The expected codelength is shown
below.
7
X
E(L) = pi xi
i=1
= 2.86
Therefore, the expected codelength for this encoding is 2.86 bits.
20
10 Problem 1.10
Consider a source with probabilities 13 , 15 , 15 , 15 2 2
, 15 . Find the binary Huffman
code for this source. Argue that this code is also optimal for the source with
probabilities 15 , 15 , 15 , 15 , 15 .
Solution:
1 1 1 2 2
Optimal Huffman Code for the Source with Probabilities 3 , 5 , 5 , 15 , 15
2 3
4 5 6 7
a b c d e f
a : 000
b : 001
c : 010
d : 011
e : 10
f : 11
1 1 1 1 1
Proof of Optimality for the Source with Probabilities 5, 5, 5, 5, 5
21
1 1 1 2 2
The Huffman code constructed for the source with probabilities 3 , 5 , 5 , 15 , 15
1 1 1 1 1
is also optimal for the source with probabilities 5 , 5 , 5 , 5 , 5 because the
two sources have the same entropy.
The entropy of a source is given by:
X
H(X) = − p(x) log2 (p(x))
1 1 1 1 1 1 1 1 1 1
H(X) = − log2 + log2 + log2 + log2 + log2 =2
5 5 5 5 5 5 5 5 5 5
Since the two sources have the same entropy, the Huffman code con-
structed for the first source is also optimal for the second source.
11 Problem 1.11
• Huffman Code for Random Variable X:
• Optimal Code Length Assignments: Show that there exist two differ-
ent sets of optimal lengths for the codewords, namely, show that codeword
length assignments (1, 2, 3, 3) and (2, 2, 2, 2) are both optimal.
Solution:
22
1 (Root)
7 1
2 (Probability: 12 ) 3 (Probability: 3)
1 1
4 (Probability: 4) 5 (Probability: 3)
Where (1) represents the root node, (2) represents the node with probability
7 1
12 , (3) represents the node with probability 3 , (4) represents the node with
probability 4 , and (5) represents the node with probability 13 .
1
Symbol 1: 0
Symbol 2: 10
Symbol 3: 110
Symbol 4: 111
12 Problem 1.12
Consider the following method for generating a code for a random variable X
which takes on m values {1, 2, . . . , m} with probabilities p1 , p2 , . . . , pm . Assume
that the probabilities are ordered so that p1 ≤ p2 ≤ . . . ≤ pm . Define:
23
i−1
X
Fi = pk
k=1
Then the codeword for i is the number Fi ∈ [0, 1] rounded off to li bits, where
li = log p1i .
• Show that the code constructed by this process is prefix-free, and the
average length satisfies H(X) ≤ L < H(X) + 1.
• Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125).
Solution:
where the inequality follows from the fact that log x is a concave function.
Therefore, the average length of the code satisfies H(X) ≤ H(X).
24
13 Problem 1.13
Assume that a communication channel with transition probabilities p(y|x) and
channel capacity C = max p(x)I(X; Y ) is given. A helpful statistician prepro-
cesses the output by forming Y = g(Y ). He claims that this will strictly improve
the capacity.
Solution:
14 Problem 1.14
Consider the Z-channel with binary input and output alphabets. The transition
probabilities p(y|x) are given by the following matrix:
p(0|0) p(1|0)
p(0|1) p(1|1)
Find the capacity of the Z-channel and determine the maximizing input proba-
bility distribution.
25
Solution:
The capacity of a binary channel is given by:
C = max I(X; Y )
p(x)
where p(x) is the input probability distribution and I(X; Y ) is the mutual in-
formation between X and Y .
The mutual information is given by:
H(Y |X) = −p(0) (p(0|0) log p(0|0) + p(1|0) log p(1|0))−p(1) (p(0|1) log p(0|1) + p(1|1) log p(1|1))
H(Y |X) = 1
Substituting the above equations into the equation for the mutual information,
we get:
dI(X; Y )
= − log p(0) − 1 + log p(1) = 0
dp(0)
26
Solving for p(0), we get:
1
p(0) =
2
1
Substituting p(0) = 2 into the equation for the mutual information, we get:
I(X; Y ) = 1
Therefore, the capacity of the Z-channel is 1 bit per channel use, and the max-
imizing input probability distribution is p(0) = p(1) = 12 .
15 Problem 1.15
Consider a 26-key typewriter.
1. If pushing a key results in printing the associated letter, what is the ca-
pacity C in bits?
2. Now suppose that pushing a key results in printing that letter or the next
(with equal probability). Thus, A → (A or B), . . . , Z → (Z or A). What
is the capacity?
3. What is the highest rate code with block length one that you can find that
achieves zero probability of error for the channel in the item above?
Solution:
1. Capacity C in bits:
For a 26-key typewriter, each key can represent one of 26 possible letters. There-
fore, the number of possible messages is 26. The capacity C of a channel is
defined as the maximum number of bits that can be transmitted per second
without error. In this case, since each keystroke represents one of 26 possible
letters, the capacity C is:
27
can use a simple binary code, where each keystroke is represented by a 5-bit
codeword. The codewords can be assigned as follows:
A: 00000
B: 00001
C: 00010
..
.
Z: 11111
With this code, each possible message is represented by a unique codeword, and
the probability of error is zero. The rate of this code is 5 bits per keystroke,
which is the highest possible rate that can achieve zero probability of error for
the given channel.
28