0% found this document useful (0 votes)

66 views52 pages

lời giải

1) The entropy H(X) of a fair coin flipped until the first head is 1 bit. An efficient sequence of yes/no questions to determine X is: "Is X = 1?", "If not, is X = 2?", etc. The expected number of questions equals H(X). 2) For a random variable Y that is a function of X, H(X) ≥ H(Y) in general, with equality if the function is one-to-one. Specifically, H(X) = H(Y) if Y = 2X, since this is one-to-one, but H(X) may be greater than H(Y) if Y = cos(X

Uploaded by

Phúc Hữu Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views52 pages

lời giải

Uploaded by

Phúc Hữu Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

10 Entropy, Relative Entropy and Mutual Information

(b) Intuitively, it seems clear that the best questions are those that have equally likely
chances of receiving a yes or a no answer. Consequently, one possible guess is
that the most “efficient” series of questions is: Is X = 1 ? If not, is X = 2 ?
If not, is X = 3 ? . . . with a resulting expected number of questions equal to
$∞ n
n=1 n(1/2 ) = 2. This should reinforce the intuition that H(X) is a mea-
sure of the uncertainty of X . Indeed in this case, the entropy is exactly the
Chapter 2 same as the average number of questions needed to define X , and in general
E(# of questions) ≥ H(X) . This problem has an interpretation as a source cod-
ing problem. Let 0 = no, 1 = yes, X = Source, and Y = Encoded Source. Then
the set of questions in the above procedure can be written as a collection of (X, Y )
Entropy, Relative Entropy and pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is the
optimal (Huffman) code minimizing the expected number of questions.
Mutual Information 2. Entropy of functions. Let X be a random variable taking on a finite number of
values. What is the (general) inequality relationship of H(X) and H(Y ) if

(a) Y = 2X ?
1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number (b) Y = cos X ?
of flips required.
Solution: Let y = g(x) . Then
(a) Find the entropy H(X) in bits. The following expressions may be useful: !
∞ ∞
p(y) = p(x).
!
n1 !
n r x: y=g(x)
r = , nr = .
n=0
1−r n=0
(1 − r)2
Consider any set of x ’s that map onto a single y . For this set
(b) A random variable X is drawn according to this distribution. Find an “efficient” ! !
p(x) log p(x) ≤ p(x) log p(y) = p(y) log p(y),
sequence of yes-no questions of the form, “Is X contained in the set S ?” Compare
x: y=g(x) x: y=g(x)
H(X) to the expected number of questions required to determine X .
$
since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y) . Ex-
Solution: tending this argument to the entire range of X (and Y ), we obtain
(a) The number X of tosses till the first head appears has the geometric distribution H(X) = −
!
p(x) log p(x)
with parameter p = 1/2 , where P (X = n) = pq n−1 , n ∈ {1, 2, . . .} . Hence the x
entropy of X is = −
! !
p(x) log p(x)
∞
! y x: y=g(x)
H(X) = − pq n−1 log(pq n−1 ) !
n=1
≥ − p(y) log p(y)
"∞ ∞
# y
! !
= − pq n log p + npq n log q = H(Y ),
n=0 n=0
−p log p pq log q with equality iff g is one-to-one with probability one.
= −
1−q p2
(a) Y = 2X is one-to-one and hence the entropy, which is just a function of the
−p log p − q log q
= probabilities (and not the values of a random variable) does not change, i.e.,
p H(X) = H(Y ) .
= H(p)/p bits.
(b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is that
If p = 1/2 , then H(X) = 2 bits. H(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X .
9

Entropy, Relative Entropy and Mutual Information 11 12 Entropy, Relative Entropy and Mutual Information

3. Minimum entropy. What is the minimum value of H(p 1 , ..., pn ) = H(p) as p ranges since −t log t ≥ 0 for 0 ≤ t ≤ 1 , and is strictly positive for t not equal to 0 or 1.
over the set of n -dimensional probability vectors? Find all p ’s which achieve this Therefore the conditional entropy H(Y |X) is 0 if and only if Y is a function of X .
minimum.
6. Conditional mutual information vs. unconditional mutual information. Give
Solution: We wish to find all probability vectors p = (p 1 , p2 , . . . , pn ) which minimize examples of joint random variables X , Y and Z such that
!
H(p) = − pi log pi .
(a) I(X; Y | Z) < I(X; Y ) ,
i

Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probability (b) I(X; Y | Z) > I(X; Y ) .
vectors which minimize H(p) are those with p i = 1 for some i and pj = 0, j %= i . Solution: Conditional mutual information vs. unconditional mutual information.
There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and the
minimum value of H(p) is 0. (a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z that
is, if p(x, y | z) = p(x | z)p(y | z) then, I(X; Y ) ≥ I(X; Y | Z) . Equality holds if
4. Entropy of functions of a random variable. Let X be a discrete random variable.
and only if I(X; Z) = 0 or X and Z are independent.
Show that the entropy of a function of X is less than or equal to the entropy of X by
justifying the following steps: A simple example of random variables satisfying the inequality conditions above
is, X is a fair binary random variable and Y = X and Z = Y . In this case,
(a)
H(X, g(X)) = H(X) + H(g(X) | X) (2.1)
(b)
I(X; Y ) = H(X) − H(X | Y ) = H(X) = 1
= H(X); (2.2)
(c) and,
H(X, g(X)) = H(g(X)) + H(X | g(X)) (2.3) I(X; Y | Z) = H(X | Z) − H(X | Y, Z) = 0.
(d)
≥ H(g(X)). (2.4) So that I(X; Y ) > I(X; Y | Z) .
Thus H(g(X)) ≤ H(X). (b) This example is also given in the text. Let X, Y be independent fair binary
random variables and let Z = X + Y . In this case we have that,
Solution: Entropy of functions of a random variable.
(a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies. I(X; Y ) = 0
(b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and hence and,
$ $
H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0 . I(X; Y | Z) = H(X | Z) = 1/2.
(c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule.
So I(X; Y ) < I(X; Y | Z) . Note that in this case X, Y, Z are not markov.
(d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one.
Hence H(X, g(X)) ≥ H(g(X)) . 7. Coin weighing. Suppose one has n coins, among which there may or may not be one
Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) . counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than
the other coins. The coins are to be weighed by a balance.
5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X ,
i.e., for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 . (a) Find an upper bound on the number of coins n so that k weighings will find the
counterfeit coin (if any) and correctly declare it to be heavier or lighter.
Solution: Zero Conditional Entropy. Assume that there exists an x , say x 0 and two
different values of y , say y1 and y2 such that p(x0 , y1 ) > 0 and p(x0 , y2 ) > 0 . Then (b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins?
p(x0 ) ≥ p(x0 , y1 ) + p(x0 , y2 ) > 0 , and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1.
Solution: Coin weighing.
Thus
! ! (a) For n coins, there are 2n + 1 possible situations or “states”.
H(Y |X) = − p(x) p(y|x) log p(y|x) (2.5)
x y • One of the n coins is heavier.
≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) (2.6) • One of the n coins is lighter.
> > 0, (2.7) • They are all of equal weight.
Entropy, Relative Entropy and Mutual Information 13 14 Entropy, Relative Entropy and Mutual Information

Each weighing has three possible outcomes - equal, left pan heavier or right pan is heavier, it will produce the sequence of weighings that matches its column in
heavier. Hence with k weighings, there are 3 k possible outcomes and hence we the matrix. If it is lighter, it produces the negative of its column as a sequence
can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or of weighings. Combining all these facts, we can see that any single odd coin will
n ≤ (3k − 1)/2 . produce a unique sequence of weighings, and that the coin can be determined from
Looking at it from an information theoretic viewpoint, each weighing gives at most the sequence.
log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum One of the questions that many of you had whether the bound derived in part (a)
entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least was actually achievable. For example, can one distinguish 13 coins in 3 weighings?
log2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of No, not with a scheme like the one above. Yes, under the assumptions under
the odd coin, which gives the same result as above. which the bound was derived. The bound did not prohibit the division of coins
(b) There are many solutions to this problem. We will give one which is based on the into halves, neither did it disallow the existence of another coin known to be
ternary number system. normal. Under both these conditions, it is possible to find the odd coin of 13 coins
in 3 weighings. You could try modifying the above scheme to these cases.
We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number
system with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where 8. Drawing with and without replacement. An urn contains r red, w white, and
−1 × 30 + 0 × 31 + 1 × 32 = 8 . We form the matrix with the representation of the b black balls. Which has higher entropy, drawing k ≥ 2 balls from the urn with
positive numbers as its columns. replacement or without replacement? Set it up and show why. (There is both a hard
1 2 3 4 5 6 7 8 9 10 11 12 way and a relatively simple way to do this.)
30 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0
1 Solution: Drawing with and without replacement. Intuitively, it is clear that if the
3 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2
balls are drawn with replacement, the number of possible choices for the i -th ball is
32 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8
larger, and therefore the conditional entropy is larger. But computing the conditional
Note that the row sums are not all zero. We can negate some columns to make distributions is slightly involved. It is easier to compute the unconditional entropy.
the row sums zero. For example, negating columns 7,9,11 and 12, we obtain
1 2 3 4 5 6 7 8 9 10 11 12 • With replacement. In this case the conditional distribution of each draw is the
30 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0 same for every draw. Thus
3 1 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0  r
32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0  red
 with prob. r+w+b
w
Now place the coins on the balance according to the following rule: For weighing Xi = white with prob. r+w+b (2.8)

 black b
#i , place coin n with prob. r+w+b
• On left pan, if ni = −1 . and therefore
• Aside, if ni = 0 .
• On right pan, if ni = 1 . H(Xi |Xi−1 , . . . , X1 ) = H(Xi ) (2.9)
The outcome of the three weighings will find the odd coin if any and tell whether r w b
= log(r + w + b) − log r − log w − log(2.10)
b.
it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if r+w+b r+w+b r+w+b
the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings
• Without replacement. The unconditional probability of the i -th ball being red is
give the ternary expansion of the index of the odd coin. If the expansion is the
still r/(r + w + b) , etc. Thus the unconditional entropy H(X i ) is still the same as
same as the expansion in the matrix, it indicates that the coin is heavier. If
with replacement. The conditional entropy H(X i |Xi−1 , . . . , X1 ) is less than the
the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)
unconditional entropy, and therefore the entropy of drawing without replacement
indicates (0)30 +(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates
is lower.
#8 is light, (0,0,0) indicates no odd coin.
Why does this scheme work? It is a single error correcting Hamming code for the 9. A metric. A function ρ(x, y) is a metric if for all x, y ,
ternary alphabet (discussed in Section 8.11 in the book). Here are some details.
First note a few properties of the matrix above that was used for the scheme. • ρ(x, y) ≥ 0
All the columns are distinct and no two columns add to (0,0,0). Also if any coin • ρ(x, y) = ρ(y, x)

Entropy, Relative Entropy and Mutual Information 15 16 Entropy, Relative Entropy and Mutual Information

• ρ(x, y) = 0 if and only if x = y (a) Find H(X) in terms of H(X1 ) and H(X2 ) and α.
• ρ(x, y) + ρ(y, z) ≥ ρ(x, z) . (b) Maximize over α to show that 2H(X) ≤ 2H(X1 ) + 2H(X2 ) and interpret using the
notion that 2H(X) is the effective alphabet size.
(a) Show that ρ(X, Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourth
properties above. If we say that X = Y if there is a one-to-one function mapping Solution: Entropy. We can do this problem by writing down the definition of entropy
from X to Y , then the third property is also satisfied, and ρ(X, Y ) is a metric. and expanding the various terms. Instead, we will use the algebra of entropies for a
(b) Verify that ρ(X, Y ) can also be expressed as simpler proof.
Since X1 and X2 have disjoint support sets, we can write
ρ(X, Y ) = H(X) + H(Y ) − 2I(X; Y ) (2.11) )
= H(X, Y ) − I(X; Y ) (2.12) X1 with probability α
X=
X2 with probability 1 − α
= 2H(X, Y ) − H(X) − H(Y ). (2.13)
Define a function of X ,
Solution: A metric )
1 when X = X1
(a) Let θ = f (X) =
2 when X = X2
ρ(X, Y ) = H(X|Y ) + H(Y |X). (2.14)
Then Then as in problem 1, we have
• Since conditional entropy is always ≥ 0 , ρ(X, Y ) ≥ 0 . H(X) = H(X, f (X)) = H(θ) + H(X|θ)
• The symmetry of the definition implies that ρ(X, Y ) = ρ(Y, X) . = H(θ) + p(θ = 1)H(X|θ = 1) + p(θ = 2)H(X|θ = 2)
• By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X and = H(α) + αH(X1 ) + (1 − α)H(X2 )
H(X|Y ) is 0 iff X is a function of Y . Thus ρ(X, Y ) is 0 iff X and Y
are functions of each other - and therefore are equivalent up to a reversible where H(α) = −α log α − (1 − α) log(1 − α) .
transformation.
11. A measure of correlation. Let X1 and X2 be identically distributed, but not
• Consider three random variables X , Y and Z . Then necessarily independent. Let
H(X|Y ) + H(Y |Z) ≥ H(X|Y, Z) + H(Y |Z) (2.15) H(X2 | X1 )
ρ=1− .
= H(X, Y |Z) (2.16) H(X1 )
= H(X|Z) + H(Y |X, Z) (2.17) I(X1 ;X2 )
(a) Show ρ = H(X1 ) .
≥ H(X|Z), (2.18)
(b) Show 0 ≤ ρ ≤ 1.
from which it follows that (c) When is ρ = 0 ?
ρ(X, Y ) + ρ(Y, Z) ≥ ρ(X, Z). (2.19) (d) When is ρ = 1 ?
Note that the inequality is strict unless X → Y → Z forms a Markov Chain Solution: A measure of correlation. X1 and X2 are identically distributed and
and Y is a function of X and Z . H(X2 |X1 )
(b) Since H(X|Y ) = H(X) − I(X; Y ) , the first equation follows. The second relation ρ=1−
H(X1 )
follows from the first equation and the fact that H(X, Y ) = H(X) + H(Y ) −
I(X; Y ) . The third follows on substituting I(X; Y ) = H(X) + H(Y ) − H(X, Y ) . (a)
H(X1 ) − H(X2 |X1 )
10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn ρ =
according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets H(X1 )
X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let H(X2 ) − H(X2 |X1 )
= (since H(X1 ) = H(X2 ))
) H(X1 )
X1 , with probability α, I(X1 ; X2 )
X= = .
X2 , with probability 1 − α. H(X ) 1
Entropy, Relative Entropy and Mutual Information 17 18 Entropy, Relative Entropy and Mutual Information

(b) Since 0 ≤ H(X2 |X1 ) ≤ H(X2 ) = H(X1 ) , we have

Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy
H(X2 |X1 )
0≤ ≤1
H(X1 )
0 ≤ ρ ≤ 1. H(Y)
(c) ρ = 0 iff I(X1 ; X2 ) = 0 iff X1 and X2 are independent. H(X)
(d) ρ = 1 iff H(X2 |X1 ) = 0 iff X2 is a function of X1 . By symmetry, X1 is a H(X|Y) I(X;Y)
function of X2 , i.e., X1 and X2 have a one-to-one relationship. H(Y|X)

12. Example of joint entropy. Let p(x, y) be given by

! Y
! 0 1
X !
since the second term is always negative. Hence letting y = 1/x , we obtain
1 1
0 3 3 1
− ln y ≤ −1
1 0 1 y
3
or
1
Find ln y ≥ 1 −
y
(a) H(X), H(Y ). with equality iff y = 1 .
(b) H(X | Y ), H(Y | X).
14. Entropy of a sum. Let X and Y be random variables that take on values x 1 , x2 , . . . , xr
(c) H(X, Y ). and y1 , y2 , . . . , ys , respectively. Let Z = X + Y.
(d) H(Y ) − H(Y | X).
(e) I(X; Y ) . (a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤
H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variables
(f) Draw a Venn diagram for the quantities in (a) through (e).
adds uncertainty.
Solution: Example of joint entropy (b) Give an example of (necessarily dependent) random variables in which H(X) >
2 3 1
H(Z) and H(Y ) > H(Z).
(a) H(X) = 3 log 2 + 3 log 3 = 0.918 bits = H(Y ) .
(c) Under what conditions does H(Z) = H(X) + H(Y ) ?
(b) H(X|Y ) = 31 H(X|Y = 0) + 32 H(X|Y = 1) = 0.667 bits = H(Y |X) .
1 Solution: Entropy of a sum.
(c) H(X, Y ) = 3 × 3 log 3 = 1.585 bits.
(d) H(Y ) − H(Y |X) = 0.251 bits. (a) Z = X + Y . Hence p(Z = z|X = x) = p(Y = z − x|X = x) .
(e) I(X; Y ) = H(Y ) − H(Y |X) = 0.251 bits. !
(f) See Figure 1. H(Z|X) = p(x)H(Z|X = x)
! !
1 = − p(x) p(Z = z|X = x) log p(Z = z|X = x)
13. Inequality. Show ln x ≥ 1 − x for x > 0. x z
! !
Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x) = p(x) p(Y = z − x|X = x) log p(Y = z − x|X = x)
about x = 1 , we have for some c between 1 and x x y
!
* +
1
*
−1
+
(x − 1)2 = p(x)H(Y |X = x)
ln(x) = ln(1) + (x − 1) + ≤ x−1
t t=1 t2 t=c 2 = H(Y |X).

Entropy, Relative Entropy and Mutual Information 19 20 Entropy, Relative Entropy and Mutual Information

If X and Y are independent, then H(Y |X) = H(Y ) . Since I(X; Z) ≥ 0 , (a) From the data processing inequality, and the fact that entropy is maximum for a
we have H(Z) ≥ H(Z|X) = H(Y |X) = H(Y ) . Similarly we can show that uniform distribution, we get
H(Z) ≥ H(X) .
I(X1 ; X3 ) ≤ I(X1 ; X2 )
(b) Consider the following joint distribution for X and Y Let
= H(X2 ) − H(X2 | X1 )
)
1 with probability 1/2 ≤ H(X2 )
X = −Y =
0 with probability 1/2 ≤ log k.

Then H(X) = H(Y ) = 1 , but Z = 0 with prob. 1 and hence H(Z) = 0 . Thus, the dependence between X1 and X3 is limited by the size of the bottleneck.
That is I(X1 ; X3 ) ≤ log k .
(c) We have
(b) For k = 1 , I(X1 ; X3 ) ≤ log 1 = 0 and since I(X1 , X3 ) ≥ 0 , I(X1 , X3 ) = 0 .
H(Z) ≤ H(X, Y ) ≤ H(X) + H(Y ) Thus, for k = 1 , X1 and X3 are independent.
because Z is a function of (X, Y ) and H(X, Y ) = H(X) + H(Y |X) ≤ H(X) + 17. Pure randomness and bent coins. Let X1 , X2 , . . . , Xn denote the outcomes of
H(Y ) . We have equality iff (X, Y ) is a function of Z and H(Y ) = H(Y |X) , i.e., independent flips of a bent coin. Thus Pr {X i = 1} = p, Pr {Xi = 0} = 1 − p ,
X and Y are independent. where p is unknown. We wish to obtain a sequence Z 1 , Z2 , . . . , ZK of fair coin flips
from X1 , X2 , . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ =
15. Data processing. Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this
{Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mapping
order; i.e., let
f (X1 , X2 , . . . , Xn ) = (Z1 , Z2 , . . . , ZK ) , where Zi ∼ Bernoulli ( 21 ) , and K may depend
p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ). on (X1 , . . . , Xn ) . In order that the sequence Z1 , Z2 , . . . appear to be fair coin flips, the
Reduce I(X1 ; X2 , . . . , Xn ) to its simplest form. map f from bent coin flips to fair flips must have the property that all 2 k sequences
(Z1 , Z2 , . . . , Zk ) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . .
Solution: Data Processing. By the chain rule for mutual information, For example, for n = 2 , the map f (01) = 0 , f (10) = 1 , f (00) = f (11) = Λ (the null
string), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 12 .
I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ) + I(X1 ; X3 |X2 ) + · · · + I(X1 ; Xn |X2 , . . . , Xn−2 ). (2.20)
Give reasons for the following inequalities:
By the Markov property, the past and the future are conditionally independent given
the present and hence all terms except the first are zero. Therefore (a)
nH(p) = H(X1 , . . . , Xn )
I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ). (2.21) (b)
≥ H(Z1 , Z2 , . . . , ZK , K)
(c)
16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks = H(K) + H(Z1 , . . . , ZK |K)
(d)
down to k < n states, and then fans back to m > k states. Thus X 1 → X2 → X3 , = H(K) + E(K)
i.e., p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} , (e)
x3 ∈ {1, 2, . . . , m} . ≥ EK.

(a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving Thus no more than nH(p) fair coin tosses can be derived from (X 1 , . . . , Xn ) , on the
that I(X1 ; X3 ) ≤ log k. average. Exhibit a good map f on sequences of length 4.

(b) Evaluate I(X1 ; X3 ) for k = 1 , and conclude that no dependence can survive such Solution: Pure randomness and bent coins.
a bottleneck.
(a)
Solution: nH(p) = H(X1 , . . . , Xn )
(b)
Bottleneck. ≥ H(Z1 , Z2 , . . . , ZK )
Entropy, Relative Entropy and Mutual Information 21 22 Entropy, Relative Entropy and Mutual Information
(c)
= H(Z1 , Z2 , . . . , ZK , K) For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625.
(d) This is substantially less then the 4 pure random bits that could be generated if
= H(K) + H(Z1 , . . . , ZK |K) p were exactly 12 .
(e)
= H(K) + E(K) We will now analyze the efficiency of this scheme of generating random bits for long
(f ) sequences of bent coin flips. Let n be the number of bent coin flips. The algorithm
≥ EK . that we will use is the obvious extension of the above method of generating pure
bits using the fact that all sequences with the same number of ones are equally
(a) Since X1 , X2 , . . . , Xn are i.i.d. with probability of Xi = 1 being p , the entropy likely.
H(X1 , X2 , . . . , Xn ) is nH(p) . Consider all sequences
, -
with k ones. There are nk such sequences, which are
, - , -
(b) Z1 , . . . , ZK is a function of X1 , X2 , . . . , Xn , and since the entropy of a function of a all equally likely. If nk were a power of 2, then we could generate
, -
log nk pure
random variable is less than the entropy of the random variable, H(Z 1 , . . . , ZK ) ≤ random bits from such a set. However, in the general
,n-
case, nk is not a power of
H(X1 , X2 , . . . , Xn ) . 2 and the best we can to is the divide the set of k elements into subset of sizes
n
(c) K is a function of Z1 , Z2 , . . . , ZK , so its conditional entropy given Z1 , Z2 , . . . , ZK which are powers of 2. The largest set would have a size 2 $log (k )% and could be
, -
is 0. Hence H(Z1 , Z2 , . . . , ZK , K) = H(Z1 , . . . , ZK ) + H(K|Z1 , Z2 , . . . , ZK ) = used to generate ⌊log nk ⌋ random bits. We could divide the remaining elements
H(Z1 , Z2 , . . . , ZK ). into the largest set which is a power of 2, etc. The worst case would occur when
,n- l+1 − 1 , in which case the subsets would be of sizes 2 l , 2l−1 , 2l−2 , . . . , 1 .
(d) Follows from the chain rule for entropy. k =2
Instead of analyzing the scheme exactly, we will just find a lower bound on number
(e) By assumption, Z1 , Z2 , . . . , ZK are pure random bits (given K ), with entropy 1 , - , -
of random bits generated from a set of size nk . Let l = ⌊log nk ⌋ . Then at least
bit per symbol. Hence
half of the elements belong to a set of size 2 l and would generate l random bits,
! at least 14 th belong to a set of size 2l−1 and generate l − 1 random bits, etc. On
H(Z1 , Z2 , . . . , ZK |K) = p(K = k)H(Z1 , Z2 , . . . , Zk |K = k) (2.22)
k
the average, the number of bits generated is
!
= p(k)k (2.23) 1 1 1
k E[K|k 1’s in sequence] ≥ l + (l − 1) + · · · + l 1 (2.28)
2 4* 2
= EK. (2.24) 1 1 2 3 l−1
+
= l− 1 + + + + · · · + l−2 (2.29)
4 2 4 8 2
(f) Follows from the non-negativity of discrete entropy. ≥ l − 1, (2.30)
(g) Since we do not know p , the only way to generate pure random bits is to use
the fact that all sequences with the same number of ones are equally likely. For since the infinite series sums to 1.
, -
example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used Hence the fact that nk is not a power of 2 will cost at most 1 bit on the average
to generate 2 pure random bits. An example of a mapping to generate random in the number of random bits that are produced.
bits is
Hence, the expected number of pure random bits produced by this algorithm is
0000 → Λ n
. / . /
! n k n−k n
0001 → 00 0010 → 01 0100 → 10 1000 → 11 EK ≥ p q ⌊log − 1⌋ (2.31)
k k
0011 → 00 0110 → 01 1100 → 10 1001 → 11 k=0
(2.25) n
. / . . / /
1010 → 0 0101 → 1 ! n k n−k n
≥ p q log −2 (2.32)
1110 → 11 1101 → 10 1011 → 01 0111 → 00 k k
k=0
1111 → Λ n
. / . /
! n k n−k n
The resulting expected number of bits is = p q log −2 (2.33)
k=0
k k
. / . /
3 2 2 2 2
EK = 4pq × 2 + 4p q × 2 + 2p q × 1 + 4p q × 2 3
(2.26) ! n k n−k n
≥ p q log − 2. (2.34)
= 8pq 3 + 10p2 q 2 + 8p3 q. (2.27) n(p−ǫ)≤k≤n(p+ǫ)
k k

Entropy, Relative Entropy and Mutual Information 23 24 Entropy, Relative Entropy and Mutual Information

Now for sufficiently large n , the probability that the number of 1’s in the sequence Y is a deterministic function of X, so if you know X there is no randomness in Y. Or,
is close to np is near 1 (by the weak law of large numbers). For such sequences, H(Y |X) = 0 .
k
n is close to p and hence there exists a δ such that Since H(X) + H(Y |X) = H(X, Y ) = H(Y ) + H(X|Y ) , it is easy to determine
. /
n k H(X|Y ) = H(X) + H(Y |X) − H(Y ) = 3.889
≥ 2n(H( n )−δ) ≥ 2n(H(p)−2δ) (2.35)
k
19. Infinite entropy. This problem shows that the entropy of a discrete random variable
using Stirling’s approximation for the binomial coefficients and the continuity of $
can be infinite. Let A = ∞ 2 −1 . (It is easy to show that A is finite by
n=2 (n log n)
the entropy function. If we assume that n is large enough so that the probability bounding the infinite sum by the integral of (x log 2 x)−1 .) Show that the integer-
that n(p − ǫ) ≤ k ≤ n(p + ǫ) is greater than 1 − ǫ , then we see that EK ≥ valued random variable X defined by Pr(X = n) = (An log 2 n)−1 for n = 2, 3, . . . ,
(1 − ǫ)n(H(p) − 2δ) − 2 , which is very good since nH(p) is an upper bound on the has H(X) = +∞ .
number of pure random bits that can be produced from the bent coin sequence.
Solution: Infinite entropy. By definition, p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 .
18. World Series. The World Series is a seven-game series that terminates as soon as Therefore
either team wins four games. Let X be the random variable that represents the outcome
∞
of a World Series between teams A and B; possible values of X are AAAA, BABABAB, !
H(X) = − p(n) log p(n)
and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7. n=2
Assuming that A and B are equally matched and that the games are independent, !∞ 0 1 0 1
calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) . = − 1/An log 2 n log 1/An log 2 n
n=2
Solution: ∞
! log(An log 2 n)
World Series. Two teams play until one of them has won 4 games. =
n=2 An log2 n
There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability ∞
(1/2)4 .
! log A + log n + 2 log log n
=
,4- An log2 n
There are 8 = 2 3 World Series with 5 games. Each happens with probability (1/2) 5 . n=2
∞ ∞
There are 20 = 2
,5-
World Series with 6 games. Each happens with probability (1/2) 6 .
! 1 ! 2 log log n
3 = log A + + .
,6- n=2
An log n n=2 An log2 n
There are 40 = 2 3 World Series with 7 games. Each happens with probability (1/2) 7 .
The first term is finite. For base 2 logarithms, all the elements in the sum in the last
The probability of a 4 game series ( Y = 4 ) is 2(1/2) 4 = 1/8 . term are nonnegative. (For any other base, the terms of the last sum eventually all
The probability of a 5 game series ( Y = 5 ) is 8(1/2) 5 = 1/4 . become positive.) So all we have to do is bound the middle sum, which we do by
comparing with an integral.
The probability of a 6 game series ( Y = 6 ) is 20(1/2) 6 = 5/16 .
The probability of a 7 game series ( Y = 7 ) is 40(1/2) 7 = 5/16 . ∞
! 1
2 ∞ 1 3∞
3
> dx = K ln ln x 3 = +∞ .
n=2
An log n 2 Ax log x 2
! 1
H(X) = p(x)log
p(x) We conclude that H(X) = +∞ .
= 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128
= 5.8125 20. Run length coding. Let X1 , X2 , . . . , Xn be (possibly dependent) binary random
variables. Suppose one calculates the run lengths R = (R 1 , R2 , . . .) of this sequence
(in order as they occur). For example, the sequence X = 0001100100 yields run
! 1 lengths R = (3, 2, 2, 1, 2) . Compare H(X1 , X2 , . . . , Xn ) , H(R) and H(Xn , R) . Show
H(Y ) = p(y)log
p(y) all equalities and inequalities, and bound all the differences.
= 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5) Solution: Run length coding. Since the run lengths are a function of X 1 , X2 , . . . , Xn ,
= 1.924 H(R) ≤ H(X) . Any Xi together with the run lengths determine the entire sequence
Entropy, Relative Entropy and Mutual Information 25 26 Entropy, Relative Entropy and Mutual Information

X1 , X2 , . . . , Xn . Hence 23. Conditional mutual information. Consider a sequence of n binary random vari-
ables X1 , X2 , . . . , Xn . Each sequence with an even number of 1’s has probability
H(X1 , X2 , . . . , Xn ) = H(Xi , R) (2.36) 2−(n−1) and each sequence with an odd number of 1’s has probability 0. Find the
= H(R) + H(Xi |R) (2.37) mutual informations
≤ H(R) + H(Xi ) (2.38)
I(X1 ; X2 ), I(X2 ; X3 |X1 ), . . . , I(Xn−1 ; Xn |X1 , . . . , Xn−2 ).
≤ H(R) + 1. (2.39)

Solution: Conditional mutual information.

21. Markov’s inequality for probabilities. Let p(x) be a probability mass function.
Prove, for all d ≥ 0 , Consider a sequence of n binary random variables X 1 , X2 , . . . , Xn . Each sequence of
* +
1 length n with an even number of 1’s is equally likely and has probability 2 −(n−1) .
Pr {p(X) ≤ d} log ≤ H(X). (2.40)
d Any n − 1 or fewer of these are independent. Thus, for k ≤ n − 1 ,

Solution: Markov inequality applied to entropy. I(Xk−1 ; Xk |X1 , X2 , . . . , Xk−2 ) = 0.

1 ! 1
P (p(X) < d) log = p(x) log (2.41) However, given X1 , X2 , . . . , Xn−2 , we know that once we know either Xn−1 or Xn we
d d
x:p(x)<d know the other.
! 1
≤ p(x) log (2.42)
p(x) I(Xn−1 ; Xn |X1 , X2 , . . . , Xn−2 ) = H(Xn |X1 , X2 , . . . , Xn−2 ) − H(Xn |X1 , X2 , . . . , Xn−1 )
x:p(x)<d
! 1 = 1 − 0 = 1 bit.
≤ p(x) log (2.43)
x p(x)
= H(X) (2.44) 24. Average entropy. Let H(p) = −p log 2 p − (1 − p) log2 (1 − p) be the binary entropy
function.
22. Logical order of ideas. Ideas have been developed in order of need, and then gener- (a) Evaluate H(1/4) using the fact that log 2 3 ≈ 1.584 . Hint: You may wish to
alized if necessary. Reorder the following ideas, strongest first, implications following: consider an experiment with four equally likely outcomes, one of which is more
interesting than the others.
(a) Chain rule for I(X1 , . . . , Xn ; Y ) , chain rule for D(p(x1 , . . . , xn )||q(x1 , x2 , . . . , xn )) ,
(b) Calculate the average entropy H(p) when the probability p is chosen uniformly
and chain rule for H(X1 , X2 , . . . , Xn ) .
in the range 0 ≤ p ≤ 1 .
(b) D(f ||g) ≥ 0 , Jensen’s inequality, I(X; Y ) ≥ 0 .
(c) (Optional) Calculate the average entropy H(p 1 , p2 , p3 ) where (p1 , p2 , p3 ) is a uni-
formly distributed probability vector. Generalize to dimension n .
Solution: Logical ordering of ideas.
Solution: Average Entropy.
(a) The following orderings are subjective. Since I(X; Y ) = D(p(x, y)||p(x)p(y)) is a
special case of relative entropy, it is possible to derive the chain rule for I from (a) We can generate two bits of information by picking one of four equally likely
the chain rule for D . alternatives. This selection can be made in two steps. First we decide whether the
Since H(X) = I(X; X) , it is possible to derive the chain rule for H from the first outcome occurs. Since this has probability 1/4 , the information generated
chain rule for I . is H(1/4) . If not the first outcome, then we select one of the three remaining
It is also possible to derive the chain rule for I from the chain rule for H as was outcomes; with probability 3/4 , this produces log 2 3 bits of information. Thus
done in the notes.
H(1/4) + (3/4) log 2 3 = 2
(b) In class, Jensen’s inequality was used to prove the non-negativity of D . The
inequality I(X; Y ) ≥ 0 followed as a special case of the non-negativity of D . and so H(1/4) = 2 − (3/4) log 2 3 = 2 − (.75)(1.585) = 0.811 bits.

Entropy, Relative Entropy and Mutual Information 27 28 Entropy, Relative Entropy and Mutual Information

(b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 , then the average entropy (in 26. Another proof of non-negativity of relative entropy. In view of the fundamental
nats) is nature of the result D(p||q) ≥ 0 , we will give another proof.
. /
x2 x2
2 1 2 1 31
− p ln p + (1 − p) ln(1 − p)dp = −2 x ln x dx = −2 ln x +
3
3 = 1
. (a) Show that ln x ≤ x − 1 for 0 < x < ∞ .
2 4 0 2
0 0
(b) Justify the following steps:
Therefore the average entropy is 21 log2 e = 1/(2 ln 2) = .721 bits.
! q(x)
(c) Choosing a uniformly distributed probability vector (p 1 , p2 , p3 ) is equivalent to −D(p||q) = p(x) ln (2.45)
choosing a point (p1 , p2 ) uniformly from the triangle 0 ≤ p1 ≤ 1 , p1 ≤ p2 ≤ 1 . x p(x)
* +
The probability density function has the constant value 2 because the area of the ! q(x)
≤ p(x) −1 (2.46)
triangle is 1/2. So the average entropy H(p 1 , p2 , p3 ) is x p(x)
2 12 1 ≤ 0 (2.47)
−2 p1 ln p1 + p2 ln p2 + (1 − p1 − p2 ) ln(1 − p1 − p2 )dp2 dp1 .
0 p1
(c) What are the conditions for equality?
After some enjoyable calculus, we obtain the final result 5/(6 ln 2) = 1.202 bits.
25. Venn diagrams. There isn’t realy a notion of mutual information common to three Solution: Another proof of non-negativity of relative entropy. In view of the funda-
random variables. Here is one attempt at a definition: Using Venn diagrams, we can mental nature of the result D(p||q) ≥ 0 , we will give another proof.
see that the mutual information common to three random variables X , Y and Z can
(a) Show that ln x ≤ x − 1 for 0 < x < ∞ .
be defined by
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) . There are many ways to prove this. The easiest is using calculus. Let
This quantity is symmetric in X , Y and Z , despite the preceding asymmetric defi- f (x) = x − 1 − ln x (2.48)
nition. Unfortunately, I(X; Y ; Z) is not necessarily nonnegative. Find X , Y and Z
such that I(X; Y ; Z) < 0 , and prove the following two identities: for 0 < x < ∞ . Then f ' (x) = 1 − x1 and f '' (x) = x12 > 0 , and therefore f (x)
(a) I(X; Y ; Z) = H(X, Y, Z) − H(X) − H(Y ) − H(Z) + I(X; Y ) + I(Y ; Z) + I(Z; X) is strictly convex. Therefore a local minimum of the function is also a global
minimum. The function has a local minimum at the point where f ' (x) = 0 , i.e.,
(b) I(X; Y ; Z) = H(X, Y, Z)− H(X, Y )− H(Y, Z)− H(Z, X)+ H(X)+ H(Y )+ H(Z)
when x = 1 . Therefore f (x) ≥ f (1) , i.e.,
The first identity can be understood using the Venn diagram analogy for entropy and
mutual information. The second identity follows easily from the first. x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 (2.49)
Solution: Venn Diagrams. To show the first identity,
which gives us the desired inequality. Equality occurs only if x = 1 .
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) by definition (b) We let A be the set of x such that p(x) > 0 .
= I(X; Y ) − (I(X; Y, Z) − I(X; Z)) by chain rule
= I(X; Y ) + I(X; Z) − I(X; Y, Z)
! q(x)
−De (p||q) = p(x)ln (2.50)
= I(X; Y ) + I(X; Z) − (H(X) + H(Y, Z) − H(X, Y, Z)) x∈A
p(x)
* +
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − H(Y, Z)
! q(x)
≤ p(x) −1 (2.51)
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − (H(Y ) + H(Z) − I(Y ; Z)) x∈A
p(x)
! !
= I(X; Y ) + I(X; Z) + I(Y ; Z) + H(X, Y, Z) − H(X) − H(Y ) − H(Z). = q(x) − p(x) (2.52)
x∈A x∈A
To show the second identity, simply substitute for I(X; Y ) , I(X; Z) , and I(Y ; Z)
≤ 0 (2.53)
using equations like
I(X; Y ) = H(X) + H(Y ) − H(X, Y ) . The first step follows from the definition of D , the second step follows from the
These two identities show that I(X; Y ; Z) is a symmetric (but not necessarily nonneg- inequality ln t ≤ t − 1 , the third step from expanding the sum, and the last step
ative) function of three random variables. from the fact that the q(A) ≤ 1 and p(A) = 1 .
Entropy, Relative Entropy and Mutual Information 29 30 Entropy, Relative Entropy and Mutual Information

(c) What are the conditions for equality? Then, by the log sum inequality,
We have equality in the inequality ln t ≤ t − 1 if and only if t = 1 . Therefore we pi + p j pi + p j
have equality in step 2 of the chain iff q(x)/p(x) = 1 for all x ∈ A . This implies H(P2 ) − H(P1 ) = −2( ) log( ) + pi log pi + pj log pj
2 2
that p(x) = q(x) for all x , and we have equality in the last step as well. Thus pi + p j
the condition for equality is that p(x) = q(x) for all x . = −(pi + pj ) log( ) + pi log pi + pj log pj
2
≥ 0.
27. Grouping rule for entropy: Let p = (p 1 , p2 , . . . , pm ) be a probability distribution
$
on m elements, i.e, pi ≥ 0 , and m i=1 pi = 1 . Define a new distribution q on m − 1 Thus,
elements as q1 = p1 , q2 = p2 ,. . . , qm−2 = pm−2 , and qm−1 = pm−1 + pm , i.e., the H(P2 ) ≥ H(P1 ).
distribution q is the same as p on {1, 2, . . . , m − 2} , and the probability of the last
element in q is the sum of the last two probabilities of p . Show that 29. Inequalities. Let X , Y and Z be joint random variables. Prove the following
* + inequalities and find conditions for equality.
pm−1 pm
H(p) = H(q) + (pm−1 + pm )H , . (2.54)
pm−1 + pm pm−1 + pm (a) H(X, Y |Z) ≥ H(X|Z) .
(b) I(X, Y ; Z) ≥ I(X; Z) .
Solution:
(c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X) .
m
!
H(p) = − pi log pi (2.55) (d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) .
i=1
m−2
Solution: Inequalities.
!
= − pi log pi − pm−1 log pm−1 − pm log pm (2.56) (a) Using the chain rule for conditional entropy,
i=1
m−2
! pm−1 pm H(X, Y |Z) = H(X|Z) + H(Y |X, Z) ≥ H(X|Z),
= − pi log pi − pm−1 log − pm log (2.57)
i=1
pm−1 + pm pm−1 + pm
with equality iff H(Y |X, Z) = 0 , that is, when Y is a function of X and Z .
−(pm−1 + pm ) log(pm−1 + pm ) (2.58)
pm−1 pm (b) Using the chain rule for mutual information,
= H(q) − pm−1 log − pm log (2.59)
pm−1 + pm pm−1 + pm I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X) ≥ I(X; Z),
* +
pm−1 pm−1 pm pm
= H(q) − (pm−1 + pm ) log − log (2.60)
pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
* +
pm−1 pm pendent given X .
= H(q) + (pm−1 + pm )H2 , , (2.61)
pm−1 + pm pm−1 + pm (c) Using first the chain rule for entropy and then the definition of conditional mutual
information,
where H2 (a, b) = −a log a − b log b .
H(X, Y, Z) − H(X, Y ) = H(Z|X, Y ) = H(Z|X) − I(Y ; Z|X)
28. Mixing increases entropy. Show that the entropy of the probability distribution,
(p1 , . . . , pi , . . . , pj , . . . , pm ) , is less than the entropy of the distribution ≤ H(Z|X) = H(X, Z) − H(X) ,
p +p p +p
(p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ) . Show that in general any transfer of probability that with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
makes the distribution more uniform increases the entropy. pendent given X .
Solution: (d) Using the chain rule for mutual information,
Mixing increases entropy.
I(X; Z|Y ) + I(Z; Y ) = I(X, Y ; Z) = I(Z; Y |X) + I(X; Z) ,
This problem depends on the convexity of the log function. Let
and therefore
P1 = (p1 , . . . , pi , . . . , pj , . . . , pm ) I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z) .
pi + p j pj + p i
P2 = (p1 , . . . , ,..., , . . . , pm ) We see that this inequality is actually an equality in all cases.
2 2

Entropy, Relative Entropy and Mutual Information 31 32 Entropy, Relative Entropy and Mutual Information

30. Maximum entropy. Find the probability mass function p(x) that maximizes the which implies that,
entropy H(X) of a non-negative integer-valued random variable X subject to the A
constraint β =
∞
! A+1
EX = np(n) = A 1
α = .
n=0 A+1
for a fixed value A > 0 . Evaluate this maximum H(X) . So the entropy maximizing distribution is,
Solution: Maximum entropy 1
*
A
+i
pi = .
Recall that, A+1 A+1
∞ ∞
! ! Plugging these values into the expression for the maximum entropy,
− pi log pi ≤ − pi log qi .
i=0 i=0 − log α − A log β = (A + 1) log(A + 1) − A log A.

Let qi = α(β)i . Then we have that, The general form of the distribution,
∞
! ∞
! pi = αβ i
− pi log pi ≤ − pi log qi
i=0 i=0 can be obtained either by guessing or by Lagrange multipliers where,
. ∞ ∞
/
! ! ∞
! ∞
! ∞
!
= − log(α) pi + log(β) ipi F (pi , λ1 , λ2 ) = − pi log pi + λ1 ( pi − 1) + λ2 ( ipi − A)
i=0 i=0 i=0 i=0 i=0
= − log α − A log β is the function whose gradient we set to 0.
To complete the argument with Lagrange multipliers, it is necessary to show that the
Notice that the final right hand side expression is independent of {p i } , and that the local maximum is the global maximum. One possible argument is based on the fact
inequality, that −H(p) is convex, it has only one local minima, no local maxima and therefore
∞
!
− pi log pi ≤ − log α − A log β Lagrange multiplier actually gives the global maximum for H(p) .
i=0
31. Conditional entropy. Under what conditions does H(X | g(Y )) = H(X | Y ) ?
holds for all α, β such that, Solution: (Conditional Entropy). If H(X|g(Y )) = H(X|Y ) , then H(X)−H(X|g(Y )) =
∞ H(X) − H(X|Y ) , i.e., I(X; g(Y )) = I(X; Y ) . This is the condition for equality in
! 1
αβ i = 1 = α . the data processing inequality. From the derivation of the inequality, we have equal-
1−β
i=0 ity iff X → g(Y ) → Y forms a Markov chain. Hence H(X|g(Y )) = H(X|Y ) iff
X → g(Y ) → Y . This condition includes many special cases, such as g being one-
The constraint on the expected value also requires that, to-one, and X and Y being independent. However, these two special cases do not
∞
exhaust all the possibilities.
! β
iαβ i = A = α . 32. Fano. We are given the following joint distribution on (X, Y )
i=0
(1 − β)2
Y
Combining the two constraints we have, X a b c
1 1 1
β
*
α
+*
β
+ 1 6 12 12
α = 2 1 1 1
(1 − β)2 1−β 1−β 12 6 12
1 1 1
β 3 12 12 6
=
1−β
= A, Let X̂(Y ) be an estimator for X (based on Y) and let P e = Pr{X̂(Y ) %= X}.
Entropy, Relative Entropy and Mutual Information 33 34 Entropy, Relative Entropy and Mutual Information

(a) Find the minimum probability of error estimator X̂(Y ) and the associated Pe . to find a bound on Pe in terms of H . This is Fano’s inequality in the absence of
(b) Evaluate Fano’s inequality for this problem and compare. conditioning.
Solution: (Fano’s Inequality.) The minimal probability of error predictor when there
Solution: is no information is X̂ = 1 , the most probable value of X . The probability of error in
this case is Pe = 1 − p1 . Hence if we fix Pe , we fix p1 . We maximize the entropy of X
(a) From inspection we see that
for a given Pe to obtain an upper bound on the entropy for a given P e . The entropy,

 1 y=a
 m
!
X̂(y) = 2 y=b H(p) = −p1 log p1 − pi log pi (2.62)

 3 i=2
y=c
m
pi
! pi
Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a) = −p1 log p1 − Pe log − Pe log Pe (2.63)
i=2
Pe Pe
and P (3, b). Therefore, Pe = 1/2. * +
p2 p3 pm
(b) From Fano’s inequality we know = H(Pe ) + Pe H , ,..., (2.64)
Pe Pe Pe
H(X|Y ) − 1 ≤ H(Pe ) + Pe log(m − 1), (2.65)
Pe ≥ . 0 1
log |X | since the maximum of H Pp2e , Pp3e , . . . , pPme is attained by an uniform distribution. Hence
Here, any X that can be predicted with a probability of error P e must satisfy

H(X|Y ) = H(X|Y = a) Pr{y = a} + H(X|Y = b) Pr{y = b} + H(X|Y = c) Pr{y = c} H(X) ≤ H(Pe ) + Pe log(m − 1), (2.66)
* + * + * +
1 1 1 1 1 1 1 1 1
= H , , Pr{y = a} + H , , Pr{y = b} + H , , Pr{y = c} which is the unconditional form of Fano’s inequality. We can weaken this inequality to
2 4 4 2 4 4 2 4 4 obtain an explicit lower bound for Pe ,
* +
1 1 1
= H , , (Pr{y = a} + Pr{y = b} + Pr{y = c})
2 4 4 H(X) − 1
* + Pe ≥ . (2.67)
1 1 1 log(m − 1)
= H , ,
2 4 4
= 1.5 bits. 34. Entropy of initial conditions. Prove that H(X 0 |Xn ) is non-decreasing with n for
any Markov chain.
Hence Solution: Entropy of initial conditions. For a Markov chain, by the data processing
1.5 − 1
Pe ≥ = .316. theorem, we have
log 3
I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (2.68)
Hence our estimator X̂(Y ) is not very close to Fano’s bound in this form. If
X̂ ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get Therefore
H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (2.69)
H(X|Y ) − 1
Pe ≥ . or H(X0 |Xn ) increases with n .
log(|X |-1)
35. Relative entropy is not symmetric: Let the random variable X have three possible
and outcomes {a, b, c} . Consider two distributions on this random variable
1.5 − 1 1
Pe ≥ = .
log 2 2 Symbol p(x) q(x)
a 1/2 1/3
Therefore our estimator X̂(Y ) is actually quite good.
b 1/4 1/3
33. Fano’s inequality. Let Pr(X = i) = p i , i = 1, 2, . . . , m and let p1 ≥ p2 ≥ p3 ≥ c 1/4 1/3
· · · ≥ pm . The minimal probability of error predictor of X is X̂ = 1 , with resulting Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) %=
probability of error Pe = 1 − p1 . Maximize H(p) subject to the constraint 1 − p 1 = Pe D(q||p) .

Entropy, Relative Entropy and Mutual Information 35 36 Entropy, Relative Entropy and Mutual Information

Solution: Apparently any set S with a given α is as good as any other.

1 1 1
H(p) =
log 2 + log 4 + log 4 = 1.5 bits. (2.70) Solution: The value of a question.
2 4 4
1 1 1
H(q) = log 3 + log 3 + log 3 = log 3 = 1.58496 bits. (2.71)
3 3 3
1 3 1 3 1 3
D(p||q) = log + log + log = log(3) − 1.5 = 1.58496 − 1.5 = 0.08496 (2.72) H(X) − H(X|Y ) = I(X; Y )
2 2 4 4 4 4
1 2 1 4 1 4 5 = H(Y ) − H(Y |X)
D(q||p) = log + log + log = −log(3) = 1.66666−1.58496 = 0.08170 (2.73) = H(α) − H(Y |X)
3 3 3 3 3 3 3
= H(α)
36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) %=
D(q||p) in general, there could be distributions for which equality holds. Give an since H(Y |X) = 0 .
example of two distributions p and q on a binary alphabet such that D(p||q) = D(q||p)
(other than the trivial case p = q ). 39. Entropy and pairwise independence.
Let X, Y, Z be three binary Bernoulli ( 21 ) random variables that are pairwise indepen-
Solution:
dent, that is, I(X; Y ) = I(X; Z) = I(Y ; Z) = 0 .
A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for
(a) Under this constraint, what is the minimum value for H(X, Y, Z) ?
p 1−p q 1−q
p log + (1 − p) log = q log + (1 − q) log (2.74) (b) Give an example achieving this minimum.
q 1−q p 1−p
is when q = 1 − p . Solution:
37. Relative entropy: Let X, Y, Z be three random variables with a joint probability (a)
mass function p(x, y, z) . The relative entropy between the joint distribution and the
product of the marginals is H(X, Y, Z) = H(X, Y ) + H(Z|X, Y ) (2.79)
4
p(x, y, z)
5 ≥ H(X, Y ) (2.80)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.75) = 2. (2.81)
p(x)p(y)p(z)
Expand this in terms of entropies. When is this quantity zero? So the minimum value for H(X, Y, Z) is at least 2. To show that is is actually
Solution: equal to 2, we show in part (b) that this bound is attainable.
4 5 (b) Let X and Y be iid Bernoulli( 12 ) and let Z = X ⊕ Y , where ⊕ denotes addition
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.76) mod 2 (xor).
p(x)p(y)p(z)
= E[log p(x, y, z)] − E[log p(x)] − E[log p(y)] − E[log(2.77)
p(z)] 40. Discrete entropies
= −H(X, Y, Z) + H(X) + H(Y ) + H(Z) (2.78) Let X and Y be two independent integer-valued random variables. Let X be uniformly
distributed over {1, 2, . . . , 8} , and let Pr{Y = k} = 2 −k , k = 1, 2, 3, . . .
We have D(p(x, y, z)||p(x)p(y)p(z)) = 0 if and only p(x, y, z) = p(x)p(y)p(z) for all
(x, y, z) , i.e., if X and Y and Z are independent. (a) Find H(X)
38. The value of a question Let X ∼ p(x) , x = 1, 2, . . . , m . We are given a set (b) Find H(Y )
S ⊆ {1, 2, . . . , m} . We ask whether X ∈ S and receive the answer (c) Find H(X + Y, X − Y ) .
)
1, if X ∈ S Solution:
Y =
0, if X ∈
% S.
(a) For a uniform distribution, H(X) = log m = log 8 = 3 .
$
Suppose Pr{X ∈ S} = α . Find the decrease in uncertainty H(X) − H(X|Y ) . (b) For a geometric distribution, H(Y ) = k k2−k = 2 . (See solution to problem 2.1
Entropy, Relative Entropy and Mutual Information 37 38 Entropy, Relative Entropy and Mutual Information

(c) Since (X, Y ) → (X +Y, X −Y ) is a one to one transformation, H(X +Y, X −Y ) = (c) Q2 are independent of X , Q1 , and A1 .
H(X, Y ) = H(X) + H(Y ) = 3 + 2 = 5 . (d) A2 is completely determined given Q2 and X .
(e) Conditioning decreases entropy.
41. Random questions (f) Result from part a.
One wishes to identify a random object X ∼ p(x) . A question Q ∼ r(q) is asked
at random according to r(q) . This results in a deterministic answer A = A(x, q) ∈
42. Inequalities. Which of the following inequalities are generally ≥, =, ≤ ? Label each
{a1 , a2 , . . .} . Suppose X and Q are independent. Then I(X; Q, A) is the uncertainty
with ≥, =, or ≤ .
in X removed by the question-answer (Q, A) .

(a) Show I(X; Q, A) = H(A|Q) . Interpret. (a) H(5X) vs. H(X)

(b) I(g(X); Y ) vs. I(X; Y )
(b) Now suppose that two i.i.d. questions Q 1 , Q2 , ∼ r(q) are asked, eliciting answers
A1 and A2 . Show that two questions are less valuable than twice a single question (c) H(X0 |X−1 ) vs. H(X0 |X−1 , X1 )
in the sense that I(X; Q1 , A1 , Q2 , A2 ) ≤ 2I(X; Q1 , A1 ) . (d) H(X1 , X2 , . . . , Xn ) vs. H(c(X1 , X2 , . . . , Xn )) , where c(x1 , x2 , . . . , xn ) is the Huff-
man codeword assigned to (x1 , x2 , . . . , xn ) .
Solution: Random questions.
(e) H(X, Y )/(H(X) + H(Y )) vs. 1

Solution:
(a)
(a) X → 5X is a one to one mapping, and hence H(X) = H(5X) .
I(X; Q, A) = H(Q, A) − H(Q, A, |X)
(b) By data processing inequality, I(g(X); Y ) ≤ I(X; Y ) .
= H(Q) + H(A|Q) − H(Q|X) − H(A|Q, X)
(c) Because conditioning reduces entropy, H(X 0 |X−1 ) ≥ H(X0 |X−1 , X1 ) .
= H(Q) + H(A|Q) − H(Q)
(d) H(X, Y ) ≤ H(X) + H(Y ) , so H(X, Y )/(H(X) + H(Y )) ≤ 1 .
= H(A|Q)
43. Mutual information of heads and tails.
(a) Consider a fair coin flip. What is the mutual information between the top side
The interpretation is as follows. The uncertainty removed in X by (Q, A) is the and the bottom side of the coin?
same as the uncertainty in the answer given the question.
(b) A 6-sided fair die is rolled. What is the mutual information between the top side
(b) Using the result from part a and the fact that questions are independent, we can and the front face (the side most facing you)?
easily obtain the desired relationship.
Solution:
(a)
I(X; Q1 , A1 , Q2 , A2 ) = I(X; Q1 ) + I(X; A1 |Q1 ) + I(X; Q2 |A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 ) Mutual information of heads and tails.
(b)
= I(X; A1 |Q1 ) + H(Q2 |A1 , Q1 ) − H(Q2 |X, A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 ) To prove (a) observe that
(c) I(T ; B) = H(B) − H(B|T )
= I(X; A1 |Q1 ) + I(X; A2 |A1 , Q1 , Q2 )
= I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 ) − H(A2 |X, A1 , Q1 , Q2 ) = log 2 = 1
(d) since B ∼ Ber(1/2) , and B = f (T ) . Here B, T stand for Bottom and Top respectively.
= I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 )
(e) To prove (b) note that having observed a side of the cube facing us F , there are four
≤ I(X; A1 |Q1 ) + H(A2 |Q2 ) possibilities for the top T , which are equally probable. Thus,
(f )
= 2I(X; A1 |Q1 ) I(T ; F ) = H(T ) − H(T |F )
= log 6 − log 4
= log 3 − 1
(a) Chain rule.
(b) X and Q1 are independent. since T has uniform distribution on {1, 2, . . . , 6} .

Entropy, Relative Entropy and Mutual Information 39 40 Entropy, Relative Entropy and Mutual Information

44. Pure randomness 46. Axiomatic definition of entropy. If we assume certain axioms for our measure of
We wish to use a 3-sided coin to generate a fair coin toss. Let the coin X have information, then we will be forced to use a logarithmic measure like entropy. Shannon
probability mass function used this to justify his initial definition of entropy. In this book, we will rely more on
 the other properties of entropy rather than its axiomatic derivation to justify its use.
 A,
 pA
The following problem is considerably more difficult than the other problems in this
X= B, pB

 C, section.
pC
If a sequence of symmetric functions H m (p1 , p2 , . . . , pm ) satisfies the following proper-
where pA , pB , pC are unknown. ties,
1 0 1
(a) How would you use 2 independent flips X 1 , X2 to generate (if possible) a Bernoulli( ) • Normalization: H2 1 1
2 2, 2 = 1,
random variable Z ?
• Continuity: H2 (p, 1 − p) is a continuous function of p ,
(b) What is the resulting maximum expected number of fair bits generated? 0
p1 p2
1
• Grouping: Hm (p1 , p2 , . . . , pm ) = Hm−1 (p1 +p2 , p3 , . . . , pm )+(p1 +p2 )H2 p1 +p2 , p1 +p2 ,
Solution:
prove that Hm must be of the form
(a) The trick here is to notice that for any two letters Y and Z produced by two m
!
independent tosses of our bent three-sided coin, Y Z has the same probability as Hm (p1 , p2 , . . . , pm ) = − pi log pi , m = 2, 3, . . . . (2.86)
ZY . So we can produce B ∼ Bernoulli( 21 ) coin flips by letting B = 0 when we i=1

get AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA , There are various other axiomatic formulations which also result in the same definition
BB or CC we don’t assign a value to B .) of entropy. See, for example, the book by Csiszár and Körner[4].
(b) The expected number of bits generated by the above scheme is as follows. We get Solution: Axiomatic definition of entropy. This is a long solution, so we will first
one bit, except when the two flips of the 3-sided coin produce the same symbol. outline what we plan to do. First we will extend the grouping axiom by induction and
So the expected number of fair bits generated is prove that

0 ∗ [P (AA) + P (BB) + P (CC)] + 1 ∗ [1 − P (AA) − P (BB) − P (CC)], (2.82) Hm (p1 , p2 , . . . , pm ) = Hm−k (p1 + p2 + · · · + pk , pk+1 , . . . , pm )
* +
p1 pk
+(p1 + p2 + · · · + pk )Hk ,..., (. 2.87)
or, 1 − p2A − p2B − p2C . (2.83) p1 + p 2 + · · · + p k p1 + p 2 + · · · + p k

45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E log X < Let f (m) be the entropy of a uniform distribution on m symbols, i.e.,
* +
∞ , then H(X) < ∞ . 1 1 1
$ f (m) = Hm , ,..., . (2.88)
Solution: Let the distribution on the integers be p 1 , p2 , . . . . Then H(p) = − pi logpi m m m
$
and E log X = pi logi = c < ∞ . We will then show that for any two integers r and s , that f (rs) = f (r) + f (s) .
We will now find the maximum entropy distribution subject to the constraint on the We use this to show that f (m) = log m . We then show for rational p = r/s , that
expected logarithm. Using Lagrange multipliers or the results of Chapter 12, we have H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p) . By continuity, we will extend it to irrational
the following functional to optimize p and finally by induction and grouping, we will extend the result to H m for m ≥ 2 .
! ! ! To begin, we extend the grouping axiom. For convenience in notation, we will let
J(p) = − pi log pi − λ1 p i − λ2 pi log i (2.84)
k
!
Differentiating with respect to p i and setting to zero, we find that the p i that maximizes Sk = pi (2.89)
$
the entropy set pi = aiλ , where a = 1/( iλ ) and λ chosed to meet the expected log i=1

constraint, i.e. ! ! and we will denote H2 (q, 1 − q) as h(q) . Then we can write the grouping axiom as
iλ log i = c iλ (2.85) * +
p2
Using this value of pi , we can see that the entropy is finite. Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h . (2.90)
S2
Entropy, Relative Entropy and Mutual Information 41 42 Entropy, Relative Entropy and Mutual Information

Applying the grouping axiom again, we have We can immediately use this to conclude that f (m k ) = kf (m) .
p2
* + Now, we will argue that H2 (1, 0) = h(1) = 0 . We do this by expanding H 3 (p1 , p2 , 0)
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h (2.91) ( p1 + p2 = 1 ) in two different ways using the grouping axiom
S2
+ * * +
p3 p2
= Hm−2 (S3 , p4 , . . . , pm ) + S3 h + S2 h (2.92) H3 (p1 , p2 , 0) = H2 (p1 , p2 ) + p2 H2 (1, 0) (2.105)
S3 S2
.. = H2 (1, 0) + (p1 + p2 )H2 (p1 , p2 ) (2.106)
. (2.93)
k * + Thus p2 H2 (1, 0) = H2 (1, 0) for all p2 , and therefore H(1, 0) = 0 .
pi !
= Hm−(k−1) (Sk , pk+1 , . . . , pm ) + Si h . (2.94)
i=2
Si We will also need to show that f (m + 1) − f (m) → 0 as m → ∞ . To prove this, we
use the extended grouping axiom and write
Now, we apply the same grouping axiom repeatedly to H k (p1 /Sk , . . . , pk /Sk ) , to obtain
1 1
f (m + 1) = Hm+1 ( ,..., ) (2.107)
* + * + k−1 * + m+1 m+1
p1 pk Sk−1 pk ! Si pi /Sk
Hk ,..., = H2 , + h (2.95) 1 m 1 1
Sk Sk Sk Sk Sk Si /Sk = h( )+ Hm ( , . . . , ) (2.108)
i=2 m+1 m+1 m m
1 ! k * +
pi 1 m
= Si h . (2.96) = h( )+ f (m) (2.109)
Sk i=2 Si m+1 m+1

and therefore
From (2.94) and (2.96), it follows that m 1
f (m + 1) − f (m) = h( ). (2.110)
* + m+1 m+1
p1 pk m 1
Hm (p1 , . . . , pm ) = Hm−k (Sk , pk+1 , . . . , pm ) + Sk Hk ,..., , (2.97) Thus lim f (m + 1) − m+1 f (m)= lim h( m+1 ).
But by the continuity of H2 , it follows
Sk Sk 1
that the limit on the right is h(0) = 0 . Thus lim h( m+1 ) = 0.
which is the extended grouping axiom.
Let us define
Now we need to use an axiom that is not explicitly stated in the text, namely that the an+1 = f (n + 1) − f (n) (2.111)
function Hm is symmetric with respect to its arguments. Using this, we can combine
any set of arguments of Hm using the extended grouping axiom. and
1
1 1
Let f (m) denote Hm ( m 1
, m, . . . , m ). bn = h( ). (2.112)
n
Consider Then
1 1 1
f (mn) = Hmn ( , ,..., ). (2.98)
mn mn mn 1
an+1 = − f (n) + bn+1 (2.113)
By repeatedly applying the extended grouping axiom, we have n+1
n
1 !
1 1 1 = − ai + bn+1 (2.114)
f (mn) = Hmn ( , ,..., ) (2.99) n + 1 i=2
mn mn mn
1 1 1 1 1 1
= Hmn−n ( , ,..., ) + Hn ( , . . . , ) (2.100) and therefore
m mn mn m n n n
!
1 1 1 1 2 1 1 (n + 1)bn+1 = (n + 1)an+1 + ai . (2.115)
= Hmn−2n ( , , ,..., ) + Hn ( , . . . , ) (2.101)
m m mn mn m n n i=2
.. Therefore summing over n , we have
. (2.102)
1 1 1 1
= Hm ( , . . . . ) + H( , . . . , ) (2.103) N
! N
! N
!
m m n n nbn = (nan + an−1 + . . . + a2 ) = N ai . (2.116)
= f (m) + f (n). (2.104) n=2 n=2 n=2

Entropy, Relative Entropy and Mutual Information 43 44 Entropy, Relative Entropy and Mutual Information
$N
Dividing both sides by n=1 n = N (N + 1)/2 , we obtain where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(P n (1) ) = g(n(1) ) ,
and
N N $ n−1
2 ! nbn !
an = $n=2
N
(2.117) g(n) = g(n(1) ) + g(n) − g(P n(1) ) = g(n(1) ) + αi (2.124)
N + 1 n=2 n=2 n i=P n(1)

Now by continuity of H2 and the definition of bn , it follows that bn → 0 as n → ∞ . Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing this
Since the right hand side is essentially an average of the b n ’s, it also goes to 0 (This process, we can then write
can be proved more precisely using ǫ ’s and δ ’s). Thus the left hand side goes to 0. We  
k (i−1)
n!
can then see that !
1 ! N g(n) = g(n(k) ) +  αi  . (2.125)
aN +1 = bN +1 − an (2.118) j=1 i=P n(i)
N + 1 n=2
also goes to 0 as N → ∞ . Thus Since n(k) ≤ n/P k , after
6 7
log n
k= +1 (2.126)
f (n + 1) − f (n) → 0 asn → ∞. (2.119) log P
terms, we have n(k) = 0 , and g(0) = 0 (this follows directly from the additive property
We will now prove the following lemma of g ). Thus we can write
tn
!
Lemma 2.0.1 Let the function f (m) satisfy the following assumptions: g(n) = αi (2.127)
i=1
• f (mn) = f (m) + f (n) for all integers m , n . the sum of bn terms, where * +
• limn→∞ (f (n + 1) − f (n)) = 0 log n
bn ≤ P +1 . (2.128)
log P
• f (2) = 1 ,
g(n)
Since αn → 0 , it follows that log2 n → 0 , since g(n) has at most o(log 2 n) terms αi .
then the function f (m) = log 2 m .
Thus it follows that
f (n) f (P )
lim = (2.129)
Proof of the lemma: Let P be an arbitrary prime number and let n→∞ log2 n log2 P

f (P ) log2 n Since P was arbitrary, it follows that f (P )/ log 2 P = c for every prime number P .
g(n) = f (n) − (2.120) Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) =
log2 P
log2 P .
Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 . For composite numbers N = P1 P2 . . . Pl , we can apply the first property of f and the
Also if we let prime number factorization of N to show that
! !
f (P ) n f (N ) = f (Pi ) = log2 Pi = log2 N. (2.130)
αn = g(n + 1) − g(n) = f (n + 1) − f (n) + log2 (2.121)
log2 P n+1

then the second assumption in the lemma implies that lim α n = 0 . Thus the lemma is proved.
For an integer n , define The lemma can be simplified considerably, if instead of the second assumption, we
6 7
n replace it by the assumption that f (n) is monotone in n . We will now argue that the
n(1) = . (2.122)
P only function f (m) such that f (mn) = f (m) + f (n) for all integers m, n is of the form
f (m) = log a m for some base a .
Then it follows that n(1) < n/P , and
Let c = f (2) . Now f (4) = f (2 × 2) = f (2) + f (2) = 2c . Similarly, it is easy to see
n = n(1) P + l (2.123) that f (2k ) = kc = c log 2 2k . We will extend this to integers that are not powers of 2.
Entropy, Relative Entropy and Mutual Information 45 46 Entropy, Relative Entropy and Mutual Information

For any integer m , let r > 0 , be another integer and let 2 k ≤ mr < 2k+1 . Then by for all m . This is a straightforward induction. We have just shown that this is true for
the monotonicity assumption on f , we have m = 2 . Now assume that it is true for m = n − 1 . By the grouping axiom,

kc ≤ rf (m) < (k + 1)c (2.131) Hn (p1 , . . . , pn ) = Hn−1 (p1 + p2 , p3 , . . . , pn ) (2.141)

* +
p1 p2
or +(p1 + p2 )H2 , (2.142)
k k+1 p1 + p 2 p1 + p 2
≤ f (m) < c
c (2.132) n
!
r r = −(p1 + p2 ) log(p1 + p2 ) − pi log pi (2.143)
Now by the monotonicity of log , we have i=3
p1 p1 p2 p2
k k+1 − log − log (2.144)
≤ log2 m < (2.133) p1 + p 2 p1 + p 2 p1 + p 2 p1 + p 2
r r n
!
= − pi log pi . (2.145)
Combining these two equations, we obtain
i=1
3 3
3f (m) − log 2 m 3 < 1 Thus the statement is true for m = n , and by induction, it is true for all m . Thus we
3 3
(2.134)
3 c 3 r have finally proved that the only symmetric function that satisfies the axioms is
Since r was arbitrary, we must have m
!
Hm (p1 , . . . , pm ) = − pi log pi . (2.146)
log2 m i=1
f (m) = (2.135)
c
The proof above is due to Rényi[11]
and we can identify c = 1 from the last assumption of the lemma.
47. The entropy of a missorted file.
Now we are almost done. We have shown that for any uniform distribution on m
A deck of n cards in order 1, 2, . . . , n is provided. One card is removed at random
outcomes, f (m) = Hm (1/m, . . . , 1/m) = log 2 m .
then replaced at random. What is the entropy of the resulting deck?
We will now show that
Solution: The entropy of a missorted file.
H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p). (2.136)
The heart of this problem is simply carefully counting the possible outcome states.
To begin, let p be a rational number, r/s , say. Consider the extended grouping axiom There are n ways to choose which card gets mis-sorted, and, once the card is chosen,
for Hs there are again n ways to choose where the card is replaced in the deck. Each of these
1 1 1 1 s−r s−r shuffling actions has probability 1/n 2 . Unfortunately, not all of these n 2 actions results
f (s) = Hs ( , . . . , ) = H( , . . . , , )+ f (s − r) (2.137) in a unique mis-sorted file. So we need to carefully count the number of distinguishable
s s <s => s? s s
r
outcome states. The resulting deck can only take on one of the following three cases.
r s−r s s−r
= H2 ( , ) + f (s) + f (s − r) (2.138) • The selected card is at its original location after a replacement.
s s r s
• The selected card is at most one location away from its original location after a
Substituting f (s) = log 2 s , etc, we obtain replacement.
* + * +
r s−r r r s−r s−r • The selected card is at least two locations away from its original location after a
H2 ( , ) = − log2 − 1 − log2 1 − . (2.139)
s s s s s s replacement.

Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also true To compute the entropy of the resulting deck, we need to know the probability of each
at irrational p . case.
To complete the proof, we have to extend the definition from H 2 to Hm , i.e., we have Case 1 (resulting deck is the same as the original): There are n ways to achieve this
to show that ! outcome state, one for each of the n cards in the deck. Thus, the probability associated
Hm (p1 , . . . , pm ) = − pi log pi (2.140) with case 1 is n/n2 = 1/n .

Entropy, Relative Entropy and Mutual Information 47 48 Entropy, Relative Entropy and Mutual Information

Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which will (a)
= E(N )
have a probability of 2/n2 , since for each pair, there are two ways to achieve the swap,
I(X N ; N ) = 2
either by selecting the left-hand card and moving it one to the right, or by selecting the
right-hand card and moving it one to the left. where (a) comes from the fact that the entropy of a geometric random variable is
Case 3 (typical situation): None of the remaining actions “collapses”. They all result just the mean.
in unique outcome states, each with probability 1/n 2 . Of the n2 possible shuffling
actions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted (b) Since given N we know that Xi = 0 for all i < N and XN = 1,
the case 1 and case 2 situations above).
The entropy of the resulting deck can be computed as follows. H(X N |N ) = 0.

1 2 n2 1
H(X) = log(n) + (n − 1) 2 log( ) + (n2 − 3n + 2) 2 log(n2 )
n n 2 n (c)
2n − 1 2(n − 1)
= log(n) − H(X N ) = I(X N ; N ) + H(X N |N )
n n2
= I(X N ; N ) + 0
48. Sequence length.
H(X N ) = 2.
How much information does the length of a sequence give about the content of a se-
quence? Suppose we consider a Bernoulli (1/2) process {X i }.
Stop the process when the first 1 appears. Let N designate this stopping time. (d)
Thus X N is an element of the set of all finite length binary sequences {0, 1} ∗ =
{0, 1, 00, 01, 10, 11, 000, . . .}. I(X N ; N ) = H(N ) − H(N |X N )
= H(N ) − 0
(a) Find I(N ; X N ).
I(X N ; N ) = HB (1/3)

(b) Find H(X N |N ).

(e)
(c) Find H(X N ).
1 2
H(X N |N ) = H(X 6 |N = 6) + H(X 12 |N = 12)
Let’s now consider a different stopping time. For this part, again assume X i ∼ Bernoulli (1/2) 3 3
but stop at time N = 6 , with probability 1/3 and stop at time N = 12 with probability 1 2
= H(X 6 ) + H(X 12 )
2/3. Let this stopping time be independent of the sequence X 1 X2 . . . X12 . 3 3
1 2
= 6 + 12
(d) Find I(N ; X N ). 3 3
H(X N |N ) = 10.
(e) Find H(X N |N ).

(f)
(f) Find H(X N ).
H(X N ) = I(X N ; N ) + H(X N |N )
Solution:
= I(X N ; N ) + 10
(a) H(X N ) = H(1/3) + 10.
I(X N ; N ) = H(N ) − H(N |X N )
= H(N ) − 0
50 The Asymptotic Equipartition Property
2 ∞
≥ xdF
2δ ∞
≥ δdF
δ
= δ Pr{X ≥ δ}.

Chapter 3 Rearranging sides and dividing by δ we get,

EX
Pr{X ≥ δ} ≤ . (3.4)
δ

The Asymptotic Equipartition One student gave a proof based on conditional expectations. It goes like

Property EX = E(X|X ≤ δ) Pr{X ≥ δ} + E(X|X < δ) Pr{X < δ}

≥ E(X|X ≤ δ) Pr{X ≥ δ}
≥ δ Pr{X ≥ δ},

which leads to (3.4) as well.

1. Markov’s inequality and Chebyshev’s inequality.
Given δ , the distribution achieving
(a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 ,
EX
show that Pr{X ≥ δ} = ,
EX δ
Pr {X ≥ t} ≤ . (3.1)
t
is )
Exhibit a random variable that achieves this inequality with equality. δ with probability µδ
X=
(b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and variance 0 with probability 1 − µδ ,
σ 2 . By letting X = (Y − µ)2 , show that for any ǫ > 0 ,
where µ ≤ δ .
σ2 (b) Letting X = (Y − µ)2 in Markov’s inequality,
Pr {|Y − µ| > ǫ} ≤ . (3.2)
ǫ2
Pr{(Y − µ)2 > ǫ2 } ≤ Pr{(Y − µ)2 ≥ ǫ2 }
(c) (The weak law of large numbers.) Let Z 1 , Z2 , . . . , Zn be a sequence of i.i.d. random
$ E(Y − µ)2
variables with mean µ and variance σ 2 . Let Z n = n1 ni=1 Zi be the sample mean. ≤
Show that ǫ2
@3 3 A σ2 σ2
3 3
Pr 3Z n − µ3 > ǫ ≤ 2 . (3.3) = 2
,
nǫ ǫ
@3 3 A
3 3
Thus Pr 3Z n − µ3 > ǫ → 0 as n → ∞ . This is known as the weak law of large and noticing that Pr{(Y − µ)2 > ǫ2 } = Pr{|Y − µ| > ǫ} , we get,
numbers.
σ2
Solution: Markov’s inequality and Chebyshev’s inequality. Pr{|Y − µ| > ǫ} ≤ .
ǫ2

(a) If X has distribution F (x) , (c) Letting Y in Chebyshev’s inequality from part (b) equal Z¯n , and noticing that
2
2 ∞ E Z¯n = µ and Var(Z¯n ) = σn (ie. Z¯n is the sum of n iid r.v.’s, Zni , each with
EX = xdF 2
variance σn2 ), we have,
0
2 δ 2 ∞ σ2
= xdF + xdF Pr{|Z¯n − µ| > ǫ} ≤ 2 .
0 δ nǫ
49

The Asymptotic Equipartition Property 51 52 The Asymptotic Equipartition Property

2. AEP and mutual information. Let (Xi , Yi ) be i.i.d. ∼ p(x, y) . We form the log 4. AEP
$
likelihood ratio of the hypothesis that X and Y are independent vs. the hypothesis Let Xi be iid ∼ p(x), x ∈ {1, 2, . . . , m} . Let µ = EX, and H = − p(x) log p(x). Let
$
that X and Y are dependent. What is the limit of An = {xn ∈ X n : | − n1 log p(xn ) − H| ≤ ǫ} . Let B n = {xn ∈ X n : | n1 ni=1 Xi − µ| ≤ ǫ} .
1 p(X n )p(Y n ) (a) Does Pr{X n ∈ An } −→ 1 ?
log ?
n p(X n , Y n )
(b) Does Pr{X n ∈ An ∩ B n } −→ 1 ?
(c) Show |An ∩ B n | ≤ 2n(H+ǫ) , for all n .
(d) Show |An ∩ B n | ≥ ( 21 )2n(H−ǫ) , for n sufficiently large.

Solution:
Solution:
n
1 p(X n )p(Y n ) 1 B p(Xi )p(Yi ) (a) Yes, by the AEP for discrete random variables the probability X n is typical goes
log = log to 1.
n p(X n , Y n ) n i=1
p(Xi , Yi )
1! n
p(Xi )p(Yi ) (b) Yes, by the Strong Law of Large Numbers P r(X n ∈ B n ) → 1 . So there exists
= log ǫ > 0 and N1 such that P r(X n ∈ An ) > 1 − 2ǫ for all n > N1 , and there exists
n i=i p(Xi , Yi )
N2 such that P r(X n ∈ B n ) > 1 − 2ǫ for all n > N2 . So for all n > max(N1 , N2 ) :
p(Xi )p(Yi )
→ E(log )
p(Xi , Yi ) P r(X n ∈ An ∩ B n ) = P r(X n ∈ An ) + P r(X n ∈ B n ) − P r(X n ∈ An ∪ B n )
= −I(X; Y ) ǫ ǫ
> 1− +1− −1
2 2
= 1−ǫ
n n
Thus, p(X )p(Y )
p(X n ,Y n ) → 2
−nI(X;Y ) , which will converge to 1 if X and Y are indeed

independent. So for any ǫ > 0 there exists N = max(N1 , N2 ) such that P r(X n ∈ An ∩ B n ) >
1 − ǫ for all n > N , therefore P r(X n ∈ An ∩ B n ) → 1 .
3. Piece of cake $
(c) By the law of total probability xn ∈An ∩B n p(xn ) ≤ 1 . Also, for xn ∈ An , from
A cake is sliced roughly in half, the largest piece being chosen each time, the other
Theorem 3.1.2 in the text, p(xn ) ≥ 2−n(H+ǫ) . Combining these two equations gives
pieces discarded. We will assume that a random cut creates pieces of proportions: $ $
) 1 ≥ xn ∈An ∩B n p(xn ) ≥ xn ∈An ∩B n 2−n(H+ǫ) = |An ∩ B n |2−n(H+ǫ) . Multiplying
( 32 , 13 ) w.p. 3
4 through by 2n(H+ǫ) gives the result |An ∩ B n | ≤ 2n(H+ǫ) .
P =
( 25 , 35 ) w.p. 1
4 (d) Since from (b) P r{X n ∈ An ∩ B n } → 1 , there exists N such that P r{X n ∈
An ∩ B n } ≥ 21 for all n > N . From Theorem 3.1.2 in the text, for x n ∈ An ,
Thus, for example, the first cut (and choice of largest piece) may result in a piece of $
p(xn ) ≤ 2−n(H−ǫ) . So combining these two gives 21 ≤ n
xn ∈An ∩B n p(x ) ≤
size 35 . Cutting and choosing from this piece might reduce it to size ( 35 )( 23 ) at time 2, $ −n(H−ǫ) = |An ∩ B n |2−n(H−ǫ) . Multiplying through by 2n(H−ǫ) gives
and so on. xn ∈An ∩B n 2
the result |An ∩ B n | ≥ ( 21 )2n(H−ǫ) for n sufficiently large.

How large, to first order in the exponent, is the piece of cake after n cuts? 5. Sets defined by probabilities.
Solution: Let Ci be the fraction of the piece of cake that is cut at the i th cut, and let Let X1 , X2 , . . . be an i.i.d. sequence of discrete random variables with entropy H(X).
C
Tn be the fraction of cake left after n cuts. Then we have T n = C1 C2 . . . Cn = ni=1 Ci . Let
Hence, as in Question 2 of Homework Set #3, Cn (t) = {xn ∈ X n : p(xn ) ≥ 2−nt }

1 1! n denote the subset of n -sequences with probabilities ≥ 2 −nt .

lim log Tn = lim log Ci
n n i=1
(a) Show |Cn (t)| ≤ 2nt .
= E[log C1 ]
(b) For what values of t does P ({X n ∈ Cn (t)}) → 1?
3 2 1 3
= log + log .
4 3 4 5 Solution:
The Asymptotic Equipartition Property 53 54 The Asymptotic Equipartition Property

(a) Since the total probability of all sequences is less than 1, |C n (t)| minxn ∈Cn (t) p(xn ) ≤ Thus the probability that the sequence that is generated cannot be encoded is
1 , and hence |Cn (t)| ≤ 2nt . 1 − 0.99833 = 0.00167 .
(b) Since − n1 log p(xn ) → H , if t < H , the probability that p(x n ) > 2−nt goes to 0, (c) In the case of a random variable S n that is the sum of n i.i.d. random variables
and if t > H , the probability goes to 1. X1 , X2 , . . . , Xn , Chebyshev’s inequality states that
6. An AEP-like limit. Let X1 , X2 , . . . be i.i.d. drawn according to probability mass nσ 2
function p(x). Find Pr(|Sn − nµ| ≥ ǫ) ≤ ,
1
ǫ2
lim [p(X1 , X2 , . . . , Xn )] n .
n→∞ where µ and σ 2 are the mean and variance of Xi . (Therefore nµ and nσ 2
Solution: An AEP-like limit. X1 , X2 , . . . , i.i.d. ∼ p(x) . Hence log(Xi ) are also i.i.d. are the mean and variance of Sn .) In this problem, n = 100 , µ = 0.005 , and
and σ 2 = (0.005)(0.995) . Note that S100 ≥ 4 if and only if |S100 − 100(0.005)| ≥ 3.5 ,
1
so we should choose ǫ = 3.5 . Then
1
lim(p(X1 , X2 , . . . , Xn )) n = lim 2log(p(X1 ,X2 ,...,Xn )) n 100(0.005)(0.995)
$
1 Pr(S100 ≥ 4) ≤ ≈ 0.04061 .
= 2lim n log p(Xi )
a.e. (3.5)2
= 2E(log(p(X))) a.e.
This bound is much larger than the actual probability 0.00167.
= 2−H(X) a.e.
8. Products. Let 
by the strong law of large numbers (assuming of course that H(X) exists). 1
 1,
 2
1
X= 2, 4
7. The AEP and source coding. A discrete memoryless source emits a sequence of 
 3, 1
statistically independent binary digits with probabilities p(1) = 0.005 and p(0) = 4

0.995 . The digits are taken 100 at a time and a binary codeword is provided for every Let X1 , X2 , . . . be drawn i.i.d. according to this distribution. Find the limiting behavior
sequence of 100 digits containing three or fewer ones. of the product
1
(X1 X2 · · · Xn ) n .
(a) Assuming that all codewords are the same length, find the minimum length re-
quired to provide codewords for all sequences with three or fewer ones. Solution: Products. Let 1
(b) Calculate the probability of observing a source sequence for which no codeword Pn = (X1 X2 . . . Xn ) n . (3.5)
has been assigned.
Then
n
(c) Use Chebyshev’s inequality to bound the probability of observing a source sequence 1!
log Pn = log Xi → E log X, (3.6)
for which no codeword has been assigned. Compare this bound with the actual n i=1
probability computed in part (b).
with probability 1, by the strong law of large numbers. Thus P n → 2E log X
with prob.
Solution: The AEP and source coding. 1. We can easily calculate E log X = 21 log 1 + 14 log 2 + 14 log 3 = 14 log 6 , and therefore
1
log 6
(a) The number of 100-bit binary sequences with three or fewer ones is Pn → 2 4 = 1.565 .
. / . / . / . /
100 100 100 100 9. AEP. Let X1 , X2 , . . . be independent identically distributed random variables drawn
+ + + = 1 + 100 + 4950 + 161700 = 166751 . according to the probability mass function p(x), x ∈ {1, 2, . . . , m} . Thus p(x 1 , x2 , . . . , xn ) =
0 1 2 3 Cn
i=1 p(xi ) . We know that − n1 log p(X1 , X2 , . . . , Xn ) → H(X) in probability. Let
C
The required codeword length is ⌈log 2 166751⌉ = 18 . (Note that H(0.005) = q(x1 , x2 , . . . , xn ) = ni=1 q(xi ), where q is another probability mass function on {1, 2, . . . , m} .
0.0454 , so 18 is quite a bit larger than the 4.5 bits of entropy.)
(a) Evaluate lim − n1 log q(X1 , X2 , . . . , Xn ) , where X1 , X2 , . . . are i.i.d. ∼ p(x) .
(b) The probability that a 100-bit sequence has three or fewer ones is
q(X1 ,...,Xn )
. / (b) Now evaluate the limit of the log likelihood ratio n1 log p(X1 ,...,Xn )
when X1 , X2 , . . .
3
! 100 are i.i.d. ∼ p(x) . Thus the odds favoring q are exponentially small when p is
(0.005)i (0.995)100−i = 0.60577 + 0.30441 + 0.7572 + 0.01243 = 0.99833
i=0
i true.

The Asymptotic Equipartition Property 55 56 The Asymptotic Equipartition Property

Solution: (AEP). Thus the “effective” edge length of this solid is e −1 . Note that since the Xi ’s are
C
independent, E(Vn ) = E(Xi ) = ( 12 )n . Also 12 is the arithmetic mean of the random
(a) Since the X1 , X2 , . . . , Xn are i.i.d., so are q(X1 ), q(X2 ), . . . , q(Xn ) , and hence we variable, and 1e is the geometric mean.
can apply the strong law of large numbers to obtain
11. Proof of Theorem 3.3.1. This problem shows that the size of the smallest “probable”
1 1! (n)
lim − log q(X1 , X2 , . . . , Xn ) = lim − log q(Xi ) (3.7) set is about 2nH . Let X1 , X2 , . . . , Xn be i.i.d. ∼ p(x) . Let Bδ ⊂ X n such that
n n (n)
= −E(log q(X)) w.p. 1 (3.8) Pr(Bδ ) > 1 − δ . Fix ǫ < 21 .
!
= − p(x) log q(x) (3.9) (a) Given any two sets A , B such that Pr(A) > 1 − ǫ 1 and Pr(B) > 1 − ǫ2 , show
(n) (n)
p(x)
! ! that Pr(A ∩ B) > 1 − ǫ1 − ǫ2 . Hence Pr(Aǫ ∩ Bδ ) ≥ 1 − ǫ − δ.
= −
p(x) log p(x) log p(x) (3.10)
q(x) (b) Justify the steps in the chain of inequalities
= D(p||q) + H(p). (3.11) (n)
1 − ǫ − δ ≤ Pr(A(n)
ǫ ∩ Bδ ) (3.17)
!
(b) Again, by the strong law of large numbers, = p(xn ) (3.18)
(n) (n)
1 q(X1 , X2 , . . . , Xn ) 1! q(Xi ) Aǫ ∩Bδ
lim − log = lim − log (3.12) !
n p(X1 , X2 , . . . , Xn ) n p(Xi ) ≤ 2−n(H−ǫ) (3.19)
q(X) (n)
Aǫ ∩Bδ
(n)
= −E(log ) w.p. 1 (3.13)
p(X) (n)
= |A(n)
ǫ ∩ Bδ |2
−n(H−ǫ)
(3.20)
! q(x) (n)
= − p(x) log (3.14) ≤ |Bδ |2−n(H−ǫ) . (3.21)
p(x)
! p(x) (c) Complete the proof of the theorem.
= p(x) log (3.15)
q(x)
Solution: Proof of Theorem 3.3.1.
= D(p||q). (3.16)
(a) Let Ac denote the complement of A . Then
10. Random box size. An n -dimensional rectangular box with sides X 1 , X2 , X3 , . . . , Xn
C
is to be constructed. The volume is Vn = ni=1 Xi . The edge length l of a n -cube P (Ac ∪ B c ) ≤ P (Ac ) + P (B c ). (3.22)
1/n
with the same volume as the random box is l = V n . Let X1 , X2 , . . . be i.i.d. uniform Since P (A) ≥ 1 − ǫ1 , P (Ac ) ≤ ǫ1 . Similarly, P (B c ) ≤ ǫ2 . Hence
1/n
random variables over the unit interval [0, 1]. Find lim n→∞ Vn , and compare to P (A ∩ B) = 1 − P (Ac ∪ B c ) (3.23)
1
(EVn ) n . Clearly the expected edge length does not capture the idea of the volume
≥ 1 − P (Ac ) − P (B c ) (3.24)
of the box. The geometric mean, rather than the arithmetic mean, characterizes the
behavior of products. ≥ 1 − ǫ 1 − ǫ2 . (3.25)
C
Solution: Random box size. The volume V n = ni=1 Xi is a random variable, since (b) To complete the proof, we have the following chain of inequalities
the Xi are random variables uniformly distributed on [0, 1] . V n tends to 0 as n → ∞ . (a)
(n)
However 1−ǫ−δ ≤ Pr(A(n)
ǫ ∩ Bδ ) (3.26)
1 1 1! (b) !
loge Vnn = loge Vn = loge Xi → E(loge (X)) a.e. = p(xn ) (3.27)
n n
(n) (n)
by the Strong Law of Large Numbers, since X i and loge (Xi ) are i.i.d. and E(log e (X)) < Aǫ ∩Bδ

∞ . Now (c) !
2 1 ≤ 2−n(H−ǫ) (3.28)
E(loge (Xi )) = loge (x) dx = −1 (n) (n)
0 Aǫ ∩Bδ

Hence, since ex is a continuous function, (d) (n)

= |A(n)
ǫ ∩ Bδ |2
−n(H−ǫ)
(3.29)
1 1 1 1 (e)
lim Vnn = elimn→∞ n loge Vn = < . ≤
(n)
|Bδ |2−n(H−ǫ) . (3.30)
n→∞ e 2
The Asymptotic Equipartition Property 57 58 The Asymptotic Equipartition Property

where (a) follows from the previous part, (b) follows by definition of probability of (b) The trick to this part is similar to part a) and involves rewriting p̂ n in terms of
a set, (c) follows from the fact that the probability of elements of the typical set are p̂n−1 . We see that,
(n) (n)
bounded by 2−n(H−ǫ) , (d) from the definition of |Aǫ ∩ Bδ | as the cardinality
(n) (n) (n) (n) (n)
of the set Aǫ ∩ Bδ , and (e) from the fact that Aǫ ∩ Bδ ⊆ Bδ . 1 n−1
! I(Xn = x)
p̂n = I(Xi = x) +
n i=0 n
12. Monotonic convergence of the empirical distribution. Let p̂ n denote the empir-
ical probability mass function corresponding to X 1 , X2 , . . . , Xn i.i.d. ∼ p(x), x ∈ X .
Specifically, or in general,
n
1! 1! I(Xj = x)
p̂n (x) = I(Xi = x) p̂n = I(Xi = x) + ,
n i=1 n i+=j n
is the proportion of times that Xi = x in the first n samples, where I is the indicator
function. where j ranges from 1 to n .
(a) Show for X binary that Summing over j we get,

ED(p̂2n 5 p) ≤ ED(p̂n 5 p). n

n−1!
np̂n = p̂j + p̂n ,
Thus the expected relative entropy “distance” from the empirical distribution to n j=1 n−1
the true distribution decreases with sample size.
Hint: Write p̂2n = 21 p̂n + 21 p̂'n and use the convexity of D . or,
(b) Show for an arbitrary discrete X that 1! n
p̂n = p̂j
ED(p̂n 5 p) ≤ ED(p̂n−1 5 p). n j=1 n−1

Hint: Write p̂n as the average of n empirical mass functions with each of the n
where,
samples deleted in turn.
n
! 1 !
Solution: Monotonic convergence of the empirical distribution. p̂jn−1 = I(Xi = x).
j=1
n − 1 i+=j
(a) Note that,

1 !2n Again using the convexity of D(p||q) and the fact that the D(p̂ jn−1 ||p) are identi-
p̂2n (x) = I(Xi = x) cally distributed for all j and hence have the same expected value, we obtain the
2n i=1
final result.
n 2n
11! 11 !
= I(Xi = x) + I(Xi = x)
2 n i=1 2 n i=n+1 (n)
13. Calculation of typical set To clarify the notion of a typical set A ǫ and the smallest
1 1 (n)
= p̂n (x) + p̂'n (x). set of high probability Bδ , we will calculate the set for a simple example. Consider a
2 2 sequence of i.i.d. binary random variables, X 1 , X2 , . . . , Xn , where the probability that
Using convexity of D(p||q) we have that, Xi = 1 is 0.6 (and therefore the probability that X i = 0 is 0.4).
1 1 1 1
D(p̂2n ||p) = D( p̂n + p̂'n || p + p)
2 2 2 2 (a) Calculate H(X) .
1 1
≤ D(p̂n ||p) + D(p̂'n ||p). (n)
2 2 (b) With n = 25 and ǫ = 0.1 , which sequences fall in the typical set A ǫ ? What
Taking expectations and using the fact the X i ’s are identically distributed we get, is the probability of the typical set? How many elements are there in the typical
set? (This involves computation of a table of probabilities for sequences with k
ED(p̂2n ||p) ≤ ED(p̂n ||p). 1’s, 0 ≤ k ≤ 25 , and finding those sequences that are in the typical set.)

The Asymptotic Equipartition Property 59 60 The Asymptotic Equipartition Property

,n- ,n -
k k k pk (1 − p)n−k − n1 log p(xn ) The number of elements in the typical set can be found using the third column.
0 1 0.000000 1.321928 . / . / . /
19 19 10
1 25 0.000000 1.298530 ! n ! n ! n
|A(n)
ǫ |= = − = 33486026 − 7119516 = 26366510.
2 300 0.000000 1.275131 k=11
k k=0
k k=0
k
3 2300 0.000001 1.251733 (3.31)
4 12650 0.000007 1.228334 (n)
Note that the upper and lower bounds for the size of the A ǫ can be calculated
5 53130 0.000054 1.204936 as 2 n(H+ǫ) = 2 25(0.97095+0.1) = 226.77 8
= 1.147365 × 10 , and (1 − ǫ)2 n(H−ǫ) =
6 177100 0.000227 1.181537 0.9 × 225(0.97095−0.1) = 0.9 × 221.9875 = 3742308 . Both bounds are very loose!
7 480700 0.001205 1.158139 (n)
(c) To find the smallest set Bδ of probability 0.9, we can imagine that we are filling
8 1081575 0.003121 1.134740
a bag with pieces such that we want to reach a certain weight with the minimum
9 2042975 0.013169 1.111342
number of pieces. To minimize the number of pieces that we use, we should use
10 3268760 0.021222 1.087943 the largest possible pieces. In this case, it corresponds to using the sequences with
11 4457400 0.077801 1.064545 the highest probability.
12 5200300 0.075967 1.041146
Thus we keep putting the high probability sequences into this set until we reach
13 5200300 0.267718 1.017748
a total probability of 0.9. Looking at the fourth column of the table, it is clear
14 4457400 0.146507 0.994349
that the probability of a sequence increases monotonically with k . Thus the set
15 3268760 0.575383 0.970951 consists of sequences of k = 25, 24, . . . , until we have a total probability 0.9.
16 2042975 0.151086 0.947552 (n)
Using the cumulative probability column, it follows that the set B δ consist
17 1081575 0.846448 0.924154
of sequences with k ≥ 13 and some sequences with k = 12 . The sequences with
18 480700 0.079986 0.900755 (n)
19 177100 0.970638 0.877357 k ≥ 13 provide a total probability of 1−0.153768 = 0.846232 to the set B δ . The
remaining probability of 0.9 − 0.846232 = 0.053768 should come from sequences
20 53130 0.019891 0.853958
with k = 12 . The number of such sequences needed to fill this probability is at
21 12650 0.997633 0.830560
least 0.053768/p(xn ) = 0.053768/1.460813×10−8 = 3680690.1 , which we round up
22 2300 0.001937 0.807161
to 3680691. Thus the smallest set with probability 0.9 has 33554432 − 16777216 +
23 300 0.999950 0.783763 (n)
3680691 = 20457907 sequences. Note that the set B δ is not uniquely defined
24 25 0.000047 0.760364
- it could include any 3680691 sequences with k = 12 . However, the size of the
25 1 0.000003 0.736966
smallest set is well defined.
(c) How many elements are there in the smallest set that has probability 0.9? (n) (n)
(d) The intersection of the sets Aǫ and Bδ in parts (b) and (c) consists of all
(d) How many elements are there in the intersection of the sets in part (b) and (c)? sequences with k between 13 and 19, and 3680691 sequences with k = 12 . The
What is the probability of this intersection? probability of this intersection = 0.970638 − 0.153768 + 0.053768 = 0.870638 , and
the size of this intersection = 33486026 − 16777216 + 3680691 = 20389501 .
Solution:

(a) H(X) = −0.6 log 0.6 − 0.4 log 0.4 = 0.97095 bits.
(n)
(b) By definition, Aǫ for ǫ = 0.1 is the set of sequences such that − n1 log p(xn ) lies
in the range (H(X)−ǫ, H(X)+ǫ) , i.e., in the range (0.87095, 1.07095). Examining
the last column of the table, it is easy to see that the typical set is the set of all
sequences with the number of ones lying between 11 and 19.
The probability of the typical set can be calculated from cumulative probability
column. The probability that the number of 1’s lies between 11 and 19 is equal to
F (19) − F (10) = 0.970638 − 0.034392 = 0.936246 . Note that this is greater than
1 − ǫ , i.e., the n is large enough for the probability of the typical set to be greater
than 1 − ǫ .
62 Entropy Rates of a Stochastic Process
m
= 1 log (4.5)
m
= 0, (4.6)

where the inequality follows from the log sum inequality.

1
(b) If the matrix is doubly stochastic, the substituting µ i = m , we can easily check
Chapter 4 that it satisfies µ = µP .
(c) If the uniform is a stationary distribution, then

1 ! 1 !
= µi = µj Pji = Pji , (4.7)
Entropy Rates of a Stochastic m j
m j

Process
$
or j Pji = 1 or that the matrix is doubly stochastic.

2. Time’s arrow. Let {Xi }∞

i=−∞ be a stationary stochastic process. Prove that

H(X0 |X−1 , X−2 , . . . , X−n ) = H(X0 |X1 , X2 , . . . , Xn ).

1. Doubly stochastic matrices. An n × n matrix P = [P ij ] is said to be doubly
$ $
stochastic if Pij ≥ 0 and j Pij = 1 for all i and i Pij = 1 for all j . An n × n In other words, the present has a conditional entropy given the past equal to the
matrix P is said to be a permutation matrix if it is doubly stochastic and there is conditional entropy given the future.
precisely one Pij = 1 in each row and each column.
This is true even though it is quite easy to concoct stationary random processes for
It can be shown that every doubly stochastic matrix can be written as the convex which the flow into the future looks quite different from the flow into the past. That is
combination of permutation matrices. to say, one can determine the direction of time by looking at a sample function of the
(a) Let at = (a1 , a2 , . . . , an ) , ai ≥ 0 ,
$
ai = 1 , be a probability vector. Let b = aP , process. Nonetheless, given the present state, the conditional uncertainty of the next
where P is doubly stochastic. Show that b is a probability vector and that symbol in the future is equal to the conditional uncertainty of the previous symbol in
H(b1 , b2 , . . . , bn ) ≥ H(a1 , a2 , . . . , an ) . Thus stochastic mixing increases entropy. the past.
(b) Show that a stationary distribution µ for a doubly stochastic matrix P is the Solution: Time’s arrow. By the chain rule for entropy,
uniform distribution.
H(X0 |X−1 , . . . , X−n ) = H(X0 , X−1 , . . . , X−n ) − H(X−1 , . . . , X−n ) (4.8)
(c) Conversely, prove that if the uniform distribution is a stationary distribution for
a Markov transition matrix P , then P is doubly stochastic. = H(X0 , X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn ) (4.9)
= H(X0 |X1 , X2 , . . . , Xn ), (4.10)
Solution: Doubly Stochastic Matrices.
where (4.9) follows from stationarity.
(a)
! ! 3. Shuffles increase entropy. Argue that for any distribution on shuffles T and any
H(b) − H(a) = − bj log bj + ai log ai (4.1)
j i distribution on card positions X that
!! ! !
= ai Pij log( ak Pkj ) + ai log ai (4.2)
H(T X) ≥ H(T X|T ) (4.11)
j i k i
!! ai = H(T −1 T X|T ) (4.12)
= ai Pij log $ (4.3)
i j k ak Pkj = H(X|T ) (4.13)
  $ = H(X), (4.14)
i,j ai
!
≥  ai Pij  log $ (4.4)
i,j i,j bj if X and T are independent.
61

Entropy Rates of a Stochastic Process 63 64 Entropy Rates of a Stochastic Process

Solution: Shuffles increase entropy. At time k , choose one of the k − 1 terminal nodes according to a uniform distribution
and expand it. Continue until n terminal nodes have been generated. Thus a sequence
H(T X) ≥ H(T X|T ) (4.15) leading to a five node tree might look like this:
= H(T −1 T X|T ) (4.16)
= H(X|T ) (4.17)
"! "! "! "!
= H(X). (4.18) " ! " ! " ! " !
" ! " ! " ! " !
The inequality follows from the fact that conditioning reduces entropy and the first "! "! "!
"
" !! "
" !! "
" !!
equality follows from the fact that given T , we can reverse the shuffle.
"! "!
" ! " !
4. Second law of thermodynamics. Let X1 , X2 , X3 . . . be a stationary first-order " ! " !
Markov chain. In Section 4.4, it was shown that H(X n | X1 ) ≥ H(Xn−1 | X1 ) for "!
n = 2, 3 . . . . Thus conditional uncertainty about the future grows with time. This is "" !!
true although the unconditional uncertainty H(X n ) remains constant. However, show
by example that H(Xn |X1 = x1 ) does not necessarily grow with n for every x 1 . Surprisingly, the following method of generating random trees yields the same probabil-
ity distribution on trees with n terminal nodes. First choose an integer N 1 uniformly
Solution: Second law of thermodynamics.
distributed on {1, 2, . . . , n − 1} . We then have the picture.
H(Xn |X1 ) ≤ H(Xn |X1 , X2 ) (Conditioning reduces entropy) (4.19)
= H(Xn |X2 ) (by Markovity) (4.20) "!
" !
= H(Xn−1 |X1 ) (by stationarity) (4.21) " !
N1 n − N1
Alternatively, by an application of the data processing inequality to the Markov chain Then choose an integer N2 uniformly distributed over {1, 2, . . . , N 1 − 1} , and indepen-
X1 → Xn−1 → Xn , we have dently choose another integer N3 uniformly over {1, 2, . . . , (n − N1 ) − 1} . The picture
I(X1 ; Xn−1 ) ≥ I(X1 ; Xn ). (4.22) is now:

#$
## $$
Expanding the mutual informations in terms of entropies, we have
## $$
H(Xn−1 ) − H(Xn−1 |X1 ) ≥ H(Xn ) − H(Xn |X1 ). (4.23) "! "!
" ! " !
By stationarity, H(Xn−1 ) = H(Xn ) and hence we have " ! " !
N2 N1 − N2 N3 n − N 1 − N3
H(Xn−1 |X1 ) ≤ H(Xn |X1 ). (4.24)
Continue the process until no further subdivision can be made. (The equivalence of
5. Entropy of a random tree. Consider the following method of generating a random these two tree generation schemes follows, for example, from Polya’s urn model.)
tree with n nodes. First expand the root node: Now let Tn denote a random n -node tree generated as described. The probability
distribution on such trees seems difficult to describe, but we can find the entropy of
"! this distribution in recursive form.
" !
" !
First some examples. For n = 2 , we have only one tree. Thus H(T 2 ) = 0 . For n = 3 ,
Then expand one of the two terminal nodes at random: we have two equally probable trees:

"! "! "! "!

" ! " ! " ! " !
" ! " ! " ! " !
"! "! "! "!
"" !! "" !! "" !! "" !!
Entropy Rates of a Stochastic Process 65 66 Entropy Rates of a Stochastic Process

Thus H(T3 ) = log 2 . For n = 4 , we have five possible trees, with probabilities 1/3, k) = H(Tk ) + H(Tn−k ) , so
1/6, 1/6, 1/6, 1/6.
Now for the recurrence relation. Let N 1 (Tn ) denote the number of terminal nodes of 1 n−1
!
H(Tn |N1 ) = (H(Tk ) + H(Tn−k )) . (4.35)
Tn in the right half of the tree. Justify each of the steps in the following: n − 1 k=1

(a)
H(Tn ) = H(N1 , Tn ) (4.25) (e) By a simple change of variables,
(b)
= H(N1 ) + H(Tn |N1 ) (4.26) n−1
! n−1
!
(c) H(Tn−k ) = H(Tk ). (4.36)
= log(n − 1) + H(Tn |N1 ) (4.27) k=1 k=1
(d) 1 n−1
!
= log(n − 1) + [H(Tk ) + H(Tn−k )] (4.28) (f) Hence if we let Hn = H(Tn ) ,
n − 1 k=1
(e) 2 n−1
! n−1
!
= log(n − 1) + H(Tk ). (4.29) (n − 1)Hn = (n − 1) log(n − 1) + 2 Hk (4.37)
n − 1 k=1
k=1
2 n−1
! n−2
!
= log(n − 1) + Hk . (4.30) (n − 2)Hn−1 = (n − 2) log(n − 2) + 2 Hk (4.38)
n − 1 k=1
k=1
(4.39)
(f) Use this to show that

(n − 1)Hn = nHn−1 + (n − 1) log(n − 1) − (n − 2) log(n − 2), (4.31) Subtracting the second equation from the first, we get

or (n − 1)Hn − (n − 2)Hn−1 = (n − 1) log(n − 1) − (n − 2) log(n − 2) + 2H n−1 (4.40)

Hn Hn−1
= + cn , (4.32)
n n−1 or
$
for appropriately defined cn . Since cn = c < ∞ , you have proved that n1 H(Tn ) Hn Hn−1 log(n − 1) (n − 2) log(n − 2)
converges to a constant. Thus the expected number of bits necessary to describe the = + − (4.41)
n n−1 n n(n − 1)
random tree Tn grows linearly with n . Hn−1
= + Cn (4.42)
Solution: Entropy of a random tree. n−1

(a) H(Tn , N1 ) = H(Tn ) + H(N1 |Tn ) = H(Tn ) + 0 by the chain rule for entropies and where
since N1 is a function of Tn . log(n − 1) (n − 2) log(n − 2)
(b) H(Tn , N1 ) = H(N1 ) + H(Tn |N1 ) by the chain rule for entropies. Cn = − (4.43)
n n(n − 1)
(c) H(N1 ) = log(n − 1) since N1 is uniform on {1, 2, . . . , n − 1} . log(n − 1) log(n − 2) 2 log(n − 2)
= − + (4.44)
(d) n (n − 1) n(n − 1)

n−1
! Substituting the equation for H n−1 in the equation for Hn and proceeding recursively,
H(Tn |N1 ) = P (N1 = k)H(Tn |N1 = k) (4.33) we obtain a telescoping sum
k=1
n
1 n−1
! Hn ! H2
= H(Tn |N1 = k) (4.34) = Cj + (4.45)
n − 1 k=1 n j=3
2
n
by the definition of conditional entropy. Since conditional on N 1 , the left subtree
! 2 log(j − 2) 1
= + log(n − 1). (4.46)
and the right subtree are chosen independently, H(T n |N1 = k) = H(Tk , Tn−k |N1 = j=3
j(j − 1) n

Entropy Rates of a Stochastic Process 67 68 Entropy Rates of a Stochastic Process

1
Since limn→∞ n log(n − 1) = 0 which further implies, by averaging both sides, that,
$n−1 i−1 )
∞ i=1 H(Xi |X
Hn ! 2 H(Xn |X n−1 ) ≤ (4.56)
lim = log(j − 2) (4.47) n−1
n→∞ n
j=3
j(j − 1) H(X1 , X2 , . . . , Xn−1 )
∞ = . (4.57)
! 2 n−1
≤ log(j − 1) (4.48)
j=3
(j − 1)2 Combining (4.55) and (4.57) yields,
∞ 4 5
! 2 H(X1 , X2 , . . . , Xn ) 1 H(X1 , X2 , . . . , Xn−1 )
= log j (4.49) ≤ + H(X1 , X2 , . . . , Xn−1 )
j2 n n n−1
j=2
H(X1 , X2 , . . . , Xn−1 )
√ = . (4.58)
For sufficiently large j , log j ≤ j and hence the sum in (4.49) is dominated by the n−1
$ −3
sum j j 2 which converges. Hence the above sum converges. In fact, computer (b) By stationarity we have for all 1 ≤ i ≤ n ,
evaluation shows that
H(Xn |X n−1 ) ≤ H(Xi |X i−1 ),
∞
Hn ! 2 which implies that
lim = log(j − 2) = 1.736 bits. (4.50)
n j(j − 1) $n n−1 )
i=1 H(Xn |X
j=3
H(Xn |X n−1 ) = (4.59)
n
Thus the number of bits required to describe a random n -node tree grows linearly with $n i−1 )
i=1 H(Xi |X
n. ≤ (4.60)
n
H(X1 , X2 , . . . , Xn )
6. Monotonicity of entropy per element. For a stationary stochastic process X 1 , X2 , . . . , Xn , = . (4.61)
show that n
7. Entropy rates of Markov chains.
(a)
(a) Find the entropy rate of the two-state Markov chain with transition matrix
H(X1 , X2 , . . . , Xn ) H(X1 , X2 , . . . , Xn−1 )
≤ . (4.51) " #
n n−1 1 − p01 p01
P = .
p10 1 − p10
(b)
H(X1 , X2 , . . . , Xn ) (b) What values of p01 , p10 maximize the rate of part (a)?
≥ H(Xn |Xn−1 , . . . , X1 ). (4.52)
n (c) Find the entropy rate of the two-state Markov chain with transition matrix
" #
Solution: Monotonicity of entropy per element. 1−p p
P = .
1 0
(a) By the chain rule for entropy,
$n (d) Find the maximum value of the entropy rate of the Markov chain of part (c). We
H(X1 , X2 , . . . , Xn ) i−1 )
i=1 H(Xi |X expect that the maximizing value of p should be less than 1/2 , since the 0 state
= (4.53)
n n permits more information to be generated than the 1 state.
n−1 $n−1
H(Xn |X ) + i=1 H(Xi |X i−1 ) (e) Let N (t) be the number of allowable state sequences of length t for the Markov
= (4.54)
n chain of part (c). Find N (t) and calculate
H(Xn |X n−1 ) + H(X1 , X2 , . . . , Xn−1 ) 1
= . (4.55) H0 = lim
log N (t) .
n t t→∞

From stationarity it follows that for all 1 ≤ i ≤ n , Hint: Find a linear recurrence that expresses N (t) in terms of N (t − 1) and
N (t − 2) . Why is H0 an upper bound on the entropy rate of the Markov chain?
H(Xn |X n−1 ) ≤ H(Xi |X i−1 ), Compare H0 with the maximum entropy found in part (d).
Entropy Rates of a Stochastic Process 69 70 Entropy Rates of a Stochastic Process

Solution: Entropy rates of Markov chains. 8. Maximum entropy process. A discrete memoryless source has alphabet {1, 2}
where the symbol 1 has duration 1 and the symbol 2 has duration 2. The proba-
(a) The stationary distribution is easily calculated. (See EIT pp. 62–63.)
bilities of 1 and 2 are p1 and p2 , respectively. Find the value of p 1 that maximizes
p10 p01 the source entropy per unit time H(X)/El X . What is the maximum value H ?
µ0 = , µ0 = .
p01 + p10 p01 + p10
Solution: Maximum entropy process. The entropy per symbol of the source is
Therefore the entropy rate is
H(p1 ) = −p1 log p1 − (1 − p1 ) log(1 − p1 )
p10 H(p01 ) + p01 H(p10 )
H(X2 |X1 ) = µ0 H(p01 ) + µ1 H(p10 ) = .
p01 + p10 and the average symbol duration (or time per symbol) is
(b) The entropy rate is at most 1 bit because the process has only two states. This T (p1 ) = 1 · p1 + 2 · p2 = p1 + 2(1 − p1 ) = 2 − p1 = 1 + p2 .
rate can be achieved if (and only if) p 01 = p10 = 1/2 , in which case the process is
Therefore the source entropy per unit time is
actually i.i.d. with Pr(Xi = 0) = Pr(Xi = 1) = 1/2 .
(c) As a special case of the general two-state Markov chain, the entropy rate is H(p1 ) −p1 log p1 − (1 − p1 ) log(1 − p1 )
f (p1 ) = = .
T (p1 ) 2 − p1
H(p)
H(X2 |X1 ) = µ0 H(p) + µ1 H(1) = . Since f (0) = f (1) = 0 , the maximum value of f (p 1 ) must occur for some point p1
p+1
such that 0 < p1 < 1 and ∂f /∂p1 = 0 and
(d) By straightforward calculus,
√ we find that the maximum value of H(X) of part (c)
occurs for p = (3 − 5)/2 = 0.382 . The maximum value is ∂ H(p1 ) T (∂H/∂p1 ) − H(∂T /∂p1 )
=
.√ / ∂p1 T (p1 ) T2
5−1
H(p) = H(1 − p) = H = 0.694 bits . After some calculus, we find that the numerator of the above expression (assuming
2
√ natural logarithms) is
Note that ( 5 − 1)/2 = 0.618 is (the reciprocal of) the Golden Ratio.
T (∂H/∂p1 ) − H(∂T /∂p1 ) = ln(1 − p1 ) − 2 ln p1 ,
(e) The Markov chain of part (c) forbids consecutive ones. Consider any allowable √
sequence of symbols of length t . If the first symbol is 1, then the next symbol which is zero when 1 − p1 = p21 = p2 , that is, p1 = 21 ( 5 − 1) = 0.61803 , the reciprocal
√
must be 0; the remaining N (t − 2) symbols can form any allowable sequence. If of the golden ratio, 21 ( 5 + 1) = 1.61803 . The corresponding entropy per unit time is
the first symbol is 0, then the remaining N (t − 1) symbols can be any allowable
H(p1 ) −p1 log p1 − p21 log p21 −(1 + p21 ) log p1
sequence. So the number of allowable sequences of length t satisfies the recurrence = = = − log p1 = 0.69424 bits.
T (p1 ) 2 − p1 1 + p21
N (t) = N (t − 1) + N (t − 2) N (1) = 2, N (2) = 3
Note that this result is the same as the maximum entropy rate for the Markov chain
(The initial conditions are obtained by observing that for t = 2 only the sequence in problem 4.7(d). This is because a source in which every 1 must be followed by a 0
11 is not allowed. We could also choose N (0) = 1 as an initial condition, since is equivalent to a source in which the symbol 1 has duration 2 and the symbol 0 has
there is exactly one allowable sequence of length 0, namely, the empty sequence.) duration 1.
The sequence N (t) grows exponentially, that is, N (t) ≈ cλ t , where λ is the
9. Initial conditions. Show, for a Markov chain, that
maximum magnitude solution of the characteristic equation
H(X0 |Xn ) ≥ H(X0 |Xn−1 ).
1 = z −1 + z −2 .
√ Thus initial conditions X0 become more difficult to recover as the future X n unfolds.
Solving the characteristic equation yields λ = (1+ 5)/2 , the Golden Ratio. (The
sequence {N (t)} is the sequence of Fibonacci numbers.) Therefore Solution: Initial conditions. For a Markov chain, by the data processing theorem, we
√ have
1
H0 = lim log N (t) = log(1 + 5)/2 = 0.694 bits . I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (4.62)
n→∞ t

Since there are only N (t) possible outcomes for X 1 , . . . , Xt , an upper bound on Therefore
H(X1 , . . . , Xt ) is log N (t) , and so the entropy rate of the Markov chain of part (c) H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (4.63)
is at most H0 . In fact, we saw in part (d) that this upper bound can be achieved. or H(X0 |Xn ) increases with n .

Entropy Rates of a Stochastic Process 71 72 Entropy Rates of a Stochastic Process

10. Pairwise independence. Let X1 , X2 , . . . , Xn−1 be i.i.d. random variables taking since Xn is a function of the previous Xi ’s. The total entropy is not n , which is
$n−1
values in {0, 1} , with Pr{Xi = 1} = 12 . Let Xn = 1 if i=1 Xi is odd and Xn = 0 what would be obtained if the Xi ’s were all independent. This example illustrates
otherwise. Let n ≥ 3 . that pairwise independence does not imply complete independence.

(a) Show that Xi and Xj are independent, for i %= j , i, j ∈ {1, 2, . . . , n} . 11. Stationary processes. Let . . . , X−1 , X0 , X1 , . . . be a stationary (not necessarily
(b) Find H(Xi , Xj ) , for i %= j . Markov) stochastic process. Which of the following statements are true? Prove or
provide a counterexample.
(c) Find H(X1 , X2 , . . . , Xn ) . Is this equal to nH(X1 ) ?

Solution: (Pairwise Independence) X1 , X2 , . . . , Xn−1 are i.i.d. Bernoulli(1/2) random (a) H(Xn |X0 ) = H(X−n |X0 ) .
$k
variables. We will first prove that for any k ≤ n − 1 , the probability that i=1 Xi is (b) H(Xn |X0 ) ≥ H(Xn−1 |X0 ) .
odd is 1/2 . We will prove this by induction. Clearly this is true for k = 1 . Assume (c) H(Xn |X1 , X2 , . . . , Xn−1 , Xn+1 ) is nonincreasing in n .
$k
that it is true for k − 1 . Let Sk = i=1 Xi . Then
(d) H(Xn |X1 , . . . , Xn−1 , Xn+1 , . . . , X2n ) is non-increasing in n .
P (Sk odd) = P (Sk−1 odd)P (Xk = 0) + P (Sk−1 even)P (Xk = 1) (4.64)
11 11 Solution: Stationary processes.
= + (4.65)
22 22
1 (a) H(Xn |X0 ) = H(X−n |X0 ) .
= . (4.66) This statement is true, since
2
Hence for all k ≤ n − 1 , the probability that S k is odd is equal to the probability that H(Xn |X0 ) = H(Xn , X0 ) − H(X0 ) (4.76)
it is even. Hence,
1 H(X−n |X0 ) = H(X−n , X0 ) − H(X0 ) (4.77)
P (Xn = 1) = P (Xn = 0) = . (4.67)
2
and H(Xn , X0 ) = H(X−n , X0 ) by stationarity.
(a) It is clear that when i and j are both less than n , X i and Xj are independent.
(b) H(Xn |X0 ) ≥ H(Xn−1 |X0 ) .
The only possible problem is when j = n . Taking i = 1 without loss of generality,
This statement is not true in general, though it is true for first order Markov chains.
n−1
! A simple counterexample is a periodic process with period n . Let X 0 , X1 , X2 , . . . , Xn−1
P (X1 = 1, Xn = 1) = P (X1 = 1, Xi even) (4.68)
i=2
be i.i.d. uniformly distributed binary random variables and let X k = Xk−n for
n−1
! k ≥ n . In this case, H(Xn |X0 ) = 0 and H(Xn−1 |X0 ) = 1 , contradicting the
= P (X1 = 1)P ( Xi even) (4.69) statement H(Xn |X0 ) ≥ H(Xn−1 |X0 ) .
i=2
(c) H(Xn |X1n−1 , Xn+1 ) is non-increasing in n .
11
= (4.70) This statement is true, since by stationarity H(X n |X1n−1 , Xn+1 ) = H(Xn+1 |X2n , Xn+2 ) ≥
22
= P (X1 = 1)P (Xn = 1) (4.71) H(Xn+1 |X1n , Xn+2 ) where the inequality follows from the fact that conditioning
reduces entropy.
and similarly for other possible values of the pair (X 1 , Xn ) . Hence X1 and Xn
are independent. 12. The entropy rate of a dog looking for a bone. A dog walks on the integers,
possibly reversing direction at each step with probability p = .1. Let X 0 = 0 . The
(b) Since Xi and Xj are independent and uniformly distributed on {0, 1} ,
first step is equally likely to be positive or negative. A typical walk might look like this:
H(Xi , Xj ) = H(Xi ) + H(Xj ) = 1 + 1 = 2 bits. (4.72)
(X0 , X1 , . . .) = (0, −1, −2, −3, −4, −3, −2, −1, 0, 1, . . .).
(c) By the chain rule and the independence of X 1 , X2 , . . . , Xn1 , we have
(a) Find H(X1 , X2 , . . . , Xn ).
H(X1 , X2 , . . . , Xn ) = H(X1 , X2 , . . . , Xn−1 ) + H(Xn |Xn−1 , . . . , X1 )(4.73)
n−1
!
(b) Find the entropy rate of this browsing dog.
= H(Xi ) + 0 (4.74) (c) What is the expected number of steps the dog takes before reversing direction?
i=1
= n − 1, (4.75) Solution: The entropy rate of a dog looking for a bone.
Entropy Rates of a Stochastic Process 73 74 Entropy Rates of a Stochastic Process

(a) By the chain rule, since H(X1 , X2 , . . . , Xn ) = H(Xn+1 , Xn+2 , . . . , X2n ) by stationarity.
n
! Thus
H(X0 , X1 , . . . , Xn ) = H(Xi |X i−1 )
i=0 1
n
lim I(X1 , X2 , . . . , Xn ; Xn+1 , Xn+2 , . . . , X2n )
! n→∞ 2n
= H(X0 ) + H(X1 |X0 ) + H(Xi |Xi−1 , Xi−2 ), 1 1
i=2
= lim 2H(X1 , X2 , . . . , Xn ) − lim H(X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . . ,(4.80)
X2n )
n→∞ 2n n→∞ 2n
1 1
since, for i > 1 , the next position depends only on the previous two (i.e., the = lim H(X1 , X2 , . . . , Xn ) − lim H(X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . . , X(4.81)
2n )
n→∞ n n→∞ 2n
dog’s walk is 2nd order Markov, if the dog’s position is the state). Since X 0 = 0
deterministically, H(X0 ) = 0 and since the first step is equally likely to be positive
or negative, H(X1 |X0 ) = 1 . Furthermore for i > 1 , Now limn→∞ n1 H(X1 , X2 , . . . , Xn ) = limn→∞ 2n
1
H(X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . . , X2n )
since both converge to the entropy rate of the process, and therefore
H(Xi |Xi−1 , Xi−2 ) = H(.1, .9).
1
lim I(X1 , X2 , . . . , Xn ; Xn+1 , Xn+2 , . . . , X2n ) = 0. (4.82)
Therefore, n→∞ 2n
H(X0 , X1 , . . . , Xn ) = 1 + (n − 1)H(.1, .9).
14. Functions of a stochastic process.
(b) From a),
(a) Consider a stationary stochastic process X 1 , X2 , . . . , Xn , and let Y1 , Y2 , . . . , Yn be
H(X0 , X1 , . . . Xn ) 1 + (n − 1)H(.1, .9)
= defined by
n+1 n+1
→ H(.1, .9). Yi = φ(Xi ), i = 1, 2, . . . (4.83)

for some function φ . Prove that

(c) The dog must take at least one step to establish the direction of travel from which
it ultimately reverses. Letting S be the number of steps taken between reversals, H(Y) ≤ H(X ) (4.84)
we have
∞
! (b) What is the relationship between the entropy rates H(Z) and H(X ) if
E(S) = s(.9)s−1 (.1)
s=1
Zi = ψ(Xi , Xi+1 ), i = 1, 2, . . . (4.85)
= 10.

Starting at time 0, the expected number of steps to the first reversal is 11. for some function ψ .

13. The past has little to say about the future. For a stationary stochastic process Solution: The key point is that functions of a random variable have lower entropy.
X1 , X2 , . . . , Xn , . . . , show that Since (Y1 , Y2 , . . . , Yn ) is a function of (X1 , X2 , . . . , Xn ) (each Yi is a function of the
corresponding Xi ), we have (from Problem 2.4)
1
lim I(X1 , X2 , . . . , Xn ; Xn+1 , Xn+2 , . . . , X2n ) = 0. (4.78)
n→∞ 2n H(Y1 , Y2 , . . . , Yn ) ≤ H(X1 , X2 , . . . , Xn ) (4.86)
Thus the dependence between adjacent n -blocks of a stationary process does not grow
linearly with n . Dividing by n , and taking the limit as n → ∞ , we have
Solution: H(Y1 , Y2 , . . . , Yn ) H(X1 , X2 , . . . , Xn )
lim ≤ lim (4.87)
I(X1 , X2 , . . . , Xn ; Xn+1 , Xn+2 , . . . , X2n )
n→∞ n n→∞ n
= H(X1 , X2 , . . . , Xn ) + H(Xn+1 , Xn+2 , . . . , X2n ) − H(X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . . , X2n ) or
= 2H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn , Xn+1 , Xn+2 , . . . , X2n ) (4.79) H(Y) ≤ H(X ) (4.88)

Entropy Rates of a Stochastic Process 75 76 Entropy Rates of a Stochastic Process

15. Entropy rate. Let {Xi } be a discrete stationary stochastic process with entropy rate
H(X ). Show
1
H(Xn , . . . , X1 | X0 , X−1 , . . . , X−k ) → H(X ), (4.89)
n
for k = 1, 2, . . . .
Solution: Entropy rate of a stationary process. By the Cesáro mean theorem, the
running average of the terms tends to the same limit as the limit of the terms. Hence
n
1 1!
H(X1 , X2 , . . . , Xn |X0 , X−1 , . . . , X−k ) = H(Xi |Xi−1 , Xi−2 , . . . , X−k(4.90)
)
n n i=1
→ lim H(Xn |Xn−1 , Xn−2 , . . . , X−k(4.91)
)
= H, (4.92)

the entropy rate of the process.

16. Entropy rate of constrained sequences. In magnetic recording, the mechanism of Figure 4.1: Entropy rate of constrained sequence
recording and reading the bits imposes constraints on the sequences of bits that can be
recorded. For example, to ensure proper sychronization, it is often necessary to limit
the length of runs of 0’s between two 1’s. Also to reduce intersymbol interference, it Using the eigenvalue decomposition of A for the case of distinct eigenvalues, we
may be necessary to require at least one 0 between any two 1’s. We will consider a can write A = U −1 ΛU , where Λ is the diagonal matrix of eigenvalues. Then
simple example of such a constraint. An−1 = U −1 Λn−1 U . Show that we can write

Suppose that we are required to have at least one 0 and at most two 0’s between any X(n) = λ1n−1 Y1 + λ2n−1 Y2 + λ3n−1 Y3 , (4.96)
pair of 1’s in a sequences. Thus, sequences like 101001 and 0101001 are valid sequences,
where Y1 , Y2 , Y3 do not depend on n . For large n , this sum is dominated by
but 0110010 and 0000101 are not. We wish to calculate the number of valid sequences
the largest term. Therefore argue that for i = 1, 2, 3 , we have
of length n .
1
log Xi (n) → log λ, (4.97)
(a) Show that the set of constrained sequences is the same as the set of allowed paths n
on the following state diagram: where λ is the largest (positive) eigenvalue. Thus the number of sequences of
(b) Let Xi (n) be the number of valid paths of length n ending at state i . Argue that length n grows as λn for large n . Calculate λ for the matrix A above. (The
X(n) = [X1 (n) X2 (n) X3 (n)]t satisfies the following recursion: case when the eigenvalues are not distinct can be handled in a similar manner.)
     (d) We will now take a different approach. Consider a Markov chain whose state
X1 (n) 0 1 1 X1 (n − 1) diagram is the one given in part (a), but with arbitrary transition probabilities.
    
 X2 (n)  =  1 0 0   X2 (n − 1)  , (4.93) Therefore the probability transition matrix of this Markov chain is
X3 (n) 0 1 0 X3 (n − 1)  
0 1 0
with initial conditions X(1) = [1 1 0]t .  
P =  α 0 1 − α . (4.98)
(c) Let 1 0 0
 
0 1 1 Show that the stationary distribution of this Markov chain is
 
A =  1 0 0 . (4.94) 4 5
0 1 0 1 1 1−α
µ= , , . (4.99)
3−α 3−α 3−α
Then we have by induction
(e) Maximize the entropy rate of the Markov chain over choices of α . What is the
X(n) = AX(n − 1) = A2 X(n − 2) = · · · = An−1 X(1). (4.95) maximum entropy rate of the chain?
Entropy Rates of a Stochastic Process 77 78 Entropy Rates of a Stochastic Process

(f) Compare the maximum entropy rate in part (e) with log λ in part (c). Why are where Y1 , Y2 , Y3 do not depend on n . Without loss of generality, we can assume
the two answers the same? that λ1 > λ2 > λ3 . Thus

X1 (n) = λn−1
1 Y11 + λn−1
2 Y21 + λ3n−1 Y31 (4.106)
Solution:
X2 (n) = λ1n−1 Y12 + λ2n−1 Y22 + λ3n−1 Y32 (4.107)
Entropy rate of constrained sequences.
X3 (n) = λ1n−1 Y13 + λ2n−1 Y23 + λ3n−1 Y33 (4.108)
(a) The sequences are constrained to have at least one 0 and at most two 0’s between For large n , this sum is dominated by the largest term. Thus if Y 1i > 0 , we have
two 1’s. Let the state of the system be the number of 0’s that has been seen since
the last 1. Then a sequence that ends in a 1 is in state 1, a sequence that ends in 1
log Xi (n) → log λ1 . (4.109)
10 in is state 2, and a sequence that ends in 100 is in state 3. From state 1, it is n
only possible to go to state 2, since there has to be at least one 0 before the next To be rigorous, we must also show that Y 1i > 0 for i = 1, 2, 3 . It is not difficult
1. From state 2, we can go to either state 1 or state 3. From state 3, we have to to prove that if one of the Y1i is positive, then the other two terms must also be
go to state 1, since there cannot be more than two 0’s in a row. Thus we can the positive, and therefore either
state diagram in the problem.
1
(b) Any valid sequence of length n that ends in a 1 must be formed by taking a valid log Xi (n) → log λ1 . (4.110)
n
sequence of length n − 1 that ends in a 0 and adding a 1 at the end. The number
of valid sequences of length n − 1 that end in a 0 is equal to X 2 (n − 1) + X3 (n − 1) for all i = 1, 2, 3 or they all tend to some other value.
and therefore, The general argument is difficult since it is possible that the initial conditions of
X1 (n) = X2 (n − 1) + X3 (n − 1). (4.100) the recursion do not have a component along the eigenvector that corresponds to
the maximum eigenvalue and thus Y1i = 0 and the above argument will fail. In
By similar arguments, we get the other two equations, and we have our example, we can simply compute the various quantities, and thus
      
X1 (n) 0 1 1 X1 (n − 1) 0 1 1
A =  1 0 0  = U −1 ΛU,
 
  
 X2 (n)  =  1 0
 
0   X2 (n − 1)  . (4.101) (4.111)
X3 (n) 0 1 0 X3 (n − 1) 0 1 0

where
The initial conditions are obvious, since both sequences of length 1 are valid and  
therefore X(1) = [1 1 0]T . 1.3247 0 0
 
(c) The induction step is obvious. Now using the eigenvalue decomposition of A = Λ= 0 −0.6624 + 0.5623i 0 , (4.112)
U −1 ΛU , it follows that A2 = U −1 ΛU U −1 ΛU = U −1 Λ2 U , etc. and therefore 0 0 −0.6624 − 0.5623i

and
X(n) = An−1 X(1) = U −1 Λn−1 U X(1) (4.102)  
   
λn−1 0 0 1 −0.5664 −0.7503 −0.4276
 1  
U =  0.6508 − 0.0867i −0.3823 + 0.4234i −0.6536 − 0.4087i  , (4.113)
= U −1  0 λn−1
  
2 0 U  1  (4.103)
0 0 λ3n−1 0 0.6508 + 0.0867i −0.3823i0.4234i −0.6536 + 0.4087i
       
1 0 0 1 0 0 0 1 and therefore  
= λ1n−1 U −1  0 0 0  U  1  + λ2n−1 U −1  0 1 0  U  1 
        0.9566
 
0 0 0 0 0 0 0 0 Y1 =  0.7221  , (4.114)
    0.5451
0 0 0 1
+λ3n−1 U −1  0 0
   
0 U  1  (4.104) which has all positive components. Therefore,
0 0 1 0 1
log Xi (n) → log λi = log 1.3247 = 0.4057 bits. (4.115)
= λ1n−1 Y1 + λ2n−1 Y2 + λ3n−1 Y3 , (4.105) n

Entropy Rates of a Stochastic Process 79 80 Entropy Rates of a Stochastic Process

(d) To verify the that the Markov type classes that satisfy the state constraints of part (a), at least one
4 5T
1 1 1−α of them has the same exponent as the total number of sequences that satisfy the
µ= , , . (4.116)
3−α 3−α 3−α state constraint. For this Markov type, the number of sequences in the typeclass
is the stationary distribution, we have to verify that P µ = µ . But this is straight- is 2nH , and therefore for this type class, H = log λ 1 .
forward. This result is a very curious one that connects two apparently unrelated objects -
the maximum eigenvalue of a state transition matrix, and the maximum entropy
(e) The entropy rate of the Markov chain (in nats) is
rate for a probability transition matrix with the same state diagram. We don’t
! ! 1 know a reference for a formal proof of this result.
H=− µi Pij ln Pij = (−α ln α − (1 − α) ln(1 − α)) , (4.117)
i j
3−α 17. Waiting times are insensitive to distributions. Let X 0 , X1 , X2 , . . . be drawn i.i.d.
∼ p(x), x ∈ X = {1, 2, . . . , m} and let N be the waiting time to the next occurrence
and differentiating with respect to α to find the maximum, we find that of X0 , where N = minn {Xn = X0 } .
dH 1 1
= (−α ln α − (1 − α) ln(1 − α))+ (−1 − ln α + 1 + ln(1 − α)) = 0, (a) Show that EN = m .
dα (3 − α)2 3−α
(4.118) (b) Show that E log N ≤ H(X) .
or (c) (Optional) Prove part (a) for {Xi } stationary and ergodic.
(3 − α) (ln a − ln(1 − α)) = (−α ln α − (1 − α) ln(1 − α)) (4.119)
Solution: Waiting times are insensitive to distributions. Since X 0 , X1 , X2 , . . . , Xn are
which reduces to
drawn i.i.d. ∼ p(x) , the waiting time for the next occurrence of X 0 has a geometric
3 ln α = 2 ln(1 − α), (4.120) distribution with probability of success p(x 0 ) .
i.e.,
3 2
α = α − 2α + 1, (4.121) (a) Given X0 = i , the expected time until we see it again is 1/p(i) . Therefore,
* +
which can be solved (numerically) to give α = 0.5698 and the maximum entropy
! 1
EN = E[E(N |X0 )] = p(X0 = i) = m. (4.122)
rate as 0.2812 nats = 0.4057 bits. p(i)
(f) The answers in parts (c) and (f) are the same. Why? A rigorous argument is (b) By the same argument, since given X0 = i , N has a geometric distribution with
quite involved, but the essential idea is that both answers give the asymptotics of mean p(i) and
the number of sequences of length n for the state diagram in part (a). In part (c) 1
we used a direct argument to calculate the number of sequences of length n and E(N |X0 = i) = . (4.123)
p(i)
found that asymptotically, X(n) ≈ λn1 .
Then using Jensen’s inequality, we have
If we extend the ideas of Chapter 3 (typical sequences) to the case of Markov !
chains, we can see that there are approximately 2 nH typical sequences of length E log N = p(X0 = i)E(log N |X0 = i) (4.124)
n for a Markov chain of entropy rate H . If we consider all Markov chains with !
i
state diagram given in part (a), the number of typical sequences should be less ≤ p(X0 = i) log E(N |X0 = i) (4.125)
than the total number of sequences of length n that satisfy the state constraints. i
Thus, we see that 2nH ≤ λn1 or H ≤ log λ1 . ! 1
= p(i) log (4.126)
To complete the argument, we need to show that there exists a Markov transition i
p(i)
matrix that achieves the upper bound. This can be done by two different methods. = H(X). (4.127)
One is to derive the Markov transition matrix from the eigenvalues, etc. of parts
(a)–(c). Instead, we will use an argument from the method of types. In Chapter 12, (c) The property that EN = m is essentially a combinatorial property rather than
we show that there are at most a polynomial number of types, and that therefore, a statement about expectations. We prove this for stationary ergodic sources. In
the largest type class has the same number of sequences (to the first order in essence, we will calculate the empirical average of the waiting time, and show that
the exponent) as the entire set. The same arguments can be applied to Markov this converges to m . Since the process is ergodic, the empirical average converges
types. There are only a polynomial number of Markov types and therefore of all to the expected value, and thus the expected value must be m .
Entropy Rates of a Stochastic Process 81 82 Entropy Rates of a Stochastic Process

Let X1 = a , and define a sequence of random variables N 1 , N2 , . . . , where N1 = recurrence X Y1 Y2 Probability

time for X1 , etc. It is clear that the N process is also stationary and ergodic. 1 H H p2
Let Ia (Xi ) be the indicator that Xi = a and Ja (Xi ) be the indicator that Xi %= a . 1 H T p(1 − p)
Then for all i , all a ∈ X , Ia (Xi ) + Ja (Xi ) = 1 . 1 T H p(1 − p)
1 T T (1 − p)2
Let N1 (a), N2 (a), . . . be the recurrence times of the symbol a in the sequence. 2 H H (1 − p)2
Thus X1 = a , Xi %= a, 1 < i < N1 (a) , and XN1 (a) = a , etc. Thus the sum of 2 H T p(1 − p)
$
Ja (Xi ) over all i is equal to the j (Nj (a) − 1) . Or equivalently 2 T H p(1 − p)
! ! !
2 T T p2
Nj (a) = Ja (Xi ) + Ia (Xi ) = n (4.128)
Thus the joint distribution of (Y1 , Y2 ) is ( 21 (p2 + (1 − p)2 ), p(1 − p), p(1 − p), 12 (p2 +
j i i
(1 − p)2 )) , and we can now calculate
Summing this over all a ∈ X , we obtain
!! I(X; Y1 , Y2 ) = H(Y1 , Y2 ) − H(Y1 , Y2 |X) (4.130)
Nj (a) = nm (4.129) = H(Y1 , Y2 ) − H(Y1 |X) − H(Y2 |X) (4.131)
a j
= H(Y1 , Y2 ) − 2H(p) (4.132)
* +
1 2 1
There are n terms in this sum, and therefore the empirical mean of N j (Xi ) is m . = H (p + (1 − p)2 ), p(1 − p), p(1 − p), (p2 + (1 − p)2 ) − 2H(p)
2 2
Thus the empirical average of N over any sample sequence is m and thus the
= H(p(1 − p)) + 1 − 2H(p) (4.133)
expected value of N must also be m .

18. Stationary but not ergodic process. A bin has two biased coins, one with prob- where the last step follows from using the grouping rule for entropy.
ability of heads p and the other with probability of heads 1 − p . One of these coins
is chosen at random (i.e., with probability 1/2), and is then tossed n times. Let X (c)
denote the identity of the coin that is picked, and let Y 1 and Y2 denote the results of
the first two tosses.
H(Y1 , Y2 , . . . , Yn )
H(Y) = lim (4.134)
n
(a) Calculate I(Y1 ; Y2 |X) . H(X, Y1 , Y2 , . . . , Yn ) − H(X|Y1 , Y2 , . . . , Yn )
= lim (4.135)
(b) Calculate I(X; Y1 , Y2 ) . n
H(X) + H(Y1 , Y2 , . . . , Yn |X) − H(X|Y1 , Y2 , . . . , Yn )
(c) Let H(Y) be the entropy rate of the Y process (the sequence of coin tosses). = lim (4.136)
n
Calculate H(Y) . (Hint: Relate this to lim n→∞ n1 H(X, Y1 , Y2 , . . . , Yn ) ).

You can check the answer by considering the behavior as p → 1/2 . Since 0 ≤ H(X|Y1 , Y2 , . . . , Yn ) ≤ H(X) ≤ 1 , we have lim n1 H(X) = 0 and sim-
ilarly n1 H(X|Y1 , Y2 , . . . , Yn ) = 0 . Also, H(Y1 , Y2 , . . . , Yn |X) = nH(p) , since the
Solution: Yi ’s are i.i.d. given X . Combining these terms, we get

(a) SInce the coin tosses are indpendent conditional on the coin chosen, I(Y 1 ; Y2 |X) =
nH(p)
0. H(Y) = lim = H(p) (4.137)
n
(b) The key point is that if we did not know the coin being used, then Y 1 and Y2
are not independent. The joint distribution of Y 1 and Y2 can be easily calculated
from the following table 19. Random walk on graph. Consider a random walk on the graph

Entropy Rates of a Stochastic Process 83 84 Entropy Rates of a Stochastic Process

What about the entropy rate of rooks, bishops and queens? There are two types of
2! bishops.
!"
! " Solution:
! "!3
! %& Random walk on the chessboard.
! % &
! % & Notice that the king cannot remain where it is. It has to move from one state to the
! % & next. The stationary distribution is given by µ i = Ei /E , where Ei = number of edges
1 !
!## % & $
emanating from node i and E = 9i=1 Ei . By inspection, E1 = E3 = E7 = E9 = 3 ,
" ####% &
" #
% ### &&! E2 = E4 = E6 = E8 = 5 , E5 = 8 and E = 40 , so µ1 = µ3 = µ7 = µ9 = 3/40 ,
" % $$
4 µ2 = µ4 = µ6 = µ8 = 5/40 and µ5 = 8/40 . In a random walk the next state is
" % $$
"!%$ chosen with equal probability among possible choices, so H(X 2 |X1 = i) = log 3 bits
5 for i = 1, 3, 7, 9 , H(X2 |X1 = i) = log 5 for i = 2, 4, 6, 8 and H(X2 |X1 = i) = log 8
(a) Calculate the stationary distribution. bits for i = 5 . Therefore, we can calculate the entropy rate of the king as

(b) What is the entropy rate? 9

!
H = µi H(X2 |X1 = i) (4.143)
(c) Find the mutual information I(Xn+1 ; Xn ) assuming the process is stationary. i=1
= 0.3 log 3 + 0.5 log 5 + 0.2 log 8 (4.144)
Solution: = 2.24 bits. (4.145)
(a) The stationary distribution for a connected graph of undirected edges with equal
weight is given as µi = 2E Ei
where Ei denotes the number of edges emanating 21. Maximal entropy graphs. Consider a random walk on a connected graph with 4
from node i and E is the total number of edges in the graph. Hence, the station- edges.
3 3 3 3 4
ary distribution is [ 16 , 16 , 16 , 16 , 16 ] ; i.e., the first four nodes exterior nodes have (a) Which graph has the highest entropy rate?
3
steady state probability of 16 while node 5 has steady state probability of 41 .
(b) Which graph has the lowest?
3 4
(b) Thus, the entropy rate of the random walk on this graph is 4 16 log2 (3)+ 16 log2 (4) =
3 1 Solution: Graph entropy.
4 log2 (3) + 2 = log 16 − H(3/16, 3/16, 3/16, 3/16, 1/4)
(c) The mutual information There are five graphs with four edges.

I(Xn+1 ; Xn ) = H(Xn+1 ) − H(Xn+1 |Xn ) (4.138)

= H(3/16, 3/16, 3/16, 3/16, 1/4) − (log16 − H(3/16, 3/16, 3/16, 3/16,
(4.139)
1/4))
= 2H(3/16, 3/16, 3/16, 3/16, 1/4) − log16 (4.140)
3 16 1
= 2( log + log 4) − log16 (4.141)
4 3 4
3
= 3 − log 3 (4.142)
2

20. Random walk on chessboard. Find the entropy rate of the Markov chain associated
with a random walk of a king on the 3 × 3 chessboard

1 2 3
4 5 6
7 8 9
Entropy Rates of a Stochastic Process 85 86 Entropy Rates of a Stochastic Process

(a) What is the stationary distribution?

(b) What is the entropy rate of this random walk?

Solution: 3D Maze.
The entropy rate of a random walk on a graph with equal weights is given by equation
4.41 in the text:
* +
E1 Em
H(X ) = log(2E) − H ,...,
2E 2E
There are 8 corners, 12 edges, 6 faces and 1 center. Corners have 3 edges, edges have
4 edges, faces have 5 edges and centers have 6 edges. Therefore, the total number of
edges E = 54 . So,
* + * + * + * +
3 3 4 4 5 5 6 6
H(X ) = log(108) + 8 log + 12 log +6 log +1 log
108 108 108 108 108 108 108 108
= 2.03 bits

23. Entropy rate

Let {Xi } be a stationary stochastic process with entropy rate H(X ) .

(a) Argue that H(X ) ≤ H(X1 ) .

(b) What are the conditions for equality?

Solution: Entropy Rate

(a) From Theorem 4.2.1

H(X ) = H(X1 |X0 , X−1 , . . .) ≤ H(X1 ) (4.146)

since conditioning reduces entropy

(b) We have equality only if X1 is independent of the past X0 , X−1 , . . . , i.e., if and
only if Xi is an i.i.d. process.
Where the entropy rates are 1/2+ 3/8 log(3) ≈ 1.094, 1, .75, 1 and 1/4+ 3/8 log(3) ≈
24. Entropy rates
.844.
Let {Xi } be a stationary process. Let Yi = (Xi , Xi+1 ) . Let Zi = (X2i , X2i+1 ) . Let
(a) From the above we see that the first graph maximizes entropy rate with and Vi = X2i . Consider the entropy rates H(X ) , H(Y) , H(Z) , and H(V) of the processes
entropy rate of 1.094. {Xi } , {Yi } , {Zi } , and {Vi } . What is the inequality relationship ≤ , =, or ≥ between
each of the pairs listed below:
(b) From the above we see that the third graph minimizes entropy rate with and ≥
entropy rate of .75. (a) H(X ) ≤ H(Y) .
≥
(b) H(X ) ≤ H(Z) .
22. 3-D Maze.
≥
A bird is lost in a 3 × 3 × 3 cubical maze. The bird flies from room to room going to (c) H(X ) ≤ H(V) .
adjoining rooms with equal probability through each of the walls. To be specific, the ≥
corner rooms have 3 exits. (d) H(Z) ≤ H(X ) .

Entropy Rates of a Stochastic Process 87 88 Entropy Rates of a Stochastic Process

Solution: Entropy rates 27. Entropy rate

{Xi } is a stationary process, Yi = (Xi , Xi+1 ) . Let Zi = (X2i , X2i+1 ) . Let Vi = X2i . Let {Xi } be a stationary {0, 1} valued stochastic process obeying
Consider the entropy rates H(X ) , H(Y) , H(Z) , and H(V) of the processes {X i } ,
{Zi } , and {Vi } . Xk+1 = Xk ⊕ Xk−1 ⊕ Zk+1 ,

(a) H(X ) = H (Y) , since H(X1 , X2 , . . . , Xn , Xn+1 ) = H(Y1 , Y2 , . . . , Yn ) , and dividing where {Zi } is Bernoulli( p ) and ⊕ denotes mod 2 addition. What is the entropy rate
by n and taking the limit, we get equality. H(X ) ?
(b) H(X ) < H (Z) , since H(X1 , . . . , X2n ) = H(Z1 , . . . , Zn ) , and dividing by n and Solution: Entropy Rate
taking the limit, we get 2H(X ) = H(Z) .
(c) H(X ) > H (V) , since H(V1 |V0 , . . .) = H(X2 |X0 , X−2 , . . .) ≤ H(X2 |X1 , X0 , X−1 , . . .) . H(X ) = H(Xk+1 |Xk , Xk−1 , . . .) = H(Xk+1 |Xk , Xk−1 ) = H(Zk+1 ) = H(p) (4.151)
(d) H(Z) = 2H (X ) since H(X1 , . . . , X2n ) = H(Z1 , . . . , Zn ) , and dividing by n and
taking the limit, we get 2H(X ) = H(Z) . 28. Mixture of processes
Suppose we observe one of two stochastic processes but don’t know which. What is the
25. Monotonicity.
entropy rate? Specifically, let X11 , X12 , X13 , . . . be a Bernoulli process with parameter
(a) Show that I(X; Y1 , Y2 , . . . , Yn ) is non-decreasing in n . p1 and let X21 , X22 , X23 , . . . be Bernoulli (p2 ) . Let
(b) Under what conditions is the mutual information constant for all n ? 
 1, 1
with probability 2
Solution: Monotonicity θ=
 2, 1
with probability 2
(a) Since conditioning reduces entropy,
and let Yi = Xθi , i = 1, 2, . . . , be the observed stochastic process. Thus Y observes
H(X|Y1 , Y2 , . . . , Yn ) ≥ H(X|Y1 , Y2 , . . . , Yn , Yn+1 ) (4.147) the process {X1i } or {X2i } . Eventually Y will know which.

and hence (a) Is {Yi } stationary?

I(X; Y1 , Y2 , . . . , Yn ) = H(X) − H(X|Y1 , Y2 , . . . , Yn ) (4.148) (b) Is {Yi } an i.i.d. process?
≤ H(X) − H(X|Y1 , Y2 , . . . , Yn ,n+1 ) (4.149) (c) What is the entropy rate H of {Yi } ?
= I(X; Y1 , Y2 , . . . , Yn , Yn+1 ) (4.150) (d) Does
1
− log p(Y1 , Y2 , . . . Yn ) −→ H?
(b) We have equality if and only if H(X|Y 1 , Y2 , . . . , Yn ) = H(X|Y1 ) for all n , i.e., if n
X is conditionally independent of Y 2 , . . . given Y1 . 1
(e) Is there a code that achieves an expected per-symbol description length n ELn −→
26. Transitions in Markov chains. Suppose {X i } forms an irreducible Markov chain H?
with transition matrix P and stationary distribution µ . Form the associated “edge-
process” {Yi } by keeping track only of the transitions. Thus the new process {Y i } Now let θi be Bern( 12 ). Observe
takes values in X × X , and Yi = (Xi−1 , Xi ) .
Zi = Xθi i , i = 1, 2, . . . ,
For example
X = 3, 2, 8, 5, 7, . . . Thus θ is not fixed for all time, as it was in the first part, but is chosen i.i.d. each time.
becomes Answer (a), (b), (c), (d), (e) for the process {Z i } , labeling the answers (a ' ), (b ' ), (c ' ),
Y = (∅, 3), (3, 2), (2, 8), (8, 5), (5, 7), . . . (d ' ), (e ' ).
Solution: Mixture of processes.
Find the entropy rate of the edge process {Y i } .
Solution: Edge Process H(X ) = H (Y) , since H(X 1 , X2 , . . . , Xn , Xn+1 ) = H(Y1 , Y2 , . . . , Yn ) , (a) Yes, {Yi } is stationary, since the scheme that we use to generate the Y i s doesn’t
and dividing by n and taking the limit, we get equality. change with time.
Entropy Rates of a Stochastic Process 89 90 Entropy Rates of a Stochastic Process

(b) No, it is not IID, since there’s dependence now – all Y i s have been generated Let Sn be the waiting time for the nth head to appear.
according to the same parameter θ . Thus,
Alternatively, we can arrive at the result by examining I(Y n+1 ; Y n ) . If the process S0 = 0
were to be IID, then the expression I(Y n+1 ; Y n ) would have to be 0 . However,
if we are given Y n , then we can estimate what θ is, which in turn allows us to Sn+1 = Sn + Xn+1
predict Yn+1 . Thus, I(Yn+1 ; Y n ) is nonzero. where X1 , X2 , X3 , . . . are i.i.d according to the distribution above.
(c) The process {Yi } is the mixture of two Bernoulli processes with different param-
(a) Is the process {Sn } stationary?
eters, and its entropy rate is the mixture of the two entropy rates of the two
processes so it’s given by (b) Calculate H(S1 , S2 , . . . , Sn ) .
H(p1 ) + H(p2 ) (c) Does the process {Sn } have an entropy rate? If so, what is it? If not, why not?
.
2 (d) What is the expected number of fair coin flips required to generate a random
More rigorously, variable having the same distribution as S n ?
1 Solution: Waiting time process.
H = lim H(Y n )
n
n→∞
1
= lim (H(θ) + H(Y n |θ) − H(θ|Y n )) (a) For the process to be stationary, the distribution must be time invariant. It turns
n→∞ n
H(p1 ) + H(p2 ) out that process {Sn } is not stationary. There are several ways to show this.
= • S0 is always 0 while Si , i %= 0 can take on several values. Since the marginals
2
for S0 and S1 , for example, are not the same, the process can’t be stationary.
• It’s clear that the variance of S n grows with n , which again implies that the
Note that only H(Y n |θ) grows with n . The rest of the term is finite and will go
marginals are not time-invariant.
to 0 as n goes to ∞ .
• Process {Sn } is an independent increment process. An independent increment
(d) The process {Yi } is not ergodic, so the AEP does not apply and the quantity process is not stationary (not even wide sense stationary), since var(S n ) =
−(1/n) log P (Y1 , Y2 , . . . , Yn ) does NOT converge to the entropy rate. (But it does var(Xn ) + var(Sn−1 ) > var(Sn−1 ) .
converge to a random variable that equals H(p 1 ) w.p. 1/2 and H(p2 ) w.p. 1/2.)
(b) We can use chain rule and Markov properties to obtain the following results.
(e) Since the process is stationary, we can do Huffman coding on longer and longer
n
blocks of the process. These codes will have an expected per-symbol length H(S1 , S2 , . . . , Sn ) = H(S1 ) +
!
H(Si |S i−1 )
bounded above by H(X1 ,X2n,...,Xn )+1 and this converges to H(X ) . i=2
!n
(a’) Yes, {Yi } is stationary, since the scheme that we use to generate the Y i ’s doesn’t = H(S1 ) + H(Si |Si−1 )
change with time. i=2
n
(b’) Yes, it is IID, since there’s no dependence now – each Y i is generated according = H(X1 ) +
!
H(Xi )
to an independent parameter θi , and Yi ∼ Bernoulli( (p1 + p2 )/2) . i=2
(c’) Since the process is now IID, its entropy rate is n
!
* + = H(Xi )
p1 + p 2 i=1
H .
2 = 2n

(d’) Yes, the limit exists by the AEP. (c) It follows trivially from the previous part that
(e’) Yes, as in (e) above. H(S n )
H(S) = lim
n
n→∞
29. Waiting times. 2n
Let X be the waiting time for the first heads to appear in successive flips of a fair coin. = lim
n→∞ n
Thus, for example, P r{X = 3} = ( 12 )3 . = 2

Entropy Rates of a Stochastic Process 91 92 Entropy Rates of a Stochastic Process

Note that the entropy rate can still exist even when the process is not stationary. the exact value of H(Sn ) is 5.8636 . The expected number of flips required is
Furthermore, the entropy rate (for this problem) is the same as the entropy of X. somewhere between 5.8636 and 7.8636 .
(d) The expected number of flips required can be lower-bounded by H(S n ) and upper-
bounded by H(Sn ) + 2 (Theorem.5.12.3, page 115). S n has a negative binomial 30. Markov chain transitions.
/
k−1  1 1 1

distribution; i.e., P r(Sn = k) = ( 12 )k for k ≥ n . (We have the n th 2 4 4
n−1  1 1 1

success at the k th trial if and only if we have exactly n − 1 successes in k − 1 P = [Pij ] = 
 4 2 4


1 1 1
trials and a suceess at the k th trial.) 4 4 2

Since computing the exact value of H(S n ) is difficult (and fruitless in the exam Let X1 be uniformly distributed over the states {0, 1, 2}. Let {X i }∞ 1 be a Markov
setting), it would be sufficient to show that the expected number of flips required chain with transition matrix P , thus P (Xn+1 = j|Xn = i) = Pij , i, j ∈ {0, 1, 2}.
is between H(Sn ) and H(Sn ) + 2 , and set up the expression of H(S n ) in terms
of the pmf of Sn . (a) Is {Xn } stationary?
(b) Find limn→∞ n1 H(X1 , . . . , Xn ).
Note, however, that for large n , however, the distribution of S n will tend to
Gaussian with mean np = 2n and variance n(1 − p)/p2 = 2n . Now consider the derived process Z1 , Z2 , . . . , Zn , where
Let pk = P r(Sn = k + ESn ) = P r(Sn = k + 2n) . Let φ(x) be the normal √
density
function with mean zero and variance 2n , i.e. φ(x) = exp (−x 2 /2σ 2 )/ 2πσ 2 , Z1 = X1
where σ 2 = 2n . Zi = Xi − Xi−1 (mod 3), i = 2, . . . , n.
Then for large n , since the entropy is invariant under any constant shift of a
random variable and φ(x) log φ(x) is Riemann integrable, Thus Z n encodes the transitions, not the states.
H(Sn ) = H(Sn − E(Sn ))
! (c) Find H(Z1 , Z2 , ..., Zn ).
= − pk log pk
! (d) Find H(Zn ) and H(Xn ), for n ≥ 2 .
≈ − φ(k) log φ(k)
2 (e) Find H(Zn |Zn−1 ) for n ≥ 2 .
≈ − φ(x) log φ(x)dx
(f) Are Zn−1 and Zn independent for n ≥ 2 ?
2
= (− log e) φ(x) ln φ(x)dx
Solution:
2
x2 √
= (− log e) φ(x)(− − ln 2πσ 2 ) (a) Let µn denote the probability mass function at time n . Since µ 1 = ( 13 , 13 , 13 ) and
2σ 2
1 1 µ2 = µ1 P = µ1 , µn = µ1 = ( 31 , 13 , 13 ) for all n and {Xn } is stationary.
= (log e)( + ln 2πσ 2 )
2 2 Alternatively, the observation P is doubly stochastic will lead the same conclusion.
1
= log 2πeσ 2 (b) Since {Xn } is stationary Markov,
2
1
= log nπe + 1. lim H(X1 , . . . , Xn ) = H(X2 |X1 )
2 n→∞
2
(Refer to Chapter 9 for a more general discussion of the entropy of a continuous !
= P (X1 = k)H(X2 |X1 = k)
random variable and its relation to discrete entropy.) k=0
1 1 1 1
Here is = 3× × H( , , )
. a specific/example for n = 100 . Based on earlier discussion, P r(S 100 = 3 2 4 4
k−1 3
k) = ( 12 )k . The Gaussian approximation of H(S n ) is 5.8690 while = .
100 − 1 2
Entropy Rates of a Stochastic Process 93 94 Entropy Rates of a Stochastic Process

(c) Since (X1 , . . . , Xn ) and (Z1 , . . . , Zn ) are one-to-one, by the chain rule of entropy 31. Markov.
and the Markovity,
Let {Xi } ∼ Bernoulli(p) . Consider the associated Markov chain {Y i }ni=1 where
H(Z1 , . . . , Zn ) = H(X1 , . . . , Xn ) Yi = (the number of 1’s in the current run of 1’s) . For example, if X n = 101110 . . . ,
n
! we have Y n = 101230 . . . .
= H(Xk |X1 , . . . , Xk−1 )
k=1 (a) Find the entropy rate of X n .
n
!
= H(X1 ) + H(Xk |Xk−1 ) (b) Find the entropy rate of Y n .
k=2
= H(X1 ) + (n − 1)H(X2 |X1 ) Solution: Markov solution.
3
= log 3 + (n − 1).
2 (a) For an i.i.d. source, H(X ) = H(X) = H(p) .
Alternatively, we can use the results of parts (d), (e), and (f). Since Z 1 , . . . , Zn (b) Observe that X n and Y n have a one-to-one mapping. Thus, H(Y) = H(X ) =
are independent and Z2 , . . . , Zn are identically distributed with the probability H(p) .
distribution ( 12 , 14 , 14 ) ,
32. Time symmetry.
H(Z1 , . . . , Zn ) = H(Z1 ) + H(Z2 ) + . . . + H(Zn )
Let {Xn } be a stationary Markov process. We condition on (X 0 , X1 ) and look into
= H(Z1 ) + (n − 1)H(Z2 ) the past and future. For what index k is
3
= log 3 + (n − 1).
2 H(X−n |X0 , X1 ) = H(Xk |X0 , X1 )?
(d) Since {Xn } is stationary with µn = ( 13 , 13 , 13 ) ,
Give the argument.
1 1 1 Solution: Time symmetry.
H(Xn ) = H(X1 ) = H( , , ) = log 3.
3 3 3 The trivial solution is k = −n. To find other possible values of k we expand
 1
 0,
 2,
H(X−n |X0 , X1 ) = H(X−n , X0 , X1 ) − H(X0 , X1 )
For n ≥ 2 , Zn = 1, 4,1


2, 4.1 = H(X−n ) + H(X0 , X1 |X−n ) − H(X0 , X1 )
Hence, H(Zn ) = H( 12 , 14 , 14 ) = 32 . = H(X−n ) + H(X0 |X−n ) + H(X1 |X0 , X−n ) − H(X0 , X1 )
(a)
(e) Due to the symmetry of P , P (Zn |Zn−1 ) = P (Zn ) for n ≥ 2. Hence, H(Zn |Zn−1 ) = = H(X−n ) + H(X0 |X−n ) + H(X1 |X0 ) − H(X0 , X1 )
H(Zn ) = 32 . = H(X−n ) + H(X0 |X−n ) − H(X0 )
Alternatively, using the result of part (f), we can trivially reach the same conclu- (b)
= H(X0 ) + H(X0 |X−n ) − H(X0 )
sion.
(c)
(f) Let k ≥ 2 . First observe that by the symmetry of P , Z k+1 = Xk+1 − Xk is = H(Xn |X0 )
independent of Xk . Now that (d)
= H(Xn |X0 , X−1 )
H(Zk+1 |Xk , Xk−1 ) = H(Xk+1 − Xk |Xk , Xk−1 ) (e)
= H(Xn+1 |X1 , X0 )
= H(Xk+1 − Xk |Xk )
= H(Xk+1 − Xk )
= H(Zk+1 ), where (a) and (d) come from Markovity and (b), (c) and (e) come from stationarity.
Hence k = n + 1 is also a solution. There are no other solution since for any other
Zk+1 is independent of (Xk , Xk−1 ) and hence independent of Zk = Xk − Xk−1 . k, we can construct a periodic Markov process as a counterexample. Therefore k ∈
For k = 1 , again by the symmetry of P , Z2 is independent of Z1 = X1 trivially. {−n, n + 1}.

Entropy Rates of a Stochastic Process 95 96 Entropy Rates of a Stochastic Process

33. Chain inequality: Let X1 → X2 → X3 → X4 form a Markov chain. Show that Thus the second difference is negative, establishing that H(X n |X0 ) is a concave func-
tion of n .
I(X1 ; X3 ) + I(X2 ; X4 ) ≤ I(X1 ; X4 ) + I(X2 ; X3 ) (4.152)
Solution: Concavity of second law of thermodynamics
Solution: Chain inequality X1 → X2 → X3 → X4 Since X0 → Xn−2 → Xn−1 → Xn is a Markov chain

Solution: Broadcast Channel

X → Y → (Z, W ) , hence by the data processing inequality, I(X; Y ) ≥ I(X; (Z, W )) ,
and hence

I(X : Y ) +I(Z; W ) − I(X; Z) − I(X; W ) (4.163)

≥ I(X : Z, W ) + I(Z; W ) − I(X; Z) − I(X; W ) (4.164)
= H(Z, W ) + H(X) − H(X, W, Z) + H(W ) + H(Z) − H(W, Z)
−H(Z) − H(X) + H(X, Z)) − H(W ) − H(X) + H(W, X)(4.165)
= −H(X, W, Z) + H(X, Z) + H(X, W ) − H(X) (4.166)
= H(W |X) − H(W |X, Z) (4.167)
= I(W ; Z|X) (4.168)
≥ 0 (4.169)

35. Concavity of second law. Let {Xn }∞ −∞ be a stationary Markov process. Show that
H(Xn |X0 ) is concave in n . Specifically show that

Uniquely decodable codes satisfy Kraft’s inequality. Therefore

f (D) = D −1 + D −1 + D −2 + D −3 + D −2 + D −3 ≤ 1. (5.4)

We have f (2) = 7/4 > 1 , hence D > 2 . We have f (3) = 26/27 < 1 . So a possible
value of D is 3. Our counting system is base 10, probably because we have 10 fingers.
Chapter 5 Perhaps the Martians were using a base 3 representation because they have 3 fingers.
(Maybe they are like Maine lobsters ?)

3. Slackness in the Kraft inequality. An instantaneous code has word lengths l 1 , l2 , . . . , lm

which satisfy the strict inequality
Data Compression m
!
D−li < 1.
i=1

$ The code alphabet is D = {0, 1, 2, . . . , D − 1}. Show that there exist arbitrarily long
1. Uniquely decodable and instantaneous codes. Let L = m 100 be the ex-
i=1 pi li
sequences of code symbols in D ∗ which cannot be decoded into sequences of codewords.
pected value of the 100th power of the word lengths associated with an encoding of the
random variable X. Let L1 = min L over all instantaneous codes; and let L 2 = min L Solution:
over all uniquely decodable codes. What inequality relationship exists between L 1 and Slackness in the Kraft inequality. Instantaneous codes are prefix free codes, i.e., no
L2 ? codeword is a prefix of any other codeword. Let n max = max{n1 , n2 , ..., nq }. There
Solution: Uniquely decodable and instantaneous codes. are D nmax sequences of length nmax . Of these sequences, D nmax −ni start with the
i -th codeword. Because of the prefix condition no two sequences can start with the
m
! same codeword. Hence the total number of sequences which start with some codeword
L= pi n100 (5.1)
is qi=1 Dnmax −ni = D nmax qi=1 D−ni < D nmax . Hence there are sequences which do
i $ $
i=1
not start with any codeword. These and all longer sequences with these length n max
L1 = min L (5.2) sequences as prefixes cannot be decoded. (This situation can be visualized with the aid
Instantaneous codes of a tree.)
L2 = min L (5.3) Alternatively, we can map codewords onto dyadic intervals on the real line correspond-
Uniquely decodable codes
ing to real numbers whose decimal expansions start with that codeword. Since the
$ −ni
Since all instantaneous codes are uniquely decodable, we must have L 2 ≤ L1 . Any set length of the interval for a codeword of length n i is D −ni , and D < 1 , there ex-
of codeword lengths which achieve the minimum of L 2 will satisfy the Kraft inequality ists some interval(s) not used by any codeword. The binary sequences in these intervals
and hence we can construct an instantaneous code with the same codeword lengths, do not begin with any codeword and hence cannot be decoded.
and hence the same L . Hence we have L1 ≤ L2 . From both these conditions, we must
have L1 = L2 . 4. Huffman coding. Consider the random variable

2. How many fingers has a Martian? Let . /

x1 x2 x3 x4 x5 x6 x7
X=
. / 0.49 0.26 0.12 0.04 0.04 0.03 0.02
S1 , . . . , S m
S= .
p1 , . . . , p m
(a) Find a binary Huffman code for X.
The Si ’s are encoded into strings from a D -symbol output alphabet in a uniquely de- (b) Find the expected codelength for this encoding.
codable manner. If m = 6 and the codeword lengths are (l 1 , l2 , . . . , l6 ) = (1, 1, 2, 3, 2, 3), (c) Find a ternary Huffman code for X.
find a good lower bound on D. You may wish to explain the title of the problem.
Solution: How many fingers has a Martian? Solution: Examples of Huffman codes.
97

Data Compression 99 100 Data Compression

(a) The Huffman tree for this distribution is Huffman code produced by the Huffman encoding algorithm is optimal, they all have
Codeword the same average length.)
1 x1 0.49 0.49 0.49 0.49 0.49 0.51 1
6. Bad codes. Which of these codes cannot be Huffman codes for any probability as-
00 x2 0.26 0.26 0.26 0.26 0.26 0.49
signment?
011 x3 0.12 0.12 0.12 0.13 0.25
01000 x4 0.04 0.05 0.08 0.12
(a) {0, 10, 11}.
01001 x5 0.04 0.04 0.05
01010 x6 0.03 0.04 (b) {00, 01, 10, 110}.
01011 x7 0.02 (c) {01, 10}.
(b) The expected length of the codewords for the binary Huffman code is 2.02 bits.
Solution: Bad codes
( H(X) = 2.01 bits)
(c) The ternary Huffman tree is (a) {0,10,11} is a Huffman code for the distribution (1/2,1/4,1/4).
Codeword (b) The code {00,01,10, 110} can be shortened to {00,01,10, 11} without losing its
0 x1 0.49 0.49 0.49 1.0 instantaneous property, and therefore is not optimal, so it cannot be a Huffman
1 x2 0.26 0.26 0.26 code. Alternatively, it is not a Huffman code because there is a unique longest
20 x3 0.12 0.12 0.25 codeword.
22 x4 0.04 0.09
210 x5 0.04 0.04 (c) The code {01,10} can be shortened to {0,1} without losing its instantaneous prop-
211 x6 0.03 erty, and therefore is not optimal and not a Huffman code.
212 x7 0.02
7. Huffman 20 questions. Consider a set of n objects. Let X i = 1 or 0 accordingly as
This code has an expected length 1.34 ternary symbols. ( H 3 (X) = 1.27 ternary the i-th object is good or defective. Let X 1 , X2 , . . . , Xn be independent with Pr {Xi =
symbols). 1} = pi ; and p1 > p2 > . . . > pn > 1/2 . We are asked to determine the set of all
5. More Huffman codes. Find the binary Huffman code for the source with probabilities defective objects. Any yes-no question you can think of is admissible.
(1/3, 1/5, 1/5, 2/15, 2/15) . Argue that this code is also optimal for the source with
probabilities (1/5, 1/5, 1/5, 1/5, 1/5). (a) Give a good lower bound on the minimum average number of questions required.
(b) If the longest sequence of questions is required by nature’s answers to our questions,
Solution: More Huffman codes. The Huffman code for the source with probabilities
what (in words) is the last question we should ask? And what two sets are we
( 13 , 15 , 15 , 15
2 2
, 15 ) has codewords {00,10,11,010,011}.
distinguishing with this question? Assume a compact (minimum average length)
To show that this code (*) is also optimal for (1/5, 1/5, 1/5, 1/5, 1/5) we have to sequence of questions.
show that it has minimum expected length, that is, no shorter code can be constructed
(c) Give an upper bound (within 1 question) on the minimum average number of
without violating H(X) ≤ EL .
questions required.
H(X) = log 5 = 2.32 bits. (5.5)
Solution: Huffman 20 Questions.
3 2 12
E(L(∗)) = 2 × +3× = bits. (5.6) (a) We will be using the questions to determine the sequence X 1 , X2 , . . . , Xn , where
5 5 5
Since Xi is 1 or 0 according to whether the i -th object is good or defective. Thus the
C
5
! li k most likely sequence is all 1’s, with a probability of ni=1 pi , and the least likely
E(L(any code)) = = bits (5.7) C
sequence is the all 0’s sequence with probability ni=1 (1 − pi ) . Since the optimal
i=1
5 5
set of questions corresponds to a Huffman code for the source, a good lower bound
for some integer k , the next lowest possible value of E(L) is 11/5 = 2.2 bits ¡ 2.32 on the average number of questions is the entropy of the sequence X 1 , X2 , . . . , Xn .
bits. Hence (*) is optimal. But since the Xi ’s are independent Bernoulli random variables, we have
Note that one could also prove the optimality of (*) by showing that the Huffman ! !
code for the (1/5, 1/5, 1/5, 1/5, 1/5) source has average length 12/5 bits. (Since each EQ ≥ H(X1 , X2 , . . . , Xn ) = H(Xi ) = H(pi ). (5.8)
Data Compression 101 102 Data Compression

(b) The last bit in the Huffman code distinguishes between the least likely source To find the unconditional average, we have to find the stationary distribution on the
symbols. (By the conditions of the problem, all the probabilities are different, states. Let µ be the stationary distribution. Then
and thus the two least likely sequences are uniquely defined.) In this case, the  
two least likely sequences are 000 . . . 00 and 000 . . . 01 , which have probabilities 1/2 1/4 1/4
 
(1 − p1 )(1 − p2 ) . . . (1 − pn ) and (1 − p1 )(1 − p2 ) . . . (1 − pn−1 )pn respectively. Thus µ = µ  1/4 1/2 1/4  (5.9)
the last question will ask “Is Xn = 1 ”, i.e., “Is the last item defective?”. 0 1/2 1/2
(c) By the same arguments as in Part (a), an upper bound on the minimum average We can solve this to find that µ = (2/9, 4/9, 1/3) . Thus the unconditional average
number of questions is an upper bound on the average length of a Huffman code, number of bits per source symbol
$
namely H(X1 , X2 , . . . , Xn ) + 1 = H(pi ) + 1 .
3
!
8. Simple optimum compression of a Markov source. Consider the 3-state Markov EL = µi E(L|Ci ) (5.10)
process U1 , U2 , . . . , having transition matrix i=1
2 4 1
= × 1.5 + × 1.5 + × 1 (5.11)
Un−1 \Un S1 S2 S3 9 9 3
4
S1 1/2 1/4 1/4 = bits/symbol. (5.12)
S2 1/4 1/2 1/4 3
S3 0 1/2 1/2
The entropy rate H of the Markov chain is

Thus the probability that S1 follows S3 is equal to zero. Design 3 codes C1 , C2 , C3 H = H(X2 |X1 ) (5.13)
(one for each state 1, 2 and 3 ), each code mapping elements of the set of S i ’s into 3
!
sequences of 0’s and 1’s, such that this Markov process can be sent with maximal = µi H(X2 |X1 = Si ) (5.14)
compression by the following scheme: i=1
= 4/3 bits/symbol. (5.15)
(a) Note the present symbol Xn = i .
(b) Select code Ci . Thus the unconditional average number of bits per source symbol and the entropy rate
H of the Markov chain are equal, because the expected length of each code C i equals
(c) Note the next symbol Xn+1 = j and send the codeword in Ci corresponding to
the entropy of the state after state i , H(X 2 |X1 = Si ) , and thus maximal compression
j.
is obtained.
(d) Repeat for the next symbol.
What is the average message length of the next symbol conditioned on the previous 9. Optimal code lengths that require one bit above entropy. The source coding
state Xn = i using this coding scheme? What is the unconditional average number theorem shows that the optimal code for a random variable X has an expected length
of bits per source symbol? Relate this to the entropy rate H(U) of the Markov less than H(X) + 1 . Give an example of a random variable for which the expected
chain. length of the optimal code is close to H(X) + 1 , i.e., for any ǫ > 0 , construct a
distribution for which the optimal code has L > H(X) + 1 − ǫ .
Solution: Simple optimum compression of a Markov source. Solution: Optimal code lengths that require one bit above entropy. There is a trivial
It is easy to design an optimal code for each state. A possible solution is example that requires almost 1 bit above its entropy. Let X be a binary random
variable with probability of X = 1 close to 1. Then entropy of X is close to 0 , but
Next state S1 S2 S3
the length of its optimal code is 1 bit, which is almost 1 bit above its entropy.
Code C1 0 10 11 E(L|C1 ) = 1.5 bits/symbol
code C2 10 0 11 E(L|C2 ) = 1.5 bits/symbol 10. Ternary codes that achieve the entropy bound. A random variable X takes
code C3 - 0 1 E(L|C3 ) = 1 bit/symbol on m values and has entropy H(X) . An instantaneous ternary code is found for this
The average message lengths of the next symbol conditioned on the previous state source, with average length
being Si are just the expected lengths of the codes C i . Note that this code assignment H(X)
L= = H3 (X). (5.16)
achieves the conditional entropy lower bound. log2 3

Data Compression 103 104 Data Compression

(a) Show that each symbol of X has a probability of the form 3 −i for some i . (b) Show that there exist two different sets of optimal lengths for the codewords,
(b) Show that m is odd. namely, show that codeword length assignments (1, 2, 3, 3) and (2, 2, 2, 2) are
both optimal.
Solution: Ternary codes that achieve the entropy bound. (c) Conclude that there are optimal codes with codeword lengths for some symbols
1
that exceed the Shannon code length ⌈log p(x) ⌉.
(a) We will argue that an optimal ternary code that meets the entropy bound corre-
sponds to complete ternary tree, with the probability of each leaf of the form 3 −i . Solution: Shannon codes and Huffman codes.
To do this, we essentially repeat the arguments of Theorem 5.3.1. We achieve the
ternary entropy bound only if D(p||r) = 0 and c = 1 , in (5.25). Thus we achieve (a) Applying the Huffman algorithm gives us the following table
the entropy bound if and only if pi = 3−j for all i . Code Symbol Probability
(b) We will show that any distribution that has p i = 3−li for all i must have an 0 1 1/3 1/3 2/3 1
odd number of symbols. We know from Theorem 5.2.1, that given the set of 11 2 1/3 1/3 1/3
lengths, li , we can construct a ternary tree with nodes at the depths l i . Now, 101 3 1/4 1/3
since
$ −l
3 i = 1 , the tree must be complete. A complete ternary tree has an 100 4 1/12
odd number of leaves (this can be proved by induction on the number of internal which gives codeword lengths of 1,2,3,3 for the different codewords.
nodes). Thus the number of source symbols is odd. (b) Both set of lengths 1,2,3,3 and 2,2,2,2 satisfy the Kraft inequality, and they both
Another simple argument is to use basic number theory. We know that for achieve the same expected length (2 bits) for the above distribution. Therefore
$ −li $
this distribution, 3 = 1 . We can write this as 3−lmax 3lmax −li = 1 or they are both optimal.
$ lmax −li
3 = 3lmax . Each of the terms in the sum is odd, and since their sum is (c) The symbol with probability 1/4 has an Huffman code of length 3, which is greater
odd, the number of terms in the sum has to be odd (the sum of an even number than ⌈log p1 ⌉ . Thus the Huffman code for a particular symbol may be longer than
of odd terms is even). Thus there are an odd number of source symbols for any the Shannon code for that symbol. But on the average, the Huffman code cannot
code that meets the ternary entropy bound. be longer than the Shannon code.
11. Suffix condition. Consider codes that satisfy the suffix condition, which says that 13. Twenty questions. Player A chooses some object in the universe, and player B
no codeword is a suffix of any other codeword. Show that a suffix condition code is attempts to identify the object with a series of yes-no questions. Suppose that player B
uniquely decodable, and show that the minimum average length over all codes satisfying is clever enough to use the code achieving the minimal expected length with respect to
the suffix condition is the same as the average length of the Huffman code for that player A’s distribution. We observe that player B requires an average of 38.5 questions
random variable. to determine the object. Find a rough lower bound to the number of objects in the
Solution: Suffix condition. The fact that the codes are uniquely decodable can be universe.
seen easily be reversing the order of the code. For any received sequence, we work Solution: Twenty questions.
backwards from the end, and look for the reversed codewords. Since the codewords
satisfy the suffix condition, the reversed codewords satisfy the prefix condition, and the 37.5 = L∗ − 1 < H(X) ≤ log |X | (5.17)
we can uniquely decode the reversed code.
The fact that we achieve the same minimum expected length then follows directly from and hence number of objects in the universe > 2 37.5 = 1.94 × 1011 .
the results of Section 5.5. But we can use the same reversal argument to argue that
corresponding to every suffix code, there is a prefix code of the same length and vice 14. Huffman code. Find the (a) binary and (b) ternary Huffman codes for the random
versa, and therefore we cannot achieve any lower codeword lengths with a suffix code variable X with probabilities
than we can with a prefix code.
1 2 3 4 5 6
p=( , , , , , ) .
12. Shannon codes and Huffman codes. Consider a random variable X which takes 21 21 21 21 21 21
on four values with probabilities ( 13 , 13 , 14 , 12
1
). $
(c) Calculate L = pi li in each case.
(a) Construct a Huffman code for this random variable. Solution: Huffman code.
Data Compression 105 106 Data Compression

(a) The Huffman tree for this distribution is (a) Construct a binary Huffman code for this random variable. What is its average
Codeword length?
00 x1 6/21 6/21 6/21 9/21 12/21 1 (b) Construct a quaternary Huffman code for this random variable, i.e., a code over
10 x2 5/21 5/21 6/21 6/21 9/21 an alphabet of four symbols (call them a, b, c and d ). What is the average length
11 x3 4/21 4/21 5/21 6/21 of this code?
010 x4 3/21 3/21 4/21
(c) One way to construct a binary code for the random variable is to start with a
0110 x5 2/21 3/21
quaternary code, and convert the symbols into binary using the mapping a → 00 ,
0111 x6 1/21
b → 01 , c → 10 and d → 11 . What is the average length of the binary code for
(b) The ternary Huffman tree is the above random variable constructed by this process?
Codeword
1 x1 6/21 6/21 10/21 1 (d) For any random variable X , let LH be the average length of the binary Huffman
2 x2 5/21 5/21 6/21 code for the random variable, and let L QB be the average length code constructed
00 x3 4/21 4/21 5/21 by first building a quaternary Huffman code and converting it to binary. Show
01 x4 3/21 3/21 that
020 x5 2/21 3/21 LH ≤ LQB < LH + 2 (5.18)
021 x6 1/21 (e) The lower bound in the previous example is tight. Give an example where the
022 x7 0/21 code constructed by converting an optimal quaternary code is also the optimal
(c) The expected length of the codewords for the binary Huffman code is 51/21 = 2.43 binary code.
bits.
(f) The upper bound, i.e., LQB < LH + 2 is not tight. In fact, a better bound is
The ternary code has an expected length of 34/21 = 1.62 ternary symbols. LQB ≤ LH + 1 . Prove this bound, and provide an example where this bound is
15. Huffman codes. tight.

(a) Construct a binary Huffman code for the following distribution on 5 symbols p = Solution: Huffman codes: Consider a random variable X which takes 6 values {A, B, C, D, E, F }
(0.3, 0.3, 0.2, 0.1, 0.1) . What is the average length of this code? with probabilities (0.5, 0.25, 0.1, 0.05, 0.05, 0.05) respectively.
(b) Construct a probability distribution p'
on 5 symbols for which the code that you
(a) Construct a binary Huffman code for this random variable. What is its average
constructed in part (a) has an average length (under p ' ) equal to its entropy
length?
H(p' ) .
Solution:
Solution: Huffman codes Code Source symbol Prob.
0 A 0.5 0.5 0.5 0.5 0.5 1.0
(a) The code constructed by the standard Huffman procedure 10 B 0.25 0.25 0.25 0.25 0.5
Codeword X Probability 1100 C 0.1 0.1 0.15 0.25
10 1 0.3 0.3 0.4 0.6 1 1101 D 0.05 0.1 0.1
11 2 0.3 0.3 0.3 0.4 1110 E 0.05 0.05
00 3 0.2 0.2 0.3 1111 F 0.05
010 4 0.1 0.2
The average length of this code is 1×0.5+2×0.25+4×(0.1+0.05+0.05+0.05) = 2
011 5 0.1
bits. The entropy H(X) in this case is 1.98 bits.
The average length = 2 ∗ 0.8 + 3 ∗ 0.2 = 2.2 bits/symbol.
(b) Construct a quaternary Huffman code for this random variable, i.e., a code over
(b) The code would have a rate equal to the entropy if each of the codewords was of
an alphabet of four symbols (call them a, b, c and d ). What is the average length
length 1/p(X) . In this case, the code constructed above would be efficient for the
of this code?
distrubution (0.25.0.25,0.25,0.125,0.125).
Solution:Since the number of symbols, i.e., 6 is not of the form 1 + k(D − 1) ,
16. Huffman codes: Consider a random variable X which takes 6 values {A, B, C, D, E, F } we need to add a dummy symbol of probability 0 to bring it to this form. In this
with probabilities (0.5, 0.25, 0.1, 0.05, 0.05, 0.05) respectively. case, drawing up the Huffman tree is straightforward.

Data Compression 107 108 Data Compression

Code Symbol Prob. (f) (Optional, no credit) The upper bound, i.e., L QB < LH + 2 is not tight. In fact, a
a A 0.5 0.5 1.0 better bound is LQB ≤ LH + 1 . Prove this bound, and provide an example where
b B 0.25 0.25 this bound is tight.
d C 0.1 0.15 Solution:Consider a binary Huffman code for the random variable X and consider
ca D 0.05 0.1 all codewords of odd length. Append a 0 to each of these codewords, and we will
cb E 0.05 obtain an instantaneous code where all the codewords have even length. Then we
cc F 0.05 can use the inverse of the mapping mentioned in part (c) to construct a quaternary
cd G 0.0 code for the random variable - it is easy to see that the quatnerary code is also
The average length of this code is 1 × 0.85 + 2 × 0.15 = 1.15 quaternary symbols. instantaneous. Let LBQ be the average length of this quaternary code. Since the
(c) One way to construct a binary code for the random variable is to start with a length of the quaternary codewords of BQ are half the length of the corresponding
quaternary code, and convert the symbols into binary using the mapping a → 00 , binary codewords, we have
b → 01 , c → 10 and d → 11 . What is the average length of the binary code for  
the above random variable constructed by this process? 1 ! LH + 1
LBQ = LH + pi  < (5.22)
Solution:The code constructed by the above process is A → 00 , B → 01 , C → 2 2
i:li is odd
11 , D → 1000 , E → 1001 , and F → 1010 , and the average length is 2 × 0.85 +
4 × 0.15 = 2.3 bits. and since the BQ code is at best as good as the quaternary Huffman code, we
(d) For any random variable X , let LH be the average length of the binary Huffman have
code for the random variable, and let L QB be the average length code constructed LBQ ≥ LQ (5.23)
by firsting building a quaternary Huffman code and converting it to binary. Show
Therefore LQB = 2LQ ≤ 2LBQ < LH + 1 .
that
An example where this upper bound is tight is the case when we have only two
LH ≤ LQB < LH + 2 (5.19)
possible symbols. Then LH = 1 , and LQB = 2 .
Solution:Since the binary code constructed from the quaternary code is also in-
stantaneous, its average length cannot be better than the average length of the 17. Data compression. Find an optimal set of binary codeword lengths l 1 , l2 , . . . (min-
$
best instantaneous code, i.e., the Huffman code. That gives the lower bound of imizing pi li ) for an instantaneous code for each of the following probability mass
the inequality above. functions:
To prove the upper bound, the LQ be the length of the optimal quaternary code. 10 9 8 7 7
(a) p = ( 41 , 41 , 41 , 41 , 41 )
Then from the results proved in the book, we have 9 9 1 9 1 2 9 1 3
(b) p = ( 10 , ( 10 )( 10 ), ( 10 )( 10 ) , ( 10 )( 10 ) , . . .)
H4 (X) ≤ LQ < H4 (X) + 1 (5.20)
Solution: Data compression
Also, it is easy to see that LQB = 2LQ , since each symbol in the quaternary code
is converted into two bits. Also, from the properties of entropy, it follows that Code Source symbol Prob.
H4 (X) = H2 (X)/2 . Substituting these in the previous equation, we get 10 A 10/41 14/41 17/41 24/41 41/41
00 B 9/41 10/41 14/41 17/41
H2 (X) ≤ LQB < H2 (X) + 2. (5.21) (a)
01 C 8/41 9/41 10/41
Combining this with the bound that H 2 (X) ≤ LH , we obtain LQB < LH + 2 . 110 D 7/41 8/41
(e) The lower bound in the previous example is tight. Give an example where the 111 E 7/41
code constructed by converting an optimal quaternary code is also the optimal (b) This is case of an Huffman code on an infinite alphabet. If we consider an initial
binary code? subset of the symbols, we can see that the cumulative probability of all symbols
$ j−1 = 0.9(0.1)i−1 (1/(1 − 0.1)) = (0.1)i−1 . Since
Solution:Consider a random variable that takes on four equiprobable values. {x : x > i} is j>i 0.9 ∗ (0.1)
Then the quaternary Huffman code for this is 1 quaternary symbol for each source this is less than 0.9 ∗ (0.1)i−1 , the cumulative sum of all the remaining terms is
symbol, with average length 1 quaternary symbol. The average length L QB for less than the last term used. Thus Huffman coding will always merge the last two
this code is then 2 bits. The Huffman code for this case is also easily seen to assign terms. This in terms implies that the Huffman code in this case is of the form
2 bit codewords to each symbol, and therefore for this case, L H = LQB . 1,01,001,0001, etc.
Data Compression 109 110 Data Compression

18. Classes of codes. Consider the code {0, 01} irrelevant to this procedure. He will choose the 63 most valuable outcomes, and
his first question will be “Is X = i ?” where i is the median of these 63 numbers.
(a) Is it instantaneous? After isolating to either half, his next question will be “Is X = j ?”, where j is
(b) Is it uniquely decodable? the median of that half. Proceeding this way, he will win if X is one of the 63
(c) Is it nonsingular? most valuable outcomes, and lose otherwise. This strategy maximizes his expected
winnings.
Solution: Codes. (b) Now if arbitrary questions are allowed, the game reduces to a game of 20 questions
$
(a) No, the code is not instantaneous, since the first codeword, 0, is a prefix of the to determine the object. The return in this case to the player is x p(x)(v(x) −
second codeword, 01. l(x)) , where l(x) is the number of questions required to determine the object.
Maximizing the return is equivalent to minimizing the expected number of ques-
(b) Yes, the code is uniquely decodable. Given a sequence of codewords, first isolate tions, and thus, as argued in the text, the optimal strategy is to construct a
occurrences of 01 (i.e., find all the ones) and then parse the rest into 0’s. Huffman code for the source and use that to construct a question strategy. His
$ $
(c) Yes, all uniquely decodable codes are non-singular. expected return is therefore between p(x)v(x) − H and p(x)v(x) − H − 1 .

19. The game of Hi-Lo. (c) A computer wishing to minimize the return to player will want to minimize
$
p(x)v(x) − H(X) over choices of p(x) . We can write this as a standard mini-
(a) A computer generates a number X according to a known probability mass function mization problem with constraints. Let
p(x), x ∈ {1, 2, . . . , 100} . The player asks a question, “Is X = i ?” and is told ! ! !
“Yes”, “You’re too high,” or “You’re too low.” He continues for a total of six J(p) = pi vi + pi log pi + λ pi (5.24)
questions. If he is right (i.e., he receives the answer “Yes”) during this sequence,
and differentiating and setting to 0, we obtain
he receives a prize of value v(X). How should the player proceed to maximize his
expected winnings?
vi + log pi + 1 + λ = 0 (5.25)
(b) The above doesn’t have much to do with information theory. Consider the fol-
lowing variation: X ∼ p(x), prize = v(x) , p(x) known, as before. But arbitrary or after normalizing to ensure that the p i ’s form a probability distribution,
Yes-No questions are asked sequentially until X is determined. (“Determined”
doesn’t mean that a “Yes” answer is received.) Questions cost one unit each. How 2−vi
pi = $ −vj . (5.26)
should the player proceed? What is the expected payoff? j2

(c) Continuing (b), what if v(x) is fixed, but p(x) can be chosen by the computer
To complete the proof, we let ri = $2
−vi
(and then announced to the player)? The computer wishes to minimize the player’s −vj , and rewrite the return as
j
2
expected return. What should p(x) be? What is the expected return to the player?
! ! ! !
pi vi + pi log pi = pi log pi − pi log 2−vi (5.27)
Solution: The game of Hi-Lo. ! ! !
= pi log pi − pi log ri − log( 2−vj ) (5.28)
(a) The first thing to recognize in this problem is that the player cannot cover more !
= D(p||r) − log( 2−vj ), (5.29)
than 63 values of X with 6 questions. This can be easily seen by induction.
With one question, there is only one value of X that can be covered. With two and thus the return is minimized by choosing p i = ri . This is the distribution
questions, there is one value of X that can be covered with the first question, that the computer must choose to minimize the return to the player.
and depending on the answer to the first question, there are two possible values
of X that can be asked in the next question. By extending this argument, we see 20. Huffman codes with costs. Words like Run! Help! and Fire! are short, not because
that we can ask at more 63 different questions of the form “Is X = i ?” with 6 they are frequently used, but perhaps because time is precious in the situations in which
questions. (The fact that we have narrowed the range at the end is irrelevant, if these words are required. Suppose that X = i with probability p i , i = 1, 2, . . . , m. Let
we have not isolated the value of X .) li be the number of binary symbols in the codeword associated with X = i, and let c i
Thus if the player seeks to maximize his return, he should choose the 63 most denote the cost per letter of the codeword when X = i. Thus the average cost C of
$
valuable outcomes for X , and play to isolate these values. The probabilities are the description of X is C = m i=1 pi ci li .

Data Compression 111 112 Data Compression

$
(a) Minimize C over all l1 , l2 , . . . , lm such that 2−li ≤ 1. Ignore any implied in- 21. Conditions for unique decodability. Prove that a code C is uniquely decodable if
teger constraints on li . Exhibit the minimizing l1∗ , l2∗ , . . . , lm
∗ and the associated (and only if) the extension
minimum value C ∗ .
C k (x1 , x2 , . . . , xk ) = C(x1 )C(x2 ) · · · C(xk )
(b) How would you use the Huffman code procedure to minimize C over all uniquely
decodable codes? Let CHuf f man denote this minimum. is a one-to-one mapping from X k to D ∗ for every k ≥ 1 . (The only if part is obvious.)
(c) Can you show that Solution: Conditions for unique decodability. If C k is not one-to-one for some k , then
m
! C is not UD, since there exist two distinct sequences, (x 1 , . . . , xk ) and (x'1 , . . . , x'k ) such
C ∗ ≤ CHuf f man ≤ C ∗ + pi ci ?
that
i=1
C k (x1 , . . . , xk ) = C(x1 ) · · · C(xk ) = C(x'1 ) · · · C(x'k ) = C(x'1 , . . . , x'k ) .
Solution: Huffman codes with costs. Conversely, if C is not UD then by definition there exist distinct sequences of source
$ $ symbols, (x1 , . . . , xi ) and (y1 , . . . , yj ) , such that
(a) We wish to minimize C = pi ci ni subject to 2−ni ≤ 1 . We will assume
$
equality in the constraint and let ri = 2−ni and let Q = i pi ci . Let qi = C(x1 )C(x2 ) · · · C(xi ) = C(y1 )C(y2 ) · · · C(yj ) .
(pi ci )/Q . Then q also forms a probability distribution and we can write C as
! Concatenating the input sequences (x 1 , . . . , xi ) and (y1 , . . . , yj ) , we obtain
C = pi ci ni (5.30)
C(x1 ) · · · C(xi )C(y1 ) · · · C(yj ) = C(y1 ) · · · C(yj )C(x1 ) · · · C(xi ) ,
! 1
= Q qi log (5.31)
ri which shows that C k is not one-to-one for k = i + j .
*! +
qi !
= Q qi log − qi log qi (5.32) 22. Average length of an optimal code. Prove that L(p 1 , . . . , pm ) , the average code-
ri
= Q(D(q||r) + H(q)). (5.33) word length for an optimal D -ary prefix code for probabilities {p 1 , . . . , pm } , is a con-
tinuous function of p1 , . . . , pm . This is true even though the optimal code changes
Since the only freedom is in the choice of r i , we can minimize C by choosing discontinuously as the probabilities vary.
r = q or Solution: Average length of an optimal code. The longest possible codeword in an
pi ci
n∗i = − log $ , (5.34) optimal code has n − 1 binary digits. This corresponds to a completely unbalanced tree
pj cj in which each codeword has a different length. Using a D -ary alphabet for codewords
where we have ignored any integer constraints on n i . The minimum cost C ∗ for can only decrease its length. Since we know the maximum possible codeword length,
this assignment of codewords is there are only a finite number of possible codes to consider. For each candidate code C ,
the average codeword length is determined by the probability distribution p 1 , p2 , . . . , pn :
C ∗ = QH(q) (5.35) n
!
L(C) = p i ℓi .
(b) If we use q instead of p for the Huffman procedure, we obtain a code minimizing i=1
expected cost.
This is a linear, and therefore continuous, function of p 1 , p2 , . . . , pn . The optimal
(c) Now we can account for the integer constraints. code is the candidate code with the minimum L , and its length is the minimum of a
Let finite number of continuous functions and is therefore itself a continuous function of
ni = ⌈− log qi ⌉ (5.36) p1 , p2 , . . . , p n .

Then 23. Unused code sequences. Let C be a variable length code that satisfies the Kraft
− log qi ≤ ni < − log qi + 1 (5.37) inequality with equality but does not satisfy the prefix condition.

Multiplying by pi ci and summing over i , we get the relationship (a) Prove that some finite sequence of code alphabet symbols is not the prefix of any
sequence of codewords.
C ∗ ≤ CHuf f man < C ∗ + Q. (5.38) (b) (Optional) Prove or disprove: C has infinite decoding delay.
Data Compression 113 114 Data Compression

Solution: Unused code sequences. Let C be a variable length code that satisfies the If two codes differ by 2 bits or more, call m s the message with the shorter codeword
Kraft inequality with equality but does not satisfy the prefix condition. Cs and mℓ the message with the longer codeword C ℓ . Change the codewords
for these two messages so that the new codeword C s' is the old Cs with a zero
(a) When a prefix code satisfies the Kraft inequality with equality, every (infinite) appended (Cs' = Cs 0) and Cℓ' is the old Cs with a one appended (Cℓ' = Cs 1) . Cs'
sequence of code alphabet symbols corresponds to a sequence of codewords, since and Cℓ' are legitimate codewords since no other codeword contained C s as a prefix
the probability that a random generated sequence begins with a codeword is (by definition of a prefix code), so obviously no other codeword could contain C s'
m
! or Cℓ' as a prefix. The length of the codeword for m s increases by 1 and the
D−ℓi = 1 . length of the codeword for mℓ decreases by at least 1. Since these messages are
i=1 equally likely, L' ≤ L . By this method we can transform any optimal code into a
If the code does not satisfy the prefix condition, then at least one codeword, say code in which the length of the shortest and longest codewords differ by at most
C(x1 ) , is a prefix of another, say C(xm ) . Then the probability that a random one bit. (In fact, it is easy to see that every optimal code has this property.)
generated sequence begins with a codeword is at most For a source with n messages, ℓ(ms ) = ⌊log 2 n⌋ and ℓ(mℓ ) = ⌈log 2 n⌉ . Let d be
the difference between n and the next smaller power of 2:
m−1
!
D−ℓi ≤ 1 − D −ℓm < 1 , d = n − 2⌊log2 n⌋ .
i=1
Then the optimal code has 2d codewords of length ⌈log 2 n⌉ and n−2d codewords
which shows that not every sequence of code alphabet symbols is the beginning of
of length ⌊log 2 n⌋ . This gives
a sequence of codewords.
(b) (Optional) A reference to a paper proving that C has infinite decoding delay will 1
L = (2d⌈log 2 n⌉ + (n − 2d)⌊log 2 n⌋)
be supplied later. It is easy to see by example that the decoding delay cannot be n
finite. An simple example of a code that satisfies the Kraft inequality, but not the 1
= (n⌊log 2 n⌋ + 2d)
prefix condition is a suffix code (see problem 11). The simplest non-trivial suffix n
2d
code is one for three symbols {0, 01, 11} . For such a code, consider decoding a = ⌊log2 n⌋ + .
string 011111 . . . 1110. If the number of one’s is even, then the string must be n
parsed 0,11,11, . . . ,11,0, whereas if the number of 1’s is odd, the string must be Note that d = 0 is a special case in the above equation.
parsed 01,11, . . . ,11. Thus the string cannot be decoded until the string of 1’s has (b) The average codeword length equals the entropy if and only if n is a power of 2.
ended, and therefore the decoding delay could be infinite. To see this, consider the following calculation of L :
! !
24. Optimal codes for uniform distributions. Consider a random variable with m L= p i ℓi = − pi log2 2−ℓi = H + D(p5q) ,
equiprobable outcomes. The entropy of this information source is obviously log 2 m i i
bits.
where qi = 2−ℓi. Therefore L = H only if pi = qi , that is, when all codewords
(a) Describe the optimal instantaneous binary code for this source and compute the have equal length, or when d = 0 .
average codeword length Lm . (c) For n = 2m + d , the redundancy r = L − H is given by
(b) For what values of m does the average codeword length L m equal the entropy r = L − log2 n
H = log2 m ?
2d
(c) We know that L < H + 1 for any probability distribution. The redundancy of a = ⌊log2 n⌋ + − log2 n
n
variable length code is defined to be ρ = L − H . For what value(s) of m , where 2d
2k ≤ m ≤ 2k+1 , is the redundancy of the code maximized? What is the limiting = m+ − log2 (2m + d)
n
value of this worst case redundancy as m → ∞ ? 2d ln(2m + d)
= m+ m − .
2 +d ln 2
Solution: Optimal codes for uniform distributions.
Therefore
(a) For uniformly probable codewords, there exists an optimal binary variable length ∂r (2m + d)(2) − 2d 1 1
= − ·
prefix code such that the longest and shortest codewords differ by at most one bit. ∂d (2m + d)2 ln 2 2m + d

Data Compression 115 116 Data Compression

Setting this equal to zero implies d∗ = 2m (2 ln 2 − 1) . Since there is only one subtree beginning with 10; that is, we replace codewords of the form 10x . . . by
maximum, and since the function is convex ∩ , the maximizing d is one of the two 0x . . . and we let c1 = 10 . This improvement contradicts the assumption that
integers nearest (.3862)(2m ) . The corresponding maximum redundancy is ℓ1 = 1 , and so ℓ1 ≥ 2 .

2d∗ ln(2m + d∗ ) 26. Merges. Companies with values W 1 , W2 , . . . , Wm are merged as follows. The two least
r∗ ≈ m + − valuable companies are merged, thus forming a list of m − 1 companies. The value
2m + d∗ ln 2
2(.3862)(2m ) ln(2m + (.3862)2m ) of the merge is the sum of the values of the two merged companies. This continues
= m+ m − until one supercompany remains. Let V equal the sum of the values of the merges.
2 + (.3862)(2m ) ln 2
= .0861 . Thus V represents the total reported dollar volume of the merges. For example, if
W = (3, 3, 2, 2) , the merges yield (3, 3, 2, 2) → (4, 3, 3) → (6, 4) → (10) , and V =
This is achieved with arbitrary accuracy as n → ∞ . (The quantity σ = 0.0861 is 4 + 6 + 10 = 20 .
one of the lesser fundamental constants of the universe. See Robert Gallager[8]).
(a) Argue that V is the minimum volume achievable by sequences of pair-wise merges
25. Optimal codeword lengths. Although the codeword lengths of an optimal variable terminating in one supercompany. (Hint: Compare to Huffman coding.)
$
length code are complicated functions of the message probabilities {p 1 , p2 , . . . , pm } , it (b) Let W = Wi , W̃i = Wi /W , and show that the minimum merge volume V
can be said that less probable symbols are encoded into longer codewords. Suppose satisfies
that the message probabilities are given in decreasing order p 1 > p2 ≥ · · · ≥ pm . W H(W̃) ≤ V ≤ W H(W̃) + W (5.39)

(a) Prove that for any binary Huffman code, if the most probable message symbol has Solution: Problem: Merges
probability p1 > 2/5 , then that symbol must be assigned a codeword of length 1.
(a) We first normalize the values of the companies to add to one. The total volume of
(b) Prove that for any binary Huffman code, if the most probable message symbol the merges is equal to the sum of value of each company times the number of times
has probability p1 < 1/3 , then that symbol must be assigned a codeword of it takes part in a merge. This is identical to the average length of a Huffman code,
length ≥ 2 . with a tree which corresponds to the merges. Since Huffman coding minimizes
average length, this scheme of merges minimizes total merge volume.
Solution: Optimal codeword lengths. Let {c 1 , c2 , . . . , cm } be codewords of respective
lengths {ℓ1 , ℓ2 , . . . , ℓm } corresponding to probabilities {p 1 , p2 , . . . , pm } . (b) Just as in the case of Huffman coding, we have

(a) We prove that if p1 > p2 and p1 > 2/5 then ℓ1 = 1 . Suppose, for the sake of H ≤ EL < H + 1, (5.40)
contradiction, that ℓ1 ≥ 2 . Then there are no codewords of length 1; otherwise
we have in this case for the corresponding merge scheme
c1 would not be the shortest codeword. Without loss of generality, we can assume
that c1 begins with 00. For x, y ∈ {0, 1} let Cxy denote the set of codewords W H(W̃) ≤ V ≤ W H(W̃) + W (5.41)
beginning with xy . Then the sets C01 , C10 , and C11 have total probability
1 − p1 < 3/5 , so some two of these sets (without loss of generality, C 10 and C11 ) 27. The Sardinas-Patterson test for unique decodability. A code is not uniquely
have total probability less 2/5. We can now obtain a better code by interchanging decodable if and only if there exists a finite sequence of code symbols which can be
the subtree of the decoding tree beginning with 1 with the subtree beginning with resolved in two different ways into sequences of codewords. That is, a situation such as
00; that is, we replace codewords of the form 1x . . . by 00x . . . and codewords of
the form 00y . . . by 1y . . . . This improvement contradicts the assumption that | A1 | A2 | A3 ... Am |
ℓ1 ≥ 2 , and so ℓ1 = 1 . (Note that p1 > p2 was a hidden assumption for this | B1 | B2 | B3 ... Bn |
problem; otherwise, for example, the probabilities {.49, .49, .02} have the optimal must occur where each Ai and each Bi is a codeword. Note that B1 must be a
code {00, 1, 01} .) prefix of A1 with some resulting “dangling suffix.” Each dangling suffix must in turn
(b) The argument is similar to that of part (a). Suppose, for the sake of contradiction, be either a prefix of a codeword or have another codeword as its prefix, resulting in
that ℓ1 = 1 . Without loss of generality, assume that c 1 = 0 . The total probability another dangling suffix. Finally, the last dangling suffix in the sequence must also be
of C10 and C11 is 1 − p1 > 2/3 , so at least one of these two sets (without loss a codeword. Thus one can set up a test for unique decodability (which is essentially
of generality, C10 ) has probability greater than 2/3. We can now obtain a better the Sardinas-Patterson test[12]) in the following way: Construct a set S of all possible
code by interchanging the subtree of the decoding tree beginning with 0 with the dangling suffixes. The code is uniquely decodable if and only if S contains no codeword.
Data Compression 117 118 Data Compression

(a) State the precise rules for building the set S . (b) A simple upper bound can be obtained from the fact that all strings in the sets
(b) Suppose the codeword lengths are l i , i = 1, 2, . . . , m . Find a good upper bound Si have length less than lmax , and therefore the maximum number of elements in
on the number of elements in the set S . S is less than 2lmax .

(c) Determine which of the following codes is uniquely decodable: (c) i. {0, 10, 11} . This code is instantaneous and hence uniquely decodable.
ii. {0, 01, 11} . This code is a suffix code (see problem 11). It is therefore uniquely
i. {0, 10, 11} .
decodable. The sets in the Sardinas-Patterson test are S 1 = {0, 01, 11} ,
ii. {0, 01, 11} . S2 = {1} = S3 = S4 = . . . .
iii. {0, 01, 10} . iii. {0, 01, 10} . This code is not uniquely decodable. The sets in the test are
iv. {0, 01} . S1 = {0, 01, 10} , S2 = {1} , S3 = {0} , . . . . Since 0 is codeword, this code
v. {00, 01, 10, 11} . fails the test. It is easy to see otherwise that the code is not UD - the string
vi. {110, 11, 10} . 010 has two valid parsings.
vii. {110, 11, 100, 00, 10} . iv. {0, 01} . This code is a suffix code and is therefore UD. THe test produces
sets S1 = {0, 01} , S2 = {1} , S3 = φ .
(d) For each uniquely decodable code in part (c), construct, if possible, an infinite
encoded sequence with a known starting point, such that it can be resolved into v. {00, 01, 10, 11} . This code is instantaneous and therefore UD.
codewords in two different ways. (This illustrates that unique decodability does vi. {110, 11, 10} . This code is uniquely decodable, by the Sardinas-Patterson
not imply finite decodability.) Prove that such a sequence cannot arise in a prefix test, since S1 = {110, 11, 10} , S2 = {0} , S3 = φ .
code. vii. {110, 11, 100, 00, 10} . This code is UD, because by the Sardinas Patterson
test, S1 = {110, 11, 100, 00, 10} , S2 = {0} , S3 = {0} , etc.
Solution: Test for unique decodability. (d) We can produce infinite strings which can be decoded in two ways only for examples
The proof of the Sardinas-Patterson test has two parts. In the first part, we will show where the Sardinas Patterson test produces a repeating set. For example, in part
that if there is a code string that has two different interpretations, then the code will fail (ii), the string 011111 . . . could be parsed either as 0,11,11, . . . or as 01,11,11, . . . .
the test. The simplest case is when the concatenation of two codewords yields another Similarly for (viii), the string 10000 . . . could be parsed as 100,00,00, . . . or as
codeword. In this case, S2 will contain a codeword, and hence the test will fail. 10,00,00, . . . . For the instantaneous codes, it is not possible to construct such a
In general, the code is not uniquely decodeable, iff there exists a string that admits two string, since we can decode as soon as we see a codeword string, and there is no
different parsings into codewords, e.g. way that we would need to wait to decode.

x1 x2 x3 x4 x5 x6 x7 x8 = x1 x2 , x3 x4 x5 , x6 x7 x8 = x1 x2 x3 x4 , x5 x6 x7 x8 . (5.42) 28. Shannon code. Consider the following method for generating a code for a random
variable X which takes on m values {1, 2, . . . , m} with probabilities p 1 , p2 , . . . , pm .
In this case, S2 will contain the string x3 x4 , S3 will contain x5 , S4 will contain Assume that the probabilities are ordered so that p 1 ≥ p2 ≥ · · · ≥ pm . Define
x6 x7 x8 , which is a codeword. It is easy to see that this procedure will work for any
i−1
string that has two different parsings into codewords; a formal proof is slightly more !
Fi = pk , (5.43)
difficult and using induction. k=1
In the second part, we will show that if there is a codeword in one of the sets S i , i ≥ 2 ,
the sum of the probabilities of all symbols less than i . Then the codeword for i is the
then there exists a string with two different possible interpretations, thus showing that
number Fi ∈ [0, 1] rounded off to li bits, where li = ⌈log p1i ⌉ .
the code is not uniquely decodeable. To do this, we essentially reverse the construction
of the sets. We will not go into the details - the reader is referred to the original paper. (a) Show that the code constructed by this process is prefix-free and the average length
satisfies
(a) Let S1 be the original set of codewords. We construct S i+1 from Si as follows:
H(X) ≤ L < H(X) + 1. (5.44)
A string y is in Si+1 iff there is a codeword x in S1 , such that xy is in Si or if
there exists a z ∈ Si such that zy is in S1 (i.e., is a codeword). Then the code (b) Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125) .
is uniquely decodable iff none of the S i , i ≥ 2 contains a codeword. Thus the set
S = ∪i≥2 Si . Solution: Shannon code.

Data Compression 119 120 Data Compression

1
(a) Since li = ⌈log pi ⌉ , we have (a) For a dyadic distribution, the Huffman code acheives the entropy bound. The
code tree constructed be the Huffman algorithm is a complete tree with leaves at
1 1 depth li with probability pi = 2−li .
log ≤ li < log + 1 (5.45)
pi pi For such a complete binary tree, we can prove the following properties

which implies that • The probability of any internal node at depth k is 2 −k .

! We can prove this by induction. Clearly, it is true for a tree with 2 leaves.
H(X) ≤ L = pi li < H(X) + 1. (5.46)
Assume that it is true for all trees with n leaves. For any tree with n + 1
The difficult part is to prove that the code is a prefix code. By the choice of l i , leaves, at least two of the leaves have to be siblings on the tree (else the tree
we have would not be complete). Let the level of these siblings be j . The probability of
2−li ≤ pi < 2−(li −1) . (5.47) the parent of these two siblings (at level j − 1 ) has probability 2 j + 2j = 2j−1 .
We can now replace the two siblings with their parent, without changing the
Thus Fj , j > i differs from Fi by at least 2−li , and will therefore differ from Fi probability of any other internal node. But now we have a tree with n leaves
is at least one place in the first li bits of the binary expansion of Fi . Thus the which satisfies the required property. Thus, by induction, the property is true
codeword for Fj , j > i , which has length lj ≥ li , differs from the codeword for for all complete binary trees.
Fi at least once in the first li places. Thus no codeword is a prefix of any other • From the above property, it follows immediately the the probability of the left
codeword. child is equal to the probability of the right child.
(b) We build the following table (b) For a sequence X1 , X2 , we can construct a code tree by first constructing the
Symbol Probability Fi in decimal Fi in binary li Codeword optimal tree for X1 , and then attaching the optimal tree for X 2 to each leaf of
1 0.5 0.0 0.0 1 0 the optimal tree for X1 . Proceeding this way, we can construct the code tree for
2 0.25 0.5 0.10 2 10 X1 , X2 , . . . , Xn . When Xi are drawn i.i.d. according to a dyadic distribution, it
3 0.125 0.75 0.110 3 110 is easy to see that the code tree constructed will be also be a complete binary tree
4 0.125 0.875 0.111 3 111 with the properties in part (a). Thus the probability of the first bit being 1 is 1/2,
The Shannon code in this case achieves the entropy bound (1.75 bits) and is and at any internal node, the probability of the next bit produced by the code
optimal. being 1 is equal to the probability of the next bit being 0. Thus the bits produced
by the code are i.i.d. Bernoulli(1/2), and the entropy rate of the coded sequence
29. Optimal codes for dyadic distributions. For a Huffman code tree, define the is 1 bit per symbol.
probability of a node as the sum of the probabilities of all the leaves under that node. (c) Assume that we have a coded sequence of bits from a code that met the entropy
Let the random variable X be drawn from a dyadic distribution, i.e., p(x) = 2 −i , for bound with equality. If the coded sequence were compressible, then we could used
some i , for all x ∈ X . Now consider a binary Huffman code for this distribution. the compressed version of the coded sequence as our code, and achieve an average
length less than the entropy bound, which will contradict the bound. Thus the
(a) Argue that for any node in the tree, the probability of the left child is equal to the
coded sequence cannot be compressible, and thus must have an entropy rate of 1
probability of the right child.
bit/symbol.
(b) Let X1 , X2 , . . . , Xn be drawn i.i.d. ∼ p(x) . Using the Huffman code for p(x) , we
map X1 , X2 , . . . , Xn to a sequence of bits Y1 , Y2 , . . . , Yk(X1 ,X2 ,...,Xn ) . (The length 30. Relative entropy is cost of miscoding: Let the random variable X have five
of this sequence will depend on the outcome X 1 , X2 , . . . , Xn .) Use part (a) to possible outcomes {1, 2, 3, 4, 5} . Consider two distributions p(x) and q(x) on this
argue that the sequence Y1 , Y2 , . . . , forms a sequence of fair coin flips, i.e., that random variable
Pr{Yi = 0} = Pr{Yi = 1} = 21 , independent of Y1 , Y2 , . . . , Yi−1 . Symbol p(x) q(x) C1 (x) C2 (x)
Thus the entropy rate of the coded sequence is 1 bit per symbol. 1 1/2 1/2 0 0
2 1/4 1/8 10 100
(c) Give a heuristic argument why the encoded sequence of bits for any code that 3 1/8 1/8 110 101
achieves the entropy bound cannot be compressible and therefore should have an 4 1/16 1/8 1110 110
entropy rate of 1 bit per symbol. 5 1/16 1/8 1111 111
Solution: Optimal codes for dyadic distributions. (a) Calculate H(p) , H(q) , D(p||q) and D(q||p) .
Data Compression 121 122 Data Compression

(b) The last two columns above represent codes for the random variable. Verify that (b) Let LIN ST be the expected length of the best instantaneous code and L ∗1:1 be the
the average length of C1 under p is equal to the entropy H(p) . Thus C 1 is expected length of the best non-singular code for X . Argue that L ∗1:1 ≤ L∗IN ST ≤
optimal for p . Verify that C2 is optimal for q . H(X) + 1 .
(c) Now assume that we use code C2 when the distribution is p . What is the average (c) Give a simple example where the average length of the non-singular code is less
length of the codewords. By how much does it exceed the entropy p ? than the entropy.
(d) What is the loss if we use code C1 when the distribution is q ? (d) The set of codewords available for an non-singular code is {0, 1, 00, 01, 10, 11, 000, . . .} .
$
Since L1:1 = m i=1 pi li , show that this is minimized if we allot the shortest code-
Solution: Cost of miscoding
words to the most probable symbols.
1
(a) H(p) = 2 log 2 + 14 log 4 + 18 log 8 + 161
log 16 + 161
log 16 = 1.875 bits. Thus0 l1 =1 l2 = 1 , l3 = l4 = l5 = l6 = 2 , 0etc. Show 1 that in general li =
$
H(q) = 1
2 log 2 + 18 log 8 + 18 log 8 + 18 log 8 + 81 log 8 = 2 bits. ⌈log 2i + 1 ⌉ , and therefore L∗1:1 = m i
i=1 pi ⌈log 2 + 1 ⌉ .
D(p||q) = 1
2 log 1/2
1/2 +
1
4 log 1/4
1/8 +
1
8 log 1/8
1/8 +
1
16 log 1/16
1/8 +
1
16 log 1/16
1/8 = 0.125 bits. (e) The previous part shows that it is easy to find the optimal non-singular code for
D(p||q) = 1
log 1/2 1
log 1/8 1
log 1/8 1 1/8
+ 18 log 1/8 a distribution. However, it is a little more tricky to deal with the average length
2 1/2 + 8 1/4 + 8 1/8 + 8 log 1/16 1/16 = 0.125 bits.
of this code. We now bound0 this average
1 length. It follows from the previous part
(b) The average length of C1 for p(x) is 1.875 bits, which is the entropy of p . Thus -$
that L∗1:1 ≥ L̃= m i
i=1 pi log 2 + 1 . Consider the difference
C1 is an efficient code for p(x) . Similarly, the average length of code C 2 under
q(x) is 2 bits, which is the entropy of q . Thus C 2 is an efficient code for q . m m * +
! ! i
(c) If we use code C2 for p(x) , then the average length is 12 ∗ 1 + 41 ∗ 3 + 81 ∗ 3 + 16
1
∗ F (p) = H(X) − L̃ = − pi log pi − pi log +1 . (5.49)
1 i=1 i=1
2
3 + 16 ∗ 3 = 2 bits. It exceeds the entropy by 0.125 bits, which is the same as
D(p||q) . Prove by the method of Lagrange multipliers that the maximum of F (p) occurs
(d) Similary, using code C1 for q has an average length of 2.125 bits, which exceeds when pi = c/(i+2) , where c = 1/(Hm+2 −H2 ) and Hk is the sum of the harmonic
the entropy of q by 0.125 bits, which is D(q||p) . series, i.e.,
k
-! 1
31. Non-singular codes: The discussion in the text focused on instantaneous codes, with Hk = (5.50)
i
extensions to uniquely decodable codes. Both these are required in cases when the i=1
code is to be used repeatedly to encode a sequence of outcomes of a random variable. (This can also be done using the non-negativity of relative entropy.)
But if we need to encode only one outcome and we know when we have reached the
(f) Complete the arguments for
end of a codeword, we do not need unique decodability - only the fact that the code is
non-singular would suffice. For example, if a random variable X takes on 3 values a, H(X) − L∗1:1 ≤ H(X) − L̃ (5.51)
b and c, we could encode them by 0, 1, and 00. Such a code is non-singular but not ≤ log(2(Hm+2 − H2 )) (5.52)
uniquely decodable.
In the following, assume that we have a random variable X which takes on m values Now it is well known (see, e.g. Knuth, “Art of Computer Programming”, Vol.
1 1 1
with probabilities p1 , p2 , . . . , pm and that the probabilities are ordered so that p 1 ≥ 1) that Hk ≈ ln k (more precisely, Hk = ln k + γ + 2k − 12k 2 + 120k 4 − ǫ where

p2 ≥ . . . ≥ p m . 0 < ǫ < 1/252n6 , and γ = Euler’s constant = 0.577 . . . ). Either using this or a
simple approximation that Hk ≤ ln k + 1 , which can be proved by integration of
(a) By viewing the non-singular binary code as a ternary code with three symbols, 1 ∗
x , it can be shown that H(X) − L 1:1 < log log m + 2 . Thus we have
0,1 and “STOP”, show that the expected length of a non-singular code L 1:1 for a
random variable X satisfies the following inequality: H(X) − log log |X | − 2 ≤ L∗1:1 ≤ H(X) + 1. (5.53)

H2 (X) A non-singular code cannot do much better than an instantaneous code!

L1:1 ≥ −1 (5.48)
log2 3
Solution:
where H2 (X) is the entropy of X in bits. Thus the average length of a non-
singular code is at least a constant fraction of the average length of an instanta- (a) In the text, it is proved that the average length of any prefix-free code in a D -ary
neous code. alphabet was greater than HD (X) , the D -ary entropy. Now if we start with any

Data Compression 123 124 Data Compression

binary non-singular code and add the additional symbol “STOP” at the end, the use codewords of length k . Now by using the formula for the sum of the geometric
new code is prefix-free in the alphabet of 0,1, and “STOP” (since “STOP” occurs series, it is easy to see that
only at the end of codewords, and every codeword has a “STOP” symbol, so the
! ! 2k−1 − 1
only way a code word can be a prefix of another is if they were equal). Thus each ck = j = 1k−1 2j = 2 j = 0k−2 2j = 2 = 2k − 2 (5.55)
code word in the new alphabet is one symbol longer than the binary codewords, 2−1
and the average length is 1 symbol longer. Thus all sources with index i , where 2k − 1 ≤ i ≤ 2k − 2 + 2k = 2k+1 − 2 use
Thus we have L1:1 + 1 ≥ H3 (X) , or L1:1 ≥ Hlog 2 (X)
3 − 1 = 0.63H(X) − 1 . codewords of length k . This corresponds to 2 k < i + 2 ≤ 2k+1 or k < log(i + 2) ≤
(b) Since an instantaneous code is also a non-singular code, the best non-singular code k + 1 or k − 1 < log i+2 2 ≤ k . Thus the length of the codeword for the i -
is at least as good as the best instantaneous code. Since the best instantaneous th symbol is k = ⌈log i+22 ⌉ . Thus the best non-singular code assigns codeword
$
code has average length ≤ H(X) + 1 , we have L ∗1:1 ≤ L∗IN ST ≤ H(X) + 1 . length li∗ = ⌈log(i/2+1)⌉ to symbol i , and therefore L ∗1:1 = m
i=1 pi ⌈log(i/2+1)⌉ .
- $m
0 1
i
(c) For a 2 symbol alphabet, the best non-singular code and the best instantaneous (e) Since ⌈log(i/2 + 1)⌉ ≥ log(i/2 + 1) , it follows that L ∗1:1 ≥ L̃= i=1 pi log 2 +1 .
code are the same. So the simplest example where they differ is when |X | = 3 . Consider the difference
In this case, the simplest (and it turns out, optimal) non-singular code has three m
! m
! *
i
+
codewords 0, 1, 00 . Assume that each of the symbols is equally likely. Then F (p) = H(X) − L̃ = − pi log pi − pi log +1 . (5.56)
i=1 i=1
2
H(X) = log 3 = 1.58 bits, whereas the average length of the non-singular code
is 13 .1 + 31 .1 + 31 .2 = 4/3 = 1.3333 < H(X) . Thus a non-singular code could do We want to maximize this function over all probability distributions, and therefore
better than entropy. we use the method of Lagrange multipliers with the constraint
$
pi = 1 .
(d) For a given set of codeword lengths, the fact that allotting the shortest codewords Therefore let
to the most probable symbols is proved in Lemma 5.8.1, part 1 of EIT. m m * + m
! ! i !
This result is a general version of what is called the Hardy-Littlewood-Polya in- J(p) = − pi log pi − pi log + 1 + λ( pi − 1) (5.57)
i=1 i=1
2 i=1
equality, which says that if a < b , c < d , then ad + bc < ac + bd . The general
version of the Hardy-Littlewood-Polya inequality states that if we were given two Then differentiating with respect to p i and setting to 0, we get
sets of numbers A = {aj } and B = {bj } each of size m , and let a[i] be the i -th * +
∂J i
largest element of A and b[i] be the i -th largest element of set B . Then = −1 − log pi − log +1 +λ=0 (5.58)
∂pi 2
m m m
! ! ! i+2
a[i] b[m+1−i] ≤ ai bi ≤ a[i] b[i] (5.54) log pi = λ − 1 − log (5.59)
i=1 i=1 i=1 2
2
An intuitive explanation of this inequality is that you can consider the a i ’s to the pi = 2λ−1 (5.60)
i+2
position of hooks along a rod, and bi ’s to be weights to be attached to the hooks. Now substituting this in the constraint that
$
pi = 1 , we get
To maximize the moment about one end, you should attach the largest weights to
m
the furthest hooks. ! 1
2λ =1 (5.61)
The set of available codewords is the set of all possible sequences. Since the only i=1
i+2
restriction is that the code be non-singular, each source symbol could be alloted $ $k
1 1
to any codeword in the set {0, 1, 00, . . .} . or 2λ = 1/( i i+2 ) . Now using the definition Hk = j=1 j , it is obvious that
Thus we should allot the codewords 0 and 1 to the two most probable source m m+2
symbols, i.e., to probablities p 1 and p2 . Thus l1 = l2 = 1 . Similarly, l3 = l4 =
! 1 ! 1 1
= − 1 − = Hm+2 − H2 . (5.62)
l5 = l6 = 2 (corresponding to the codewords 00, 01, 10 and 11). The next 8 i=1
i+2 i=1
i 2
symbols will use codewords of length 3, etc. 1
Thus 2λ = Hm+2 −H2 , and
We will now find the general form for li . We can prove it by induction, but we will
$k−1 j
derive the result from first principles. Let c k = j=1 2 . Then by the arguments of 1 1
pi = (5.63)
the previous paragraph, all source symbols of index c k +1, ck +2, . . . , ck +2k = ck+1 Hm+2 − H2 i + 2
Data Compression 125 126 Data Compression

Substituting this value of pi in the expression for F (p) , we obtain We therefore have the following bounds on the average length of a non-singular
m m * + code
! ! i
F (p) = − pi log pi − pi log +1 (5.64) H(X) − log log |X | − 2 ≤ L∗1:1 ≤ H(X) + 1 (5.77)
i=1 i=1
2
m A non-singular code cannot do much better than an instantaneous code!
! i+2
= − pi log pi (5.65) 32. Bad wine. One is given 6 bottles of wine. It is known that precisely one bottle has gone
i=1
2
m bad (tastes terrible). From inspection of the bottles it is determined that the probability
! 1 8 6 4 2 2 1
= − pi log (5.66) pi that the ith bottle is bad is given by (p1 , p2 , . . . , p6 ) = ( 23 , 23 , 23 , 23 , 23 , 23 ) . Tasting
i=1
2(Hm+2 − H2 ) will determine the bad wine.
= log 2(Hm+2 − H2 ) (5.67) Suppose you taste the wines one at a time. Choose the order of tasting to minimize
Thus the extremal value of F (p) is log 2(H m+2 − H2 ) . We have not showed that the expected number of tastings required to determine the bad bottle. Remember, if
it is a maximum - that can be shown be taking the second derivative. But as usual, the first 5 wines pass the test you don’t have to taste the last.
it is easier to see it using relative entropy. Looking at the expressions above, we can
(a) What is the expected number of tastings required?
see that if we define qi = Hm+21 −H2 i+2 1
, then qi is a probability distribution (i.e.,
$ i+2 (b) Which bottle should be tasted first?
qi ≥ 0 , qi = 1 ). Also, 2= 1 1 , and substuting this in the expression
2(Hm+2 −H2 ) qi
for F (p) , we obtain Now you get smart. For the first sample, you mix some of the wines in a fresh glass and
m m * + sample the mixture. You proceed, mixing and tasting, stopping when the bad bottle
! ! i has been determined.
F (p) = − pi log pi − pi log +1 (5.68)
i=1 i=1
2
m
! i+2 (c) What is the minimum expected number of tastings required to determine the bad
= − pi log pi (5.69) wine?
i=1
2
m (d) What mixture should be tasted first?
! 1 1
= − pi log pi (5.70)
i=1
2(Hm+2 − H2 ) qi Solution: Bad Wine
m m
! pi ! 1 (a) If we taste one bottle at a time, to minimize the expected number of tastings the
= − pi log − pi log (5.71)
i=1
qi i=1 2(Hm+2 − H2 ) order of tasting should be from the most likely wine to be bad to the least. The
= log 2(Hm+2 − H2 ) − D(p||q) (5.72) expected number of tastings required is
≤ log 2(Hm+2 − H2 ) (5.73) 6
! 8 6 4 2 2 1
pi li = 1 × +2× +3× +4× +5× +5×
with equality iff p = q . Thus the maximum value of F (p) is log 2(H m+2 − H2 ) i=1
23 23 23 23 23 23
(f) 55
=
H(X) − L∗1:1 ≤ H(X) − L̃ (5.74) 23
= 2.39
≤ log 2(Hm+2 − H2 ) (5.75)
8
(b) The first bottle to be tasted should be the one with probability 23 .
The first inequality follows from the definition of L̃ and the second from the result
of the previous part. (c) The idea is to use Huffman coding. With Huffman coding, we get codeword lengths
as (2, 2, 2, 3, 4, 4) . The expected number of tastings required is
To complete the proof, we will use the simple inequality H k ≤ ln k + 1 , which can
be shown by integrating x1 between 1 and k . Thus Hm+2 ≤ ln(m + 2) + 1 , and 6
! 8 6 4 2 2 1
2(Hm+2 − H2 ) = 2(Hm+2 − 1 − 21 ) ≤ 2(ln(m + 2) + 1 − 1 − 21 ) ≤ 2(ln(m + 2)) = pi li = 2 × +2× +2× +3× +4× +4×
i=1
23 23 23 23 23 23
2 log(m + 2)/ log e ≤ 2 log(m + 2) ≤ 2 log m 2 = 4 log m where the last inequality
54
is true for m ≥ 2 . Therefore =
23
H(X) − L1:1 ≤ log 2(Hm+2 − H2 ) ≤ log(4 log m) = log log m + 2 (5.76) = 2.35

Data Compression 127 128 Data Compression

0$ 1
(d) The mixture of the first and second bottles should be tasted first. Thus log2 i 2Ti is the analog of entropy for this problem.

33. Huffman vs. Shannon. A random variable X takes on three values with probabil- Solution:
ities 0.6, 0.3, and 0.1.
Tree construction:
(a) What are the lengths of the binary Huffman codewords for X ? What are the
1
lengths of the binary Shannon codewords (l(x) = ⌈log( p(x) )⌉) for X ? (a) The proof is identical to the proof of optimality of Huffman coding. We first show
that for the optimal tree if Ti < Tj , then li ≥ lj . The proof of this is, as in the
(b) What is the smallest integer D such that the expected Shannon codeword length case of Huffman coding, by contradiction. Assume otherwise, i.e., that if T i < Tj
with a D -ary alphabet equals the expected Huffman codeword length with a D - and li < lj , then by exchanging the inputs, we obtain a tree with a lower total
ary alphabet? cost, since
Solution: Huffman vs. Shannon max{Ti + li , Tj + lj } ≥ max{Ti + lj , Tj + li } (5.81)
Thus the longest branches are associated with the earliest times.
(a) It is obvious that an Huffman code for the distribution (0.6,0.3,0.1) is (1,01,00),
with codeword lengths (1,2,2). The Shannon code would use lengths ⌈log p1 ⌉ , The rest of the proof is identical to the Huffman proof. We show that the longest
which gives lengths (1,2,4) for the three symbols. branches correspond to the two earliest times, and that they could be taken as
siblings (inputs to the same gate). Then we can reduce the problem to constructing
(b) For any D > 2 , the Huffman code for the three symbols are all one character. The
the optimal tree for a smaller problem. By induction, we extend the optimality to
Shannon code length ⌈log D 1p ⌉ would be equal to 1 for all symbols if log D 0.1
1
= 1,
the larger problem, proving the optimality of the above algorithm.
i.e., if D = 10 . Hence for D ≥ 10 , the Shannon code is also optimal.
Given any tree of gates, the earliest that the output corresponding to a particular
34. Huffman algorithm for tree construction. Consider the following problem: m signal would be available is Ti + li , since the signal undergoes li gate delays. Thus
binary signals S1 , S2 , . . . , Sm are available at times T1 ≤ T2 ≤ . . . ≤ Tm , and we maxi (Ti + li ) is a lower bound on the time at which the final answer is available.
would like to find their sum S1 ⊕ S2 ⊕ · · · ⊕ Sm using 2-input gates, each gate with The fact that the tree achieves this bound can be shown by induction. For any
1 time unit delay, so that the final result is available as quickly as possible. A simple internal node of the tree, the output is available at time equal to the maximum of
greedy algorithm is to combine the earliest two results, forming the partial result at the input times plus 1. Thus for the gates connected to the inputs T i and Tj , the
time max(T1 , T2 ) + 1 . We now have a new problem with S 1 ⊕ S2 , S3 , . . . , Sm , available output is available at time max(Ti , Tj ) + 1 . For any node, the output is available
at times max(T1 , T2 ) + 1, T3 , . . . , Tm . We can now sort this list of T’s, and apply the at time equal to maximum of the times at the leaves plus the gate delays to get
same merging step again, repeating this until we have the final result. from the leaf to the node. This result extneds to the complete tree, and for the
root, the time at which the final result is available is max i (Ti + li ) . The above
(a) Argue that the above procedure is optimal, in that it constructs a circuit for which algorithm minimizes this cost.
the final result is available as quickly as possible. $ $
(b) Let c1 = i 2Ti and c2 = i 2−li . By the Kraft inequality, c2 ≤ 1 . Now let
(b) Show that this procedure finds the tree that minimizes T
pi = $2 i Tj , and let ri = $2 −l
−li
. Clearly, pi and ri are probability mass
2 j 2
j j
C(T ) = max(Ti + li ) (5.78) functions. Also, we have Ti = log(pi c1 ) and li = − log(ri c2 ) . Then
i

where Ti is the time at which the result alloted to the i -th leaf is available, and C(T ) = max(Ti + li ) (5.82)
i
li is the length of the path from the i -th leaf to the root.
= max (log(pi c1 ) − log(ri c2 )) (5.83)
(c) Show that i
. / pi
! = log c1 − log c2 + max log (5.84)
C(T ) ≥ log 2 2Ti (5.79) i ri
i
Now the maximum of any random variable is greater than its average under any
for any tree T .
distribution, and therefore
(d) Show that there exists a tree such that ! pi
. / C(T ) ≥ log c1 − log c2 + pi log (5.85)
! ri
C(T ) ≤ log 2 2Ti +1 (5.80) i

i ≥ log c1 − log c2 + D(p||r) (5.86)

Data Compression 129 130 Data Compression

Since −logc2 ≥ 0 and D(p||r) ≥ 0 , we have (b) What word lengths l = (l1 , l2 , . . .) can arise from binary Huffman codes?

C(T ) ≥ log c1 (5.87) Solution: Optimal Word Lengths

We first answer (b) and apply the result to (a).
which is the desired result.
(b) Word lengths of a binary Huffman code must satisfy the Kraft inequality with
$ −li
(c) From the previous part, we achieve the lower bound if p i = ri and c2 = 1 . equality, i.e., i2 = 1 . An easy way to see this is the following: every node in
However, since the li ’s are constrained to be integers, we cannot achieve equality the tree has a sibling (property of optimal binary code), and if we assign each node a
in all cases. ‘weight’, namely 2−li , then 2 × 2−li is the weight of the father (mother) node. Thus,
$
Instead, if we let ‘collapsing’ the tree back, we have that i 2−li = 1 .
J K L $ Tj M
1 j2 (a) Clearly, (1, 2, 2) satisfies Kraft with equality, while (2, 2, 3, 3) does not. Thus,
li = log = log , (5.88)
pi 2Ti (1, 2, 2) can arise from Huffman code, while (2, 2, 3, 3) cannot.
$ $
it is easy to verify that 2−li ≤ pi = 1 , and that thus we can construct a tree 37. Codes. Which of the following codes are
that achieves !
Tj (a) uniquely decodable?
Ti + li ≤ log( 2 )+1 (5.89)
j (b) instantaneous?
for all i . Thus this tree achieves within 1 unit of the lower bound.
$ C1 = {00, 01, 0}
Clearly, log( j 2Tj ) is the equivalent of entropy for this problem! C2 = {00, 01, 100, 101, 11}
C3 = {0, 10, 110, 1110, . . .}
35. Generating random variables. One wishes to generate a random variable X C4 = {0, 00, 000, 0000}
)
1, with probability p Solution: Codes.
X= (5.90)
0, with probability 1 − p
(a) C1 = {00, 01, 0} is uniquely decodable (suffix free) but not instantaneous.
You are given fair coin flips Z1 , Z2 , . . . . Let N be the (random) number of flips needed (b) C2 = {00, 01, 100, 101, 11} is prefix free (instantaneous).
to generate X . Find a good way to use Z1 , Z2 , . . . to generate X . Show that EN ≤ 2 . (c) C3 = {0, 10, 110, 1110, . . .} is instantaneous
Solution: We expand p = 0.p1 p2 . . . as a binary number. Let U = 0.Z1 Z2 . . . , the se- (d) C4 = {0, 00, 000, 0000} is neither uniquely decodable or instantaneous.
quence Z treated as a binary number. It is well known that U is uniformly distributed 6 6 4 4 3 2
38. Huffman. Find the Huffman D -ary code for (p 1 , p2 , p3 , p4 , p5 , p6 ) = ( 25 , 25 , 25 , 25 , 25 , 25 )
on [0, 1) . Thus, we generate X = 1 if U < p and 0 otherwise.
and the expected word length
The procedure for generated X would therefore examine Z 1 , Z2 , . . . and compare with
p1 , p2 , . . . , and generate a 1 at the first time one of the Z i ’s is less than the correspond- (a) for D = 2 .
ing pi and generate a 0 the first time one of the Z i ’s is greater than the corresponding (b) for D = 4 .
pi ’s. Thus the probability that X is generated after seeing the first bit of Z is the Solution: Huffman Codes.
probability that Z1 %= p1 , i.e., with probability 1/2. Similarly, X is generated after 2
bits of Z if Z1 = p1 and Z2 %= p2 , which occurs with probability 1/4. Thus (a) D=2

1 1 1
EN = 1. + 2 + 3 + . . . + (5.91) 6 6 6 8 11 14 25
2 4 8
= 2 (5.92) 6 6 6 6 8 11
4 4 5 6 6
36. Optimal word lengths. 4 4 4 5
2 3 4
(a) Can l = (1, 2, 2) be the word lengths of a binary Huffman code. What about 2 2
(2,2,3,3)? 1

Data Compression 131 132 Data Compression

(a) Since the code is non-singular, the function X → C(X) is one to one, and hence
6 6 4 4 2 2 1
H(X) = H(C(X)) . (Problem 2.4)
pi 25 25 25 25 25 25 25
li 2 2 3 3 3 4 4 (b) Since the code is not uniquely decodable, the function X n → C(X n ) is many to
one, and hence H(X n ) ≥ H(C(X n )) .

40. Code rate.

7
! Let X be a random variable with alphabet {1, 2, 3} and distribution
E(l) = pi li 
i=1
 1,
 with probability 1/2
1 X= 2, with probability 1/4
= (6 × 2 + 6 × 2 + 4 × 3 + 4 × 3 + 2 × 3 + 2 × 4 + 1 × 4)
25 
 3, with probability 1/4.
66
= = 2.66
25 The data compression code for X assigns codewords

(b) D=4  0,
 if x = 1
C(x) = 10, if x = 2

 11, if x = 3.

6 9 25 Let X1 , X2 , . . . be independent identically distributed according to this distribution

6 6 and let Z1 Z2 Z3 . . . = C(X1 )C(X2 ) . . . be the string of binary symbols resulting from
4 6 concatenating the corresponding codewords. For example, 122 becomes 01010 .
4 4
2 (a) Find the entropy rate H(X ) and the entropy rate H(Z) in bits per symbol. Note
2 that Z is not compressible further.
1 (b) Now let the code be 
 00,
 if x = 1
C(x) = 10, if x = 2
6 6 4 4 2 2 1 
 01,
pi 25 25 25 25 25 25 25
if x = 3.
li 1 1 1 2 2 2 2
and find the entropy rate H(Z).
(c) Finally, let the code be
7 
 00, if x = 1
! 
E(l) = pi li
i=1
C(x) = 1, if x = 2

 01,
1 if x = 3.
= (6 × 1 + 6 × 1 + 4 × 1 + 4 × 2 + 2 × 2 + 2 × 2 + 1 × 2)
25
34 and find the entropy rate H(Z).
= = 1.36
25 Solution: Code rate.

39. Entropy of encoded bits. Let C : X −→ {0, 1} ∗ be a nonsingular but nonuniquely This is a slightly tricky question. There’s no straightforward rigorous way to calculate
decodable code. Let X have entropy H(X). the entropy rates, so you need to do some guessing.

(a) Compare H(C(X)) to H(X) . (a) First, since the Xi ’s are independent, H(X ) = H(X1 ) = 1/2 log 2+2(1/4) log(4) =
3/2.
(b) Compare H(C(X n )) to H(X n ) .
Now we observe that this is an optimal code for the given distribution on X ,
Solution: Entropy of encoded bits and since the probabilities are dyadic there is no gain in coding in blocks. So the
Data Compression 133 134 Data Compression

resulting process has to be i.i.d. Bern(1/2), (for otherwise we could get further 41. Optimal codes. Let l1 , l2 , . . . , l10 be the binary Huffman codeword lengths for the
compression from it). probabilities p1 ≥ p2 ≥ . . . ≥ p10 . Suppose we get a new distribution by splitting the
Therefore H(Z) = H(Bern(1/2)) = 1 . last probability mass. What can you say about the optimal binary codeword lengths
l˜1 , l˜2 , . . . , l11
˜ for the probabilities p1 , p2 , . . . , p9 , αp10 , (1 − α)p10 , where 0 ≤ α ≤ 1 .
(b) Here it’s easy.
Solution: Optimal codes.
H(Z1 , Z2 , . . . , Zn )
H(Z) = lim
n→∞ n
To construct a Huffman code, we first combine the two smallest probabilities. In this
H(X1 , X2 , . . . , Xn/2 )
= lim case, we would combine αp10 and (1 − α)p10 . The result of the sum of these two
n→∞ n probabilities is p10 . Note that the resulting probability distribution is now exactly the
H(X ) n2
= lim same as the original probability distribution. The key point is that an optimal code
n→∞ n for p1 , p2 , . . . , p10 yields an optimal code (when expanded) for p 1 , p2 , . . . , p9 , αp10 , (1 −
= 3/4.
α)p10 . In effect, the first 9 codewords will be left unchanged, while the 2 new code-
(We’re being a little sloppy and ignoring the fact that n above may not be a even, words will be XXX0 and XXX1 where XXX represents the last codeword of the
but in the limit as n → ∞ this doesn’t make a difference). original distribution.

(c) This is the tricky part. In short, the lengths of the first 9 codewords remain unchanged, while the lengths of
the last 2 codewords (new codewords) are equal to l 10 + 1 .
Suppose we encode the first n symbols X1 X2 · · · Xn into
42. Ternary codes. Which of the following codeword lengths can be the word lengths of
Z1 Z2 · · · Zm = C(X1 )C(X2 ) · · · C(Xn ). a 3-ary Huffman code and which cannot?
Here m = L(C(X1 ))+L(C(X2 ))+· · ·+L(C(Xn )) is the total length of the encoded (a) (1, 2, 2, 2, 2)
sequence (in bits), and L is the (binary) length function. Since the concatenated
codeword sequence is an invertible function of (X 1 , . . . , Xn ) , it follows that (b) (2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3)

nH(X ) = H(X1 X2 · · · Xn ) = H(Z1 Z2 · · · Z$n L(C(Xi )) ) (5.93) Solution: Ternary codes.

(a) The word lengths (1, 2, 2, 2, 2) CANNOT be the word lengths for a 3-ary Huffman
The first equality above is trivial since the X i ’s are independent. Similarly, may
code. This can be seen by drawing the tree implied by these lengths, and seeing
guess that the right-hand-side above can be written as
that one of the codewords of length 2 can be reduced to a codeword of length 1
n
! which is shorter. Since the Huffman tree produces the minimum expected length
H(Z1 Z2 · · · Z$n L(C(Xi )) ) = E[ L(C(Xi ))]H(Z) tree, these codeword lengths cannot be the word lengths for a Huffman tree.
1
i=1
= nE[L(C(X1 ))]H(Z) (5.94) (b) The word lengths (2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3) ARE the word lengths for a 3-ary
$
Huffman code. Again drawing the tree will verify this. Also, i 3−li = 8 × 3−2 +
(This is not trivial to prove, but it is true.) 3 × 3−3 = 1 , so these word lengths satisfy the Kraft inequality with equality.
Combining the left-hand-side of (5.93) with the right-hand-side of (5.94) yields Therefore the word lengths are optimal for some distribution, and are the word
lengths for a 3-ary Huffman code.
H(X )
H(Z) = 43. Piecewise Huffman. Suppose the codeword that we use to describe a random variable
E[L(C(X1 ))]
3/2 X ∼ p(x) always starts with a symbol chosen from the set {A, B, C} , followed by binary
= digits {0, 1} . Thus we have a ternary code for the first symbol and binary thereafter.
7/4
Give the optimal uniquely decodeable code (minimum expected number of symbols) for
6
= , the probability distribution
7
* +
$3 16 15 12 10 8 8
where E[L(C(X1 ))] = x=1 p(x)L(C(x)) = 7/4. p= , , , , , . (5.95)
69 69 69 69 69 69

Data Compression 135 136 Data Compression

Solution: Piecewise Huffman. (e) Use Markov’s inequality Pr{X ≥ tµ} ≤ 1t , to show that the probability of error
Codeword (one or more wrong object remaining) goes to zero as n −→ ∞ .
a x1 16 16 22 31 69
Solution: Random “20” questions.
b1 x2 15 16 16 22
c1 x3 12 15 16 16 (a) Obviously, Huffman codewords for X are all of length n . Hence, with n deter-
c0 x4 10 12 15 ministic questions, we can identify an object out of 2 n candidates.
b01 x5 8 10
(b) Observe that the total number of subsets which include both object 1 and object
b00 x6 8
2 or neither of them is 2m−1 . Hence, the probability that object 2 yields the same
Note that the above code is not only uniquely decodable, but it is also instantaneously answers for k questions as object 1 is (2m−1 /2m )k = 2−k .
decodable. Generally given a uniquely decodable code, we can construct an instan- More information theoretically, we can view this problem as a channel coding
taneous code with the same codeword lengths. This is not the case with the piece- problem through a noiseless channel. Since all subsets are equally likely, the
wise Huffman construction. There exists a code with smaller expected lengths that is probability the object 1 is in a specific random subset is 1/2 . Hence, the question
uniquely decodable, but not instantaneous. whether object 1 belongs to the k th subset or not corresponds to the k th bit of
Codeword the random codeword for object 1, where codewords X k are Bern( 1/2 ) random
a k -sequences.
b Object Codeword
c 1 0110 . . . 1
a0 2 0010 . . . 0
b0 ..
c0 .
0 1 Now we observe a noiseless output Y k of X k and figure out which object was
1 1 1
44. Huffman. Find the word lengths of the optimal binary encoding of p = 100 , 100 , . . . , 100 . sent. From the same line of reasoning as in the achievability proof of the channel
Solution: Huffman. coding theorem, i.e. joint typicality, it is obvious the probability that object 2 has
the same codeword as object 1 is 2−k .
Since the distribution is uniform the Huffman tree will consist of word lengths of
(c) Let
⌈log(100)⌉ = 7 and ⌊log(100)⌋ = 6 . There are 64 nodes of depth 6, of which (64-
k ) will be leaf nodes; and there are k nodes of depth 6 which will form 2k leaf nodes )
1, object j yields the same answers for k questions as object 1
of depth 7. Since the total number of leaf nodes is 100, we have 1j = ,
0, otherwise
(64 − k) + 2k = 100 ⇒ k = 36. for j = 2, . . . , m.
So there are 64 - 36 = 28 codewords of word length 6, and 2 × 36 = 72 codewords of Then,
word length 7.
m
!
45. Random “20” questions. Let X be uniformly distributed over {1, 2, . . . , m} . As- E(# of objects in {2, 3, . . . , m} with the same answers) = E( 1j )
j=2
sume m = 2n . We ask random questions: Is X ∈ S1 ? Is X ∈ S2 ?...until only one
m
integer remains. All 2m subsets of {1, 2, . . . , m} are equally likely. !
= E(1j )
(a) How many deterministic questions are needed to determine X ? j=2
!m
(b) Without loss of generality, suppose that X = 1 is the random object. What is = 2−k
the probability that object 2 yields the same answers for k questions as object 1? j=2
(c) What is the expected number of objects in {2, 3, . . . , m} that have the same = (m − 1)2−k
answers to the questions as does the correct object 1? = (2n − 1)2−k .
√
(d) Suppose we ask n + n random questions. What is the expected number of
√ √
wrong objects agreeing with the answers? (d) Plugging k = n + n into (c) we have the expected number of (2n − 1)2−n− n.
Data Compression 137

(e) Let N by the number of wrong objects remaining. Then, by Markov’s inequality

P (N ≥ 1) ≤ EN
√
= (2n − 1)2−n− n
√
≤ 2− n

→ 0,
Chapter 6
where the first equality follows from part (d).

Gambling and Data Compression

1. Horse race. Three horses run a race. A gambler offers 3-for-1 odds on each of the
horses. These are fair odds under the assumption that all horses are equally likely to
win the race. The true win probabilities are known to be

1 1 1
* +
p = (p1 , p2 , p3 ) = , , . (6.1)
2 4 4
$
Let b = (b1 , b2 , b3 ) , bi ≥ 0 , bi = 1 , be the amount invested on each of the horses.
The expected log wealth is thus
3
!
W (b) = pi log 3bi . (6.2)
i=1

(a) Maximize this over b to find b∗ and W ∗ . Thus the wealth achieved in repeated
horse races should grow to infinity like 2nW with probability one.
∗

(b) Show that if instead we put all of our money on horse 1, the most likely winner,
we will eventually go broke with probability one.

Solution: Horse race.

(a) The doubling rate

!
W (b) = pi log bi oi (6.3)
i
!
= pi log 3bi (6.4)
i
! ! ! pi
= pi log 3 + pi log pi − pi log (6.5)
bi
= log 3 − H(p) − D(p||b) (6.6)
≤ log 3 − H(p), (6.7)
139

140 Gambling and Data Compression Gambling and Data Compression 141
$m
with equality iff p = b . Hence b∗ = p = ( 21 , 14 , 14 ) and W ∗ = log 3 − H( 21 , 14 , 14 ) = over all choices b with bi ≥ 0 and i=0 bi = 1.
1 9
2 log 8 = 0.085 . Approach 1: Relative Entropy
By the strong law of large numbers,
B
We try to express W (b, p) as a sum of relative entropies.
Sn = 3b(Xj ) (6.8) !
j W (b, p) = pi log(b0 + bi oi ) (6.15)
1
$ . b0 /
= 2
n( n j
log 3b(Xj ))
(6.9) ! oi + bi
= pi log 1 (6.16)
→ 2nE log 3b(X) (6.10) oi
nW (b)
. b0 /
= 2 (6.11) ! oi + bi p i
= pi log 1 (6.17)
(6.12) pi oi
!
. nW ∗ = pi log pi oi + log K − D(p||r), (6.18)
When b = b∗ , W (b) = W ∗ and Sn =2 = 20.085n = (1.06)n .
(b) If we put all the money on the first horse, then the probability that we do not where
go broke in n races is ( 21 )n . Since this probability goes to zero with n , the ! b0 ! 1 ! ! 1
K= ( + bi ) = b 0 + bi = b0 ( − 1) + 1, (6.19)
probability of the set of outcomes where we do not ever go broke is zero, and we oi oi oi
will go broke with probability 1. and
b0
Alternatively, if b = (1, 0, 0) , then W (b) = −∞ and + bi oi
ri = (6.20)
K
Sn → 2nW = 0 w.p.1 (6.13)
is a kind of normalized portfolio. Now both K and r depend on the choice of b . To
by the strong law of large numbers. maximize W (b, p) , we must maximize log K and at the same time minimize D(p||r) .
Let us consider the two cases:
2. Horse race with subfair odds. If the odds are bad (due to a track take) the
1
(a)
$
gambler may wish to keep money in his pocket. Let b(0) be the amount in his oi ≤ 1 . This is the case of superfair or fair odds. In these cases, it seems intu-
pocket and let b(1), b(2), . . . , b(m) be the amount bet on horses 1, 2, . . . , m , with itively clear that we should put all of our money in the race. For example, in the
odds o(1), o(2), . . . , o(m) , and win probabilities p(1), p(2), . . . , p(m) . Thus the result- case of a superfair gamble, one could invest any cash using a “Dutch book” (in-
ing wealth is S(x) = b(0) + b(x)o(x), with probability p(x), x = 1, 2, . . . , m. vesting inversely proportional to the odds) and do strictly better with probability
1.
(a) Find b∗ maximizing E log S if
$
1/o(i) < 1. Examining the expression for K , we see that K is maximized for b 0 = 0 . In this
(b) Discuss b∗ if
$
1/o(i) > 1. (There isn’t an easy closed form solution in this case, case, setting bi = pi would imply that ri = pi and hence D(p||r) = 0 . We have
but a “water-filling” solution results from the application of the Kuhn-Tucker succeeded in simultaneously maximizing the two variable terms in the expression
conditions.) for W (b, p) and this must be the optimal solution.
Hence, for fair or superfair games, the gambler should invest all his money in the
Solution: (Horse race with a cash option). race using proportional gambling, and not leave anything aside as cash.
Since in this case, the gambler is allowed to keep some of the money as cash, the (b) 1
oi > 1 . In this case, sub-fair odds, the argument breaks down. Looking at the
mathematics becomes more complicated. In class, we used two different approaches to expression for K , we see that it is maximized for b 0 = 1 . However, we cannot
prove the optimality of proportional betting when the gambler is not allowed keep any simultaneously minimize D(p||r) .
of the money as cash. We will use both approaches for this problem. But in the case
If pi oi ≤ 1 for all horses, then the first term in the expansion of W (b, p) , that
of subfair odds, the relative entropy approach breaks down, and we have to use the
is,
$
pi log pi oi is negative. With b0 = 1 , the best we can achieve is proportional
calculus approach.
betting, which sets the last term to be 0. Hence, with b 0 = 1 , we can only achieve a
The setup of the problem is straight-forward. We want to maximize the expected log negative expected log return, which is strictly worse than the 0 log return achieved
return, i.e., be setting b0 = 1 . This would indicate, but not prove, that in this case, one should
m
W (b, p) = E log S(X) =
!
pi log(b0 + bi oi ) (6.14) leave all one’s money as cash. A more rigorous approach using calculus will prove
i=1 this.
142 Gambling and Data Compression Gambling and Data Compression 143

We can however give a simple argument to show that in the case of sub-fair odds, Differentiating w.r.t. λ , we get the constraint
the gambler should leave at least some of his money as cash and that there is !
at least one horse on which he does not bet any money. We will prove this by bi = 1. (6.30)
contradiction—starting with a portfolio that does not satisfy these criteria, we will
generate one which does better with probability one. The solution to these three equations, if they exist, would give the optimal portfolio b .
But substituting the first equation in the second, we obtain the following equation
Let the amount bet on each of the horses be (b 1 , b2 , . . . , bm ) with m
$
i=1 bi = 1 , so
that there is no money left aside. Arrange the horses in order of decreasing b i oi , ! 1
so that the m -th horse is the one with the minimum product. λ = λ. (6.31)
oi
Consider a new portfolio with
1
Clearly in the case when
$
%= 1 , the only solution to this equation is λ = 0 ,
oi
bm om
b'i = bi − (6.21) which indicates that the solution is on the boundary of the region over which the
oi maximization is being carried out. Actually, we have been quite cavalier with the
$
for all i . Since bi oi ≥ bm om for all i , b'i ≥ 0 . We keep the remaining money, i.e., setup of the problem—in addition to the constraint bi = 1 , we have the inequality
m m *
constraints bi ≥ 0 . We should have allotted a Lagrange multiplier to each of these.
bm om
+
Rewriting the functional with Lagrange multipliers
! !
1− b'i = 1 − bi − (6.22)
i=1 i=1
oi
m m
. /
m ! ! !
! bm om J(b) = pi log(b0 + bi oi ) + λ bi + γi bi (6.32)
= (6.23)
i=1
oi i=1 i=0

as cash. Differentiating with respect to bi , we obtain

The return on the new portfolio if horse i wins is ∂J pi oi
m
= + λ + γi = 0. (6.33)
*
bm om
+
bm om ∂bi b0 + bi oi
b'i oi =
!
bi − oi + (6.24)
oi i=1
oi Differentiating with respect to b0 , we obtain
m
. /
! 1 m
= bi oi + bm om −1 (6.25) ∂J ! pi
i=1
oi = + λ + γ0 = 0. (6.34)
∂b0 b + bi oi
i=1 0
> bi oi , (6.26)
$ Differentiating w.r.t. λ , we get the constraint
since 1/oi > 1 . Hence irrespective of which horse wins, the new portfolio does
better than the old one and hence the old portfolio could not be optimal. !
bi = 1. (6.35)
Approach 2: Calculus
Now, carrying out the same substitution, we get
We set up the functional using Lagrange multipliers as before:
! 1 ! γi
m m λ + γ0 = λ + , (6.36)
. /
oi oi
! !
J(b) = pi log(b0 + bi oi ) + λ bi (6.27)
i=1 i=0
1
which indicates that if
$
oi %= 1 , at least one of the γ ’s is non-zero, which indicates
Differentiating with respect to bi , we obtain that the corresponding constraint has become active, which shows that the solution is
∂J pi oi on the boundary of the region.
= + λ = 0. (6.28)
∂bi b0 + bi oi In the case of solutions on the boundary, we have to use the Kuhn-Tucker conditions
Differentiating with respect to b0 , we obtain to find the maximum. These conditions are described in Gallager[7], pg. 87. The
m
conditions describe the behavior of the derivative at the maximum of a concave function
∂J ! pi over a convex region. For any coordinate which is in the interior of the region, the
= + λ = 0. (6.29)
∂b0 b + bi oi
i=1 0 derivative should be 0. For any coordinate on the boundary, the derivative should be

144 Gambling and Data Compression Gambling and Data Compression 145

negative in the direction towards the interior of the region. More formally, for a concave Claim: The optimal strategy for the horse race when the odds are subfair and
function F (x1 , x2 , . . . , xn ) over the region xi ≥ 0 , some of the pi oi are greater than 1 is: set

∂F ≤ 0 if xi = 0 b0 = Ct , (6.43)
(6.37)
∂xi = 0 if xi > 0
and for i = 1, 2, . . . , t , set
Applying the Kuhn-Tucker conditions to the present maximization, we obtain Ct
bi = pi − , (6.44)
oi
pi oi ≤0 if bi = 0
+λ (6.38) and for i = t + 1, . . . , m , set
b0 + bi oi =0 if bi > 0
bi = 0. (6.45)
and
! pi ≤0 if b0 = 0 The above choice of b satisfies the Kuhn-Tucker conditions with λ = 1 . For b 0 ,
+λ (6.39) the Kuhn-Tucker condition is
b0 + bi oi =0 if b0 > 0
t m t
Theorem 4.4.1 in Gallager[7] proves that if we can find a solution to the Kuhn-Tucker pi 1 pi 1 1 − ti=1 pi
$
! ! ! !
= + = + = 1. (6.46)
conditions, then the solution is the maximum of the function in the region. Let us bo + bi oi o C o Ct
i=1 i i=t+1 t i=1 i
consider the two cases:
1 For 1 ≤ i ≤ t , the Kuhn Tucker conditions reduce to
(a)
$
oi ≤ 1 . In this case, we try the solution we expect, b 0 = 0 , and bi = pi .
Setting λ = −1 , we find that all the Kuhn-Tucker conditions are satisfied. Hence, pi oi pi oi
= = 1. (6.47)
this is the optimal portfolio for superfair or fair odds. b0 + bi oi pi oi
1
(b)
$
oi> 1 . In this case, we try the expected solution, b 0 = 1 , and bi = 0 . We find For t + 1 ≤ i ≤ m , the Kuhn Tucker conditions reduce to
that all the Kuhn-Tucker conditions are satisfied if all p i oi ≤ 1 . Hence under this
condition, the optimum solution is to not invest anything in the race but to keep pi oi pi oi
= ≤ 1, (6.48)
everything as cash. b0 + bi oi Ct
In the case when some pi oi > 1 , the Kuhn-Tucker conditions are no longer satisfied by the definition of t . Hence the Kuhn Tucker conditions are satisfied, and this
by b0 = 1 . We should then invest some money in the race; however, since the is the optimal solution.
denominator of the expressions in the Kuhn-Tucker conditions also changes, more
than one horse may now violate the Kuhn-Tucker conditions. Hence, the optimum 3. Cards. An ordinary deck of cards containing 26 red cards and 26 black cards is shuffled
solution may involve investing in some horses with p i oi ≤ 1 . There is no explicit and dealt out one card at at time without replacement. Let X i be the color of the ith
form for the solution in this case. card.
The Kuhn Tucker conditions for this case do not give rise to an explicit solution.
(a) Determine H(X1 ).
Instead, we can formulate a procedure for finding the optimum distribution of
capital: (b) Determine H(X2 ).
Order the horses according to pi oi , so that (c) Does H(Xk | X1 , X2 , . . . , Xk−1 ) increase or decrease?
(d) Determine H(X1 , X2 , . . . , X52 ).
p1 o1 ≥ p2 o2 ≥ · · · ≥ pm om . (6.40)

Define  Solution:
$k
1− pi
if k ≥ 1

 $ki=1 1 (a) P(first card red) = P(first card black) = 1/2 . Hence H(X 1 ) = (1/2) log 2 +
Ck = 1− i=1 oi (6.41)
 1 (1/2) log 2 = log 2 = 1 bit.
if k = 0

(b) P(second card red) = P(second card black) = 1/2 by symmetry. Hence H(X 2 ) =
Define
(1/2) log 2 + (1/2) log 2 = log 2 = 1 bit. There is no change in the probability from
t = min{n|pn+1 on+1 ≤ Cn }. (6.42)
X1 to X2 (or to Xi , 1 ≤ i ≤ 52 ) since all the permutations of red and black
Clearly t ≥ 1 since p1 o1 > 1 = C0 . cards are equally likely.
146 Gambling and Data Compression Gambling and Data Compression 147

(c) Since all permutations are equally likely, the joint distribution of X k and X1 , . . . , Xk−1 5. Beating the public odds. Consider a 3-horse race with win probabilities
is the same as the joint distribution of X k+1 and X1 , . . . , Xk−1 . Therefore
1 1 1
(p1 , p2 , p3 ) = ( , , )
H(Xk |X1 , . . . , Xk−1 ) = H(Xk+1 |X1 , . . . , Xk−1 ) ≥ H(Xk+1 |X1 , . . . , Xk ) (6.49) 2 4 4

and so the conditional entropy decreases as we proceed along the sequence. and fair odds with respect to the (false) distribution
Knowledge of the past reduces uncertainty and thus means that the conditional 1 1 1
entropy of the k -th card’s color given all the previous cards will decrease as k (r1 , r2 , r3 ) = ( , , ) .
4 4 2
increases.
(d) All 52
, - Thus the odds are
26 possible sequences of 26 red cards and 26 black cards are equally likely.
Thus (o1 , o2 , o3 ) = (4, 4, 2) .
. /
52 (a) What is the entropy of the race?
H(X1 , X2 , . . . , X52 ) = log = 48.8 bits (3.2 bits less than 52) (6.50)
26 (b) Find the set of bets (b1 , b2 , b3 ) such that the compounded wealth in repeated plays
will grow to infinity.
4. Gambling. Suppose one gambles sequentially on the card outcomes in Problem 3.
Even odds of 2-for-1 are paid. Thus the wealth S n at time n is Sn = 2n b(x1 , x2 , . . . , xn ),
Solution: Beating the public odds.
where b(x1 , x2 , . . . , xn ) is the proportion of wealth bet on x1 , x2 , . . . , xn . Find maxb(·) E log S52 .
Solution: Gambling on red and black cards. (a) The entropy of the race is given by

E[log Sn ] = E[log[2n b(X1 , X2 , ..., Xn )]] (6.51) 1 1 1

H(p) = log 2 + log 4 + log 4
= n log 2 + E[log b(X)] (6.52) 2 4 4
! 3
= n+ p(x) log b(x) (6.53) = .
x∈X n
2
! b(x) (b) Compounded wealth will grow to infinity for the set of bets (b 1 , b2 , b3 ) such that
= n+ p(x)[log − log p(x)] (6.54)
x∈X n
p(x) W (b, p) > 0 where
= n + D(p(x)||b(x)) − H(X). (6.55)
W (b, p) = D(p5r) − D(p5b)
Taking p(x) = b(x) makes D(p(x)||b(x)) = 0 and maximizes E log S 52 . 3
! bi
= pi log .
i=1
ri
max E log S52 = 52 − H(X) (6.56) Calculating D(p5r) , this criterion becomes
b(x)
52! 1
= 52 − log (6.57) D(p5b) < .
26!26! 4
= 3.2 (6.58)
6. Horse race: A 3 horse race has win probabilities p = (p 1 , p2 , p3 ) , and odds o =
Alternatively, as in the horse race, proportional betting is log-optimal. Thus b(x) =
$
(1, 1, 1) . The gambler places bets b = (b 1 , b2 , b3 ) , bi ≥ 0, bi = 1 , where bi denotes
p(x) and, regardless of the outcome, the proportion on wealth bet on horse i . These odds are very bad. The gambler gets
his money back on the winning horse and loses the other bets. Thus the wealth S n at
252
S52 = ,52- = 9.08. (6.59) time n resulting from independent gambles goes expnentially to zero.
26
(a) Find the exponent.
and hence
log S52 = max E log S52 = log 9.08 = 3.2. (6.60) (b) Find the optimal gambling scheme b , i.e., the bet b ∗ that maximizes the expo-
b(x) nent.

148 Gambling and Data Compression Gambling and Data Compression 149

(c) Assuming b is chosen as in (b), what distribution p causes S n to go to zero at (b) What is the optimal growth rate that you can achieve in this game?
the fastest rate? (c) If (f1 , f2 , . . . , f8 ) = (1/8, 1/8, 1/4, 1/16, 1/16, 1/16, 1/4, 1/16) , and you start with
Solution: Minimizing losses. $1, how long will it be before you become a millionaire?

(a) Despite the bad odds, the optimal strategy is still proportional gambling. Thus Solution:
the optimal bets are b = p , and the exponent in this case is
(a) The probability of winning does not depend on the number you choose, and there-
W∗ =
!
pi log pi = −H(p). (6.61) fore, irrespective of the proportions of the other players, the log optimal strategy
i is to divide your money uniformly over all the tickets.
(b) The optimal gambling strategy is still proportional betting. (b) If there are n people playing, and f i of them choose number i , then the number
of people sharing the jackpot of n dollars is nf i , and therefore each person gets
(c) The worst distribution (the one that causes the doubling rate to be as negative as
n/nfi = 1/fi dollars if i is picked at the end of the day. Thus the odds for number
possible) is that distribution that maximizes the entropy. Thus the worst W ∗ is
i is 1/fi , and does not depend on the number of people playing.
− log 3 , and the gambler’s money goes to zero as 3 −n .
Using the results of Section 6.1, the optimal growth rate is given by
7. Horse race. Consider a horse race with 4 horses. Assume that each of the horses pays !1 1
W ∗ (p) =
!
4-for-1 if it wins. Let the probabilities of winning of the horses be { 12 , 14 , 18 , 18 } . If you pi log oi − H(p) = log − log 8 (6.62)
8 fi
started with $100 and bet optimally to maximize your long term growth rate, what
are your optimal bets on each horse? Approximately how much money would you have (c) Substituing these fraction in the previous equation we get
after 20 races with this strategy ? 1! 1
W ∗ (p) = log − log 8 (6.63)
Solution: Horse race. The optimal betting strategy is proportional betting, i.e., divid- 8 fi
ing the investment in proportion to the probabilities of each horse winning. Thus the 1
bets on each horse should be (50%, 25%,12.5%,12.5%), and the growth rate achieved = (3 + 3 + 2 + 4 + 4 + 4 + 2 + 4) − 3 (6.64)
8
by this strategy is equal to log 4 − H(p) = log 4 − H( 21 , 14 , 18 , 18 ) = 2 − 1.75 = 0.25 . After = 0.25 (6.65)
20 races with this strategy, the wealth is approximately 2 nW = 25 = 32 , and hence the
wealth would grow approximately 32 fold over 20 races. and therefore after N days, the amount of money you would have would be approx-
imately 20.25N . The number of days before this crosses a million = log 2 (1, 000, 000)/0.25 =
8. Lotto. The following analysis is a crude approximation to the games of Lotto conducted 79.7 , i.e., in 80 days, you should have a million dollars.
by various states. Assume that the player of the game is required pay $1 to play and is There are many problems with the analysis, not the least of which is that the
asked to choose 1 number from a range 1 to 8. At the end of every day, the state lottery state governments take out about half the money collected, so that the jackpot
commission picks a number uniformly over the same range. The jackpot, i.e., all the is only half of the total collections. Also there are about 14 million different
money collected that day, is split among all the people who chose the same number as possible tickets, and it is therefore possible to use a uniform distribution using $1
the one chosen by the state. E.g., if 100 people played today, and 10 of them chose the tickets only if we use capital of the order of 14 million dollars. And with such
number 2, and the drawing at the end of the day picked 2, then the $100 collected is large investments, the proportions of money bet on the different possibilities will
split among the 10 people, i.e., each of persons who picked 2 will receive $10, and the change, which would further complicate the analysis.
others will receive nothing. However, the fact that people’s choices are not uniform does leave a loophole
The general population does not choose numbers uniformly - numbers like 3 and 7 are that can be exploited. Under certain conditions, i.e., if the accumulated jackpot
supposedly lucky and are more popular than 4 or 8. Assume that the fraction of people has reached a certain size, the expected return can be greater than 1, and it is
choosing the various numbers 1, 2, . . . , 8 is (f 1 , f2 , . . . , f8 ) , and assume that n people worthwhile to play, despite the 50% cut taken by the state. But under normal
play every day. Also assume that n is very large, so that any single person’s choice circumstances, the 50% cut of the state makes the odds in the lottery very unfair,
choice does not change the proportion of people betting on any number. and it is not a worthwhile investment.

(a) What is the optimal strategy to divide your money among the various possible 9. Horse race. Suppose one is interested in maximizing the doubling rate for a horse
tickets so as to maximize your long term growth rate? (Ignore the fact that you race. Let p1 , p2 , . . . , pm denote the win probabilities of the m horses. When do the
cannot buy fractional tickets.) odds (o1 , o2 , . . . , om ) yield a higher doubling rate than the odds (o '1 , o'2 , . . . , o'm ) ?
150 Gambling and Data Compression Gambling and Data Compression 151

Solution: Horse Race )

Let W and W ' denote the optimal doubling rates for the odds (o 1 , o2 , . . . , om ) and X, with probability 1 − p(x)
Z= (6.67)
(o'1 , o'2 , . . . , o'm ) respectively. By Theorem 6.1.2 in the book, Y, with probability p(x)
!
W = pi log oi − H(p), and (a) Show that E(X) = E(Y ) = 3b
.
! 2
W' = pi log o'i − H(p) (b) Show that E(Y /X) = 5/4 . Since the expected ratio of the amount in the other
where p is the probability vector (p 1 , p2 , . . . , pm ) . Then W > W' exactly when envelope to the one in hand is 5/4, it seems that one should always switch.
$
pi log oi > pi log o'i ; that is, when
$ (This is the origin of the switching paradox.) However, observe that E(Y ) %=
E(X)E(Y /X) . Thus, although E(Y /X) > 1 , it does not follow that E(Y ) >
E log oi > E log o'i . E(X) .
(c) Let J be the index of the envelope containing the maximum amount of money,
and let J ' be the index of the envelope chosen by the algorithm. Show that for
10. Horse race with probability estimates
any b , I(J; J ' ) > 0 . Thus the amount in the first envelope always contains some
(a) Three horses race. Their probabilities of winning are ( 21 , 14 , 14 ) . The odds are information about which envelope to choose.
(4-for-1, 3-for-1 and 3-for-1). Let W ∗ be the optimal doubling rate. (d) Show that E(Z) > E(X) . Thus you can do better than always staying or always
Suppose you believe the probabilities are ( 14 , 12 , 14 ) . If you try to maximize the switching. In fact, this is true for any monotonic decreasing switching function
doubling rate, what doubling rate W will you achieve? By how much has your p(x) . By randomly switching according to p(x) , you are more likely to trade up
doubling rate decreased due to your poor estimate of the probabilities, i.e., what than trade down.
is ∆W = W ∗ − W ?
(b) Now let the horse race be among m horses, with probabilities p = (p 1 , p2 , . . . , pm ) Solution: Two envelope problem:
and odds o = (o1 , o2 , . . . , om ) . If you believe the true probabilities to be q =
(a) X = b or 2b with prob. 1/2, and therefore E(X) = 1.5b . Y has the same
(q1 , q2 , . . . , qm ) , and try to maximize the doubling rate W , what is W ∗ − W ?
unconditional distribution.
Solution: Horse race with probability estimates (b) Given X = x , the other envelope contains 2x with probability 1/2 and contains
(a) If you believe that the probabilities of winning are ( 14 , 12 , 14 ) , you would bet pro- x/2 with probability 1/2. Thus E(Y /X) = 5/4 .
portional to this, and would achieve a growth rate pi log bi oi = 21 log 4 14 +
$
(c) Without any conditioning, J = 1 or 2 with probability (1/2,1/2). By symmetry,
1 1 1 1 1 9
4 log 3 2 + 4 log 3 4 = 4 log 8 . If you bet according to the true probabilities, you it is not difficult to see that the unconditional probability distribution of J ' is also
1 1 1
would bet ( 2 , 4 , 4 ) on the three horses, achieving a growth rate
$
pi log bi oi = the same. We will now show that the two random variables are not independent,
1 1 1 1 1 1 1 3
2 log 4 2 + 4 log 3 4 + 4 log 3 4 = 2 log 2 . The loss in growth rate due to incorrect es- and therefore I(J; J ' ) %= 0 . To do this, we will calculate the conditional probability
timation of the probabilities is the difference between the two growth rates, which P (J ' = 1|J = 1) .
is 14 log 2 = 0.25 . Conditioned on J = 1 , the probability that X = b or 2b is still (1/2,1/2). How-
(b) For m horses, the growth rate with the true distribution is ever, conditioned on (J = 1, X = 2b) , the probability that Z = X , and therefore
$
pi log pi oi , and
with the incorrect estimate is J ' = 1 is p(2b) . Similary, conditioned on (J = 1, X = b) , the probability that
$
pi log qi oi . The difference between the two is
p1
J ' = 1 is 1 − p(b) . Thus,
$
pi log qi = D(p||q) .
11. The two envelope problem: One envelope contains b dollars, the other 2b dollars. P (J ' = 1|J = 1) = P (X = b|J = 1)P (J ' = 1|X = b, J = 1)
The amount b is unknown. An envelope is selected at random. Let X be the amount
+P (X = 2b|J = 1)P (J ' = 1|X = 2b, J = 1) (6.68)
observed in this envelope, and let Y be the amount in the other envelope.
1 1
Adopt the strategy of switching to the other envelope with probability p(x) , where = (1 − p(b)) + p(2b) (6.69)
2 2
e−x
p(x) = e−x +ex . Let Z be the amount that the player receives. Thus 1 1
= + (p(2b) − p(b)) (6.70)
) 2 2
(b, 2b), with probability 1/2 1
(X, Y ) = (6.66) > (6.71)
(2b, b), with probability 1/2
2

152 Gambling and Data Compression Gambling and Data Compression 153

Thus the conditional distribution is not equal to the unconditional distribution Therefore the minimum value of the growth rate occurs when p i = qi . This
and J and J ' are not independent. is the0 distribution that minimizes the growth rate, and the minimum value is
$ 11
(d) We use the above calculation of the conditional distribution to calculate E(Z) . − log j oj .
Without loss of generality, we assume that J = 1 , i.e., the first envelope contains (b) The maximum growth rate occurs when the horse with the maximum odds wins
2b . Then in all the races, i.e., pi = 1 for the horse that provides the maximum odds
E(Z|J = 1) = P (X = b|J = 1)E(Z|X = b, J = 1) 13. Dutch book. Consider a horse race with m = 2 horses,
+P (X = 2b|J = 1)E(Z|X = 2b, J = 1) (6.72)
1 1 X = 1, 2
= E(Z|X = b, J = 1) + E(Z|X = 2b, J = 1) (6.73)
2 2 p = 1/2, 1/2
1,
= p(J ' = 1|X = b, J = 1)E(Z|J ' = 1, X = b, J = 1) Odds (for one) = 10, 30
2
+p(J ' = 2|X = b, J = 1)E(Z|J ' = 2, X = b, J = 1) Bets = b, 1 − b.

+p(J ' = 1|X = 2b, J = 1)E(Z|J ' = 1, X = 2b, J = 1) The odds are super fair.
+ p(J ' = 2|X = 2b, J = 1)E(Z|J ' = 2, X = 2b, J = (6.74)
-
1)
(a) There is a bet b which guarantees the same payoff regardless of which horse wins.
1
= ([1 − p(b)]2b + p(b)b + p(2b)2b + [1 − p(2b)]b) (6.75) Such a bet is called a Dutch book. Find this b and the associated wealth factor
2
S(X).
3b 1
= + b(p(2b) − p(b)) (6.76) (b) What is the maximum growth rate of the wealth for this gamble? Compare it to
2 2
3b the growth rate for the Dutch book.
> (6.77)
2
Solution: Solution: Dutch book.
as long as p(2b) − p(b) > 0 . Thus E(Z) > E(X) .
12. Gambling. Find the horse win probabilities p 1 , p2 , . . . , pm (a)
(a) maximizing the doubling rate W ∗ for given fixed known odds o1 , o2 , . . . , om .
10bD = 30(1 − bD )
(b) minimizing the doubling rate for given fixed odds o 1 , o2 , . . . , om .
40bD = 30
Solution: Gambling bD = 3/4.
(a) From Theorem 6.1.2, W ∗ =
$
pi log oi − H(p) . We can also write this as Therefore,
!
W∗ = pi log pi oi (6.78) 1 3
* +
1
*
1
+
i W (bD , P ) = log 10 + log 30
! pi 2 4 2 4
= pi log 1 (6.79) = 2.91
i oi
 
! pi ! ! 1 and
= pi log − pi log   (6.80) SD (X) = 2W (bD ,P ) = 7.5.
i
qi i j
oj
  (b) In general,
! pi ! 1 1 1
= pi log − log   (6.81) W (b, p) = log(10b) + log(30(1 − b)).
i
qi j
oj 2 2
∂W
where Setting the ∂b to zero we get
1
qi = $oi 1 1 10 1 −30
* + * +
(6.82) + = 0
j oj 2 10b∗ 2 30 − 30b∗
154 Gambling and Data Compression Gambling and Data Compression 155

1 1 (a) Find the expected wealth ES(X) .

+ = 0
2b∗ 2(b∗ − 1)
(b) Find W ∗ , the optimal growth rate of wealth.
(b∗ − 1) + b∗
= 0 (c) Suppose
2b∗ (b∗ − 1) )
2b∗ − 1 1, X = 1 or 2
Y =
= 0 0, otherwise
4b∗ (1 − b∗ )
1 If this side information is available before the bet, how much does it increase the
b∗ = .
2 growth rate W ∗ ?
Hence (d) Find I(X; Y ) .

1 1 Solution: Entropy of a fair horse race.

W ∗ (p) = log(5) + log(15) = 3.11
2 2
W (bD , p) = 2.91 (a) The expected wealth ES(X) is
m
!
ES(X) = S(x)p(x) (6.83)
and x=1
!m
S ∗ = 2W = 8.66
∗
= b(x)o(x)p(x) (6.84)
x=1
SD = 2WD = 7.5 m
!
= b(x), (since o(x) = 1/p(x)) (6.85)
x=1
Thus gambling (a little) with b∗ beats the sure win of 7.5 given by the Dutch book = 1. (6.86)

14. Horse race. Suppose one is interested in maximizing the doubling rate for a horse (b) The optimal growth rate of wealth, W∗, is achieved when b(x) = p(x) for all x ,
race. Let p1 , p2 , . . . , pm denote the win probabilities of the m horses. When do the in which case,
odds (o1 , o2 , . . . , om ) yield a higher doubling rate than the odds (o '1 , o'2 , . . . , o'm ) ?
Solution: Horse Race (Repeat of problem 9) W ∗ = E(log S(X)) (6.87)
m
Let W and W ' denote the optimal doubling rates for the odds (o 1 , o2 , . . . , om ) and
!
= p(x) log(b(x)o(x)) (6.88)
(o'1 , o'2 , . . . , o'm ) respectively. By Theorem 6.1.2 in the book, x=1
!m
!
W = pi log oi − H(p), and = p(x) log(p(x)/p(x)) (6.89)
x=1
W' = pi log o'i − H(p)
!
!m
= p(x) log(1) (6.90)
where p is the probability vector (p 1 , p2 , . . . , pm ) . Then W > W ' exactly when x=1
pi log oi > pi log o'i ; that is, when = 0, (6.91)
$ $

E log oi > E log o'i . so we maintain our current wealth.

(c) The increase in our growth rate due to the side information is given by I(X; Y ) .
Let q = Pr(Y = 1) = p(1) + p(2) .
15. Entropy of a fair horse race. Let X ∼ p(x) , x = 1, 2, . . . , m , denote the winner of
1 I(X; Y ) = H(Y ) − H(Y |X) (6.92)
a horse race. Suppose the odds o(x) are fair with respect to p(x) , i.e., o(x) = p(x) .
$m = H(Y ) (since Y is a deterministic function of X) (6.93)
Let b(x) be the amount bet on horse x , b(x) ≥ 0 , 1 b(x) = 1 . Then the resulting
wealth factor is S(x) = b(x)o(x) , with probability p(x) . = H(q). (6.94)

156 Gambling and Data Compression Gambling and Data Compression 157

(d) Already computed above. Show that this limit is ∞ or 0 , with probability one, accordingly as c < c ∗ or
c > c∗ . Identify the “fair” entry fee c∗ .
16. Negative horse race Consider a horse race with m horses with win probabili-
ties p1 , p2 , . . . pm . Here the gambler hopes a given horse will lose. He places bets More realistically, the gambler should be allowed to keep a proportion b = 1 − b of his
(b1 , b2 , . . . , bm ), m
$
i=1 bi = 1 , on the horses,$loses his bet bi if horse i wins, and retains money in his pocket and invest the rest in the St. Petersburg game. His wealth at time
the rest of his bets. (No odds.) Thus S = j+=i bj , with probability pi , and one wishes n is then
n *
bXi
+
to maximize
$
pi ln(1 − bi ) subject to the constraint
$
bi = 1. B
Sn = b+ . (6.96)
i=1
c
(a) Find the growth rate optimal investment strategy b ∗ . Do not constrain the bets
to be positive, but do constrain the bets to sum to 1. (This effectively allows short Let
∞
. /
selling and margin.) b2k
2−k log 1 − b +
!
W (b, c) = . (6.97)
(b) What is the optimal growth rate? k=1
c
We have
Solution: Negative horse race .
Sn =2nW (b,c) (6.98)
(a) Let b'i = 1 − bi ≥ 0 , and note that i b'i = m − 1 . Let qi = b'i / ' . Then, {qi }
$ $
j bj Let
is a probability distribution on {1, 2, . . . , m} . Now, ∗
W (c) = max W (b, c). (6.99)
! 0≤b≤1
W = pi log(1 − bi )
Here are some questions about W ∗ (c).
i
!
= pi log qi (m − 1) (c) For what value of the entry fee c does the optimizing value b ∗ drop below 1?
i
! qi (d) How does b∗ vary with c ?
= log(m − 1) + pi log pi
i
pi (e) How does W ∗ (c) fall off with c ?
= log(m − 1) − H(p) − D(p5q) . Note that since W ∗ (c) > 0 , for all c , we can conclude that any entry fee c is fair.
Thus, W ∗ is obtained upon setting D(p5q) = 0 , which means making the bets Solution: The St. Petersburg paradox.
such that pi = qi = b'i /(m − 1) , or bi = 1 − (m − 1)pi . Alternatively, one can use
Lagrange multipliers to solve the problem. (a) The expected return,

(b) From (a) we directly see that setting D(p5q) = 0 implies W ∗ = log(m−1)−H(p) . ∞ ∞ ∞
p(X = 2k )2k = 2−k 2k =
! ! !
EX = 1 = ∞. (6.100)
17. The St. Petersburg paradox. Many years ago in ancient St. Petersburg the k=1 k=1 k=1

following gambling proposition caused great consternation. For an entry fee of c units, Thus the expected return on the game is infinite.
a gambler receives a payoff of 2k units with probability 2−k , k = 1, 2, . . . .
(b) By the strong law of large numbers, we see that
(a) Show that the expected payoff for this game is infinite. For this reason, it was n
argued that c = ∞ was a “fair” price to pay to play this game. Most people find 1 1!
log Sn = log Xi − log c → E log X − log c, w.p.1 (6.101)
this answer absurd. n n i=1
(b) Suppose that the gambler can buy a share of the game. For example, if he in- and therefore Sn goes to infinity or 0 according to whether E log X is greater or
vests c/2 units in the game, he receives 1/2 a share and a return X/2 , where less than log c . Therefore
Pr(X = 2k ) = 2−k , k = 1, 2, . . . . Suppose X1 , X2 , . . . are i.i.d. according to this
distribution and the gambler reinvests all his wealth each time. Thus his wealth ∞
log c∗ = E log X = k2−k = 2.
!
Sn at time n is given by (6.102)
n k=1
B Xi
Sn = . (6.95)
i=1
c Therefore a fair entry fee is 2 units if the gambler is forced to invest all his money.
158 Gambling and Data Compression Gambling and Data Compression 159

1
"fileplot"
"filebstar"

0.8
W(b,c)

2.5
2 0.6
1.5
1
0.5

b*
0
-0.5
-1 0.4

1 2 0.5
3 b 0.2
4 5 6 7
c 8 9 10 0

0
1 2 3 4 5 6 7 8 9 10
c

Figure 6.1: St. Petersburg: W (b, c) as a function of b and c . Figure 6.2: St. Petersburg: b∗ as a function of c .

(c) If the gambler is not required to invest all his money, then the growth rate is
3
∞
. /
! b2k "filewstar"
W (b, c) = 2−k log 1 − b + . (6.103)
k=1
c
2.5

For b = 0 , W = 1 , and for b = 1 , W = E log X −log c = 2−log c . Differentiating

to find the optimum value of b , we obtain
2

∞
. /
∂W (b, c) 1 2k
2−k 0
!
= 1 −1 + (6.104)
∂b 1−b+ b2k c

b*
k=1 1.5
c

Unfortunately, there is no explicit solution for the b that maximizes W for a given
value of c , and we have to solve this numerically on the computer. 1

We have illustrated the results with three plots. The first (Figure 6.1) shows
W (b, c) as a function of b and c . The second (Figure 6.2)shows b ∗ as a function
0.5
of c and the third (Figure 6.3) shows W ∗ as a function of c .
From Figure 2, it is clear that b∗ is less than 1 for c > 3 . We can also see this
analytically by calculating the slope ∂W∂b(b,c) at b = 1 . 0
1 2 3 4 5 6 7 8 9 10
c
∞
. /
∂W (b, c) 1 2k
2−k 0
!
= 1 −1 + (6.105) Figure 6.3: St. Petersburg: W ∗ (b∗ , c) as a function of c .
∂b k=1 1−b+ b2 k
c
c

160 Gambling and Data Compression Gambling and Data Compression 161
. /
! 2−k 2k Expected log ratio
= 2k
−1 (6.106)
k
d
c
∞ ∞
−k −2k
! !
= 2 − c2 (6.107)
k=1 k=1
c
= 1− (6.108) ER(b,c)
3
which is positive for c < 3 . Thus for c < 3 , the optimal value of b lies on the 0

boundary of the region of b ’s, and for c > 3 , the optimal value of b lies in the -1

interior. -2
-3
(d) The variation of b∗ with c is shown in Figure 6.2. As c → ∞ , b∗ → 0 . We have
-4
a conjecture (based on numerical results) that b ∗ → √12 c2−c as c → ∞ , but we
1
do not have a proof. 0.8
0.9
0.7
(e) The variation of W ∗ with c is shown in Figure 6.3. 0.6
0.5
0.4 b
5 0.3
0.2
18. Super St. Petersburg. Finally, we have the super St. Petersburg paradox, where c
10
0.1
k
Pr(X = 22 ) = 2−k , k = 1, 2, . . . . Here the expected log wealth is infinite for all b > 0 ,
for all c , and the gambler’s wealth grows to infinity faster than exponentially for any
b > 0. But that doesn’t mean all investment ratios b are equally good. To see this,
we wish to maximize the relative growth rate with respect to some other portfolio, say,
Figure 6.4: Super St. Petersburg: J(b, c) as a function of b and c .
b = ( 12 , 12 ). Show that there exists a unique b maximizing
k
(b + bX/c) complications. For example, for k = 6 , 22 is outside the normal range of numbers
E ln
( 12 + 21 X/c) representable on a standard computer. However, for k ≥ 6 , we can approximate the
b
ratio within the log by 0.5 without any loss of accuracy. Using this, we can do a simple
and interpret the answer. numerical computation as in the previous problem.
k
Solution: Super St. Petersburg. With Pr(X = 2 2 ) = 2−k , k = 1, 2, . . . , we have As before, we have illustrated the results with three plots. The first (Figure 6.4) shows
k
J(b, c) as a function of b and c . The second (Figure 6.5)shows b ∗ as a function of c
2−k log 22 = ∞,
!
E log X = (6.109) and the third (Figure 6.6) shows J ∗ as a function of c .
k
These plots indicate that for large values of c , the optimum strategy is not to put all
and thus with any constant entry fee, the gambler’s money grows to infinity faster than the money into the game, even though the money grows at an infinite rate. There exists
exponentially, since for any b > 0 , a unique b∗ which maximizes the expected ratio, which therefore causes the wealth to
∞
. k
/ k
grow to infinity at the fastest possible rate. Thus there exists an optimal b ∗ even when
! b22 ! b22 the log optimal portfolio is undefined.
W (b, c) = 2−k log 1 − b + > 2−k log = ∞. (6.110)
k=1
c c

But if we wish to maximize the wealth relative to the ( 21 , 12 ) portfolio, we need to

maximize
2k
(1 − b) + b2c
2−k log
!
J(b, c) = (6.111)
1 1 22k
k 2 + 2 c

As in the case of the St. Petersburg problem, we cannot solve this problem explicitly.
In this case, a computer solution is fairly straightforward, although there are some
162 Gambling and Data Compression

0.9

0.8

0.7 Chapter 7
0.6
b*

0.5
Channel Capacity
0.4

0.3

1. Preprocessing the output. One is given a communication channel with transition

0.2 probabilities p(y|x) and channel capacity C = max p(x) I(X; Y ). A helpful statistician
preprocesses the output by forming Ỹ = g(Y ). He claims that this will strictly improve
0.1
0.1 1 10 100 1000 the capacity.
c

Figure 6.5: Super St. Petersburg: b∗ as a function of c . (a) Show that he is wrong.
(b) Under what conditions does he not strictly decrease the capacity?
1
Solution: Preprocessing the output.
0.9

0.8 (a) The statistician calculates Ỹ = g(Y ) . Since X → Y → Ỹ forms a Markov chain,
we can apply the data processing inequality. Hence for every distribution on x ,
0.7

0.6 I(X; Y ) ≥ I(X; Ỹ ). (7.1)

0.5
Let p̃(x) be the distribution on x that maximizes I(X; Ỹ ) . Then
J*

0.4
C = max I(X; Y ) ≥ I(X; Y )p(x)=p̃(x) ≥ I(X; Ỹ )p(x)=p̃(x) = max I(X; Ỹ ) = C̃.
0.3 p(x) p(x)
(7.2)
0.2
Thus, the statistician is wrong and processing the output does not increase capac-
0.1 ity.
0 (b) We have equality (no decrease in capacity) in the above sequence of inequalities
only if we have equality in the data processing inequality, i.e., for the distribution
-0.1
0.1 1 10 100 1000 that maximizes I(X; Ỹ ) , we have X → Ỹ → Y forming a Markov chain.
c

2. An additive noise channel. Find the channel capacity of the following discrete
Figure 6.6: Super St. Petersburg: J ∗ (b∗ , c) as a function of c .
memoryless channel:
163

164 Channel Capacity Channel Capacity 165

Z and Zi are not independent.

I(X1 , X2 , . . . , Xn ; Y1 , Y2 , . . . , Yn )
#$
& = H(X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn |Y1 , Y2 , . . . , Yn )
X % % Y
!" = H(X1 , X2 , . . . , Xn ) − H(Z1 , Z2 , . . . , Zn |Y1 , Y2 , . . . , Yn )
≥ H(X1 , X2 , . . . , Xn ) − H(Z1 , Z2 , . . . , Zn ) (7.7)
!
≥ H(X1 , X2 , . . . , Xn ) − H(Zi ) (7.8)
where Pr{Z = 0} = Pr{Z = a} = 21 . The alphabet for x is X = {0, 1}. Assume that
= H(X1 , X2 , . . . , Xn ) − nH(p) (7.9)
Z is independent of X.
= n − nH(p), (7.10)
Observe that the channel capacity depends on the value of a.
Solution: A sum channel. if X1 , X2 , . . . , Xn are chosen i.i.d. ∼ Bern( 12 ). The capacity of the channel with
memory over n uses of the channel is
Y =X +Z X ∈ {0, 1}, Z ∈ {0, a} (7.3)
nC (n) = max I(X1 , X2 , . . . , Xn ; Y1 , Y2 , . . . , Yn ) (7.11)
p(x1 ,x2 ,...,xn )
We have to distinguish various cases depending on the values of a .
≥ I(X1 , X2 , . . . , Xn ; Y1 , Y2 , . . . , Yn )p(x1 ,x2 ,...,xn )=Bern( 1 ) (7.12)
2
a = 0 In this case, Y = X , and max I(X; Y ) = max H(X) = 1 . Hence the capacity ≥ n(1 − H(p)) (7.13)
is 1 bit per transmission.
= nC. (7.14)
a %= 0, ±1 In this case, Y has four possible values 0, 1, a and 1 + a . Knowing Y ,
we know the X which was sent, and hence H(X|Y ) = 0 . Hence max I(X; Y ) = Hence channels with memory have higher capacity. The intuitive explanation for this
max H(X) = 1 , achieved for an uniform distribution on the input X . result is that the correlation between the noise decreases the effective noise; one could
a = 1 In this case Y has three possible output values, 0, 1 and 2 , and the channel use the information from the past samples of the noise to combat the present noise.
is identical to the binary erasure channel discussed in class, with a = 1/2 . As
4. Channel capacity. Consider the discrete memoryless channel Y = X + Z (mod 11),
derived in class, the capacity of this channel is 1 − a = 1/2 bit per transmission.
where . /
a = −1 This is similar to the case when a = 1 and the capacity here is also 1/2 bit 1, 2, 3
Z=
per transmission. 1/3, 1/3, 1/3

3. Channels with memory have higher capacity. Consider a binary symmetric chan- and X ∈ {0, 1, . . . , 10} . Assume that Z is independent of X .
nel with Yi = Xi ⊕ Zi , where ⊕ is mod 2 addition, and Xi , Yi ∈ {0, 1}.
Suppose that {Zi } has constant marginal probabilities Pr{Z i = 1} = p = 1 − Pr{Zi = (a) Find the capacity.
0}, but that Z1 , Z2 , . . . , Zn are not necessarily independent. Assume that Z n is inde- (b) What is the maximizing p∗ (x) ?
pendent of the input X n . Let C = 1 − H(p, 1 − p). Show that
Solution: Channel capacity.
max I(X1 , X2 , . . . , Xn ; Y1 , Y2 , . . . , Yn ) ≥ nC. (7.4)
p(x1 ,x2 ,...,xn )
Y = X + Z(mod 11) (7.15)

Solution: Channels with memory have a higher capacity. where 

 1
 with probability1/3
Yi = Xi ⊕ Zi , (7.5) Z= 2 with probability1/3 (7.16)
 3

with probability1/3
where )
1 with probability p In this case,
Zi = (7.6)
0 with probability 1 − p H(Y |X) = H(Z|X) = H(Z) = log 3, (7.17)
166 Channel Capacity Channel Capacity 167

independent of the distribution of X , and hence the capacity of the channel is 6. Noisy typewriter. Consider a 26-key typewriter.
C = max I(X; Y ) (7.18) (a) If pushing a key results in printing the associated letter, what is the capacity C
p(x)
= max H(Y ) − H(Y |X) (7.19) in bits?
p(x)
(b) Now suppose that pushing a key results in printing that letter or the next (with
= max H(Y ) − log 3 (7.20) equal probability). Thus A → A or B, . . . , Z → Z or A. What is the capacity?
p(x)
= log 11 − log 3, (7.21) (c) What is the highest rate code with block length one that you can find that achieves
zero probability of error for the channel in part (b) .
which is attained when Y has a uniform distribution, which occurs (by symmetry)
when X has a uniform distribution. Solution: Noisy typewriter.
(a) The capacity of the channel is log 11
3 bits/transmission. (a) If the typewriter prints out whatever key is struck, then the output, Y , is the
1
(b) The capacity is achieved by an uniform distribution on the inputs. p(X = i) = 11 same as the input, X , and
for i = 0, 1, . . . , 10 .
C = max I(X; Y ) = max H(X) = log 26, (7.32)
5. Using two channels at once. Consider two discrete memoryless channels (X 1 , p(y1 |
x1 ), Y1 ) and (X2 , p(y2 | x2 ), Y2 ) with capacities C1 and C2 respectively. A new channel attained by a uniform distribution over the letters.
(X1 × X2 , p(y1 | x1 ) × p(y2 | x2 ), Y1 × Y2 ) is formed in which x1 ∈ X1 and x2 ∈ X2 , are
(b) In this case, the output is either equal to the input (with probability 12 ) or equal
simultaneously sent, resulting in y 1 , y2 . Find the capacity of this channel.
to the next letter ( with probability 21 ). Hence H(Y |X) = log 2 independent of
Solution: Using two channels at once. Suppose we are given two channels, (X 1 , p(y1 |x1 ), Y1 ) the distribution of X , and hence
and (X2 , p(y2 |x2 ), Y2 ) , which we can use at the same time. We can define the product
channel as the channel, (X1 × X2 , p(y1 , y2 |x1 , x2 ) = p(y1 |x1 )p(y2 |x2 ), Y1 × Y2 ) . To find C = max I(X; Y ) = max H(Y ) − log 2 = log 26 − log 2 = log 13, (7.33)
the capacity of the product channel, we must find the distribution p(x 1 , x2 ) on the
input alphabet X1 × X2 that maximizes I(X1 , X2 ; Y1 , Y2 ) . Since the joint distribution attained for a uniform distribution over the output, which in turn is attained by
a uniform distribution on the input.
p(x1 , x2 , y1 , y2 ) = p(x1 , x2 )p(y1 |x1 )p(y2 |x2 ), (7.22)
(c) A simple zero error block length one code is the one that uses every alternate
Y1 → X1 → X2 → Y2 forms a Markov chain and therefore letter, say A,C,E,. . . ,W,Y. In this case, none of the codewords will be confused,
since A will produce either A or B, C will produce C or D, etc. The rate of this
I(X1 , X2 ; Y1 , Y2 ) = H(Y1 , Y2 ) − H(Y1 , Y2 |X1 , X2 ) (7.23)
code,
= H(Y1 , Y2 ) − H(Y1 |X1 , X2 ) − H(Y2 |X1 , X2 ) (7.24) log(# codewords) log 13
R= = = log 13. (7.34)
= H(Y1 , Y2 ) − H(Y1 |X1 ) − H(Y2 |X2 ) (7.25) Block length 1
≤ H(Y1 ) + H(Y2 ) − H(Y1 |X1 ) − H(Y2 |X2 ) (7.26) In this case, we can achieve capacity with a simple code with zero error.
= I(X1 ; Y1 ) + I(X2 ; Y2 ), (7.27)
7. Cascade of binary symmetric channels. Show that a cascade of n identical
where (7.24) and (7.25) follow from Markovity, and we have equality in (7.26) if Y 1 and independent binary symmetric channels,
Y2 are independent. Equality occurs when X1 and X2 are independent. Hence
X0 → BSC →1 → · · · → Xn−1 → BSC →n
C = max I(X1 , X2 ; Y1 , Y2 ) (7.28)
p(x1 ,x2 )
each with raw error probability p , is equivalent to a single BSC with error probability
≤ max I(X1 ; Y1 ) + max I(X2 ; Y2 ) (7.29) 1 n
p(x1 ,x2 ) p(x1 ,x2 ) 2 (1 − (1 − 2p) ) and hence that n→∞ lim I(X0 ; Xn ) = 0 if p %= 0, 1 . No encoding or
= max I(X1 ; Y1 ) + max I(X2 ; Y2 ) (7.30) decoding takes place at the intermediate terminals X 1 , . . . , Xn−1 . Thus the capacity
p(x1 ) p(x2 ) of the cascade tends to zero.
= C1 + C2 . (7.31)
Solution: Cascade of binary symmetric channels. There are many ways to solve
with equality iff p(x1 , x2 ) = p∗ (x1 )p∗ (x2 ) and p∗ (x1 ) and p∗ (x2 ) are the distributions this problem. One way is to use the singular value decomposition of the transition
that maximize C1 and C2 respectively. probability matrix for a single BSC.

168 Channel Capacity Channel Capacity 169

Let, " #
Solution: The Z channel. First we express I(X; Y ) , the mutual information between
1−p p the input and output of the Z-channel, as a function of x = Pr(X = 1) :
A=
p 1−p
H(Y |X) = Pr(X = 0) · 0 + Pr(X = 1) · 1 = x
be the transition probability matrix for our BSC. Then the transition probability matrix H(Y ) = H(Pr(Y = 1)) = H(x/2)
for the cascade of n of these BSC’s is given by, I(X; Y ) = H(Y ) − H(Y |X) = H(x/2) − x

An = An . Since I(X; Y ) = 0 when x = 0 and x = 1 , the maximum mutual information is

obtained for some value of x such that 0 < x < 1 .
Now check that, " # Using elementary calculus, we determine that
1 0
A = T −1 T d 1 1 − x/2
0 1 − 2p I(X; Y ) = log2 − 1,
dx 2 x/2
where, " # which is equal to zero for x = 2/5 . (It is reasonable that Pr(X = 1) < 1/2 because
1 1
T = . X = 1 is the noisy input to the channel.) So the capacity of the Z-channel in bits is
1 −1
H(1/5) − 2/5 = 0.722 − 0.4 = 0.322 .
Using this we have, 9. Suboptimal codes. For the Z channel of the previous problem, assume that we choose
a (2nR , n) code at random, where each codeword is a sequence of fair coin tosses. This
An = An will not achieve capacity. Find the maximum rate R such that the probability of error
(n)
" #
1 0 Pe , averaged over the randomly generated codes, tends to zero as the block length n
= T −1 T
0 (1 − 2p)n tends to infinity.
Solution: Suboptimal codes. From the proof of the channel coding theorem, it follows
" #
1 1
2 (1 + (1 − 2p)n ) 2 (1 − (1 − 2p)n )
= 1 1 . that using a random code with codewords generated according to probability p(x) , we
2 (1 − (1 − 2p)n ) 2 (1 + (1 − 2p)n )
can send information at a rate I(X; Y ) corresponding to that p(x) with an arbitrarily
From this we see that the cascade of n BSC’s is also a BSC with probablility of error, low probability of error. For the Z channel described in the previous problem, we can
calculate I(X; Y ) for a uniform distribution on the input. The distribution on Y is
1 (3/4, 1/4), and therefore
pn = (1 − (1 − 2p)n ).
2
3 1 1 1 1 3 3
I(X; Y ) = H(Y ) − H(Y |X) = H( , ) − H( , ) = − log 3. (7.35)
The matrix, T , is simply the matrix of eigenvectors of A . 4 4 2 2 2 2 4
This problem can also be solved by induction on n . 10. Zero-error capacity. A channel with alphabet {0, 1, 2, 3, 4} has transition probabil-
Probably the simplest way to solve the problem is to note that the probability of ities of the form )
error for the cascade channel is simply the sum of the odd terms of the binomial 1/2 if y = x ± 1 mod 5
p(y|x) =
expansion of (x + y)n with x = p and y = 1 − p . But this can simply be written as 0 otherwise.
1 n 1 n 1 n
2 (x + y) − 2 (y − x) = 2 (1 − (1 − 2p) . (a) Compute the capacity of this channel in bits.
8. The Z channel. The Z-channel has binary input and output alphabets and transition (b) The zero-error capacity of a channel is the number of bits per channel use that
probabilities p(y|x) given by the following matrix: can be transmitted with zero probability of error. Clearly, the zero-error capacity
of this pentagonal channel is at least 1 bit (transmit 0 or 1 with probability 1/2).
Find a block code that shows that the zero-error capacity is greater than 1 bit.
" #
1 0
Q= x, y ∈ {0, 1} Can you estimate the exact value of the zero-error capacity?
1/2 1/2
(Hint: Consider codes of length 2 for this channel.)
Find the capacity of the Z-channel and the maximizing input probability distribution. The zero-error capacity of this channel was finally found by Lovasz[9].
170 Channel Capacity Channel Capacity 171

Solution: Zero-error capacity.

i 1−p
' 0
0 * )
(
* (
(a) Since the channel is symmetric, it is easy to compute its capacity: * (
* (
* (
* (
* ( pi
H(Y |X) = 1 (
*
( * pi
( *
I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − 1 . ( *
( *
( *
( *
1 ( '
+
* 1
So mutual information is maximized when Y is uniformly distributed, which oc- 1 − pi
curs when the input X is uniformly distributed. Therefore the capacity in bits is
C = log2 5 − 1 = log 2 2.5 = 1.32 .
Let X = (X1 , X2 , . . . , Xn ), Y = (Y1 , Y2 , . . . , Yn ). Find maxp(x) I(X; Y).
(b) Let us construct a block code consisting of 2-tuples. We need more than 4 code- Solution: Time-varying channels.
words in order to achieve capacity greater than 2 bits, so we will pick 5 codewords
We can use the same chain of inequalities as in the proof of the converse to the channel
with distinct first symbols: {0a, 1b, 2c, 3d, 4e} . We must choose a, b, c, d, e so that
the receiver will be able to determine which codeword was transmitted. A sim- coding theorem. Hence
ple repetition code will not work, since if, say, 22 is transmitted, then 11 might I(X n ; Y n ) = H(Y n ) − H(Y n |X n ) (7.36)
be received, and the receiver could not tell whether the codeword was 00 or 22. n
= H(Y n ) − H(Yi |Y1 , . . . , Yi−1 , X n )
!
Instead, using codewords of the form (i+1 mod 5, 2i+1 mod 5) yields the code (7.37)
11,23,30,42,04. i=1
n
= H(Y n ) −
!
Here is the decoding table for the pentagon channel: H(Yi |Xi ), (7.38)
i=1
040. 43. 2320101. 34. 34. 1212
since by the definition of the channel, Y i depends only on Xi and is conditionally
It is amusing to note that the five pairs that cannot be received are exactly the 5
independent of everything else. Continuing the series of inequalities, we have
codewords.
n
Then whenever xy is received, there is exactly one possible codeword. (Each I(X n ; Y n ) = H(Y n ) −
!
H(Yi |Xi ) (7.39)
codeword will be received as one of 4 possible 2-tuples; so there are 20 possible i=1
received 2-tuples, out of a total of 25.) Since there are 5 possible error-free messages n
! n
!
with 2 channel uses, the zero-error capacity of this channel is at least 12 log2 5 = ≤ H(Yi ) − H(Yi |Xi ) (7.40)
i=1 i=1
1.161 bits. n
!
In fact, the zero-error capacity of this channel is exactly 21 log2 5 . This result ≤ (1 − h(pi )), (7.41)
was obtained by László Lovász, “On the Shannon capacity of a graph,” IEEE i=1

Transactions on Information Theory, Vol IT-25, pp. 1–7, January 1979. The with equality if X1 , X2 , . . . , Xn is chosen i.i.d. ∼ Bern(1/2). Hence
first results on zero-error capacity are due to Claude E. Shannon, “The zero-
n
error capacity of a noisy channel,” IEEE Transactions on Information Theory, Vol max I(X1 , X2 , . . . , Xn ; Y1 , Y2 , . . . , Yn ) =
!
(1 − h(pi )). (7.42)
IT-2, pp. 8–19, September 1956, reprinted in Key Papers in the Development of p(x)
i=1
Information Theory, David Slepian, editor, IEEE Press, 1974.
12. Unused symbols. Show that the capacity of the channel with probability transition
matrix  
11. Time-varying channels. Consider a time-varying discrete memoryless channel. Let 2/3 1/3 0
Y1 , Y2 , . . . , Yn be conditionally independent given X1 , X2 , . . . , Xn , with conditional dis- Py|x =  1/3 1/3 1/3  (7.43)
 
tribution given by p(y | x) = ni=1 pi (yi | xi ). 0 1/3 2/3
C

172 Channel Capacity Channel Capacity 173

1−α−ǫ ' 0
0 *
is achieved by a distribution that places zero probability on one of input symbols. What . * -
,
. * ,
. **
is the capacity of this channel? Give an intuitive reason why that letter is not used. ,
. * ,
Solution: Unused symbols Let the probabilities of the three input symbols be p 1 , p2 . * ,
*
and p3 . Then the probabilities of the three output symbols can be easily calculated to . ǫ * ,
. *,
be ( 32 p1 + 13 p2 , 13 , 31 p2 + 23 p3 ) , and therefore *
. , * α
. , *
*
I(X; Y ) = H(Y ) − H(Y |X) (7.44) ., +
* e
,. )
(
2 1 1 1 2 2 1 (
= H( p1 + p2 , , p2 + p3 ) − (p1 + p3 )H( , ) − p2 log 3 (7.45) , . (α
3 3 3 3 3 3 3 , . ((
1 1 1 1 1 2 1 ,ǫ (.
= H( + (p1 − p3 ), , − (p1 − p3 )) − (p1 + p3 )H( , ) − (1 − p1 − p3(7.46)
) log 3 , ( .
3 3 3 3 3 3 3 (
, ( .
, ( ( .
where we have substituted p2 = 1 − p1 − p3 . Now if we fix p1 + p3 , then the second and , ( .
, ( .
third terms are fixed, and the first term is maximized if p 1 − p3 = 0 , i.e., if p1 = p3 . (
1 (
, '/
. 1
(The same conclusion can be drawn from the symmetry of the problem.)
1−α−ǫ
Now setting p1 = p3 , we have (a) Find the capacity of this channel.
1 1 1 2 1 (b) Specialize to the case of the binary symmetric channel ( α = 0 ).
I(X; Y ) = H( , , ) − (p1 + p3 )H( , ) − (1 − p1 − p3 ) log 3 (7.47)
3 3 3 3 3 (c) Specialize to the case of the binary erasure channel ( ǫ = 0 ).
2 1
= log 3 − (p1 + p3 )H( , ) − (1 − p1 − p3 ) log 3 (7.48)
3 3
2 1 Solution:
= (p1 + p3 )(log 3 − H( , )) (7.49)
3 3
(a) As with the examples in the text, we set the input distribution for the two inputs
which is maximized if p1 + p3 is as large as possible (since log 3 > H( 23 , 13 ) ). There- to be π and 1 − π . Then
fore the maximizing distribution corresponds to p 1 + p3 = 1 , p1 = p3 , and therefore
(p1 , p2 , p3 ) = ( 21 , 0, 12 ) . The capacity of this channel = log 3 − H( 23 , 13 ) = log 3 − (log 3 − C = max I(X; Y ) (7.50)
2 2 p(x)
3 ) = 3 bits. = max(H(Y ) − H(Y |X)) (7.51)
p(x)
The intuitive reason why p2 = 0 for the maximizing distribution is that conditional
on the input being 2, the output is uniformly distributed. The same uniform output = max H(Y ) − H(1 − ǫ − α, α, ǫ). (7.52)
p(x)
distribution can be achieved without using the symbol 2 (by setting p 1 = p3 ), and
therefore the use of symbol 2 does not add any information (it does not change the As in the case of the erasure channel, the maximum value for H(Y ) cannot be
entropy of the output and the conditional entropy H(Y |X = 2) is the maximum log 3 , since the probability of the erasure symbol is α independent of the input
possible, i.e., log 3 , so any positive probability for symbol 2 will only reduce the mutual distribution. Thus,
information.
Note that not using a symbol is optimal only if the uniform output distribution can be H(Y ) = H(π(1 − ǫ − α) + (1 − π)ǫ, α, (1 − π)(1 − ǫ − α) + πǫ) (7.53)
π + ǫ − πα − 2πǫ 1 − π − ǫ + 2ǫπ − α + απ
* +
achieved without use of that symbol. For example, in the Z channel example above, both = H(α) + (1 − α)H , (7.54)
symbols are used, even though one of them gives a conditionally uniform distribution 1−α 1−α
on the output. ≤ H(α) + (1 − α) (7.55)

13. Erasures and errors in a binary channel. Consider a channel with binary inputs with equality iff π+ǫ−πα−2πǫ
1−α = 21 , which can be achieved by setting π = 12 . (The
that has both erasures and errors. Let the probability of error be ǫ and the probability fact that π = 1 − π = 21 is the optimal distribution should be obvious from the
of erasure be α , so the the channel is as illustrated below: symmetry of the problem, even though the channel is not weakly symmetric.)
174 Channel Capacity Channel Capacity 175

Therefore the capacity of this channel is (b) The capacity of the channel is achieved by a uniform distribution over the inputs,
which produces a uniform distribution on the output pairs
C = H(α) + 1 − α − H(1 − α − ǫ, α, ǫ) (7.56)
*
1−α−ǫ ǫ
+ C = max I(X1 , X2 ; Y1 , Y2 ) = 2 bits (7.64)
= H(α) + 1 − α − H(α) − (1 − α)H , (7.57) p(x1 ,x2 )
1−α 1−α
1
1−α−ǫ ǫ and the maximizing p(x1 , x2 ) puts probability on each of the pairs 00, 01, 10
* * ++
4
= (1 − α) 1 − H , (7.58)
1−α 1−α and 11.
(c) To calculate I(X1 ; Y1 ) , we need to calculate the joint distribution of X 1 and Y1 .
(b) Setting α = 0 , we get The joint distribution of X1 X2 and Y1 Y2 under an uniform input distribution is
C = 1 − H(ǫ), (7.59) given by the following matrix
X1 X2 \Y1 Y2 00 01 10 11
which is the capacity of the binary symmetric channel. 00 0 1
0 0
4
1
(c) Setting ǫ = 0 , we get 01 0 0 4 0
1
C =1−α (7.60) 10 0 0 0 4
1
11 4 0 0 0
which is the capacity of the binary erasure channel. From this, it is easy to calculate the joint distribution of X 1 and Y1 as
X1 \Y1 0 1
14. Channels with dependence between the letters. Consider the following channel 0 1 1
4 4
over a binary alphabet that takes in two bit symbols and produces a two bit output, 1 1 1
4 4
as determined by the following mapping: 00 → 01 , 01 → 10 , 10 → 11 , and 11 → and therefore we can see that the marginal distributions of X 1 and Y1 are both
00 . Thus if the two bit sequence 01 is the input to the channel, the output is 10 (1/2, 1/2) and that the joint distribution is the product of the marginals, i.e., X 1
with probability 1. Let X1 , X2 denote the two input symbols and Y1 , Y2 denote the is independent of Y1 , and therefore I(X1 ; Y1 ) = 0 .
corresponding output symbols.
Thus the distribution on the input sequences that achieves capacity does not nec-
essarily maximize the mutual information between individual symbols and their
(a) Calculate the mutual information I(X 1 , X2 ; Y1 , Y2 ) as a function of the input corresponding outputs.
distribution on the four possible pairs of inputs.
15. Jointly typical sequences. As we did in problem 13 of Chapter 3 for the typical
(b) Show that the capacity of a pair of transmissions on this channel is 2 bits.
set for a single random variable, we will calculate the jointly typical set for a pair of
(c) Show that under the maximizing input distribution, I(X 1 ; Y1 ) = 0 . random variables connected by a binary symmetric channel, and the probability of error
Thus the distribution on the input sequences that achieves capacity does not nec- for jointly typical decoding for such a channel.
essarily maximize the mutual information between individual symbols and their
corresponding outputs.

Solution: 0.9
0 * ' 0
)
(
(a) If we look at pairs of inputs and pairs of outputs, this channel is a noiseless * (
* (
four input four output channel. Let the probabilities of the four input pairs be * (
* (
p00 , p01 , p10 and p11 respectively. Then the probability of the four pairs of output * (
* ( 0.1
bits is p11 , p00 , p01 and p10 respectively, and (
*
( * 0.1
( *
( *
I(X1 , X2 ; Y1 , Y2 ) = H(Y1 , Y2 ) − H(Y1 , Y2 |X1 , X2 ) (7.61) ( *
( *
( *
= H(Y1 , Y2 ) − 0 (7.62) 1 ( +
'
* 1
= H(p11 , p00 , p01 , p10 ) (7.63) 0.9

176 Channel Capacity Channel Capacity 177

,n- ,n -
We will consider a binary symmetric channel with crossover probability 0.1. The input k k k pk (1 − p)n−k − n1 log p(xn )
distribution that achieves capacity is the uniform distribution, i.e., p(x) = (1/2, 1/2) , 0 1 0.071790 0.152003
which yields the joint distribution p(x, y) for this channel is given by 1 25 0.199416 0.278800
2 300 0.265888 0.405597
X\Y 0 1
3 2300 0.226497 0.532394
0 0.45 0.05
4 12650 0.138415 0.659191
1 0.05 0.45
5 53130 0.064594 0.785988
The marginal distribution of Y is also (1/2, 1/2) . 6 177100 0.023924 0.912785
7 480700 0.007215 1.039582
(a) Calculate H(X) , H(Y ) , H(X, Y ) and I(X; Y ) for the joint distribution above. 8 1081575 0.001804 1.166379
9 2042975 0.000379 1.293176
(b) Let X1 , X2 , . . . , Xn be drawn i.i.d. according the Bernoulli(1/2) distribution.
10 3268760 0.000067 1.419973
Of the 2n possible input sequences of length n , which of them are typical, i.e.,
(n) (n) 11 4457400 0.000010 1.546770
member of Aǫ (X) for ǫ = 0.2 ? Which are the typical sequences in A ǫ (Y ) ?
12 5200300 0.000001 1.673567
(n)
(c) The jointly typical set Aǫ (X, Y ) is defined as the set of sequences that satisfy (Sequences with more than 12 ones are omitted since their total probability is
equations (7.35-7.37). The first two equations correspond to the conditions that negligible (and they are not in the typical set).)
(n) (n)
xn and y n are in Aǫ (X) and Aǫ (Y ) respectively. Consider the last condition, What is the size of the set Aǫ (Z) ?
(n)
which can be rewritten to state that − n1 log p(xn , y n ) ∈ (H(X, Y )−ǫ, H(X, Y )+ǫ) .
Let k be the number of places in which the sequence x n differs from y n ( k is a (e) Now consider random coding for the channel, as in the proof of the channel coding
function of the two sequences). Then we can write theorem. Assume that 2nR codewords X n (1), X n (2), . . . , X n (2nR ) are chosen uni-
formly over the 2n possible binary sequences of length n . One of these codewords
n is chosen and sent over the channel. The receiver looks at the received sequence
p(xn , y n ) =
B
p(xi , yi ) (7.65) and tries to find a codeword in the code that is jointly typical with the received
i=1
sequence. As argued above, this corresponds to finding a codeword X n (i) such
= (0.45)n−k (0.05)k (7.66) (n)
that Y n − X n (i) ∈ Aǫ (Z) . For a fixed codeword xn (i) , what is the probability
* +n
1 that the received sequence Y n is such that (xn (i), Y n ) is jointly typical?
= (1 − p)n−k pk (7.67)
2
(f) Now consider a particular received sequence y n = 000000 . . . 0 , say. Assume that
An alternative way at looking at this probability is to look at the binary symmetric we choose a sequence X n at random, uniformly distributed among all the 2n
channel as in additive channel Y = X ⊕ Z , where Z is a binary random variable possible binary n -sequences. What is the probability that the chosen sequence
that is equal to 1 with probability p , and is independent of X . In this case, is jointly typical with this y n ? (Hint: this is the probability of all sequences x n
(n)
such that y n − xn ∈ Aǫ (Z) .)
p(xn , y n ) = p(xn )p(y n |xn ) (7.68)
(g) Now consider a code with 29 = 512 codewords of length 12 chosen at random,
= p(xn )p(z n |xn ) (7.69) uniformly distributed among all the 2n sequences of length n = 25 . One of
n n
= p(x )p(z ) (7.70) these codewords, say the one corresponding to i = 1 , is chosen and sent over the
* +n
1 channel. As calculated in part (e), the received sequence, with high probability, is
= (1 − p)n−k pk (7.71)
2 jointly typical with the codeword that was sent. What is probability that one or
more of the other codewords (which were chosen at random, independently of the
Show that the condition that (xn , y n ) being jointly typical is equivalent to the sent codeword) is jointly typical with the received sequence? (Hint: You could use
condition that xn is typical and z n = y n − xn is typical. the union bound but you could also calculate this probability exactly, using the
(n) result of part (f) and the independence of the codewords)
(d) We now calculate the size of Aǫ (Z) for n = 25 and ǫ = 0.2 . As in problem 13
of Chapter 3, here is a table of the probabilities and numbers of sequences of with (h) Given that a particular codeword was sent, the probability of error (averaged over
k ones the probability distribution of the channel and over the random choice of other
178 Channel Capacity Channel Capacity 179

codewords) can be written as condition, which can be rewritten to state that − n1 log p(xn , y n ) ∈ (H(X, Y ) −
ǫ, H(X, Y ) + ǫ) . Let k be the number of places in which the sequence x n differs
Pr(Error|xn (1) sent) = p(y n |xn (1))
!
(7.72) from y n ( k is a function of the two sequences). Then we can write
y n :y n causes error
n
There are two kinds of error: the first occurs if the received sequence is not yn p(xn , y n ) =
B
p(xi , yi ) (7.73)
jointly typical with the transmitted codeword, and the second occurs if there is i=1
another codeword jointly typical with the received sequence. Using the result of = (0.45)n−k (0.05)k (7.74)
the previous parts, calculate this probability of error. * +n
1
By the symmetry of the random coding argument, this does not depend on which = (1 − p)n−k pk (7.75)
2
codeword was sent.
An alternative way at looking at this probability is to look at the binary symmetric
The calculations above show that average probability of error for a random code with channel as in additive channel Y = X ⊕ Z , where Z is a binary random variable
512 codewords of length 25 over the binary symmetric channel of crossover probability that is equal to 1 with probability p , and is independent of X . In this case,
0.1 is about 0.34. This seems quite high, but the reason for this is that the value of
ǫ that we have chosen is too large. By choosing a smaller ǫ , and a larger n in the p(xn , y n ) = p(xn )p(y n |xn ) (7.76)
(n)
definitions of Aǫ , we can get the probability of error to be as small as we want, as = p(xn )p(z n |xn ) (7.77)
long as the rate of the code is less than I(X; Y ) − 3ǫ . n
= p(x )p(z ) n
(7.78)
* +n
Also note that the decoding procedure described in the problem is not optimal. The 1
optimal decoding procedure is maximum likelihood, i.e., to choose the codeword that = (1 − p)n−k pk (7.79)
2
is closest to the received sequence. It is possible to calculate the average probability
of error for a random code for which the decoding is based on an approximation to Show that the condition that (xn , y n ) being jointly typical is equivalent to the
maximum likelihood decoding, where we decode a received sequence to the unique condition that xn is typical and z n = y n − xn is typical.
codeword that differs from the received sequence in ≤ 4 bits, and declare an error (n)
Solution:The conditions for (xn , y n ) ∈ Aǫ (X, Y ) are
otherwise. The only difference with the jointly typical decoding described above is
that in the case when the codeword is equal to the received sequence! The average A(n)
ǫ = {(x , y ) ∈ X × Y n :
n n n
(7.80)
probability of error for this decoding scheme can be shown to be about 0.285. 3 1
3 3
3− log p(xn ) − H(X)3 < ǫ,
3
3 n (7.81)
Solution: Jointly typical set
3
3 1
3 3
3− log p(y n ) − H(Y )3 < ǫ,
3
(a) Calculate H(X) , H(Y ) , H(X, Y ) and I(X; Y ) for the joint distribution above. 3 n 3 (7.82)
Solution: H(X) = H(Y ) = 1 bit, H(X, Y ) = H(X) + H(Y |X) = 1 + H(p) = 1 − 3 1
3 3
3− log p(xn , y n ) − H(X, Y )3 < ǫ},
3
0.9 log 0.9−0.1 log 0.1 = 1+0.469 = 1.469 bits, and I(X; Y ) = H(Y )−H(Y |X) = 3 n 3 (7.83)
0.531 bits.
(b) Let X1 , X2 , . . . , Xn be drawn i.i.d. according the Bernoulli(1/2) distribution. Of But, as argued above, every sequence x n and y n satisfies the first two conditions.
the 2n possible sequences of length n , which of them are typical, i.e., member of Thereofre, the only condition that matters is the last one. As argued above,
(n) (n)
Aǫ (X) for ǫ = 0.2 ? Which are the typical sequences in A ǫ (Y ) ? 1 1 1 n k
** + +
Solution:In the case for the uniform distribution, every sequence has probability − log p(xn , y n ) = − log p (1 − p)n−k (7.84)
n n 2
(1/2)n , and therefore for every sequence, − n1 log p(xn ) = 1 = H(X) , and therefore k n−k
(n) = 1 − log p − log(1 − p) (7.85)
every sequence is typical, i.e., ∈ Aǫ (X) . n n
(n)
Similarly, every sequence y n is typical, i.e., ∈ Aǫ (Y ) .
Thus the pair (xn , y n ) is jointly typical iff |1− nk log p− n−k
n log(1−p)−H(X, Y )| <
(n)
(c) The jointly typical set Aǫ (X, Y ) is defined as the set of sequences that satisfy ǫ , i.e., iff | − nk log p − n−k
n log(1 − p) − H(p)| < ǫ , which is exactly the condition
equations (7.35-7.37) of EIT. The first two equations correspond to the conditions for z n = y n ⊕ xn to be typical. Thus the set of jointly typical pairs (x n , y n ) is the
(n) (n)
that xn and y n are in Aǫ (X) and Aǫ (Y ) respectively. Consider the last set such that the number of places in which x n differs from y n is close to np .

180 Channel Capacity Channel Capacity 181

(n)
(d) We now calculate the size of Aǫ (Z) for n = 25 and ǫ = 0.2 . As in problem 7 of (f) Now consider a particular received sequence y n = 000000 . . . 0 , say. Assume that
Homework 4, here is a table of the probabilities and numbers of sequences of with we choose a sequence X n at random, uniformly distributed among all the 2n
k ones ,n- $ ,n- ,n- k
possible binary n -sequences. What is the probability that the chosen sequence
k k j≤k j p(xn ) = pk (1 − p)n−k k p (1 − p)
n−k Cumul. pr. − n1 log p(xn ) is jointly typical with this y n ? (Hint: this is the probability of all sequences x n
(n)
0 1 1 7.178975e-02 0.071790 0.071790 0.152003 such that y n − xn ∈ Aǫ (Z) .)
1 25 26 7.976639e-03 0.199416 0.271206 0.278800 Solution:Since all xn sequences are chosen with the same probability ( (1/2) n ),
2 300 326 8.862934e-04 0.265888 0.537094 0.405597 the probability that the xn sequence chosen is jointly typical with the received y n
3 2300 2626 9.847704e-05 0.226497 0.763591 0.532394 is equal to the number of possible jointly typical (x n , y n ) pairs times (1/2)n . The
4 12650 15276 1.094189e-05 0.138415 0.902006 0.659191 number of sequences xn that are jointly typical with a given y n is equal to number
5 53130 68406 1.215766e-06 0.064594 0.966600 0.785988 of typical z n , where z n = xn ⊕ y n . Thus the probability that a randomly chosen
(n)
6 177100 245506 1.350851e-07 0.023924 0.990523 0.912785 xn is typical with the given y n is |Aǫ (Z)| ∗ ( 21 )n = 15275 ∗ 2−25 = 4.552 × 10−4 .
7 480700 726206 1.500946e-08 0.007215 0.997738 1.039582 (g) Now consider a code with 29 = 512 codewords of length 12 chosen at random,
8 1081575 1807781 1.667718e-09 0.001804 0.999542 1.166379 uniformly distributed among all the 2n sequences of length n = 25 . One of
9 2042975 3850756 1.853020e-10 0.000379 0.999920 1.293176 these codewords, say the one corresponding to i = 1 , is chosen and sent over the
10 3268760 7119516 2.058911e-11 0.000067 0.999988 1.419973 channel. As calculated in part (e), the received sequence, with high probability, is
11 4457400 11576916 2.287679e-12 0.000010 0.999998 1.546770 jointly typical with the codeword that was sent. What is probability that one or
12 5200300 16777216 2.541865e-13 0.000001 0.999999 1.673567 more of the other codewords (which were chosen at random, independently of the
(Sequences with more than 12 ones are omitted since their total probability is sent codeword) is jointly typical with the received sequence? (Hint: You could use
negligible (and they are not in the typical set).) the union bound but you could also calculate this probability exactly, using the
(n)
What is the size of the set Aǫ (Z) ? result of part (f) and the independence of the codewords)
Solution: H(Z) = H(0.1) = 0.469 . Solution:Each of the other codewords is jointly typical with received sequence
with probability 4.552 × 10−4 , and each of these codewords is independent. The
Setting ǫ = 0.2 , the typical set for Z is the set sequences for which − n1 log p(z n ) ∈
probability that none of the 511 codewords are jointly typical with the received
(H(Z) − ǫ, H(Z) + ǫ) = (0.269, 0.669) . Looking at the table above for n = 25 , it
sequence is therefore (1 − 4.552 × 10−4 )511 = 0.79241 , and the probability that
follows that the typical Z sequences are those with 1,2,3 or 4 ones.
(n)
at least one of them is jointly typical with the received sequence is therefore 1 −
The total probability of the set Aǫ (Z) = 0.902006 − 0.071790 = 0.830216 and 0.79241 = 0.20749 .
the size of this set is 15276 − 1 = 15275 . Using the simple union of events bound gives the probability of another codeword
(e) Now consider random coding for the channel, as in the proof of the channel coding being jointly typical with the received sequence to be 4.552×10 −4 ×511 = 0.23262 .
theorem. Assume that 2nR codewords X n (1), X n (2), . . . , X n (2nR ) are chosen uni- The previous calculation gives the more exact answer.
formly over the 2n possible binary sequences of length n . One of these codewords (h) Given that a particular codeword was sent, the probability of error (averaged over
is chosen and sent over the channel. The receiver looks at the received sequence the probability distribution of the channel and over the random choice of other
and tries to find a codeword in the code that is jointly typical with the received codewords) can be written as
sequence. As argued above, this corresponds to finding a codeword X n (i) such
Pr(Error|xn (1) sent) = p(y n |xn (1))
!
(n) (7.86)
that Y n − X n (i) ∈ Aǫ (Z) . For a fixed codeword xn (i) , what is the probability
that the received sequence Y n is such that (xn (i), Y n ) is jointly typical? y n :y n causes error
Solution:The easiest way to calculate this probability is to view the BSC as an There are two kinds of error: the first occurs if the received sequence y n is not
additive channel Y = X ⊕ Z , where Z is Bernoulli( p ). Then the probability that jointly typical with the transmitted codeword, and the second occurs if there is
for a given codeword, xn (i) , that the output Y n is jointly typical with it is equal another codeword jointly typical with the received sequence. Using the result of
(n)
to the probability that the noise sequence Z n is typical, i.e., in Aǫ (Z) . The noise the previous parts, calculate this probability of error.
sequence is drawn i.i.d. according to the distribution (1 − p, p) , and as calculated By the symmetry of the random coding argument, this does not depend on which
(n)
above, the probability that the sequence is typical, i.e., Pr(A ǫ (Z)) = 0.830216 . codeword was sent.
Therefore the probability that the received sequence is not jointly typical with the Solution:There are two error events, which are conditionally independent, given
transmitted codeword is 0.169784. the received sequence. In the previous part, we showed that the conditional proba-
182 Channel Capacity Channel Capacity 183

bility of error of the second kind was 0.20749, irrespective of the received sequence (b) What is the capacity of this channel in bits per transmission of the original chan-
yn . nel?
The probability of error of the first kind is 0.1698, conditioned on the input code- (c) What is the capacity of the original BSC with crossover probability 0.1?
(n)
word. In part (e), we calculated the probability that (x n (i), Y n ) ∈ / Aǫ (X, Y ) , (d) Prove a general result that for any channel, considering the encoder, channel and
but this was conditioned on a particular input sequence. Now by the symmetry decoder together as a new channel from messages to estimated messages will not
and uniformity of the random code construction, this probability does not depend increase the capacity in bits per transmission of the original channel.
(n)
on xn (i) , and therefore the probability that (X n , Y n ) ∈
/ Aǫ (X, Y ) is also equal
to this probability, i.e., to 0.1698. Solution: Encoder and Decoder as part of the channel:
We can therefore use a simple union of events bound to bound the total probability
(a) The probability of error with these 3 bits codewords was 2.8%, and thus the
of error ≤ 0.1698 + 0.2075 = 0.3773 .
crossover probability of this channel is 0.028.
Thus we can send 512 codewords of length 25 over a BSC with crossover probability
0.1 with probability of error less than 0.3773. (b) The capacity of a BSC with crossover probability 0.028 is 1 − H(0.028) , i.e., 1-
0.18426 or 0.81574 bits for each 3 bit codeword. This corresponds to 0.27191 bits
A little more accurate calculation can be made of the probability of error using the
per transmission of the original channel.
fact that conditioned on the received sequence, both kinds of error are independent.
Using the symmetry of the code construction process, the probability of error of (c) The original channel had capacity 1 − H(0.1) , i.e., 0.531 bits/transmission.
the first kind conditioned on the received sequence does not depend on the received (d) The general picture for the channel with encoder and decoder is shown below
sequence, and is therefore = 0.1698 . Therefore the probability that neither type
of error occurs is (using their independence) = (1 − 0.1698)(1 − 0.2075) = 0.6579
and therefore, the probability of error is 1 − 0.6579 = 0.3421
W % Xn % Channel Yn % Ŵ %
Encoder p(y|x) Decoder
The calculations above show that average probability of error for a random code with Message Estimate
512 codewords of length 25 over the binary symmetric channel of crossover probability of
0.1 is about 0.34. This seems quite high, but the reason for this is that the value of Message
ǫ that we have chosen is too large. By choosing a smaller ǫ , and a larger n in the
(n)
definitions of Aǫ , we can get the probability of error to be as small as we want, as By the data processing inequality, I(W ; Ŵ ) ≤ I(X n ; Y n ) , and therefore
long as the rate of the code is less than I(X; Y ) − 3ǫ .
1 1
Also note that the decoding procedure described in the problem is not optimal. The CW = max I(W ; Ŵ ) ≤ max I(X n ; Y n ) = C (7.87)
n p(w) n p(xn )
optimal decoding procedure is maximum likelihood, i.e., to choose the codeword that
is closest to the received sequence. It is possible to calculate the average probability Thus the capacity of the channel per transmission is not increased by the addition
of error for a random code for which the decoding is based on an approximation to of the encoder and decoder.
maximum likelihood decoding, where we decode a received sequence to the unique
17. Codes of length 3 for a BSC and BEC: In Problem 16, the probability of error was
codeword that differs from the received sequence in ≤ 4 bits, and declare an error
calculated for a code with two codewords of length 3 (000 and 111) sent over a binary
otherwise. The only difference with the jointly typical decoding described above is
symmetric channel with crossover probability ǫ . For this problem, take ǫ = 0.1 .
that in the case when the codeword is equal to the received sequence! The average
probability of error for this decoding scheme can be shown to be about 0.285. (a) Find the best code of length 3 with four codewords for this channel. What is
the probability of error for this code? (Note that all possible received sequences
16. Encoder and decoder as part of the channel: Consider a binary symmetric chan-
should be mapped onto possible codewords)
nel with crossover probability 0.1. A possible coding scheme for this channel with two
codewords of length 3 is to encode message a 1 as 000 and a2 as 111. With this coding (b) What is the probability of error if we used all the 8 possible sequences of length 3
scheme, we can consider the combination of encoder, channel and decoder as forming as codewords?
a new BSC, with two inputs a1 and a2 and two outputs a1 and a2 . (c) Now consider a binary erasure channel with erasure probability 0.1. Again, if we
used the two codeword code 000 and 111, then received sequences 00E,0E0,E00,0EE,E0E,EE0
(a) Calculate the crossover probability of this channel. would all be decoded as 0, and similarly we would decode 11E,1E1,E11,1EE,E1E,EE1

184 Channel Capacity Channel Capacity 185

as 1. If we received the sequence EEE we would not know if it was a 000 or a 111 Received Sequences codeword
that was sent - so we choose one of these two at random, and are wrong half the 000, 00E, 0E0, E00 000
time. 011, 01E, 0E1, E11 011
What is the probability of error for this code over the erasure channel? 110, 11E, 1E0, E10 110
101, 10E, 1E1, E01 101
(d) What is the probability of error for the codes of parts (a) and (b) when used over 0EE 000 or 011 with prob. 0.5
the binary erasure channel? EE0 000 or 110 with prob. 0.5
..
.
Solution: Codes of length 3 for a BSC and BEC: EE1 011 or 101 with prob. 0.5
EEE 000 or 011 or 110 or 101 with prob. 0.25
(a) To minimize the probability of confusion, the codewords should be as far apart as Essentially all received sequences with only one erasure can be decoded correctly.
possible. With four codewords, the minimum distance is at most 2, and there are If there are two erasures, then there are two possible codewords that could have
various sets of codewords that achieve this minimum distance. An example set is caused the received sequence, and the conditional probability of error is 0.5. If
000, 011, 110 and 101. Each of these codewords differs from the other codewords there are three erasures, any of the codewords could have caused it, and the
in at least two places. conditional probability of error is 0.75. Thus the probability of error given that
To calculate the probability of error, we need to find the best decoding rule, i.e,. 000 was sent is the probability of two erasures times 0.5 plus the probability of 3
we need to map all possible recieved sequences onto codewords. As argued in the erasures times 0.75, i.e, 3 ∗ 0.9 ∗ 0.1 ∗ 0.1 ∗ 0.5 + 0.1 ∗ 0.1 ∗ 0.1 ∗ 0.75 = 0.01425 .
previous homework, the best decoding rule assigns to each received sequence the This is also the average probability of error.
nearest codeword, with ties being broken arbitrarily. Of the 8 possible received If all input sequences are used as codewords, then we will be confused if there
sequences, 4 are codewords, and each of the other 4 sequences has three codewords is any erasure in the received sequence. The conditional probability of error if
within distance one of them. We can assign these received sequences to any of the there is one erasure is 0.5, two erasures is 0.75 and three erasures is 0.875 (these
nearest codewords, or alternatively, for symmetry, we might toss a three sided corrospond to the numbers of other codewords that could have caused the received
coin on receiving the sequence, and choose one of the nearest codewords with sequence). Thus the probability of error given any codeword is 3 ∗ 0.9 ∗ 0.9 ∗ 0.1 ∗
probability (1/3, 1/3, 1/3). All these decoding strategies will give the same average 0.5 + 3 ∗ 0.9 ∗ 0.1 ∗ 0.1 ∗ 0.75 + 0.1 ∗ 0.1 ∗ 0.1 ∗ 0.875 = 0.142625 . This is also the
probability of error. average probability of error.
In the current example, there are 8 possible received sequences, and we will use
18. Channel capacity: Calculate the capacity of the following channels with probability
the following decoder mapping 000, 001 → 000; 011, 010 → 011; 110, 100 →
transition matrices:
110; and 101, 111 → 101.
Under this symmetric mapping, the codeword and one received sequence at dis- (a) X = Y = {0, 1, 2}
tance 1 from the codeword are mapped on to the codeword. The probability there-
 
1/3 1/3 1/3
fore that the codeword is decoded correctly is 0.9∗0.9∗0.9+0.9∗0.9∗0.1 = 0.81 and p(y|x) =  1/3 1/3 1/3 
 
(7.88)
the probability of error (for each codeword) is 0.19. Thus the average probability 1/3 1/3 1/3
of error is also 0.19.
(b) X = Y = {0, 1, 2}
(b) If we use all possible input sequences as codewords, then we have an error if any of 
1/2 1/2 0

the bits is changed. The probability that all the three bits are received correctly p(y|x) =  0 1/2 1/2 
 
(7.89)
is 0.9 ∗ 0.9 ∗ 0.9 = 0.729 and therefore the probability of error is 0.271. 1/2 0 1/2
(c) There will be an error only if all three bits of the codeword are erased, and on
(c) X = Y = {0, 1, 2, 3}
receiveing EEE, the decoder choses the wrong codeword. The probability of re-
ceiving EEE is 0.001 and conditioned on that, the probability of error is 0.5, so
 
p 1−p 0 0
the probability of error for this code over the BEC is 0.0005.  1−p p 0 0 
p(y|x) =  (7.90)
 
 0 0 q 1−q

(d) For the code of part (a), the four codewords are 000, 011,110, and 101. We use 
the following decoder mapping: 0 0 1−q q
186 Channel Capacity Channel Capacity 187

Solution: Channel Capacity: Set up an appropriate model for the channel in each of the above cases, and indicate
how to go about finding the capacity.
(a) X = Y = {0, 1, 2} 
1/3 1/3 1/3
 Solution: Capacity of the carrier pigeon channel.
p(y|x) =  1/3 1/3 1/3  (7.91)
 
(a) The channel sends 8 bits every 5 minutes, or 96 bits/hour.
1/3 1/3 1/3
(b) This is the equivalent of an erasure channel with an input alphabet of 8 bit symbols,
This is a symmetric channel and by the results of Section 8.2, i.e., 256 different symbols. For any symbols sent, a fraction α of them are received
as an erasure. We would expect that the capacity of this channel is (1 − α)8
C = log |Y| − H(r) = log 3 − log 3 = 0. (7.92) bits/pigeon. We will justify it more formally by mimicking the derivation for the
binary erasure channel.
In this case, the output is independent of the input.
Consider a erasure channel with 256 symbol inputs and 257 symbol output - the
(b) X = Y = {0, 1, 2}   extra symbol is the erasure symbol, which occurs with probability α . Then
1/2 1/2 0
p(y|x) =  0 1/2 1/2 
 
(7.93) I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − H(α) (7.97)
1/2 0 1/2
since the probability of erasure is independent of the input.
Again the channel is symmetric, and by the results of Section 8.2, However, we cannot get H(Y ) to attain its maximum value, i.e., log 257 , since
the probability of the erasure channel is α independent of our input distribution.
C = log |Y| − H(r) = log 3 − log = 0.58 bits (7.94)
However, if we let E be the erasure event, then
(c) X = Y = {0, 1, 2, 3} H(Y ) = H(Y, E) = H(E) + H(Y |E) = H(α) + α × 0 + (1 − α)H(Y |E = 0) (7.98)
 
p 1−p 0 0 and we can maximize H(Y ) by maximizing H(Y |E = 0) . However, H(Y |E = 0)
 1−p p 0 0 
p(y|x) = 
 
(7.95) is just the entropy of the input distribution, and this is maximized by the uniform.
 0 0 q 1−q 

Thus the maximum value of H(Y ) is H(α) + (1 − α) log 256 , and the capacity of
0 0 1−q q this channel is (1 − α) log 256 bits/pigeon, or (1 − α)96 bits/hour, as we might
have expected from intuitive arguments.
This channel consists of a sum of two BSC’s, and using the result of Problem 2 of
Homework 9, the capacity of the channel is (c) In this case, we have a symmetric channel with 256 inputs and 256 output. With
0 1 probability (1 − α) + α/256 , the output symbol is equal to the input, and with
1−H(p) 1−H(q) probability α/256 , it is transformed to any of the other 255 symbols. This channel
C = log 2 +2 (7.96)
is symmetric in the sense of Section 8.2, and therefore the capacity of the channel
19. Capacity of the carrier pigeon channel. Consider a commander of an army be- is
sieged a fort for whom the only means of communication to his allies is a set of carrier
C = log |Y| − H(r) (7.99)
pigeons. Assume that each carrier pigeon can carry one letter (8 bits), and assume that
pigeons are released once every 5 minutes, and that each pigeon takes exactly 3 minutes = log 256 − H(1 − α + α/256, α/256, α/256, . . . , α/256) (7.100)
to reach its destination. 255 255
= 8 − H(1 − α) − αH(1/255, 1/255, . . . , 1/255) (7.101)
256 256
(a) Assuming all the pigeons reach safely, what is the capacity of this link in bits/hour? 255 255
= 8 − H(1 − α) − α log 255 (7.102)
(b) Now assume that the enemies try to shoot down the pigeons, and that they manage 256 256
to hit a fraction α of them. Since the pigeons are sent at a constant rate, the We have to multiply this by 12 to get the capacity in bits/hour.
receiver knows when the pigeons are missing. What is the capacity of this link?
20. A channel with two independent looks at Y. Let Y 1 and Y2 be conditionally
(c) Now assume that the enemy is more cunning, and every time they shoot down a independent and conditionally identically distributed given X.
pigeon, they send out a dummy pigeon carrying a random letter (chosen uniformly
from all 8-bit letters). What is the capacity of this link in bits/hour? (a) Show I(X; Y1 , Y2 ) = 2I(X; Y1 ) − I(Y1 , Y2 ).

188 Channel Capacity Channel Capacity 189

(b) Conclude that the capacity of the channel (a) The average height of the individuals in the population is 5 feet. So n1 hi = 5
$

where n is the population size and hi is the height of the i -th person. If more
X % % (Y1 , Y2 ) than 31 of the population is at least 15 feet tall, then the average will be greater
than 13 15 = 5 feet since each person is at least 0 feet tall. Thus no more than 13
of the population is 15 feet tall.
is less than twice the capacity of the channel
(b) By the same reasoning as in part (a), at most 12 of the poplulation is 10 feet tall
X % % Y1 and at most 13 of the population weighs 300 lbs. Therefore at most 31 are both
10 feet tall and weigh 300 lbs.

Solution: A channel with two independent looks at Y 22. Can signal alternatives lower capacity? Show that adding a row to a channel
transition matrix does not decrease capacity.
(a)

I(X; Y1 , Y2 ) = H(Y1 , Y2 ) − H(Y1 , Y2 |X) (7.103) Solution: Can signal alternatives lower capacity?
Adding a row to the channel transition matrix is equivalent to adding a symbol to the
= H(Y1 ) + H(Y2 ) − I(Y1 ; Y2 ) − H(Y1 |X) − H(Y2 |X) (7.104)
input alphabet X . Suppose there were m symbols and we add an (m + 1) -st. We can
(since Y1 and Y2 are conditionally independent given X)
(7.105) always choose not to use this extra symbol.
= I(X; Y1 ) + I(X; Y2 ) − I(Y1 ; Y2 ) (7.106)
Specifically, let Cm and Cm+1 denote the capacity of the original channel and the new
= 2I(X; Y1 ) − I(Y1 ; Y2 ) (since Y1 and Y2 are conditionally(7.107)
iden-. channel, respectively. Then
tically distributed)
Cm+1 = max I(X; Y )
(b) The capacity of the single look channel X → Y 1 is p(x1 ,...,xm+1 )
≥ max I(X; Y )
C1 = max I(X; Y1 ). (7.108) p(x1 ,...,xm ,0)
p(x)
= Cm .
The capacity of the channel X → (Y1 , Y2 ) is
23. Binary multiplier channel
C2 = max I(X; Y1 , Y2 ) (7.109)
p(x)
= max 2I(X; Y1 ) − I(Y1 ; Y2 ) (7.110) (a) Consider the channel Y = XZ where X and Z are independent binary random
p(x) variables that take on values 0 and 1. Z is Bernoulli( α ), i.e. P (Z = 1) = α .
≤ max 2I(X; Y1 ) (7.111) Find the capacity of this channel and the maximizing distribution on X .
p(x)
= 2C1 . (7.112) (b) Now suppose the receiver can observe Z as well as Y . What is the capacity?

Hence, two independent looks cannot be more than twice as good as one look. Solution: Binary Multiplier Channel

21. Tall, fat people (a) Let P (X = 1) = p . Then P (Y = 1) = P (X = 1)P (Z = 1) = αp .

Suppose that average height of people in a room is 5 feet. Suppose the average weight
is 100 lbs. I(X; Y ) = H(Y ) − H(Y |X)
= H(Y ) − P (X = 1)H(Z)
1
(a) Argue that no more than 3 of the population is 15 feet tall.
= H(αp) − pH(α)
(b) Find an upper bound on the fraction of 300 lb, 10 footers in the room.
1
We find that p∗ = H(α) maximizes I(X; Y ) . The capacity is calculated to
Solution: α(2 α +1)
H(α)
H(α)
Tall, fat people. be log(2 α + 1) − α .
190 Channel Capacity Channel Capacity 191

(b) Let P (X = 1) = p . Then where x = {1, 2, . . . , m} , y = {1, 2, . . . , m} , and v = {1, 2, . . . , k} . Here p(v|x) and
p(y|v) are arbitrary and the channel has transition probability p(y|x) = v p(v|x)p(y|v) .
$
I(X; Y, Z) = I(X; Z) + I(X; Y |Z)
Show C ≤ log k .
= H(Y |Z) − H(Y |X, Z)
Solution: Bottleneck channel
= H(Y |Z)
= αH(p) The capacity of the cascade of channels is C = max p(x) I(X; Y ) . By the data processing
inequality, I(X; Y ) ≤ I(V ; Y ) = H(V ) − H(V |Y ) ≤ H(V ) ≤ log k .
The expression is maximized for p = 1/2 , resulting in C = α . Intuitively, we can
only get X through when Z is 1, which happens α of the time. 26. Noisy typewriter. Consider the channel with x, y ∈ {0, 1, 2, 3} and transition prob-
abilities p(y|x) given by the following matrix:
24. Noise alphabets  1 1 
0 0
Consider the channel 
2 2

1 1
 0 0 
 
2 2
Z  0

0 1 1 
2 2

1 1
2 0 0 2

#$
& (a) Find the capacity of this channel.
X % Σ % Y
!" (b) Define the random variable z = g(y) where
)
A if y ∈ {0, 1}
g(y) = .
X = {0, 1, 2, 3} , where Y = X + Z , and Z is uniformly distributed over three distinct B if y ∈ {2, 3}
integer values Z = {z1 , z2 , z3 }.
For the following two PMFs for x , compute I(X; Z)
(a) What is the maximum capacity over all choices of the Z alphabet? Give distinct i.
integer values z1 , z2 , z3 and a distribution on X achieving this. )
1
(b) What is the minimum capacity over all choices for the Z alphabet? Give distinct 2 if x ∈ {1, 3}
p(x) =
integer values z1 , z2 , z3 and a distribution on X achieving this. 0 if x ∈ {0, 2}

Solution: Noise alphabets ii.

)
(a) Maximum capacity is C = 2 bits. Z = {10, 20, 30} and p(X) = ( 41 , 14 , 14 , 14 ) . 0 if x ∈ {1, 3}
p(x) = 1
(b) Minimum capacity is C = 1 bit. Z = {0, 1, 2} and p(X) = ( 21 , 0, 0, 12 ) . 2 if x ∈ {0, 2}

25. Bottleneck channel (c) Find the capacity of the channel between x and z , specifically where x ∈ {0, 1, 2, 3} ,
z ∈ {A, B} , and the transition probabilities P (z|x) are given by
Suppose a signal X ∈ X = {1, 2, . . . , m} goes through an intervening transition X −→
V −→ Y :
!
p(Z = z|X = x) = P (Y = y0 |X = x)
g(y0 )=z

(d) For the X distribution of part i. of b , does X → Z → Y form a Markov chain?

X p(v|x) V p(y|v) Y
Solution: Noisy typewriter

(a) This is a noisy typewriter channel with 4 inputs, and is also a symmetric channel.
Capacity of the channel by Theorem 7.2.1 is log 4 − 1 = 1 bit per transmission.

192 Channel Capacity Channel Capacity 193

(b) i. The resulting conditional distribution of Z given X is The capacity of the channel is
 
1 0

 1 1

 C = max I(X; S) (7.113)
 2 2  p(x)
 0 1 
 
1 1
2 2
Define a new random variable Z , a function of S , where Z = 1 if S = e and Z =
If ) 0 otherwise. Note that p(Z = 1) = α independent of X . Expanding the mutual
1
2 if x ∈ {1, 3} information,
p(x) =
0 if x ∈ {0, 2}
then it is easy to calculate H(Z|X) = 0 , and I(X; Z) = 1 . If
I(X; S) = H(S) − H(S|X) (7.114)
)
0 if x ∈ {1, 3} = H(S, Z) − H(S, Z|X) (7.115)
p(x) = 1
2 if x ∈ {0, 2} + H(Z) + H(S|Z) − H(Z|X) − H(S|X, Z) (7.116)
= I(X; Z) + I(S; X|Z) (7.117)
then H(Z|X) = 1 and I(X; Y ) = 0 .
= 0 + αI(X; S|Z = 1) + (1 − α)I(X; S|Z = 0) (7.118)
ii. Since I(X; Z) ≤ H(Z) ≤ 1 , the capacity of the channel is 1, achieved by the
input distribution )
1
p(x) = 2 if x ∈ {1, 3} When Z = 1 , S = e and H(S|Z = 1) = H(S|X, Z = 1) = 0 . When Z = 0 , S = Y ,
0 if x ∈ {0, 2} and I(X; S|Z = 0) = I(X; Y ) . Thus
(c) For the input distribution that achieves capacity, X ↔ Z is a one-to-one func-
tion, and hence p(x, z) = 1 or 0 . We can therefore see the that p(x, y, z) =
I(X; S) = (1 − α)I(X; Y ) (7.119)
p(z, x)p(y|x, z) = p(z, x)p(y|z) , and hence X → Z → Y forms a Markov chain.

27. Erasure channel

and therefore the capacity of the cascade of a channel with an erasure channel is (1− α)
Let {X , p(y|x), Y} be a discrete memoryless channel with capacity C . Suppose this times the capacity of the original channel.
channel is immediately cascaded with an erasure channel {Y, p(s|y), S} that erases α
of its symbols.
28. Choice of channels.

(
(
X p(y|x) Y $ ( S Find the capacity C of the union of 2 channels (X 1 , p1 (y1 |x1 ), Y1 ) and (X2 , p2 (y2 |x2 ), Y2 )
$$(
( where, at each time, one can send a symbol over channel 1 or over channel 2 but not
'''$$(
''$ both. Assume the output alphabets are distinct and do not intersect.
'(
$
'(
' e
$

Specifically, S = {y1 , y2 , . . . , ym , e}, and (a) Show 2C = 2C1 + 2C2 . Thus 2C is the effective alphabet size of a channel with
capacity C .
Pr{S = y|X = x} = ᾱp(y|x), y ∈ Y,
Pr{S = e|X = x} = α. (b) Compare with problem 10 of Chapter 2 where 2 H = 2H1 + 2H2 , and interpret (a)
in terms of the effective number of noise-free symbols.
Determine the capacity of this channel.
Solution: Erasure channel (c) Use the above result to calculate the capacity of the following channel
194 Channel Capacity Channel Capacity 195

' 0
1−p
0 * Maximizing over α one gets the desired result. The maximum occurs for H ' (α) + C1 −
)
(
* ( C2 = 0 , or α = 2C1 /(2C1 + 2C2 ) .
* (
* (
* ( (b) If one interprets M = 2C as the effective number of noise free symbols, then the
* (
* ( p above result follows in a rather intuitive manner: we have M 1 = 2C1 noise free symbols
(
*
( * p from channel 1, and M2 = 2C2 noise free symbols from channel 2. Since at each
( *
( * step we get to chose which channel to use, we essentially have M 1 + M2 = 2C1 + 2C2
( *
( * noise free symbols for the new channel. Therefore, the capacity of this channel is
( *
1 ( +
'
* 1 C = log2 (2C1 + 2C2 ) .
1−p
This argument is very similiar to the effective alphabet argument given in Problem 10,
Chapter 2 of the text.

29. Binary multiplier channel.

(a) Consider the discrete memoryless channel Y = XZ where X and Z are inde-
2 ' 2
pendent binary random variables that take on values 0 and 1. Let P (Z = 1) = α .
1
Find the capacity of this channel and the maximizing distribution on X .
Solution: Choice of Channels (b) Now suppose the receiver can observe Z as well as Y . What is the capacity?
(a) This is solved by using the very same trick that was used to solve problem 2.10.
Consider the following communication scheme: Solution: Binary Multiplier Channel (Repeat of problem 7.23)
)
X1 Probability α (a) Let P (X = 1) = p . Then P (Y = 1) = P (X = 1)P (Z = 1) = αp .
X=
X2 Probability (1 − α)
I(X; Y ) = H(Y ) − H(Y |X)
Let
= H(Y ) − P (X = 1)H(Z)
)
1 X = X1
θ(X) = = H(αp) − pH(α)
2 X = X2
Since the output alphabets Y1 and Y2 are disjoint, θ is a function of Y as well, i.e. We find that p∗ = 1
maximizes I(X; Y ) . The capacity is calculated to
H(α)
X → Y → θ. α(2 α +1)
H(α)
H(α)
be log(2 α + 1) − α .
I(X; Y, θ) = I(X; θ) + I(X; Y |θ)
= I(X; Y ) + I(X; θ|Y ) (b) Let P (X = 1) = p . Then

I(X; Y, Z) = I(X; Z) + I(X; Y |Z)

Since X → Y → θ , I(X; θ|Y ) = 0 . Therefore,
= H(Y |Z) − H(Y |X, Z)
I(X; Y ) = I(X; θ) + I(X; Y |θ) = H(Y |Z)
= H(θ) − H(θ|X) + αI(X1 ; Y1 ) + (1 − α)I(X2 ; Y2 ) = αH(p)
= H(α) + αI(X1 ; Y1 ) + (1 − α)I(X2 ; Y2 )
The expression is maximized for p = 1/2 , resulting in C = α . Intuitively, we can
only get X through when Z is 1, which happens α of the time.
Thus, it follows that
30. Noise alphabets.
C = sup {H(α) + αC1 + (1 − α)C2 } . Consider the channel
α

196 Channel Capacity Channel Capacity 197

Z 32. Random “20” questions

Let X be uniformly distributed over {1, 2, . . . , m} . Assume m = 2 n . We ask random
questions: Is X ∈ S1 ? Is X ∈ S2 ?...until only one integer remains. All 2m subsets S
#$
& of {1, 2, . . . , m} are equally likely.
X % Σ % Y
!"
(a) How many deterministic questions are needed to determine X ?
(b) Without loss of generality, suppose that X = 1 is the random object. What is
X = {0, 1, 2, 3} , where Y = X + Z , and Z is uniformly distributed over three distinct the probability that object 2 yields the same answers for k questions as object 1?
integer values Z = {z1 , z2 , z3 }.
(c) What is the expected number of objects in {2, 3, . . . , m} that have the same
(a) What is the maximum capacity over all choices of the Z alphabet? Give distinct answers to the questions as does the correct object 1?
integer values z1 , z2 , z3 and a distribution on X achieving this. √
(d) Suppose we ask n + n random questions. What is the expected number of
(b) What is the minimum capacity over all choices for the Z alphabet? Give distinct wrong objects agreeing with the answers?
integer values z1 , z2 , z3 and a distribution on X achieving this. (e) Use Markov’s inequality Pr{X ≥ tµ} ≤ 1t , to show that the probability of error
Solution: Noise alphabets (Repeat of problem 7.24) (one or more wrong object remaining) goes to zero as n −→ ∞ .

(a) Maximum capacity is C = 2 bits. Z = {10, 20, 30} and p(X) = ( 41 , 14 , 14 , 14 ) . Solution: Random “20” questions. (Repeat of Problem 5.45)
(b) Minimum capacity is C = 1 bit. Z = {0, 1, 2} and p(X) = ( 21 , 0, 0, 12 ) . (a) Obviously, Huffman codewords for X are all of length n . Hence, with n deter-
ministic questions, we can identify an object out of 2 n candidates.
31. Source and channel.
We wish to encode a Bernoulli( α ) process V 1 , V2 , . . . for transmission over a binary (b) Observe that the total number of subsets which include both object 1 and object
symmetric channel with crossover probability p . 2 or neither of them is 2m−1 . Hence, the probability that object 2 yields the same
answers for k questions as object 1 is (2m−1 /2m )k = 2−k .
1−p % More information theoretically, we can view this problem as a channel coding
+ *
)
+ ) problem through a noiseless channel. Since all subsets are equally likely, the
V n % X n (V n ) % +) p %Y n %̂ n
)+ p V probability the object 1 is in a specific random subset is 1/2 . Hence, the question
) +
) %
,
+ whether object 1 belongs to the k th subset or not corresponds to the k th bit of
1−p
the random codeword for object 1, where codewords X k are Bern( 1/2 ) random
k -sequences.
Find conditions on α and p so that the probability of error P ( V̂ n %= V n ) can be made
Object Codeword
to go to zero as n −→ ∞ .
1 0110 . . . 1
Solution: Source And Channel 2 0010 . . . 0
Suppose we want to send a binary i.i.d. Bernoulli( α ) source over a binary symmetric ..
.
channel with error probability p .
Now we observe a noiseless output Y k of X k and figure out which object was
By the source-channel separation theorem, in order to achieve an error rate that vanishes
sent. From the same line of reasoning as in the achievability proof of the channel
asymptotically, P (V̂ n %= V n ) → 0 , we need the entropy of the source to be smaller than
coding theorem, i.e. joint typicality, it is obvious the probability that object 2 has
the capacity of the channel. In this case this translates to
the same codeword as object 1 is 2−k .
H(α) + H(p) < 1, (c) Let
or, equivalently, )
1 1, object j yields the same answers for k questions as object 1
αα (1 − α)1−α pp (1 − p)1−p < . 1j = ,
2 0, otherwise
for j = 2, . . . , m.
198 Channel Capacity Channel Capacity 199

Then,
m
I(X n ; Y n ) = 1 + (n − 1)H(p) − H(p) = 1 + (n − 2)H(p)
!
E(# of objects in {2, 3, . . . , m} with the same answers) = E( 1j ) and,
j=2
m
1 1 + (n − 2)H(p)
I(X n ; Y n ) = lim
!
= E(1j ) lim = H(p)
n→∞ n n→∞ n
j=2
!m (b) For the BSC C = 1 − H(p) . For p = 0.5 , C = 0 , while lim n→∞ n1 I(X n ; Y n ) =
= 2−k H(0.5) = 1 .
j=2 1 n
(c) Using this scheme n I(W ; Y ) → 0.
= (m − 1)2−k
n −k 34. Capacity. Find the capacity of
= (2 − 1)2 .
√ √ (a) Two parallel BSC’s
(d) Plugging k = n + n into (c) we have the expected number of (2n − 1)2−n− n.
1 1 ' 1
(e) Let N by the number of wrong objects remaining. Then, by Markov’s inequality 1 p $0
$
11
$$
$ $p 11
P (N ≥ 1) ≤ EN 2 $ 2
1 2
'
√
= (2n − 1)2−n− n
X Y
√ 3 1 ' 3
0
$
≤ 2− n 11p $$
$
1 $p 1
→ 0, $ 12
4 $ '
1 4
where the first equality follows from part (d).

33. BSC with feedback. Suppose that feedback is used on a binary symmetric channel
with parameter p . Each time a Y is received, it becomes the next transmission. Thus (b) BSC and single symbol.
X1 is Bern(1/2), X2 = Y1 , X3 = Y2 , . . . , Xn = Yn−1 . 1 1 '
1 $0$ 1
1p1
$$
(a) Find limn→∞ n1 I(X n ; Y n ) . $$p 11
2 $ 2
' 2
1
(b) Show that for some values of p , this can be higher than capacity. X Y
(c) Using this feedback transmission scheme, ' 3
X n (W, Y n ) = (X1 (W ), Y1 , Y2 , . . . , Ym−1 ) , what is the asymptotic communication 3
rate achieved; that is, what is limn→∞ n1 I(W ; Y n ) ?
(c) BSC and ternary channel.
Solution: BSC with feedback solution. '
1
1 1 2 1
1 4
3
11 3
1

1
2
(a) 31
2
'
1
1
n n n n n
I(X ; Y ) = H(Y ) − H(Y |X ). 2 11 3 2
2

13 1
321 11
2

H(Y n |X n ) = H(Yi |Y i−1 , X n ) = H(Y1 |X1 ) + H(Yi |Y n ) = H(p) + 0.

! ! 1

X 3 2 12'
1 Y
i i 3 3
'
1
4 1112 $0$ 4
2

H(Y n ) = H(Yi |Y i−1 ) = H(Y1 ) +

! !
H(Yi |Xi ) = 1 + (n − 1)H(p) 11$
$
$21$ 11
1
i i 2

5 $ 2
'
1 5
So,

200 Channel Capacity Channel Capacity 201

(d) Ternary channel. " # (b) This part is also an application of the conclusion problem 7.28. Here the capacity
2/3 1/3 0 of the added channel is log k.
p(y|x) = . (7.120)
0 1/3 2/3
Ĉ = log(2log k + 2C )
Solution: Capacity Ĉ = log(k + 2C )
Recall the parallel channels problem (problem 7.28 showed that for two channels in
parallel with capacities C1 and C2 , the capacity C of the new channel satisfies 36. Channel with memory.
2C = 2C1 + 2C2 Consider the discrete memoryless channel Y i = Zi Xi with input alphabet Xi ∈
{−1, 1}.
(a) Here C1 = C2 = 1 − H(p) , and hence 2C = 2C1 +1 , or,
(a) What is the capacity of this channel when {Z i } is i.i.d. with
C = 2 − H(p).
)
(b) Here C1 = 1 − H(p) but C2 = 0 and so 2C = 2C1 + 1 , or, 1, p = 0.5
Zi = ? (7.121)
0 1 −1, p = 0.5
1−H(p)
C = log 2 +1 .
Now consider the channel with memory. Before transmission begins, Z is ran-
(c) The p in the figure is a typo. All the transition probabilities are 1/2. The domly chosen and fixed for all time. Thus Y i = ZXi .
capacity of the ternary channel (which is symmetric) is log 3 − H( 21 ) = log 3 − 1 .
The capacity of the BSC is 0, and together the parallel channels have a capacity (b) What is the capacity if )
2C = 3/2 + 1 , or C = log 52 . 1, p = 0.5
Z= ? (7.122)
−1, p = 0.5
(d) The channel is weakly symmetric and hence the capacity is log3 − H( 31 , 23 ) =
log 3 − (log 3 − 32 ) = 32 . Solution: Channel with memory solution.
35. Capacity.
(a) This is a BSC with cross over probability 0.5, so C = 1 − H(p) = 0 .
Suppose channel P has capacity C, where P is an m × n channel matrix.
(b) Consider the coding scheme of sending X n = (1, b1 , b2 , . . . , bn−1 ) where the first
(a) What is the capacity of symbol is always a zero and the rest of the n − 1 symbols are ±1 bits. For the
" # first symbol Y1 = Z , so the receiver knows Z exactly. After that the receiver
P 0 can recover the remaining bits error free. So in n symbol transmissions n bits
P̃ =
0 1 are sent, for a rate R = n−1
n → 1 . The capacity C is bounded by log |X | = 1 ,
therefore the capacity is 1 bit per symbol.

(b) What about the capacity of 37. Joint typicality.

" # Let (Xi , Yi , Zi ) be i.i.d. according to p(x, y, z). We will say that (x n , y n , z n ) is jointly
P 0 (n)
typical (written (xn , y n , z n ) ∈ Aǫ ) if
P̂ =
0 Ik
where Ik if the k × k identity matrix. • p(xn ) ∈ 2−n(H(X)±ǫ)
• p(y n ) ∈ 2−n(H(Y )±ǫ)
Solution: Solution: Capacity.
• p(z n ) ∈ 2−n(H(Z)±ǫ)
(a) By adding the extra column and row to the transition matrix, we have two channels • p(xn , y n ) ∈ 2−n(H(X,Y )±ǫ)
in parallel. You can transmit on either channel. From problem 7.28, it follows that • p(xn , z n ) ∈ 2−n(H(X,Z)±ǫ)
C̃ = log(20 + 2C ) • p(y n , z n ) ∈ 2−n(H(Y,Z)±ǫ)
C̃ = log(1 + 2C ) • p(xn , y n , z n ) ∈ 2−n(H(X,Y,Z)±ǫ)
202 Channel Capacity

Now suppose (X̃ n , Ỹ n , Z̃ n ) is drawn according to p(xn )p(y n )p(z n ). Thus X̃ n , Ỹ n , Z̃ n

have the same marginals as p(xn , y n , z n ) but are independent. Find (bounds on)
(n)
P r{(X̃ n , Ỹ n , Z̃ n ) ∈ Aǫ } in terms of the entropies H(X), H(Y ), H(Z), H(X, Y ), H(X, Z), H(Y, Z)
and H(X, Y, Z).
Solution: Joint typicality.

Chapter 8
P r{(X̃ n , Ỹ n , Z̃ n ) ∈ A(n) p(xn )p(y n )p(z n )
!
ǫ } =
(n)
(xn ,y n ,z n )∈Aǫ

2−n(H(X)+H(Y )+H(Z)−3ǫ) Differential Entropy

!
≤
(n)
(xn ,y n ,z n )∈Aǫ

≤ |A(n)
ǫ |2
−n(H(X)+H(Y )+H(Z)−3ǫ)

≤ 2n(H(X,Y,Z)+ǫ) 2−n(H(X)+H(Y )+H(Z)−3ǫ)

≤ 2n(H(X,Y,Z)−H(X)−H(Y )−H(Z)+4ǫ)
N
1. Differential entropy. Evaluate the differential entropy h(X) = − f ln f for the
following:

(a) The exponential density, f (x) = λe−λx , x ≥ 0.

(b) The Laplace density, f (x) = 12 λe−λ|x| .
n n n
Aǫ(n) } n n n
!
P r{(X̃ , Ỹ , Z̃ ) ∈ = p(x )p(y )p(z )
(n) (c) The sum of X1 and X2 , where X1 and X2 are independent normal random
(xn ,y n ,z n )∈Aǫ
variables with means µi and variances σi2 , i = 1, 2.
2−n(H(X)+H(Y )+H(Z)+3ǫ)
!
≥
(n)
(xn ,y n ,z n )∈Aǫ Solution: Differential Entropy.
≥ |A(n)
ǫ |2
−n(H(X)+H(Y )+H(Z)−3ǫ)

≥ (1 − ǫ)2n(H(X,Y,Z)−ǫ) 2−n(H(X)+H(Y )+H(Z)−3ǫ) (a) Exponential distribution.

≥ (1 − ǫ)2n(H(X,Y,Z)−H(X)−H(Y )−H(Z)−4ǫ) 2 ∞
h(f ) = − λe−λx [ln λ − λx]dx (8.1)
0
= − ln λ + 1 nats. (8.2)
Note that the upper bound is true for all n, but the lower bound only hold for n large. e
= log bits. (8.3)
λ

(b) Laplace density.

2 ∞ 1 −λ|x| 1
h(f ) = − λe [ln + ln λ − λ|x|] dx (8.4)
−∞ 2 2
1
= − ln − ln λ + 1 (8.5)
2
2e
= ln nats. (8.6)
λ
2e
= log bits. (8.7)
λ
203

204 Differential Entropy Differential Entropy 205

(c) Sum of two normal distributions. and for a > 1 the density for Y is
The sum of two normal random variables is also normal, so applying the result 
 y + (a + 1)/2 −(a + 1)/2 ≤ y ≤ −(a − 1)/2
derived the class for the normal distribution, since X 1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ) ,

pY (y) = 1/a −(a − 1)/2 ≤ y ≤ +(a − 1)/2
1  −y − (a + 1)/2

+(a − 1)/2 ≤ y ≤ +(a + 1)/2
h(f ) = log 2πe(σ12 + σ22 ) bits. (8.8)
2
(When a = 1 , the density of Y is triangular over the interval [−1, +1] .)
2. Concavity of determinants. Let K1 and K2 be two symmetric nonnegative definite
n × n matrices. Prove the result of Ky Fan[5]: (a) We use the identity I(X; Y ) = h(Y ) − h(Y |X) . It is easy to compute h(Y )
directly, but it is even easier to use the grouping property of entropy. First suppose
| λK1 + λK2 |≥| K1 |λ | K2 |λ , for 0 ≤ λ ≤ 1, λ = 1 − λ,
that a < 1 . With probability 1 − a , the output Y is conditionally uniformly
where | K | denotes the determinant of K. distributed in the interval [−(1 − a)/2, +(1 − a)/2] ; whereas with probability a ,
Hint: Let Z = Xθ , where X1 ∼ N (0, K1 ), X2 ∼ N (0, K2 ) and θ = Bernoulli (λ). Y has a split triangular density where the base of the triangle has width a .
Then use h(Z | θ) ≤ h(Z). 1
h(Y ) = H(a) + (1 − a) ln(1 − a) + a( + ln a)
Solution: Concavity of Determinants. Let X 1 and X2 be normally distributed n - 2
a a
vectors, Xi ∼ φKi (x) , i = 1, 2 . Let the random variable θ have distribution Pr{θ = = −a ln a − (1 − a) ln(1 − a) + (1 − a) ln(1 − a) + + a ln a = nats.
1} = λ , Pr{θ = 2} = 1 − λ, 0 ≤ λ ≤ 1 . Let θ , X1 , and X2 be independent and 2 2
let Z = Xθ . Then Z has covariance KZ = λK1 + (1 − λ)K2 . However, Z will not be If a > 1 the trapezoidal density of Y can be scaled by a factor a , which yields
multivariate normal. However, since a normal distribution maximizes the entropy for h(Y ) = ln a+1/2a . Given any value of x , the output Y is conditionally uniformly
a given variance, we have distributed over an interval of length a , so the conditional differential entropy in
nats is h(Y |X) = h(Z) = ln a for all a > 0 . Therefore the mutual information in
1 1 1
ln(2πe)n |λK1 +(1−λ)K2 | ≥ h(Z) ≥ h(Z|θ) = λ ln(2πe)n |K1 |+(1−λ) ln(2πe)n |K2 | . nats is
2 2 2
)
(8.9) a/2 − ln a if a ≤ 1
I(X; Y ) =
Thus 1/2a if a ≥ 0 .
λ 1−λ
|λK1 + (1 − λ)K2 | ≥ |K1 | |K2 | , (8.10) As expected, I(X; Y ) → ∞ as a → 0 and I(X; Y ) → 0 as a → ∞ .
as desired. (b) As usual with additive noise, we can express I(X; Y ) in terms of h(Y ) and h(Z) :
3. Uniformly distributed noise. Let the input random variable X to a channel be I(X; Y ) = h(Y ) − h(Y |X) = h(Y ) − h(Z) .
uniformly distributed over the interval −1/2 ≤ x ≤ +1/2 . Let the output of the
channel be Y = X + Z , where the noise random variable is uniformly distributed over Since both X and Z are limited to the interval [−1/2, +1/2] , their sum Y is
limited to the interval [−1, +1] . The differential entropy of Y is at most that of
the interval −a/2 ≤ z ≤ +a/2 .
a random variable uniformly distributed on that interval; that is, h(Y ) ≤ 1 . This
(a) Find I(X; Y ) as a function of a . maximum entropy can be achieved if the input X takes on its extreme values x =
(b) For a = 1 find the capacity of the channel when the input X is peak-limited; that ±1 each with probability 1/2. In this case, I(X; Y ) = h(Y ) − h(Z) = 1 − 0 = 1 .
is, the range of X is limited to −1/2 ≤ x ≤ +1/2 . What probability distribution Decoding for this channel is quite simple:
on X maximizes the mutual information I(X; Y ) ? )
−1/2 if y < 0
(c) (Optional) Find the capacity of the channel for all values of a , again assuming X̂ =
+1/2 if y ≥ 0 .
that the range of X is limited to −1/2 ≤ x ≤ +1/2 .
This coding scheme transmits one bit per channel use with zero error probability.
Solution: Uniformly distributed noise. The probability density function for Y = X +Z (Only a received value y = 0 is ambiguous, and this occurs with probability 0.)
is the convolution of the densities of X and Z . Since both X and Z have rectangular
(c) When a is of the form 1/m for m = 2, 3, . . . , we can achieve the maximum
densities, the density of Y is a trapezoid. For a < 1 the density for Y is
 possible value I(X; Y ) = log m when X is uniformly distributed over the discrete
 (1/2a)(y + (1 + a)/2)
 −(1 + a)/2 ≤ y ≤ −(1 − a)/2 points {−1, −1+2/(m−1), . . . , +1−2/(m−1), +1} . In this case Y has a uniform
pY (y) = 1 −(1 − a)/2 ≤ y ≤ +(1 − a)/2 probability density on the interval [−1 − 1/(m − 1), +1 + 1/(m − 1)] . Other values
 (1/2a)(−y − (1 + a)/2)

+(1 − a)/2 ≤ y ≤ +(1 + a)/2 of a are left as an exercise.
206 Differential Entropy Differential Entropy 207

4. Quantized random variables. Roughly how many bits are required on the average or
to describe to 3 digit accuracy the decay time (in years) of a radium atom if the half-life q(x) = c' xp(x) (8.21)
of radium is 80 years? Note that half-life is the median of the distribution.
where c' has to be chosen to satisfy the constraint,
$
x q(x) = 1 . Thus
Solution: Quantized random variables. The differential entropy of an exponentially
distributed random variable with mean 1/λ is log λe bits. If the median is 80 years, 1
c' = $ (8.22)
then 2 80 x xp(x)
1
λe−λx dx = (8.11)
0 2 Substituting this in the expression for J , we obtain
or
ln 2 c' xp(x)
J∗ = c' xp(x) ln x − c' xp(x) ln
! !
λ= = 0.00866 (8.12) (8.23)
80 p(x)
x x
and the differential entropy is log e/λ . To represent the random variable to 3 digits ≈
= − ln c' + c' xp(x) ln x − c' xp(x) ln x
! !
(8.24)
10 bits accuracy would need log e/λ + 10 bits = 18.3 bits. x x
!
= ln xp(x) (8.25)
N
5. Scaling. Let h(X) = − f (x) log f (x) dx . Show h(AX) = log | det(A) | +h(X).
x
Solution: Scaling. Let Y = AX . Then the density of Y is
1 To verify this is indeed a maximum value, we use the standard technique of writing it
g(y) = f (A−1 y). (8.13) as a relative entropy. Thus
|A|
Hence
! ! ! q(x) ! q(x)
ln xp(x) − q(x) ln x + q(x) ln = q(x) ln (8.26)
2
x x x p(x) x $xp(x)
yp(y)
h(AX) = − g(y) ln g(y) dy (8.14) y

= D(q||p' ) (8.27)
1
2 O P
= − f (A−1 y) ln f (A−1 y) − log |A| dy (8.15) ≥ 0 (8.28)
|A|
1
2
= − f (x) [ln f (x) − log |A|] |A| dx (8.16) Thus
|A| !
ln xp(x) = sup (EQ ln(X) − D(Q||P )) (8.29)
= h(X) + log |A|. (8.17) x Q

6. Variational inequality: Verify, for positive random variables X , that This is a special case of a general relationship that is a key in the theory of large
deviations.
log EP (X) = sup [EQ (log X) − D(Q||P )] (8.18)
Q 7. Differential entropy bound on discrete entropy: Let X be a discrete random
variable on the set X = {a1 , a2 , . . .} with Pr(X = ai ) = pi . Show that
where EP (X) = x xP (x) and D(Q||P ) = x Q(x) log PQ(x)
$ $
(x) , and the supremum is
over all Q(x) ≥ 0 ,
$
Q(x) = 1 . It is enough to extremize J(Q) = E Q ln X −D(Q||P )+
 /2 
∞ ∞
.
1 1
log(2πe)  pi i2 −
! !
λ( Q(x) − 1) .
$
H(p1 , p2 , . . .) ≤ ipi + . (8.30)
2 i=1 i=1
12
Solution: Variational inequality
Using the calculus of variations to extremize Moreover, for every permutation σ ,
! ! q(x) !
∞

∞
. /2 
J(Q) = q(x) ln x − q(x) ln + λ( q(x) − 1) (8.19) 1 1
log(2πe)  pσ(i) i2 −
! !
x x p(x) x H(p1 , p2 , . . .) ≤ ipσ(i) + . (8.31)
2 i=1 i=1
12
we differentiate with respect to q(x) to obtain
∂J q(x) Hint: Construct a random variable X ' such that Pr(X ' = i) = pi . Let U be an
= ln x − ln −1+λ=0 (8.20)
∂q(x) p(x) uniform(0,1] random variable and let Y = X ' + U , where X ' and U are independent.

208 Differential Entropy Differential Entropy 209

Use the maximum entropy bound on Y to obtain the bounds in the problem. This
bound is due to Massey (unpublished) and Willems(unpublished). X̃ 5
Solution: Differential entropy bound on discrete entropy
Of all distributions with the same variance, the normal maximizes the entropy. So the
entropy of the normal gives a good bound on the differential entropy in terms of the
variance of the random variable.
Let X be a discrete random variable on the set X = {a 1 , a2 , . . .} with

Pr(X = ai ) = pi . (8.32)

 /2 
∞ ∞
.
1 1 '
log(2πe)  pi i2 −
! !
H(p1 , p2 , . . .) ≤ ipi + . (8.33)
2 12
i=1 i=1 f (X̃)
Moreover, for every permutation σ ,
 /2  Figure 8.1: Distribution of X̃ .
∞ ∞
.
1 1
log(2πe)  pσ(i) i2 −
! !
H(p1 , p2 , . . .) ≤ ipσ(i) + . (8.34)
2 12
i=1 i=1 Hence we have the following chain of inequalities:

Define two new random variables. The first, X 0 , is an integer-valued discrete random H(X) = H(X0 ) (8.42)
variable with the distribution = h(X̃) (8.43)
Pr(X0 = i) = pi . (8.35) 1
≤ log(2πe)Var(X̃) (8.44)
Let U be a random variable uniformly distributed on the range [0, 1] , independent of 2
X0 . Define the continuous random variable X̃ by 1
= log(2πe) (Var(X0 ) + Var(U )) (8.45)
2
X̃ = X0 + U. (8.36)
 /2 
∞ ∞
.
1 1
log(2πe)  pi i2 −
! !
= ipi + . (8.46)
2 i=1 i=1
12
The distribution of the r.v. X̃ is shown in Figure 8.1.
It is clear that H(X) = H(X0 ) , since discrete entropy depends only on the probabilities
and not on the values of the outcomes. Now Since entropy is invariant with respect to permutation of p 1 , p2 , . . . , we can also obtain
a bound by a permutation of the pi ’s. We conjecture that a good bound on the variance
∞
! will be achieved when the high probabilities are close together, i.e, by the assignment
H(X0 ) = − pi log pi (8.37)
i=1
. . . , p5 , p3 , p1 , p2 , p4 , . . . for p1 ≥ p2 ≥ · · · .
∞ *2 i+1 i+1 How good is this bound? Let X be a Bernoulli random variable with parameter 12 ,
! + *2 +
= − fX̃ (x) dx log fX̃ (x) dx (8.38) which implies that H(X) = 1 . The corresponding random variable X 0 has variance
i=1 i i
1
!∞ 2 i+1 4 , so the bound is
= − fX̃ (x) log fX̃ (x) dx (8.39)
1 1 1
* +
i=1 i H(X) ≤ log(2πe) + = 1.255 bits. (8.47)
2 ∞ 2 4 12
= − fX̃ (x) log fX̃ (x) dx (8.40)
1
= h(X̃), (8.41) 8. Channel with uniformly distributed noise: Consider a additive channel whose
input alphabet X = {0, ±1, ±2} , and whose output Y = X + Z , where Z is uniformly
since fX̃ (x) = pi for i ≤ x < i + 1 . distributed over the interval [−1, 1] . Thus the input of the channel is a discrete random
210 Differential Entropy Differential Entropy 211

variable, while the output is continuous. Calculate the capacity C = max p(x) I(X; Y ) We can thus conclude that
of this channel.
1
Solution: Uniformly distributed noise I(X; Y ) = − log(1 − ρ2xy ρ2zy )
2
We can expand the mutual information
I(X; Y ) = h(Y ) − h(Y |X) = h(Y ) − h(Z) (8.48) 10. The Shape of the Typical Set
and h(Z) = log 2 , since Z ∼ U (−1, 1) . Let Xi be i.i.d. ∼ f (x) , where
4
f (x) = ce−x .
The output Y is a sum a of a discrete and a continuous random variable, and if the
probabilities of X are p−2 , p−1 , . . . , p2 , then the output distribution of Y has a uniform (n)
= {xn ∈ Rn :
N
Let h = − f ln f . Describe the shape (or form) or the typical set A ǫ
distribution with weight p−2 /2 for −3 ≤ Y ≤ −2 , uniform with weight (p −2 + p−1 )/2 f (xn ) ∈ 2−n(h±ǫ) } .
for −2 ≤ Y ≤ −1 , etc. Given that Y ranges from -3 to 3, the maximum entropy
Solution: The Shape of the Typical Set
that it can have is an uniform over this range. This can be achieved if the distribution
of X is (1/3, 0, 1/3,0,1/3). Then h(Y ) = log 6 and the capacity of this channel is
C = log 6 − log 2 = log 3 bits. We are interested in the set { xn ∈ R : f (xn ) ∈ 2−n(h±ǫ) }. This is:

9. Gaussian mutual information. Suppose that (X, Y, Z) are jointly Gaussian and
2−n(h−ǫ) ≤ f (xn ) ≤ 2−n(h+ǫ)
that X → Y → Z forms a Markov chain. Let X and Y have correlation coefficient
ρ1 and let Y and Z have correlation coefficient ρ 2 . Find I(X; Z) . Since Xi are i.i.d.,
Solution: Gaussian Mutual Information
First note that we may without any loss of generality assume that the means of X ,
n
Y and Z are zero. If in fact the means are not zero one can subtract the vector of
f (xn ) =
B
f (x) (8.49)
means without affecting the mutual information or the conditional independence of X ,
i=1
Z given Y . Let . / n
σx2 −x4i
B
σx σz ρxz = ce (8.50)
Λ= ,
σ σ ρ σ2
x z xz z i=1
$n
x4i
be the covariance matrix of X and Z . We can now use Eq. (8.34) to compute = enln(c)− i=1 (8.51)
(8.52)
I(X; Z) = h(X) + h(Z) − h(X, Z)
1 1 1
= log (2πeσx2 ) + log (2πeσx2 ) − log (2πe|Λ|) Plugging this in for f (xn ) in the above inequality and using algebraic manipulation
2 2 2
1 gives:
2
= − log(1 − ρxz )
2 n
x4i ≥ n(ln(c) + (h + ǫ)ln(2))
!
Now, n(ln(c) + (h − ǫ)ln(2)) ≥
i=1
E{XZ}
ρxz = @ A
σx σz So the shape of the typcial set is the shell of a 4-norm ball xn : ||xn ||4 ∈ (n(ln(c) + (h ± ǫ)ln(2)))1/4
E{E{XZ|Y }} .
=
σx σz
E{E{X|Y }E{Z|Y }} 11. Non ergodic Gaussian process.
= Consider a constant signal V in the presence of iid observational noise {Z i } . Thus
σx σz
0
σx ρxy
10
σz ρzx
1 Xi = V + Zi , where V ∼ N (0, S) , and Zi are iid ∼ N (0, N ) . Assume V and {Zi }
E{ σy Y σy Y }
= are independent.
σx σz
= ρxy ρzy (a) Is {Xi } stationary?

212 Differential Entropy Differential Entropy 213

1 $n
(b) Find limn−→∞ n i=1 Xi . Is the limit random? (d) By iterated expectation we can write
(c) What is the entropy rate h of {Xi } ? 0 12 * *0 12 33 ++
E Xn+1 − X̂n+1 (X n ) =E E Xn+1 − X̂n+1 (X n ) 33 X n (8.64)
(d) Find the least mean squared error predictor X̂n+1 (X n ) and find σ∞
2 = lim
n−→∞ E(X̂n −
Xn )2 . We note that minimizing the expression is equivalent to minimizing the inner
(e) Does {Xi } have an AEP? That is, does − n1 log f (X n ) −→ h ? expectation, and that for the inner expectation the predictor is a nonrandom
variable. Expanding the inner expectation and taking the derivative with respect
Solution: Nonergodic Gaussian process to the estimator X̂n+1 (X n ) , we get
0 1
(a) Yes. EXi = EV + Zi = 0 for all i , and E (Xn+1 − X̂n+1 (X n ))2 |X n
0 1
2
− 2Xn+1 X̂n+1 (X n ) + X̂n+1
2
(X n ))|X n
)
S, i=j =E (Xn+1 (8.65)
EXi Xj = E(V + Zi )(V + Zj ) = (8.53)
S + N. i =
% j
so
0 1
Since Xi is Gaussian distributed it is completely characterized by its first and dE (Xn+1 − X̂n+1 (X n ))2 |X n 0 1
second moments. Since the moments are stationary, X i is wide sense stationary, = E −2Xn+1 + 2X̂n+1 (X n )|X n (8.66)
which for a Gaussian distribution implies that X i is stationary. dX̂n+1 (X n )
= −2E(Xn+1 |X n ) + 2X̂n+1 (X n ) (8.67)
(b)
n n
Setting the derivative equal to 0, we see that the optimal X̂n+1 (X n ) = E(Xn+1 |X n ) .
1! 1!
lim Xi = lim (Zi + V ) (8.54) To find the limiting squared error for this estimator, we use the fact that V and
n→∞ n i=1 n→∞ n i=1 X n are normal random variables with known covariance matrix, and therefore the
n conditional distribution of V given X n is
1!
= V + lim Zi (8.55)
n→∞ n . n
/
i=1 S SN
f (V |X n ) ∼ N
!
= V + EZi (by the strong law of large numbers) (8.56) Xi , (8.68)
nS + N i=1 nS + N
= V (8.57)
Now
The limit is a random variable N (0, S) .
X̂n+1 (X n ) = E(Xn+1 |X n ) (8.69)
(c) Note that Xn ∼ N (0, KX n ) , where KX n has diagonal values of S + N and off
= E(V |X n ) + E(Zn+1 |X n ) (8.70)
diagonal values of S . Also observe that the determinant is |K X n | = N n (nS/N + n
1) . We now compute the entropy rate as: S !
= Xi + 0 (8.71)
nS + N i=1
1
h(X ) = lim h(X1 , X2 , . . . , Xn ) (8.58) and hence the limiting squared error
n
1
= lim log((2πe)n |KX n |) (8.59) e2 = lim E(X̂n − Xn )2 (8.72)
2n n→∞
1 nS /2
* * ++
n−1
.
= lim log (2πe)n N n +1 (8.60) S !
2n N = lim E Xi − Xn (8.73)
n→∞ (n − 1)S + N i=1
1 1 nS
* +
= lim log(2πeN )n + log +1 (8.61) . n−1
/2
2n 2n N S !
1 1
*
nS
+ = lim E (Zi + V ) − Zn − V (8.74)
= log 2πeN + lim log +1 (8.62)
n→∞ (n − 1)S + N i=1
2 2n N /2
n−1
.
1 S ! N
= log 2πeN (8.63) = lim E Zi − Zn − V (8.75)
2 n→∞ (n − 1)S + N i=1
(n − 1)S + N
214 Differential Entropy Differential Entropy 215
+2 n−1 +2 ǫ
S N setting s = .
* *
EZi2 EZn2 2
!
= lim + + EV(8.76) 2(1+ǫ)
n→∞ (n − 1)S + N i=1
(n − 1)S + N Thus
+2 +2
S N 1 31 1 ! 2 33
* * Q3 3 R Q3 * +3 R
Pr 33− ln f (X n ) − hn 33 > ǫ
3 3
= lim (n − 1)N + N + S (8.77) = Pr 33 −1 + Wi 3 > ǫ (8.92)
n→∞ (n − 1)S + N (n − 1)S + N n 2 n
= 0+N +0 (8.78) −n (ǫ−ln(1+ǫ))
≤ e 2 (8.93)
= N (8.79)
and the bound goes to 0 as n → ∞ , and therefore by the Borel Cantelli lemma,
(e) Even though the process is not ergodic, it is stationary, and it does have an AEP 1
because − ln f (X n ) − hn → 0 (8.94)
n
1 1 1 t −1
with probability 1. So Xi satisfies the AEP even though it is not ergodic.
− ln f (X n ) = − ln e−X KX n X/2 (8.80)
n n (2π)n/2 |KX n | 12
1 1 1 t −1
= ln(2π)n + ln |KX n | + X KX n X (8.81)
2n 2n 2n
1 1 1 t −1
= ln(2πe)n |KX n | − + X KX n X (8.82)
2n 2 2n
1 1 1 t −1
= h(X n ) − + X KX n X (8.83)
n 2 2n
(8.84)
1
Since X ∼ N (0, K) , we can write X = K 2 W , where W ∼ N (0, I) . Then
1 1
X t K −1 X = W t K 2 K −1 K 2 W = W t W =
$ 2
Wi , and therefore X t K −1 X has a
χ2 distribution with n degrees of freedom. The density of the χ 2 distribution is
n x
x 2 −1 e− 2
f (x) = n (8.85)
Γ( n2 )2 2

The moment generating function of the χ 2 distribution is

2
M (t) = f (x)etx dx (8.86)
n x
x 2 −1 e− 2 tx
2
= n e dx (8.87)
Γ( n2 )2 2
n
1
2 n ((1 − 2t)x) 2 −1 e−(1−2t)x/2
(1−2t) 2 −1
= n (1 − 2t) dx (8.88)
Γ( n2 )2 2
1
= n (8.89)
(1 − 2t) 2

By the Chernoff bound (Lemma 11.19.1)

1! 2
Q R
n
Pr Wi > 1 + ǫ ≤ min e−s(1+ǫ) (1 − 2s)− 2 (8.90)
n s
−n (ǫ−ln(1+ǫ))
≤ e 2 (8.91)

Ict Solution
No ratings yet
Ict Solution
41 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (55)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
2 Information Measurement and Entropy
No ratings yet
2 Information Measurement and Entropy
23 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
Ifo Theo&Coding-Compiled-PSS
No ratings yet
Ifo Theo&Coding-Compiled-PSS
156 pages
Entropy 4
No ratings yet
Entropy 4
10 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Lec38 - 210108071 - AKSHAY KUMAR JHA
No ratings yet
Lec38 - 210108071 - AKSHAY KUMAR JHA
12 pages
Information Theory and Computing Assignment No. 1: April 10, 2020
No ratings yet
Information Theory and Computing Assignment No. 1: April 10, 2020
12 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Information Theory and Computing Assignment No. 1: April 10, 2020
No ratings yet
Information Theory and Computing Assignment No. 1: April 10, 2020
12 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Notes It
No ratings yet
Notes It
46 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
SS 19
No ratings yet
SS 19
22 pages
Practice 2
No ratings yet
Practice 2
7 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
No of Flips For First Head
No ratings yet
No of Flips For First Head
8 pages
Copenhagen Burnout Inventory
100% (3)
Copenhagen Burnout Inventory
12 pages
Solved Problems
No ratings yet
Solved Problems
7 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
Ee5143 Pset1 PDF
No ratings yet
Ee5143 Pset1 PDF
4 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
HW 1 Sol
No ratings yet
HW 1 Sol
5 pages
Homework 3 Solutions
No ratings yet
Homework 3 Solutions
9 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Midit 10
No ratings yet
Midit 10
5 pages
Solution To Homework #1: (A) (B) (C) (D)
No ratings yet
Solution To Homework #1: (A) (B) (C) (D)
4 pages
Solution To Homework #1: (A) (B) (C) (D)
No ratings yet
Solution To Homework #1: (A) (B) (C) (D)
4 pages
(Some) Solutions For HW Set # 2
No ratings yet
(Some) Solutions For HW Set # 2
3 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
HW 2
No ratings yet
HW 2
3 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
HW 1
No ratings yet
HW 1
4 pages
Info Theory Exercise Solutions
No ratings yet
Info Theory Exercise Solutions
16 pages
Probc 1
No ratings yet
Probc 1
4 pages
Elements of Information Theory.2nd Ex 2.4
No ratings yet
Elements of Information Theory.2nd Ex 2.4
4 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
HW 1
No ratings yet
HW 1
3 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Indian Institute of Technology Bombay
No ratings yet
Indian Institute of Technology Bombay
6 pages
Chapter 2 Exercises
No ratings yet
Chapter 2 Exercises
2 pages
Exercise Problems: Information Theory and Coding
No ratings yet
Exercise Problems: Information Theory and Coding
6 pages
Ateneo de Davao University: Welcome To Statistics and Probability Subject!
No ratings yet
Ateneo de Davao University: Welcome To Statistics and Probability Subject!
13 pages
Math5846 Chapter10
No ratings yet
Math5846 Chapter10
76 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages
PT Module5
No ratings yet
PT Module5
30 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
Ms. Koni Bernadette C. Tarayao Faculty in Mathematics College of Education
No ratings yet
Ms. Koni Bernadette C. Tarayao Faculty in Mathematics College of Education
68 pages
Strategic Practice and Homework 5
100% (1)
Strategic Practice and Homework 5
14 pages
Robust Estimation Methods and Outlier Detection in Mediation Model
No ratings yet
Robust Estimation Methods and Outlier Detection in Mediation Model
25 pages
Panel Stata Command
No ratings yet
Panel Stata Command
6 pages
Production Planning and Control (MEFB 433) : Ts. Zubaidi Faiesal Email: Zubaidi@uniten - Edu.my Room No. BN-1-010
No ratings yet
Production Planning and Control (MEFB 433) : Ts. Zubaidi Faiesal Email: Zubaidi@uniten - Edu.my Room No. BN-1-010
36 pages
Chapter 9 QA
No ratings yet
Chapter 9 QA
4 pages
Sample Problemsfor Confidence Intervals 924152003
No ratings yet
Sample Problemsfor Confidence Intervals 924152003
22 pages
Exercises and Cases in Econometrics
No ratings yet
Exercises and Cases in Econometrics
30 pages
Denoising Diffusion Probabilistic Models in Six Simple Steps
No ratings yet
Denoising Diffusion Probabilistic Models in Six Simple Steps
15 pages
CHAPTER 4-5 Cashless Policy Data Analysis
No ratings yet
CHAPTER 4-5 Cashless Policy Data Analysis
22 pages
Math F432
No ratings yet
Math F432
3 pages
Causal Research
No ratings yet
Causal Research
5 pages
Course Outline - Probability & Statistics (14-02-2022)
No ratings yet
Course Outline - Probability & Statistics (14-02-2022)
4 pages
Section 1 - Section 1 Question No.1 Bookmark: Examination: M.Sc. Statistics
No ratings yet
Section 1 - Section 1 Question No.1 Bookmark: Examination: M.Sc. Statistics
22 pages
Probability & Statistics, Discrete Random Variables, RL 2.1.1
No ratings yet
Probability & Statistics, Discrete Random Variables, RL 2.1.1
21 pages
Logistic Regression Using SAS
No ratings yet
Logistic Regression Using SAS
22 pages
Statistics & Probability Quarter 3 Week 1 Day1&2
No ratings yet
Statistics & Probability Quarter 3 Week 1 Day1&2
5 pages
Data Analysis - Answer
No ratings yet
Data Analysis - Answer
9 pages
Study Guide - Normal Probability Distributions
No ratings yet
Study Guide - Normal Probability Distributions
13 pages
Stats Cheatsheet Final
No ratings yet
Stats Cheatsheet Final
2 pages
What Is Considered A Good AUC Score
No ratings yet
What Is Considered A Good AUC Score
5 pages
Chapter 6 The Normal Distribution
No ratings yet
Chapter 6 The Normal Distribution
3 pages
A Skewed Generalized Discrete Laplace Distribution: Seetha Lekshmi, V - Simi Sebastian
No ratings yet
A Skewed Generalized Discrete Laplace Distribution: Seetha Lekshmi, V - Simi Sebastian
8 pages
QNT 561 Week 2 Weekly Learning Assessments - Assignment
No ratings yet
QNT 561 Week 2 Weekly Learning Assessments - Assignment
6 pages
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
The Green Book of Mathematical Problems
From Everand
The Green Book of Mathematical Problems
Kenneth Hardy
4.5/5 (3)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet