0% found this document useful (0 votes)

8 views21 pages

Concentration

Uploaded by

timothygao8710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views21 pages

Concentration

Uploaded by

timothygao8710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Gentle Introduction to Concentration Inequalities

Karthik Sridharan

Abstract
This notes is ment to be a review of some basic inequalities and
bounds on Random variables. A basic understanding of probability
theory and set algebra might be required of the reader. This document
is aimed to provide clear and complete proof for some inequalities. For
readers familiar with the topics, many of the steps might seem trivial.
None the less they are provided to simplify the proofs for readers new
to the topic. This notes also provides to the best of my knowledge,
the most generalized statement and proof for Symmetrization lemma. I
also provides the less famous but geralized proof for Jensen’s inequality
and logarithmic sobolev inequality. Refer [2] for a more detailed review
of many of these inequalities with examples demonstrating their uses.

1 Preliminary
Throughout this notes we shall consider a probability space (⌦, E, P ) where
⌦ is the sample space, E is the event class which is a algebra on ⌦ and P
is a probability measure. Further, we shall assume that there exists a borel
measurable function mapping every point ! 2 ⌦ to a real number uniquely
called a random variable. We shall call the space of random variables X
(note that X ✓ R). Further, we shall assume that all the functions and sets
defined in the notes are measurable under the probability measure P .

2 Chebychev’s Inequality
Let us start this notes by proving what is refered to as Chebychev’s Inequal-
ity in [3]. Note, often by Chebychev’s inequality an inequality derived from
the below proved theorem is used. However [3] refers to this inequality as
Tchebyche↵’s Inequality and the same is followed in this notes.

1
Theorem 1 For some a 2 X where X ✓ R, let f be a non-negative function
such that {f (x) b|8x a} , where b 2 Y where Y ✓ R. Then the following
inequality holds,
E{f (x)}
P (x a) 
b
Proof Let set X1 = {x : x a&x 2 X} therefore we have,
X1 ✓ X
Since f is a non-negative function, taking the lebesgue integral of the func-
tion over sets X1 and X we have,
Z Z Z
f dP f dP b dP
X X1 X1
R
where X1 dP = P (X1 ). However the lebague integral over probability mea-
sure of a function is its expectation. Hence we have,
E{f (x : x 2 X)} bP (X1 )
) E{f (x)} bP (x a)
and hence
E{f (x)}
P (x a)  (1)
b

Now let us suppose that the function f is monotonically increasing. There-

fore for every x a, f (x) f (a). In (1) use b = f (a). Therefore we
get
E{f (x)}
P (x a) 
f (a)
From this we can get the well known inequalities like
E{x}
P (x a) 
a
called Markov inequality which holds for a > 0 and nonnegetive x,
E{|x E(x)|2 } V ar{x}
P (|x E(x)| a)  2
= (2)
a a2
often called the Chebychev’s Inequality1 and the Cherno↵’s bound
Eesx
P (x a)  (3)
esa
1 2
In this note, V ar{} and are used interchangeably for variance

2
3 Information Theoretic Bounds
3.1 Jensen’s Inequality
Here we shall state and prove a generalized, measure theoretic proof for
Jensen’s inequality. In general, in probability theory, a more specific form
of Jensen’s inequality is famous. But before that we shall first define a con-
vex function.

Definition A function (x) is defined to be convex in interval (a, b) if for

every point x0 in the interval (a, b) there exists an m such that

(x) m(x x0 ) + (x0 ) (4)

for any x 2 (a, b)

Note that this definition can be proved to be equivallent to the definition

of convex function as one in which the value of the function for any point
in the interval of convexity is always below the line segment joining the end
points of any subinterval of the convex interval containing that point. We
chose this particular definition for simplyfying the proof of Jensen’s inequal-
ity. Now without further a due, let us move to stating and proving Jensen’s
Inequality. (Note: Refer [4] for a similar generalized proof for Jensen’s In-
equality.)

Theorem 2 Let f and µ be measurable functions of x which are finite a.e.

on A ✓ Rn . Now let f µ and µ be integrable on A and µ 0. If is a
function
R
which is convex in interval (a, b) which is the range of function f
and A (f )µ exists then,
R R
fµ (f )µ
( RA
)  AR (5)
Aµ Aµ

Proof From our assumptions, the range of f is (a, b) which is the interval
in which (x) is convex. Hence, consider the number,
R
0 fµ
x = RA
Aµ

3
Clearly it is within the interval (a, b). Further, from Equation (4) we have
for almost every x,

(f (x)) m(f (x) x0 ) + (x0 )

Multiplying by µ and integrating both sides we get,

Z Z Z Z
0 0
(f )µ m( fµ x µ) + (x ) µ
A A A A
R R
Now see that A fµ x0 Aµ = 0 and hence we have,
Z Z
0
(f )µ (x ) µ
A A

hence we get the result,

Z R Z
fµ
(f )µ ( RA ) µ
A Aµ A

R
Now note that if µ is a probability measure then, A µ = 1 and since
expected value is simply lebesgue integral of function w.r.t. probability
measure, we have
E[ (f (x))] (E[f (x)]) (6)
for any function convex for the range of function f (x).
Now a function (x) is convex if 0 (x) exists and is monotonically in-
creasing and if second derivative exists and is nonnegetive. Therefore we
can conclude that the function log x is a convex function. Therefore by
Jensen’s inequality, we have

E[ log f (x)] log E[f (x)]

Now if we take function f (x) to be the probability result we get the result
that Entropy H(P ) is always greater than or equal to 0. If we make f (x) the
ratio of two probability measures dP and dQ, we get the result that relative
entropy or KL divergence of two distributions is always non negetive. That
is
dQ(x) dQ(x)
D(P ||Q) = EP [log( )] log(EP [ ]) = log(EQ [dQ(x)]) = 0
dP (x) dP (x)
Therefore,
D(P ||Q) 0

4
3.2 Han’s Inequality
We shall first prove the Han’s Inequality for entropy and then usingthe
result, we shall prove the Han’s Inequality for relative entropy.

Theorem 3 Let x1 , x2 , ..., xn be discrete random variables from sample space

X. Then
n
1 X
H(x1 , ..., xn )  H(x1 , ..., xi 1 , xi+1 , ..., xn ) (7)
n 1 i=1

Proof Note that D(PX,Y ||PX ⇥ PY ) = H(X) H(X|Y ). Since we already

proved that Relative entropy is non-negetive, we have, H(X) H(X|Y ).
This in a vague way means that information (about some variable Y ) can
only reduce entropy or uncertainity (H(X|Y )), which makes intutive sense.
Now consider the entropy,

H(x1 , ..., xn ) = H(x1 , ..., xi 1 , xi+1 , ..., xn ) + H(xi |x1 , ..., xi 1 , xi+1 , ..., xn )

Since we have already seen that H(X) H(X|Y ), applying this we have,

H(x1 , ..., xn )  H(x1 , ..., xi 1 , xi+1 , ..., xn ) + H(xi |x1 , ..., xi 1)

Summing both sides upto n we get

n
X
nH(x1 , ..., xn )  H(x1 , ..., xi 1 , xi+1 , ..., xn ) + H(xi |x1 , ..., xi 1)
i=1

Now note that by definition of conditional entropy, H(X|Y ) = H(X, Y )

H(Y ). Therefore extending this to many variables we get the chain rule of
entropy as,
n
X
H(x1 , ..., xn ) = H(xi |x1 , ..., xi 1)
i=1
Therefore using this chain rule,
n
X
nH(x1 , ..., xn )  H(x1 , ..., xi 1 , xi+1 , ..., xn ) + H(x1 , ..., xn )
i=1

Therefore,
n
1 X
H(x1 , ..., xn )  H(x1 , ..., xi 1 , xi+1 , ..., xn )
n 1 i=1

5
Now we shall prove the Han’s inequality for relative entropies. Let
xn1 = x1 , x2 , ..., xn be discrete random variables from sample space X just
like in the previous case. Now let P and Q be probability distributions in
the product space X n and let P be a distribution such that PX (x 1 ,...,xn
)=
n
dxn 1
dP1 (x1 ) dP2 (x2 ) dPn (xn )
dx1 dx2 ... dxn . That is distribution P assumes independence of the
variables (x1 , ..., xn ) with the probability density function of each variable
xi as dPdxi (x)
i
. Let x(i) = (x1 , ..., xi 1 , xi+1 , ..., xn ). Now with this setting we
shall state and prove Han’s relative entropy inequlity.

Theorem 4 Given any distribution Q on product space X n and a distribu-

tion P on X n which assumes independence of variables (x1 , ..., xn )
n
1 X
D(Q||P )  D(Q(i) ||P (i) ) (8)
n 1 i=1

where Z
dQ(x1 , ..., xi 1 , xi , xi+1 , ..., xn )
Q(i) (x(i) ) = dxi
X dxn1
and Z
dP (x1 , ..., xi 1 , xi , xi+1 , ..., xn )
P (i) (x(i) ) = dxi
X dxn1
Proof By definition of relative entropy,
Z Z
dQ dQ dQ dP
D(Q||P ) = dQlog( )= log(dQ) log( n )dxn1 (9)
Xn dP Xn dxn1 dxn1 dx1
R dQ
In the above equation, consider the term X n dxn
dP
log( dxn ))dx1 .
n From our
1 1
assumption about P we know that
dP (xn1 ) dP1 (x1 ) dP2 (x) dPn (xn )
n = ...
dx1 dx1 dx2 dxn
R dP (x1 ,...,xi 1 ,xi ,xi+1 ,...,xn )
Now P (i) (x(i) ) = X dxn dxi , therefore,
1

dP (xn1 ) dPi (xi ) dP (i) (x(i) )

n =
dx1 dxi dx(i)
Therefore using this we get
Z n Z
dQ dP (xn1 ) 1 X dQ dPi (xi ) dP (i) (x(i) )
log( )dx n
1 = ( (log( ) + log( ))dxn1 )
Xn dxn1 dxn1 n i=1 Xn
n
dx1 dxi dx(i)

6
n Z Z
1 X dQ(i) dP (i) (x(i) ) 1 dQ dP (xn1 )
= ( (log( ))dx(i) ) + log( )dxn1
n i=1 Xn 1 dx (i) dx (i) n Xn dxn1 dxn1
Rearranging the terms we get,
Z n Z
dQ dP (xn1 ) 1 X dQ(i) dP (i) (x(i) )
log( )dx n
1 = ( (log( ))dx(i) )
Xn dxn1 dxn1 n 1 i=1 Xn 1 dx(i) dx(i)

Now also note that by Han’s inequlity for entropy,

Z n Z
dQ dQ(xn1 ) X dQ(i) dQ(i) (x(i) )
n log( )dxn1 log( )dx(i)
Xn dx1 dxn1 i=1 X
n dx(i) dx(i)

Therefore when we consider relative entropy given by Equation (9) we get,

D(Q||P )
Z n Z n
1 X dQ(i) dQ(i) (x(i) ) 1 X dQ(i) dP (i) (x(i) )
log( )dx(i) log( )dx(i)
n 1 Xn 1
i=1
dx (i) dx (i) n 1 Xn 1
i=1
dx(i) dx(i)
Thus finally simplifying we get the required result as
n
1 X
D(Q||P )  D(Q(i) ||P (i) )
n 1 i=1

4 Inequalities of Sums of Random Variables

4.1 Hoe↵ding’s Inequality
Theorem 5 Let be independent bounded random variables such that the
random variable xi falls in the interval [pi , qi ]. Then for any a > 0 we have

n n Pn 2a2
X X pi )2
P( xi E( xi ) a)  e (q
i=1 i

i=1 i=1

Proof Form the Cherno↵’s bound given by (3) we get,

Ees(x E(x))
P (x E(x) a) 
esa

7
Pn
Let Sn = i=1 xi . Therefore we have,

Ees(Sn E(Sn ))
P (Sn E(Sn ) a) 
esa
n
Y
e sa
Ees(xi E(xi ))
(10)
i=1

Now Let y be any random variable such that p  y  q and Ey = 0. Then

for any s > 0 due to convexity of exponential function we have
y p sq q y sp
esy  e + e
q p q p
Taking expectation on both sides we get,
p sq q sp
Eesy  e + e
q p q p
p
Now let ↵ = q p. Therefore,

Eesy  (↵es(q p)
+ (1 ↵))e s↵(q p)

s(q p) +(1
) Eesy  elog(↵e ↵)) s↵(q p)

) Eesy  e (u)
(11)
Where the function (u) = log(↵eu + (1 ↵)) u↵ and u = s(q p).
Now using Taylor’s theorem, we have for some ⌘,

x2
(x) = (0) + x 0 (0) + 00
(⌘) (12)
2
↵ex
But we have (0) = 0 and 0 (x) = ↵ex +(1 ↵) ↵. Therefore 0 (0) = 0 Now,

00 ↵(1 ↵)ex
(x) =
(1 ↵ + ↵ex )2

If we consider 00 (x) we see that the function is maximum when

000 ↵(1 ↵)ex 2↵2 (1 ↵)e2x

(x) = =0
(1 ↵ + ↵ex )2 (1 ↵ + ↵ex )3
1 ↵
) ex =
↵

8
Therefore, for any x
((1 ↵)2 )00 1
= (x) 
(2 2↵) 2 4
Therefore from (12) and (11) we have
u2
Eesy  e 8

Therefore for any p  y  q

s2 (q p)2
Eesy  e 8 (13)
Using this in (10) we get,
Pn
(qi pi )2
sa s2 i=1
P (Sn E(Sn ) a)  e e 8

Now we find the best bound by minimizing the L.H.S. of the above equation
w.r.t s. Therefore we have
Pn
pi )2 Pn
s2 i=1
(qi
pi )2
Pn
i=1 (qi pi )2
sa
de 8
s2 i=1
(qi
=e 8
sa
(2s a) = 0
ds 8
Therefore for the best bound we have
4a
s = Pn
i (qi pi )2
and correspondingly we get
Pn 2a2
pi )2
P (Sn E(Sn ) a)  e (q
i=1 i (14)

Now an interesting result from the Hoe↵ding’s inequality often used in

learning theory is to bound not the di↵ernce in sum and its corresponding
expectation but the emperical average of loss function and its expectation.
This is done by using (14) as,
2
Pn 2(an)
pi )2
P (Sn E(Sn ) na)  e (q
i=1 i

2
Sn E(Sn ) Pn2(aN )
pi )2
) P( a)  e (q
i=1 i
n n
E(Sn )
Now En x = Snn denotes the emperical average of x and n = Ex. There-
fore we have 2 2
Pn 2n a
pi )2
P (En (x) E(x) a)  e i=1
(qi
(15)

9
4.2 Bernstein’s Inequality
Hoe↵ding’s inequality does not use any knowledge about the distribution of
variables. The Bernstein’s inequality [7] uses the variance of the distribution
to get a tighter bound.

Theorem 6 Let x1 , x2 , ..., xn be independent bounded random variables such

P
that Exi = 0 and |xi |  & with probability 1 and let 2 = n1 ni=1 V ar{xi }
Then for any a > 0 we have
n
1X n✏2
P( xi ✏)  e 2 2 +2&✏/3
n i=1

Proof We need to re-estimate the new bound starting from (10). Let
1 r
X 2 E(xr )
s
Fi = i

r=2
r! i2
P1 xr
where 2
i = Ex2i . Now ex = 1 + x + r=2 r! . Therefore,
1
X E(xri )
Eesxi = 1 + sExi +
r=2
r!

Since Exi = 0 we have,

Eesxi = 1 + Fi s2 2
i
2 2
 eFi s i

Consider the term Exri . Since expectation of a function is just the Lebesgue
integral
R
of the function with respect to probability measure, we have Exri =
r 1
P xi xi . Using Schwarz’s inequality we get,
Z Z Z
1 1
Exri = xri 1
xi  ( |xri | )2 (
1 2
|xi |2 ) 2
P P P
Z
1
i( |xri 1 2
) Exri  | )2
P
Proceeding to use the Schwarz’s inequality recursively n times we get
2 n 1 Z
1+ 12 + 21 +...+ 12 (2n r 2n+1 1) 1
Exri  i ( |xi |) 2n
P

10
1n
Z
2(1 ) (2n r 2n+1 1) 1
{ i
2
( |xi |) 2n }
P
Now we know that |xi |  &. Therefore
Z
(2n r 2n+1 1) 1 nr 2n+1 1) 1
( |xi |) 2n  (& (2 ) 2n
P

Hence we get
1n
2(1 ) 1
Exri  { i
2
& (r 2 2n
)
}
Taking limit n to infinity we get
1n
2(1 ) 1
Exri  limn!1 { i
2
& (r 2 2n
)
}

) Exri  2 r 2
i& (16)
Therefore,
1 r
X 2 E(xr ) 1 r
X 2 2& r 2
s s
Fi = i
 i

r=2
r! i2 r=2
r! i2
Therefore,
1 r r
1 X s & 1
Fi  = 2 2 (es& 1 s&)
s2 & 2 r=2 r! s &i
Applying this to (16) we get,
s& 1 s&)
s2 2 (e
Exri  e i s2 & 2

2
Now using (10) and the fact that 2 = n
i
we get,
s& 1 s&)
2 2 (e
sa s n
P (Sn a)  e e s2 & 2 (17)

Now to obtain the closest bound we minimize R.H.S w.r.t s. Therefore we

get
s& 1 s&)
s2 n 2 ( (e ) sa
de s2 & 2
s2 n 2 ( (e
s& 1 s&)
) sa (&es& &)
=e s2 & 2 (n 2
( ) a) = 0
ds &2
Therefore to get a tighter bound we have

(es& 1) a
( )=
& n 2

11
Therefore we have
1 a&
s = log( 2 + 1)
& n
Using this s in (17) we get
n 2
( a&2 log( a&
+1)) a
log( a&2 +1)
P (Sn a)  e &2 n n 2 & n

n 2
( a&2 log( a&
+1) a&
log( a&2 +1))
e &2 n n 2 n 2 n

Let H(x) = (1 + x)log(1 + x) x Therefore we get

n 2
H( a&2 )
P (Sn a)  e &2 n (18)

This is called the Bennett’s inequality [?]. We can derive the Bernstien’s
inequality by further bounding the function H(x). Let function G(x) =
3 x2
2 x+3 . We see that H(0) = G(0) = H (0) = G (0) = 0 and we see that
0 0

H 00 (x) = x+1 and G00 (x) = (x+3)3 . Therefore H 00 (0)

1 27
G00 (0) and further
if f n (x) of a function f represents the nth derivative of the function then
we have H n (0) Gn (0) for any {n 2}. Therefore as a consequence of
Taylor’s theorem we have

H(x) G(x)8x 0

Therefore applying this to (18) we get

n
X n 2
G( a&2 )
P( xi a)  e &2 n

i
n
X a2
) P( xi a)  e 2(a&+3n 2)

i
Now let a = n✏. Therefore,
n
X ✏2 n2
P( xi n✏)  e 2(n✏&+3n 2)

Therefore we get,
n
1X n✏2
P( xi ✏)  e 2 2 +2&✏/3 (19)
n i=1

An interesting phenomenon here is that if < ✏ then the upper bound

2
grows as e n✏ rather than e n✏ as suggested by Hoe↵ding’s inequality (14).

12
5 Inequalities of Functions of Random Variables
5.1 Efron Stien’s Inequality
Till now we only considered sum of R.V.s. Now we shall consider functions
of
of R.V.s. Theso
R.V.s. The Efron
calledStien [9]Stien
Efron inequality is one
inequality dueoftothe tightest
Michael bounds
Steel in [13]known.
given
below is one of the tightest bounds known.

Theorem 7 Let S : X n ! R be a measurable function which is invari-

ant under permutation and let the random variable Z be given by Z =
S(x1 , x2 , ..., xn ). Then we have
n
1X
V ar(Z)  E[(Z Zi0 )2 ]
2 i=1

where Zi0 = S(x1 , ..., x0i , ..., xn ) where {x01 , ...x0n } is another sample from the
same distribution as that of {x1 , ...xn }

Proof Let
Ei Z = E[Z|x1 , ..., xi 1 , xi+1 , ..., xn ]

and let V = Z EZ. Now if we define Vi as

Vi = E[Z|x1 , ..., xi ] E[Z|x1 , ..., xi 1 ], 8i = 1, .., n

Pn
then V = i=1 Vi and
n
X n
X X
V ar(Z) = EV 2 = E[( Vi )2 ] = E[ Vi2 ] + 2E[ Vi Vj ]
i=1 i=1 i>j

Now E[XY ] = E[E[XY |Y ]] = E[Y E[X|Y ]], therefore E[Vj Vi ] = E[Vi E[Vj |x1 , ..., xi ]].
But since i > j E[Vj |x1 , ..., xi ] = 0. Therefore we have,
n
X n
X
V ar(Z) = E[ Vi2 ] = E[Vi2 ]
i=1 i=1

Now let Exj represent expectation w.r.t variables {xi , ..., xj }.

n
X
V ar(Z) = E[(E[Z|x1 , ..., xi ] E[Z|x1 , ..., xi 1 ]) ]
2

i=1

13
n
X
= Exi [(Exni+1 [Z|x1 , ..., xi ] Exni [Z|x1 , ..., xi 1 ]) ]
2
1
i=1
n
X
= Exi [(Exni+1 [Z|x1 , ..., xi ] Exni+1 [Exi [Z|x1 , ..., xi 1 ]]) ]
2
1
i=1

However, x2 is a convex function and hence we can apply Jensens inequality

(6) and hence get,
n
X
V ar(Z)  Exi ,xn [(Z Exi [Z|x1 , ..., xi 1 , xi+1 , ..., xn ]) ]
2
1 i+1
i=1

Therefore,
n
X
V ar(Z)  E[(Z Ei [Z])2 ]
i=1
Where Ei [Z] = E[Z|x1 , ..., xi 1 , xi+1 , ..., xn ]. Now let x and y be 2 indepen-
dent samples from the same distribution.

E(x y)2 = E[x2 + y 2 2xy] = 2E[x2 ] 2(E[x])2

Hence, if x and y are i.i.d’s, then V ar{x} = E[ 12 (x y)2 ]. Thus we have,

1
Ei [(Z Ei [Z])2 ] = Ei [(Z Zi0 )2 ]
2
Thus we have the Efron-Stein inequality as
n
1X
V ar(Z)  E[(Z Zi0 )2 ] (20)
2 i=1

Notice that if function S is the sum of the random variables the inequality
becomes an equality. Hence the bound is tight. It is often refered to as
jacknife bound.

5.2 McDiarmid’s Inequality

Theorem 8 Let S : X n ! R be a measurable function which is invari-
ant under permutation and let the random variable Z be given by Z =
S(x1 , x2 , ..., xn ). Then for any a > 0 we have
2
2 Pna
&2
P (|Z E[Z]| a)  2e i=1 i

14
whenever the function has bounded di↵erence [10]. That is

supx1 ,...,xn ,x0i |S(x1 , x2 , ..., xn ) S(x1 , ..., x0i , ..., xn )|  &i

where Zi0 = S(x1 , ..., x0i , ..., xn ) where {x01 , ...x0n } is a sample from the same
distribution as {x1 , ...xn }

Proof Using Cherno↵’s bound (3) we get

P (Z E[Z] a)  e sa E[Z E[Z]]

Now let,

Vi = E[Z|x1 , ..., xi ] E[Z|x1 , ..., xi 1 ], 8i = 1, .., n

Pn
then V = i=1 Vi =Z E[Z]. Therefore,
Pn n
Y
sVi
P (Z E[Z] a)  e sa
E[e i=1 ]=e sa
E[esVi ] (21)
i=1

Using this in (21) we get,

n
Y s2 & 2 Pn &2
i s2 i sa
P (Z E[Z] a)  e as
e 8 =e i=1 8

i=1

Now to make the bound tight we simply minimize it with respect to s.

Therefore to do that,
Xn
&i2
2s a=0
i=1
8
4a
) s = Pn 2
i=1 &i
Therefore the bound is given by,
Pn &2 2
( Pn4a )2 i ( P4a
n )
&2 i=1 8 &2
P (Z E[Z] a)  e i=1 i i=1 i

15
2
( P2a
n )
&2
) P (Z E[Z] a)  e i=1 i

Hence we get,
2
( P2a
n )
&2
P (|Z E[Z]| a)  2e i=1 i (22)

5.3 Logarithmic Sobolev Inequality

This inequality is very useful in deriving simple proofs for many known
bounds using a method famous as Ledoux method [11]. Before jumping into
the therem, let us first prove a useful lemma.

Lemma 9 For any positive random variable y and ↵ > 0 we have

E{ylog y} Ey log Ey  E{ylog y ylog ↵ (y ↵)} (23)

Proof For any (x > 0 ) we have, log x  x 1. Therefore,

↵ ↵
log  1
Ey Ey
Therefore,
↵
Ey log ↵ Ey
Ey
adding E{ylog y} on both sides we get,

Ey log↵ Ey logEy + E{ylog y}  ↵ Ey + E{ylog y}

Therefore simplifying we get the required result as

E{ylog y} Ey log Ey  E{ylog y ylog ↵ (y ↵)}

Now just like in the previous two sections, let S : X n ! R be a mea-

surable function which is invariant under permutation and let the random
variable Z be given by Z = S(x1 , x2 , ..., xn ). Here we assume independence
of (x1 , x2 , ..., xn ). Zi0 = S(x1 , ..., x0i , ..., xn ) where {x01 , ...x0n } is another sam-
ple from the same distribution as that of {x1 , ...xn }. Now we are ready to
state and prove the logarithmic Sobolev inequality.

16
Theorem 10 If function (x) = ex x 1 then,
n
X
sE{ZesZ } E{esZ }log E{esZ }  E{esZ ( s(Z Zi0 ))} (24)
i=1

Proof From Lemma 1 (Equation(23)) we have for any positive variable Y

and ↵ = Yi0 > 0,

Ei {Y log Y } Ei Y log Ei Y  Ei {Y log Y Y log Yi0 (Y Yi0 )}

0
Now let Y = esZ and Yi0 = esZi then,
0
Ei {Y log Y } Ei Y log Ei Y  Ei {esZ (sZ sZi0 ) esZ (1 esZi sZ
)}

Writing it in terms of function (x) we get,

Ei {Y log Y } Ei Y log Ei Y  Ei {esZ ( s(Z Zi0 ))} (25)

Now let measure P denote the distribution of (x1 , ..., xn ) and let distribution
Q be given by,

dQ(x1 , ..., xn ) = dP (x1 , ..., xn )Y (x1 , ..., xn )

Then we have,

D(Q||P ) = EQ {log Y } = EP {Y log Y } = E{Y log Y }

Now since Y is positive, we have

D(Q||P ) E{Y log Y } EY log EY (26)

However by Han’s inequality for relative entropy, rearranging Equation (8)

we have,
n
X
D(Q||P )  D(Q||P ) D(Q(i) ||P (i) ) (27)
i=1

Now we have already shown that D(Q||P ) = E{Y log Y } and further,
E{Ei {Y log Y }} = E{Y log Y } therefore, D(Q||P ) = E{Ei {Y log Y }}. Now
by definition,
Z
dQ(i) (x(i) ) dQ(x1 , ..., xi 1 , xi , xi+1 , ..., xn )
= dxi
dx(i) X dxn1
Z
Y (x1 , ..., xn )dP (x1 , ..., xi 1 , xi , xi+1 , ..., xn )
= dxi
X dxn1

17
Due to the independence assumption of the sample, we can rewrite the above
as,
Z
dQ(i) (x(i) ) dP (x1 , ..., xi 1 , xi+1 , ..., xn )
(i)
= Y (x1 , ..., xn )dPi (xi )
dx dx(i) X

dP (x1 , ..., xi 1 , xi+1 , ..., xn )

= Ei Y
dx(i)
Therefore D(Q(i) ||P (i) ) is given by,
Z
D(Q(i) ||P (i) ) = Ei Y log Ei Y dP (xi ) = E{Ei Y log Ei Y }
Xn 1

Now using this we get,

n
X n
X
D(Q||P ) D(Q(i) ||P (i) ) = E{Ei {Y log Y }} E{Ei Y log Ei Y }
i=1 i=1

Therefore, using this in Equation (27) we get,

n
X
D(Q||P )  E{Ei {Y log Y }} E{Ei Y log Ei Y }
i=1

and using this inturn with Equation (26) we get,

n
X
E{Y log Y } EY log EY  E{Ei {Y log Y } Ei Y log Ei Y }
i=1

Using the above with Equation (25) and substituting Y with esZ we get the
first required result as,
n
X
sE{ZesZ } E{esZ }log E{esZ }  E{esZ ( s(Z Zi0 ))}
i=1

6 Symmetrization Lemma
The symmetrisization lemma is probably one of the easier bounds we re-
view in this notes. However, it is extremely powerful since it allows us to
bound the di↵erence between of emperical mean of a function and its ex-
pected value, using the di↵erence between emperical means of the function

18
for 2 independent samples of the same size as the original sample. Note
that in most literature, the symmetrization lemma stated and proved only
bounds zero one functions like loss function or the actual classification func-
tion. Here we derive a more generalized version where we prove the lemma
for bounding the di↵erence between expectation of any measurable function
with bounded variance and its emperical mean.

Lemma 11 Let f : X ! R be a measurable function such that V ar{f } 

C. Let Eˆn {f } be the emperical mean of the function f (x) estimated using
a set (x1 , x2 , ..., xn ) of n independent identical samples from space X. Then
for any a > 0 if n > 8C a2
we have,

0 00 1 1
P (|Eˆn {f } Eˆn {f }| a) P (|E{f } Eˆn {f }| > a) (28)
2 2
0 00
where Eˆn {f } and Eˆn {f } stand for emperical mean of the function f (x)
estimated using samples (x01 , x02 , ..., x0n ) and (x001 , x002 , ..., x00n ) respectively

Proof By the definition of probability,

Z
0 00 1
P (|Eˆn {f } Eˆn {f }| a) = 1[|Eˆ 0 00
{f } Eˆn {f }|
dP
2
1
n a]
X 2n 2

Where function 1z is 1 for any z 0 and 0 otherwise. Since X 2n is the

product space X ⇥ X , using Fubini’s theorem we have,
n n

Z Z
0 00 1
P (|Eˆn {f } Eˆn {f }| a) = 1[|Eˆ 0 00
{f } Eˆn {f }|
dP 00 dP 0
2
1
n a]
Xn Xn 2

Now since the set Y = {(x1 , x2 , ..., xn ) : |E{f } Eˆn {f }| > a} is a subset of
X n and term inside the integral is always non-negetive,
Z Z
0 00 1
P (|Eˆn {f } Eˆn {f }| a) 1[|Eˆ 0 00
{f } Eˆn {f }|
dP 00 dP 0
2
1
n a]
Y Xn 2

Now let Let Z = {(x1 , x2 , ..., xn ) : |Eˆn {f } E{f (x)}|  a2 }. Clearly for any
sample (x1 , x2 , ..., x2n ), if (x1 , x2 , ..., xn ) 2 Y and (xn+1 , xn+2 , ..., x2n ) 2 Z
0 00
it implies that (x1 , x2 , ..., x2n ) is a sample such that |Eˆn {f } Eˆn {f }| a2
Therefore, coming back to the integral since Z ⇢ X n ,
Z Z Z Z
00 0
1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP dP 1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP 00 dP 0
n n
Y Xn 2 Y Z 2

19
Now since the integral is over Y and Z half spaces as we saw earlier,
Z Z Z Z
00 0
1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP dP = 1dP 00 dP 0
n
Y Z 2 Y Z
Z
0 1
= P (|Eˆn {f } E{f }|  a)dP 0
Y 2
Therefore,
Z Z Z
0 1
1[|Eˆ 0 00
{f } Eˆn {f }| 1 dP 00 dP 0 P (|Eˆn {f } E{f }|  a)dP 0
Y Xn n 2
a]
Y 2
Now,
0 1 00 1
P (|Eˆn {f }E{f }|  a) = 1 P (|Eˆn {f } E{f }| > a)
2 2
Using Equation (2) (often called the chebyshev’s inequality) we get,
00 1 4V ar{f } 4C
P (|Eˆn {f } E{f }| > a)  
2 na2 na2
Now if we choose n such that n > 8C
a2
as per our assumption then,
00 1 1
P (|Eˆn {f } E{f }| > a) 
2 2
Therefore,
1
00 1 1
P (|Eˆn {f }
E{f }|  a) 1 =
2 2 2
Puting this back in the integral we get,
Z Z
0 00 1 1 0 1
P (|Eˆn {f } Eˆn {f }| a) dP = dP 0
2 Y 2 2 Y
Therefore we get the final result as,
0 00 1 1
P (|Eˆn {f } Eˆn {f }| a) P (|E{f } Eˆn {f }| > a)
2 2

Note that if we make the function f to be a zero one function then the
maximum possible variance is 14 . Hence if we set C to 41 then the condition
under which the inequality holds becomes, n > a22 . Further note that if
we choose the zero one function f (x) such that it is 1 when say x is a
particular value and 0 if not, then the result basically bounds the absolute
di↵erence between probability of the event of x taking a particular value and
the frequency estimate of x taking that value in a sample of size n using
the di↵erence in the frequencies of occurance of the value in 2 independent
samples of size n.

20
References
[1] V.N. Vapnik, [Link]. Chervonenkis, ”On the Uniform Convergence of
Relative Frequencies of Events to Their Probabilities”, Theory of Prob-
ability and its Applications, 16(2):264-281, 1971.

[2] Gabor Lugosi, ”Concentration-of-Measure Inequalities” , Lecture

Notes, [Link] lugosi/[Link]

[3] A.N. Kolmogorov, ”Foundations of Probability”, Chelsea Publications,

NY, 1956.

[4] Richard L. Wheeden, Antoni Zygmund, ”Measure and Integral:An In-

troduction to Real Analysis”, International Series of Monographs in
Pure and Applied Mathematics.

[5] W. Hoe↵ding, ”Probability Inequalities for Sums of Bounded Random

Variables”, Journal of the American Statistical Association, 58:13-30,
1963.

[6] G. Benette, ”Probability Inequalities for Sum of Independent Random

Variables”, Journal of the American Statistical Association, 57:33-45,
1962.

[7] S.N. Bernstein, ”The Theory of Probabilities”, Gastehizdat Publishing

House, Moscow, 1946.

[8] V. V. Petrov, ”Sums of Independent Random Variables”, Translated

by A.A. Brown, Springer Verlag, 1975.

[9] B. Efron, C. Stein, ”The Jacknife Estimate of Variance”, Annals of

Statistics, 9:586-96, 1981.

[10] C. Mcdiarmid, ”On the Method of Bounded Di↵erences” , In Surveys

in Combinatorics, LMS lecture. note series 141, 1989.

[11] Michel Ledoux, ”The Concentration of Measure Phenomenon”, Ameri-

can Mathematical Society, Mathematical Surveys and Monographs, Vol
89, 2001.

[12] Olivier Bousquet, Stephane Boucheron, and Gabor Lu-

gosi,”Introduction to Statistical Learning Theory” ,Lecture Notes
,[Link] lugosi/mlss [Link]
[13] J. Michael Steele, “An Efron-Stein Inequality for Nonsymmetric Statistics”,
Annals of Statistics, Vol 14, No. 2, Pg 753-758, 1986
21

MA 4040/ MA 2540: Probability Theory
No ratings yet
MA 4040/ MA 2540: Probability Theory
12 pages
Probability Inequalities and Their Applications
No ratings yet
Probability Inequalities and Their Applications
9 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Formulas
No ratings yet
Formulas
2 pages
Math556 05 Inequalities
No ratings yet
Math556 05 Inequalities
8 pages
נוסחאות ואי שיוויונים
No ratings yet
נוסחאות ואי שיוויונים
12 pages
Lecture 4 Inequalities and Asymptotic Estimates
No ratings yet
Lecture 4 Inequalities and Asymptotic Estimates
9 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Notes 2
No ratings yet
Notes 2
10 pages
Session 2
No ratings yet
Session 2
60 pages
Prob 2 B English
No ratings yet
Prob 2 B English
81 pages
Ent Var Two Rmks
No ratings yet
Ent Var Two Rmks
13 pages
Inequalites Mso205
No ratings yet
Inequalites Mso205
5 pages
04 - Random Variables 2
No ratings yet
04 - Random Variables 2
17 pages
Inequalities
No ratings yet
Inequalities
11 pages
Journal of Inequalities in Pure and Applied Mathematics
No ratings yet
Journal of Inequalities in Pure and Applied Mathematics
9 pages
1 Inequalities: 1.1 Markov
No ratings yet
1 Inequalities: 1.1 Markov
15 pages
Integral Inequalities Overview
0% (1)
Integral Inequalities Overview
22 pages
Integral Inequalities Overview
100% (1)
Integral Inequalities Overview
22 pages
Ch4 Handout
No ratings yet
Ch4 Handout
4 pages
Selective Review - Probability
No ratings yet
Selective Review - Probability
30 pages
Chapter 2
No ratings yet
Chapter 2
16 pages
Probability Solutions for Researchers
No ratings yet
Probability Solutions for Researchers
76 pages
Inequalities and Limit Theorems
100% (1)
Inequalities and Limit Theorems
18 pages
10.1515 - Dema 2025 0122
No ratings yet
10.1515 - Dema 2025 0122
19 pages
Probability & Stochastic Processes
No ratings yet
Probability & Stochastic Processes
25 pages
Probability and Stochastic Processes Guide
No ratings yet
Probability and Stochastic Processes Guide
38 pages
Jensen's Inequality
No ratings yet
Jensen's Inequality
8 pages
Problems: MN) MN
No ratings yet
Problems: MN) MN
11 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Markov and Chebyshev Inequalities
No ratings yet
Markov and Chebyshev Inequalities
25 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory
No ratings yet
Bernstein's Inequality, and Generalizations: CS281B/Stat241B (Spring 2003) Statistical Learning Theory
4 pages
Maximum Entropy Models in Statistics
No ratings yet
Maximum Entropy Models in Statistics
20 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Hoeffding & McDiarmid Inequalities
No ratings yet
Hoeffding & McDiarmid Inequalities
5 pages
CLT, Tcheby, Chi Problems
No ratings yet
CLT, Tcheby, Chi Problems
22 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
13 pages
M235 Lect 6
No ratings yet
M235 Lect 6
3 pages
Probability Inequalities Explained
No ratings yet
Probability Inequalities Explained
3 pages
A Proof of Jensen's Inequality
No ratings yet
A Proof of Jensen's Inequality
3 pages
Convex Sets and Jensen's Inequality
No ratings yet
Convex Sets and Jensen's Inequality
22 pages
Momentsand Inequalities
No ratings yet
Momentsand Inequalities
4 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
확통1 LectureNote06 on Limit Theorems
No ratings yet
확통1 LectureNote06 on Limit Theorems
36 pages
Some Equivalent Forms of Bernoullis Inequality A PDF
No ratings yet
Some Equivalent Forms of Bernoullis Inequality A PDF
24 pages
Info Theory Course Notes
No ratings yet
Info Theory Course Notes
46 pages
Probability Inequalities in Computing
No ratings yet
Probability Inequalities in Computing
11 pages
Chernoff Bounds for Binomial Variables
No ratings yet
Chernoff Bounds for Binomial Variables
11 pages
COMON Olympiad Inequalities
No ratings yet
COMON Olympiad Inequalities
15 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Ineq
No ratings yet
Ineq
8 pages
(Some) Solutions For HW Set # 2
No ratings yet
(Some) Solutions For HW Set # 2
3 pages
Cirtoaje V Discrete Inequalities Volume 6
No ratings yet
Cirtoaje V Discrete Inequalities Volume 6
154 pages
Robinson 2024 - The Structure of The Token Space For LLMs
No ratings yet
Robinson 2024 - The Structure of The Token Space For LLMs
33 pages
Good Luck!!!: Mathematics 8
No ratings yet
Good Luck!!!: Mathematics 8
4 pages
HHW - Class 12 Math
No ratings yet
HHW - Class 12 Math
6 pages
Advances in Rice Genetics and Crop Improvement
No ratings yet
Advances in Rice Genetics and Crop Improvement
217 pages
Advanced Control Algorithms
No ratings yet
Advanced Control Algorithms
9 pages
MyGuru SS2 Concept Review Test
No ratings yet
MyGuru SS2 Concept Review Test
1 page
Power Flow Analysis Guide
100% (1)
Power Flow Analysis Guide
29 pages
Module 6: Inequalities: Objectives
No ratings yet
Module 6: Inequalities: Objectives
5 pages
Calculus 3
No ratings yet
Calculus 3
3 pages
Quintic Function
No ratings yet
Quintic Function
8 pages
Linear Transformations
No ratings yet
Linear Transformations
31 pages
PG - Real - 2 - Sequence - 10 - 23
No ratings yet
PG - Real - 2 - Sequence - 10 - 23
14 pages
Limits Worksheet 1
No ratings yet
Limits Worksheet 1
2 pages
Polynomial Interpolation Guide
No ratings yet
Polynomial Interpolation Guide
79 pages
EE6403-Discrete Time Systems and Signal Processing
No ratings yet
EE6403-Discrete Time Systems and Signal Processing
17 pages
State Space For Control Systems
No ratings yet
State Space For Control Systems
21 pages
Z Table
No ratings yet
Z Table
2 pages
Practice Problems Set 2
No ratings yet
Practice Problems Set 2
5 pages
Multiple-Choice Test Direct Method of Interpolation: T T T V
No ratings yet
Multiple-Choice Test Direct Method of Interpolation: T T T V
2 pages
Engineering Mathematics: For Semesters III and IV
No ratings yet
Engineering Mathematics: For Semesters III and IV
20 pages
Quadratic Equation
No ratings yet
Quadratic Equation
17 pages
Probability Distributions Overview
No ratings yet
Probability Distributions Overview
45 pages
BEC 371 - Consumer Theory - 1 - 2
No ratings yet
BEC 371 - Consumer Theory - 1 - 2
61 pages
Quadratic Equation Basics
No ratings yet
Quadratic Equation Basics
10 pages
Yao Wang Polytechnic University, Brooklyn, NY11201
No ratings yet
Yao Wang Polytechnic University, Brooklyn, NY11201
35 pages
Integration JEE ADV + Mains Prep
No ratings yet
Integration JEE ADV + Mains Prep
7 pages
Lac Important Questions UNIT-3: V U V U y X
No ratings yet
Lac Important Questions UNIT-3: V U V U y X
2 pages
DISC 333 Final Exam Part 2 Instructions
100% (1)
DISC 333 Final Exam Part 2 Instructions
11 pages
IB Maths AA Paper
No ratings yet
IB Maths AA Paper
8 pages
Power Series for Math Enthusiasts
No ratings yet
Power Series for Math Enthusiasts
5 pages