Concentration
Concentration
Karthik Sridharan
Abstract
This notes is ment to be a review of some basic inequalities and
bounds on Random variables. A basic understanding of probability
theory and set algebra might be required of the reader. This document
is aimed to provide clear and complete proof for some inequalities. For
readers familiar with the topics, many of the steps might seem trivial.
None the less they are provided to simplify the proofs for readers new
to the topic. This notes also provides to the best of my knowledge,
the most generalized statement and proof for Symmetrization lemma. I
also provides the less famous but geralized proof for Jensen’s inequality
and logarithmic sobolev inequality. Refer [2] for a more detailed review
of many of these inequalities with examples demonstrating their uses.
1 Preliminary
Throughout this notes we shall consider a probability space (⌦, E, P ) where
⌦ is the sample space, E is the event class which is a algebra on ⌦ and P
is a probability measure. Further, we shall assume that there exists a borel
measurable function mapping every point ! 2 ⌦ to a real number uniquely
called a random variable. We shall call the space of random variables X
(note that X ✓ R). Further, we shall assume that all the functions and sets
defined in the notes are measurable under the probability measure P .
2 Chebychev’s Inequality
Let us start this notes by proving what is refered to as Chebychev’s Inequal-
ity in [3]. Note, often by Chebychev’s inequality an inequality derived from
the below proved theorem is used. However [3] refers to this inequality as
Tchebyche↵’s Inequality and the same is followed in this notes.
1
Theorem 1 For some a 2 X where X ✓ R, let f be a non-negative function
such that {f (x) b|8x a} , where b 2 Y where Y ✓ R. Then the following
inequality holds,
E{f (x)}
P (x a)
b
Proof Let set X1 = {x : x a&x 2 X} therefore we have,
X1 ✓ X
Since f is a non-negative function, taking the lebesgue integral of the func-
tion over sets X1 and X we have,
Z Z Z
f dP f dP b dP
X X1 X1
R
where X1 dP = P (X1 ). However the lebague integral over probability mea-
sure of a function is its expectation. Hence we have,
E{f (x : x 2 X)} bP (X1 )
) E{f (x)} bP (x a)
and hence
E{f (x)}
P (x a) (1)
b
2
3 Information Theoretic Bounds
3.1 Jensen’s Inequality
Here we shall state and prove a generalized, measure theoretic proof for
Jensen’s inequality. In general, in probability theory, a more specific form
of Jensen’s inequality is famous. But before that we shall first define a con-
vex function.
Proof From our assumptions, the range of f is (a, b) which is the interval
in which (x) is convex. Hence, consider the number,
R
0 fµ
x = RA
Aµ
3
Clearly it is within the interval (a, b). Further, from Equation (4) we have
for almost every x,
R
Now note that if µ is a probability measure then, A µ = 1 and since
expected value is simply lebesgue integral of function w.r.t. probability
measure, we have
E[ (f (x))] (E[f (x)]) (6)
for any function convex for the range of function f (x).
Now a function (x) is convex if 0 (x) exists and is monotonically in-
creasing and if second derivative exists and is nonnegetive. Therefore we
can conclude that the function log x is a convex function. Therefore by
Jensen’s inequality, we have
Now if we take function f (x) to be the probability result we get the result
that Entropy H(P ) is always greater than or equal to 0. If we make f (x) the
ratio of two probability measures dP and dQ, we get the result that relative
entropy or KL divergence of two distributions is always non negetive. That
is
dQ(x) dQ(x)
D(P ||Q) = EP [log( )] log(EP [ ]) = log(EQ [dQ(x)]) = 0
dP (x) dP (x)
Therefore,
D(P ||Q) 0
4
3.2 Han’s Inequality
We shall first prove the Han’s Inequality for entropy and then usingthe
result, we shall prove the Han’s Inequality for relative entropy.
H(x1 , ..., xn ) = H(x1 , ..., xi 1 , xi+1 , ..., xn ) + H(xi |x1 , ..., xi 1 , xi+1 , ..., xn )
Since we have already seen that H(X) H(X|Y ), applying this we have,
Therefore,
n
1 X
H(x1 , ..., xn ) H(x1 , ..., xi 1 , xi+1 , ..., xn )
n 1 i=1
5
Now we shall prove the Han’s inequality for relative entropies. Let
xn1 = x1 , x2 , ..., xn be discrete random variables from sample space X just
like in the previous case. Now let P and Q be probability distributions in
the product space X n and let P be a distribution such that PX (x 1 ,...,xn
)=
n
dxn 1
dP1 (x1 ) dP2 (x2 ) dPn (xn )
dx1 dx2 ... dxn . That is distribution P assumes independence of the
variables (x1 , ..., xn ) with the probability density function of each variable
xi as dPdxi (x)
i
. Let x(i) = (x1 , ..., xi 1 , xi+1 , ..., xn ). Now with this setting we
shall state and prove Han’s relative entropy inequlity.
where Z
dQ(x1 , ..., xi 1 , xi , xi+1 , ..., xn )
Q(i) (x(i) ) = dxi
X dxn1
and Z
dP (x1 , ..., xi 1 , xi , xi+1 , ..., xn )
P (i) (x(i) ) = dxi
X dxn1
Proof By definition of relative entropy,
Z Z
dQ dQ dQ dP
D(Q||P ) = dQlog( )= log(dQ) log( n )dxn1 (9)
Xn dP Xn dxn1 dxn1 dx1
R dQ
In the above equation, consider the term X n dxn
dP
log( dxn ))dx1 .
n From our
1 1
assumption about P we know that
dP (xn1 ) dP1 (x1 ) dP2 (x) dPn (xn )
n = ...
dx1 dx1 dx2 dxn
R dP (x1 ,...,xi 1 ,xi ,xi+1 ,...,xn )
Now P (i) (x(i) ) = X dxn dxi , therefore,
1
6
n Z Z
1 X dQ(i) dP (i) (x(i) ) 1 dQ dP (xn1 )
= ( (log( ))dx(i) ) + log( )dxn1
n i=1 Xn 1 dx (i) dx (i) n Xn dxn1 dxn1
Rearranging the terms we get,
Z n Z
dQ dP (xn1 ) 1 X dQ(i) dP (i) (x(i) )
log( )dx n
1 = ( (log( ))dx(i) )
Xn dxn1 dxn1 n 1 i=1 Xn 1 dx(i) dx(i)
D(Q||P )
Z n Z n
1 X dQ(i) dQ(i) (x(i) ) 1 X dQ(i) dP (i) (x(i) )
log( )dx(i) log( )dx(i)
n 1 Xn 1
i=1
dx (i) dx (i) n 1 Xn 1
i=1
dx(i) dx(i)
Thus finally simplifying we get the required result as
n
1 X
D(Q||P ) D(Q(i) ||P (i) )
n 1 i=1
n n Pn 2a2
X X pi )2
P( xi E( xi ) a) e (q
i=1 i
i=1 i=1
Ees(x E(x))
P (x E(x) a)
esa
7
Pn
Let Sn = i=1 xi . Therefore we have,
Ees(Sn E(Sn ))
P (Sn E(Sn ) a)
esa
n
Y
e sa
Ees(xi E(xi ))
(10)
i=1
Eesy (↵es(q p)
+ (1 ↵))e s↵(q p)
s(q p) +(1
) Eesy elog(↵e ↵)) s↵(q p)
) Eesy e (u)
(11)
Where the function (u) = log(↵eu + (1 ↵)) u↵ and u = s(q p).
Now using Taylor’s theorem, we have for some ⌘,
x2
(x) = (0) + x 0 (0) + 00
(⌘) (12)
2
↵ex
But we have (0) = 0 and 0 (x) = ↵ex +(1 ↵) ↵. Therefore 0 (0) = 0 Now,
00 ↵(1 ↵)ex
(x) =
(1 ↵ + ↵ex )2
8
Therefore, for any x
((1 ↵)2 )00 1
= (x)
(2 2↵) 2 4
Therefore from (12) and (11) we have
u2
Eesy e 8
Now we find the best bound by minimizing the L.H.S. of the above equation
w.r.t s. Therefore we have
Pn
pi )2 Pn
s2 i=1
(qi
pi )2
Pn
i=1 (qi pi )2
sa
de 8
s2 i=1
(qi
=e 8
sa
(2s a) = 0
ds 8
Therefore for the best bound we have
4a
s = Pn
i (qi pi )2
and correspondingly we get
Pn 2a2
pi )2
P (Sn E(Sn ) a) e (q
i=1 i (14)
2
Sn E(Sn ) Pn2(aN )
pi )2
) P( a) e (q
i=1 i
n n
E(Sn )
Now En x = Snn denotes the emperical average of x and n = Ex. There-
fore we have 2 2
Pn 2n a
pi )2
P (En (x) E(x) a) e i=1
(qi
(15)
9
4.2 Bernstein’s Inequality
Hoe↵ding’s inequality does not use any knowledge about the distribution of
variables. The Bernstein’s inequality [7] uses the variance of the distribution
to get a tighter bound.
Proof We need to re-estimate the new bound starting from (10). Let
1 r
X 2 E(xr )
s
Fi = i
r=2
r! i2
P1 xr
where 2
i = Ex2i . Now ex = 1 + x + r=2 r! . Therefore,
1
X E(xri )
Eesxi = 1 + sExi +
r=2
r!
Consider the term Exri . Since expectation of a function is just the Lebesgue
integral
R
of the function with respect to probability measure, we have Exri =
r 1
P xi xi . Using Schwarz’s inequality we get,
Z Z Z
1 1
Exri = xri 1
xi ( |xri | )2 (
1 2
|xi |2 ) 2
P P P
Z
1
i( |xri 1 2
) Exri | )2
P
Proceeding to use the Schwarz’s inequality recursively n times we get
2 n 1 Z
1+ 12 + 21 +...+ 12 (2n r 2n+1 1) 1
Exri i ( |xi |) 2n
P
10
1n
Z
2(1 ) (2n r 2n+1 1) 1
{ i
2
( |xi |) 2n }
P
Now we know that |xi | &. Therefore
Z
(2n r 2n+1 1) 1 nr 2n+1 1) 1
( |xi |) 2n (& (2 ) 2n
P
Hence we get
1n
2(1 ) 1
Exri { i
2
& (r 2 2n
)
}
Taking limit n to infinity we get
1n
2(1 ) 1
Exri limn!1 { i
2
& (r 2 2n
)
}
) Exri 2 r 2
i& (16)
Therefore,
1 r
X 2 E(xr ) 1 r
X 2 2& r 2
s s
Fi = i
i
r=2
r! i2 r=2
r! i2
Therefore,
1 r r
1 X s & 1
Fi = 2 2 (es& 1 s&)
s2 & 2 r=2 r! s &i
Applying this to (16) we get,
s& 1 s&)
s2 2 (e
Exri e i s2 & 2
2
Now using (10) and the fact that 2 = n
i
we get,
s& 1 s&)
2 2 (e
sa s n
P (Sn a) e e s2 & 2 (17)
(es& 1) a
( )=
& n 2
11
Therefore we have
1 a&
s = log( 2 + 1)
& n
Using this s in (17) we get
n 2
( a&2 log( a&
+1)) a
log( a&2 +1)
P (Sn a) e &2 n n 2 & n
n 2
( a&2 log( a&
+1) a&
log( a&2 +1))
e &2 n n 2 n 2 n
This is called the Bennett’s inequality [?]. We can derive the Bernstien’s
inequality by further bounding the function H(x). Let function G(x) =
3 x2
2 x+3 . We see that H(0) = G(0) = H (0) = G (0) = 0 and we see that
0 0
H(x) G(x)8x 0
i
n
X a2
) P( xi a) e 2(a&+3n 2)
i
Now let a = n✏. Therefore,
n
X ✏2 n2
P( xi n✏) e 2(n✏&+3n 2)
Therefore we get,
n
1X n✏2
P( xi ✏) e 2 2 +2&✏/3 (19)
n i=1
12
5 Inequalities of Functions of Random Variables
5.1 Efron Stien’s Inequality
Till now we only considered sum of R.V.s. Now we shall consider functions
of
of R.V.s. Theso
R.V.s. The Efron
calledStien [9]Stien
Efron inequality is one
inequality dueoftothe tightest
Michael bounds
Steel in [13]known.
given
below is one of the tightest bounds known.
where Zi0 = S(x1 , ..., x0i , ..., xn ) where {x01 , ...x0n } is another sample from the
same distribution as that of {x1 , ...xn }
Proof Let
Ei Z = E[Z|x1 , ..., xi 1 , xi+1 , ..., xn ]
Now E[XY ] = E[E[XY |Y ]] = E[Y E[X|Y ]], therefore E[Vj Vi ] = E[Vi E[Vj |x1 , ..., xi ]].
But since i > j E[Vj |x1 , ..., xi ] = 0. Therefore we have,
n
X n
X
V ar(Z) = E[ Vi2 ] = E[Vi2 ]
i=1 i=1
n
X
V ar(Z) = E[(E[Z|x1 , ..., xi ] E[Z|x1 , ..., xi 1 ]) ]
2
i=1
13
n
X
= Exi [(Exni+1 [Z|x1 , ..., xi ] Exni [Z|x1 , ..., xi 1 ]) ]
2
1
i=1
n
X
= Exi [(Exni+1 [Z|x1 , ..., xi ] Exni+1 [Exi [Z|x1 , ..., xi 1 ]]) ]
2
1
i=1
Therefore,
n
X
V ar(Z) E[(Z Ei [Z])2 ]
i=1
Where Ei [Z] = E[Z|x1 , ..., xi 1 , xi+1 , ..., xn ]. Now let x and y be 2 indepen-
dent samples from the same distribution.
Notice that if function S is the sum of the random variables the inequality
becomes an equality. Hence the bound is tight. It is often refered to as
jacknife bound.
14
whenever the function has bounded di↵erence [10]. That is
supx1 ,...,xn ,x0i |S(x1 , x2 , ..., xn ) S(x1 , ..., x0i , ..., xn )| &i
where Zi0 = S(x1 , ..., x0i , ..., xn ) where {x01 , ...x0n } is a sample from the same
distribution as {x1 , ...xn }
Now let,
Now Let Vi be bounded by the interval [Li , Ui ]. We know that |Z Zi0 )| &i ,
hence it follows that |Vi | &i and hence |Ui Li | &i . Using (13) on E[esVi ]
we get,
s2 (Ui Li )2 s2 & 2
i
E[esVi ] e 8 e 8
i=1
15
2
( P2a
n )
&2
) P (Z E[Z] a) e i=1 i
Hence we get,
2
( P2a
n )
&2
P (|Z E[Z]| a) 2e i=1 i (22)
16
Theorem 10 If function (x) = ex x 1 then,
n
X
sE{ZesZ } E{esZ }log E{esZ } E{esZ ( s(Z Zi0 ))} (24)
i=1
Now let measure P denote the distribution of (x1 , ..., xn ) and let distribution
Q be given by,
Then we have,
Now we have already shown that D(Q||P ) = E{Y log Y } and further,
E{Ei {Y log Y }} = E{Y log Y } therefore, D(Q||P ) = E{Ei {Y log Y }}. Now
by definition,
Z
dQ(i) (x(i) ) dQ(x1 , ..., xi 1 , xi , xi+1 , ..., xn )
= dxi
dx(i) X dxn1
Z
Y (x1 , ..., xn )dP (x1 , ..., xi 1 , xi , xi+1 , ..., xn )
= dxi
X dxn1
17
Due to the independence assumption of the sample, we can rewrite the above
as,
Z
dQ(i) (x(i) ) dP (x1 , ..., xi 1 , xi+1 , ..., xn )
(i)
= Y (x1 , ..., xn )dPi (xi )
dx dx(i) X
Using the above with Equation (25) and substituting Y with esZ we get the
first required result as,
n
X
sE{ZesZ } E{esZ }log E{esZ } E{esZ ( s(Z Zi0 ))}
i=1
6 Symmetrization Lemma
The symmetrisization lemma is probably one of the easier bounds we re-
view in this notes. However, it is extremely powerful since it allows us to
bound the di↵erence between of emperical mean of a function and its ex-
pected value, using the di↵erence between emperical means of the function
18
for 2 independent samples of the same size as the original sample. Note
that in most literature, the symmetrization lemma stated and proved only
bounds zero one functions like loss function or the actual classification func-
tion. Here we derive a more generalized version where we prove the lemma
for bounding the di↵erence between expectation of any measurable function
with bounded variance and its emperical mean.
0 00 1 1
P (|Eˆn {f } Eˆn {f }| a) P (|E{f } Eˆn {f }| > a) (28)
2 2
0 00
where Eˆn {f } and Eˆn {f } stand for emperical mean of the function f (x)
estimated using samples (x01 , x02 , ..., x0n ) and (x001 , x002 , ..., x00n ) respectively
Z Z
0 00 1
P (|Eˆn {f } Eˆn {f }| a) = 1[|Eˆ 0 00
{f } Eˆn {f }|
dP 00 dP 0
2
1
n a]
Xn Xn 2
Now since the set Y = {(x1 , x2 , ..., xn ) : |E{f } Eˆn {f }| > a} is a subset of
X n and term inside the integral is always non-negetive,
Z Z
0 00 1
P (|Eˆn {f } Eˆn {f }| a) 1[|Eˆ 0 00
{f } Eˆn {f }|
dP 00 dP 0
2
1
n a]
Y Xn 2
Now let Let Z = {(x1 , x2 , ..., xn ) : |Eˆn {f } E{f (x)}| a2 }. Clearly for any
sample (x1 , x2 , ..., x2n ), if (x1 , x2 , ..., xn ) 2 Y and (xn+1 , xn+2 , ..., x2n ) 2 Z
0 00
it implies that (x1 , x2 , ..., x2n ) is a sample such that |Eˆn {f } Eˆn {f }| a2
Therefore, coming back to the integral since Z ⇢ X n ,
Z Z Z Z
00 0
1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP dP 1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP 00 dP 0
n n
Y Xn 2 Y Z 2
19
Now since the integral is over Y and Z half spaces as we saw earlier,
Z Z Z Z
00 0
1[|Eˆ 0 00
{f } Eˆn {f }| 1
a]
dP dP = 1dP 00 dP 0
n
Y Z 2 Y Z
Z
0 1
= P (|Eˆn {f } E{f }| a)dP 0
Y 2
Therefore,
Z Z Z
0 1
1[|Eˆ 0 00
{f } Eˆn {f }| 1 dP 00 dP 0 P (|Eˆn {f } E{f }| a)dP 0
Y Xn n 2
a]
Y 2
Now,
0 1 00 1
P (|Eˆn {f }E{f }| a) = 1 P (|Eˆn {f } E{f }| > a)
2 2
Using Equation (2) (often called the chebyshev’s inequality) we get,
00 1 4V ar{f } 4C
P (|Eˆn {f } E{f }| > a)
2 na2 na2
Now if we choose n such that n > 8C
a2
as per our assumption then,
00 1 1
P (|Eˆn {f } E{f }| > a)
2 2
Therefore,
1
00 1 1
P (|Eˆn {f }
E{f }| a) 1 =
2 2 2
Puting this back in the integral we get,
Z Z
0 00 1 1 0 1
P (|Eˆn {f } Eˆn {f }| a) dP = dP 0
2 Y 2 2 Y
Therefore we get the final result as,
0 00 1 1
P (|Eˆn {f } Eˆn {f }| a) P (|E{f } Eˆn {f }| > a)
2 2
Note that if we make the function f to be a zero one function then the
maximum possible variance is 14 . Hence if we set C to 41 then the condition
under which the inequality holds becomes, n > a22 . Further note that if
we choose the zero one function f (x) such that it is 1 when say x is a
particular value and 0 if not, then the result basically bounds the absolute
di↵erence between probability of the event of x taking a particular value and
the frequency estimate of x taking that value in a sample of size n using
the di↵erence in the frequencies of occurance of the value in 2 independent
samples of size n.
20
References
[1] V.N. Vapnik, [Link]. Chervonenkis, ”On the Uniform Convergence of
Relative Frequencies of Events to Their Probabilities”, Theory of Prob-
ability and its Applications, 16(2):264-281, 1971.