0% found this document useful (0 votes)
102 views8 pages

Johnson-Lindenstrauss Theory

1. Subgaussian random variables are a generalization of Gaussian random variables. A random variable X is σ-subgaussian if its moment generating function is bounded above by a Gaussian. 2. Examples of subgaussian random variables include Gaussian, Rademacher, and bounded random variables. 3. Properties of subgaussian random variables include: scaling by a constant preserves subgaussianity; sums of independent subgaussians are subgaussian with variance equaling the sum of variances.

Uploaded by

No12n533
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views8 pages

Johnson-Lindenstrauss Theory

1. Subgaussian random variables are a generalization of Gaussian random variables. A random variable X is σ-subgaussian if its moment generating function is bounded above by a Gaussian. 2. Examples of subgaussian random variables include Gaussian, Rademacher, and bounded random variables. 3. Properties of subgaussian random variables include: scaling by a constant preserves subgaussianity; sums of independent subgaussians are subgaussian with variance equaling the sum of variances.

Uploaded by

No12n533
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CR12: Statistical Learning & Applications

Johnson-Lindenstrauss theory

Lecturer: Joseph Salmon Scribes: Jordan Frecon and Thomas Sibut-Pinote

1 Subgaussian random variables


In probability, Gaussian random variables are the easiest and most commonly used distribution encountered.

Denition. Subgaussian
Let X (random variable) is σ-subgaussian if there exist σ > 0 such as :
σ 2 t2
∀t ∈ R, E [exp(tX)] ≤ exp( ). (1)
2

The quantity E [exp(tX)] is called the moment generating function in by probabilists or the Laplace
transform by analysts.
Proposition. X σ-subgaussian Assume that X is σ-subgaussian, then the following statement are true:
E(X) = 0 and E(X 2 ) = Var(X) ≤ σ2

Proof.
X E(X n ) t2 σ 2
E(exp(tx)) = tn ≤ exp( ) (Fubini)
n! 2
n≥0
X  σ 2 t2 n 1
= .
2 n!
n≥0

Up to order 2 and rearranging terms of order greater than 2 on the l.h.s:

t2 σ 2 t2
1 + tE(X) + E(X 2 ) ≤ 1 + + g(t) (2)
2 2
g(t)
where
t2→t→0 0. So by dividing both side by t and taking the limit whent → 0+ we show that E(X) ≤ 0.
With t → 0− we prove that E(X) ≥ 0. So E(X) = 0.
2 2 2
By dividing both side of (2) by t and taking the limit we obtain E(X ) ≤ σ .

Example. 1. N (0, σ2 ) is σ-subgaussian.


Indeed, during previous courses, it has been checked that if X ∼ N (0, 1) then E(exp(tX)) =
exp(− 21 (x−t)2 )
. So if X ∼ N (0, 1) then X is
R +∞ 2 R +∞ 2

−∞
exp(tX) exp(− x2 ) √dx

= −∞ √

exp( 12 t2 )dx = exp( t2 )
1-subgaussian. Now if Y ∼ , then holds too, and so E(exp(tY )) = exp( σ 2t )
2 2
Y
N (0, σ 2 ) σ = X ∼ N (0, 1)
and Y is σ-subgaussian

1
Johnson-Lindenstrauss theory 2

2. Rademacher variable (ε = +1 or ε = −1 with probability 1/2) are 1-subgaussian.


exp(−t) + exp(+t)
E(exp(tX)) = P(x = −1) exp(−t) + P(x = +1) exp(t) =
2
X t2n
= cosh(t) =
(2n)!
n≥0
X (t2 )n t2
≤ = exp( ) (using 2n n! ≤ (2n)!, see Appendix)
2n n! 2
n≥0

3. Uniform random variables over a compact interval [−a, a] is a-subgaussian


Z a
dx 1
E(exp(tX)) = exp(tx)
= (exp(ta) − exp(−ta))
−a 2a 2a
X (at)2n
= sh(at) = (using now 2n n! ≤ (2n + 1)!)
(2n + 1)!
n≥0

a2 t2
≤ exp( ).
2

In this case, a2 is an upper bound on the variance of X , since Var(X) = .


Ra a2
−a
x2 dx 3 3 1
2a = [a + a ] 6a = 3
Can the bound be made sharper?
4. X is a bounded and centered random variable, with X ∈ [a, b]. Then X is b−a 2 -subgaussian. (cf.
Hoeding's inequality and McDiarmind's proof (lecture 3). Remark that here we do not need a = −b.
Theorem. Assume that X is σ-subgaussian and that α ∈ R, then αX is (|α|σ)-subgaussian. Moreover
Assume that X1 is α1 -subgaussian and X2 is α2 -subgaussian, then (X1 + X2 ) is σ1 + σ2 -subgaussian.

Proof. For the rst part:

t
E(exp(tαX)) ≤ exp(t2 α2 ) (3)
2
σ
≤ exp(|α2 | t2 ) (4)
2
For the second part compute:

E (exp(t(X1 + X2 ))) = E (exp(tX1 ) exp(tX2 ))


1 1
Then, let us introduce
p + q =1 for some p ≥ 1. It leads to

1 1
E (exp(t(X1 + X2 ))) = E (exp(tX1 p)) p E (exp(tX2 q)) q
 p1   q1
σ2 σ2
  2 
t
≤ exp( 1 t2 p2 exp( 2 t2 q 2 = exp (pσ12 + qσ22 ) .
2 2 2
2 2
1 σ1 +σ2
For example, if we choose p=q=
2 we get 4 (meaning that Cauchy-Schwartz is suboptimal in that
case). The idea is to optimize this bound over p ≥ 1. This gives the following choice:

σ2
p∗ = +1
σ1
2
(σ1 +σ2 )2
and thus leads to the bound E [exp(t(X1 + X2 ))] ≤ exp( t 2 ).
Johnson-Lindenstrauss theory 3

Theorem. Assume that X1 is αp 1 -subgaussian and X2 is α2 -subgaussian, and that moreover X1 and X2 are
independent, then (X1 + X2 ) is σ12 + σ22 -subgaussian.

Proof.
2 p
t(X1 +X2 t(X1 ) t(X2 )]=e
σ1
2
t2 σ2
+ 22 t2 t2 ( σ12 + σ22 )
E[e ] = E[e ]E[e = exp( )
2
where the rst equality holds because X1 and X2 are independent.

Theorem (Characterization of subgaussian variables). Let assume E(X) = 0. Then the following proposi-
tions are equivalent1 :

1. ∃c1 > 0, ∀λ ≥ 0, P (|X| ≥ λ) ≤ 2 exp(−λ2 c1 ) (tail)



2. ∃c2 > 0, ∀p ≥ 1, (E|X|p ) p ≤ c2 p (Moment control)
1

3. ∃c3 > 0, E exp(c3 X 2 ) ≤ 2 (Laplace transform of X 2 is bounded)




4. ∃c4 > 0, E(exp(tX)) ≤ exp(c4 t2 ) (Laplace transform decay)


2

Remark. The number 2 in the third claim is arbitrary.


Remark. You can nd articles/books, where the rst proposition is taken as the denition for subgaussian.

Proof. 1 ⇒ 2

We can assume that c1 = 1 (otherwise consider c1 X instead of X ). Then use Fubini's theorem to show
p
R +∞
that E|X| =
−∞
ptp−1 P (|X| ≥ t) dt.

dt = 0 1{X≥t} dt. By using Fubini, we can show that E(X) =


RX R +∞
Indeed, for X ≥ 0, X =
0
E 1{X≥t} dt. 2 In the same manner, E(|X|p ) = 0 ptp−1 dt = 0 ptp−1 1{|X|≥t} dt. Now,
R +∞  R |X| R +∞
0
Z +∞
p
E (|X| ) ≤ p 2tp−1 exp(−t2 )dt (5)
0
Z +∞
√ p−1 du
≤p 2 u exp(−u) √ (by using the change of variable u = t2 ) (6)
0 2 u
Z +∞ p p p 
p
≤ u 2 −1 exp −u = 2 Γ = 2Γ +1 (7)
0 2 2 2
p p
≤ 2( ) 2 (8)
2
where we have used the denition of the Γ function and the classical inequality Γ (x + 1) ≤ xx for any x≥0
(see Appendix).

1 1 1 √ 2
And so, E (|X|p ) p ≤ 2 p ( p2 ) 2 ≤ p √ (since p ≥ 2).
2
|{z}
c2

2 ⇒ 3 Same remark: we start by assuming c2 = 1, or then we can reduce the problem to that one by dividing
X by c2 .
1 and in particular are equivalent to being subgaussian
2 Remark: This is a very simple equality, but it is very frequently used in probabilities, .
Johnson-Lindenstrauss theory 4

X E[(aX 2 )n ]
E[exp(aX 2 )] = 1 +
n!
n≥1

E(X 2n )
X
≤1+ an
n!
n≥1
√ 2n
X 2n
≤1+ an (since c2 = 1)
n!
n≥1
X 2n nn
≤1+ an
n!
n≥1
X  n n
≤1+ an (2e)n (by using n! ≥ , see Appendix)
e
n≥1

1 X  1 n
≤2 (choosing 2ae ≤ , and using = 1)
2 2
n≥1

3⇒4
Z 1
E (exp(tX)) = 1 + (1 − y)E(t2 X 2 exp(ytX))dy (Tailor expansion + E(X) = 0)
0
Z 1
(1 − y)E X 2 t2 exp(t|x|) dy

≤1+
0
t2
≤ 1 + E X 2 exp(t|X|)

2 
t2 t2 X2

2 2 2
≤ 1 + E X exp( + c3 ) (using ab ≤ a /2 + b /2)
2 2c3 2
t2 t2 X2
 
2
≤ 1 + exp( ) E X exp( c3 )
2 2c3 2
| {z }
≤ c2 E[exp(X 2 c3 )] using X≤exp(X)
3
2
 
5t
≤ exp
2c3
4⇒1 (Cherno-Bernstein)
3 ∀λ ≥ 0, P (X ≥ t) = P (exp(λX) ≥ exp(λt)) Markov
4 :

E(exp(λX))
P(X ≥ t) ≤ (9)
exp(λt)
λ2
 
t
≤ exp c4 − λt (Optimization w.r.t. λ → λ∗ = ) (10)
2 c4
 2
−t
≤ exp (11)
2c4

3 It seems that Bernstein should be credited too for this method.

4 Reminder E[X 2 ]
of the Chebychev inequality: P(X ≥ t) ≤ t2
.
Johnson-Lindenstrauss theory 5

(
1
EX = 0
Lemma. Assume X is subgaussian such that (E[|X|p ]) p

p ≤K for some K ≥ 0 and that then,
EX 2 = 1
 
1
2 2
for |t| ≤

∃c > 0, E exp t(X − 1) ≤ exp(t c) . (12)
2eK 2

Proof. Dene Y = X 2 − 1. Then, as before we can write:

X tp
E (exp(tY )) = 1 + E(Y )t + E(Y p )
p
p≥2
X tp
=1+ E(Y p ) .
p
p≥2

Reminding the Minkowski Inequality,

 1  1
E |X 2 − 1|p p ≤ E X 2p p + 1 ≤ K 2 p + 1 .
 

one obtains

X |t|p
2p pp K 2p + 1

E(exp(tY )) ≤ 1 +
p!
p≥2
X |t|p
2p pp K 2p + 1

≤1+
p!
p≥2
p |t|p
X    p p
≤1+ 2|t|eK 2 + (by using p! ≥ , see Appendix)
p! e
p≥2
X |t|p

p
≤ 1 + t2 2eK 2 2|t|eK 2 + .
(p + 2)!
p≥0

1
For |t| ≤ 2eK 2 there exist c such as :

E(exp(tY )) ≤ 1 + ct2 ≤ exp(ct2 ) .

Corollary. Let us assume Xi ∼ X for i = 1, · · · , k with X subgaussian


iid
such that E(|X|p ) p

p ≤ K. Then,

≤ exp(t2 c) for |t| ≤ 2eKk2 .
h  Pk i
∃c > 0, E exp √t (X 2
− 1)
k i=1 i

Proof.
" k
!# k   
t X 2 Y t 2
E exp √ (Xi − 1) = E exp √ (Xi − 1)
k i=1 i=1
k
k √
Y
2 k
≤ exp(t c/k) (for |t| ≤ )
i=1
2eK 2
≤ exp(t2 c) .
Johnson-Lindenstrauss theory 6

2 Random projections in high dimension


2.1 Theoretical results
1

Theorem (Johnson-Lindenstrauss's Lemma). Let X be a subgaussian random variable such that E(|X|p ) p

p ≤
K , and
E(exp(t(X 2 − 1))) ≤ exp(t2 c)

for |t| ≤ 2eK


1
2 . For any ε ≤ eK 2 dene k = ε2 β log(d) for some β > 0, Then generate Ri,j ∼ X where R is
c 4c iid

a k × d matrix. Introduce T : R → R such that for x ∈ R , we have


d k d

d
1 X
(T x)i = √ Ri,j xj ,
k j=1

for i = 1, · · · , k Then with probability ≥ 1 − 2 1 β


, the following holds :

d

{∀x ∈ Rd , (1 − ε)kxk2 ≤ kT xk2 ≤ (1 + ε)kxk2 } (13)

or
{∀(x, y) ∈ Rd × Rd , (1 − ε)kx − yk2 ≤ kT x − T yk2 ≤ (1 + ε)kx − yk2 } (14)

Proof. x
Pd
Let us denote x ∈ Rd , u = kxk and Yi the column values of the output, i.e Yi = j=1 Ri,j xj . Then,

 
Xd d
X d
X
E(Yi ) = E  Ri,j uj  = E(Ri,j uj ) = uj E(Ri,j ) = 0
j=1 j=1 j=1
   2
d d d d
u2j Var(Ri,j ) = 15
X X X X
Var(Yi ) = Var  Ri,j uj  = E  Ri,j uj  = Var (Ri,j uj ) =
j=1 j=1 j=1 j=1

So (Yi )i=1,··· ,k are independent and subgaussian thanks to Theorem 1 (same constant as X). Dening Z=
√1 (Y 2 + · · · + Y 2 − k), one can state the following bound:
k 1 k

 √ 
P kT uk2 ≥ 1 + ε = P Z ≥ ε k

 2 
ε k
≤ exp − (following lemma)
4c
4c
Remind that k= ε2 β log d, so

 β
2
 1
P kT uk ≥ 1 + ε ≤ exp (−β log d) =
d

The same kind of derivations leads to:

 β
2
 1
P kT uk ≤ 1 − ε ≤ exp (−β log d) =
d
Johnson-Lindenstrauss theory 7

Lemma. satises P (Z ≥ εk) ≤ exp( −ε4c k ) for ε ≤ .


2
Z= √1 (Y 2 + · · · + Yk2 − k) c
k 1 eK 2

Proof.
√ E(exp λZ)
∀λ ≥ 0, P(Z ≥ ε k) ≤ √ (15)
exp(λ k)

 √  ε k
≤ exp λ2 c − λε k (Optimize w.r.t. λ→λ= ) (16)
 2  2c
ε k
≤ exp − (17)
4c

Remark. Here are a few comments on the previous result:


• ε is the precision needed.
• β is a condence parameter governing the probability.
log(d)
• k ε2

2.2 Historical remarks

• [Johnson and Lindenstrauss(1984)]: random space with dimension k. Technical tool:"concentration on


the sphere. The proof was not constructive.

• [Indyk and Motwani(1998)] and then [Dasgupta and Gupta(2003)]: the random space are generated in
iid
an explicit way: Ri,j ∼ N (0, 1).
iid
• [Achlioptas(2003)] extended the property for computationally more tractable random spaces: Ri,j ∼ ε
Rademacher. Interesting features of this distribution being that they require only sums and subtrac-
tions operations.

• [Matou²ek(2008)] generalized the proof for any subgaussian random variables for the elements of Ri,j .
• [Ailon and Chazelle(2009)] focused on a even faster implementation: R = Mk,d Hd Dd where Mk,d is
log2 d
random k × d sparse matrix (with probability q  that a term is non zero, and Gaussian), D has
d
diagonal generated according to Rademacher distributions and H is the Hadamard matrix dened by
 
Hd Hd
H2d = and H1 = (1). The later allows for fast computation of matrix/vector multi-
Hd −Hd
plications: one can use recursively only sums/subtractions, leading to O(d log d) operations (similar to
the standard FFT).

2.3 Application: k-Nearest-Neighbors (k-nn)

Let us consider m points (x1 , · · · , xm ) in Rd and suppose that a new point x is coming. Imagine that one
d
needs to nd the closest point x ∈ R for simplicity (you can deal with the k-nn problem in a similar way),
meaning the following problem needs to be solved:
m
arg min d2 (xi , x)
i=1 | {z }
kxi −xk2

where k · k2 is the Euclidean norm in Rd .


Johnson-Lindenstrauss theory 8

Computational cost for the naive way: O(md) operations are need, because one has to compute for
them points, the distance to x in Rd . On the other hand, using the Jonhson-Lindenstrauss theory, and using
d k
random projections of the form T : R → R . xi → T xi , one only needs to perform O(m log(d)) operations
(note that this does take into account the projection step that can be done as preliminary treatment).

Techniques similar to J.L: you could do "randomized" SVD eigenvalue decomposition. You might not
get a perfect eigenvalue decomposition, but with high probability you will get something that is "accurate
enough".

Appendix:Standard inequalities
Cauchy-Schartz, Hölder, etc.

Simple ones:

2n n! ≤ (2n)! (18)

Indeed it is true for n = 0, and for n≥1 then (2n)! ≥ 2n(2n − 1) . . . (n + 1)n! and then lower bound each
of the rst elements by 2.

 n n
n! ≥ (19)
e

ni nn
P+∞
en = i=0 i! ≥ n! , where the later holds by comparing lower bound the sum by the term corresponding
to i = n.

References
[Achlioptas(2003)] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary
coins. Journal of computer and System Sciences, 66(4):671687, 2003.

[Ailon and Chazelle(2009)] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and ap-
proximate nearest neighbors. SIAM J. Comput., 39(1):302322, 2009. ISSN 0097-5397.

[Dasgupta and Gupta(2003)] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and
lindenstrauss. Random Structures & Algorithms, 22(1):6065, 2003.

[Indyk and Motwani(1998)] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing
the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of
computing, pages 604613. ACM, 1998.

[Johnson and Lindenstrauss(1984)] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings


into a Hilbert space. Contemporary mathematics, 26(189-206):1, 1984.

[Matou²ek(2008)] J. Matou²ek. On variants of the JohnsonLindenstrauss Lemma. Random Structures &


Algorithms, 33(2):142156, 2008.

You might also like