Johnson-Lindenstrauss Theory
Johnson-Lindenstrauss Theory
Johnson-Lindenstrauss theory
Denition. Subgaussian
Let X (random variable) is σ-subgaussian if there exist σ > 0 such as :
σ 2 t2
∀t ∈ R, E [exp(tX)] ≤ exp( ). (1)
2
The quantity E [exp(tX)] is called the moment generating function in by probabilists or the Laplace
transform by analysts.
Proposition. X σ-subgaussian Assume that X is σ-subgaussian, then the following statement are true:
E(X) = 0 and E(X 2 ) = Var(X) ≤ σ2
Proof.
X E(X n ) t2 σ 2
E(exp(tx)) = tn ≤ exp( ) (Fubini)
n! 2
n≥0
X σ 2 t2 n 1
= .
2 n!
n≥0
t2 σ 2 t2
1 + tE(X) + E(X 2 ) ≤ 1 + + g(t) (2)
2 2
g(t)
where
t2→t→0 0. So by dividing both side by t and taking the limit whent → 0+ we show that E(X) ≤ 0.
With t → 0− we prove that E(X) ≥ 0. So E(X) = 0.
2 2 2
By dividing both side of (2) by t and taking the limit we obtain E(X ) ≤ σ .
−∞
exp(tX) exp(− x2 ) √dx
2π
= −∞ √
2π
exp( 12 t2 )dx = exp( t2 )
1-subgaussian. Now if Y ∼ , then holds too, and so E(exp(tY )) = exp( σ 2t )
2 2
Y
N (0, σ 2 ) σ = X ∼ N (0, 1)
and Y is σ-subgaussian
1
Johnson-Lindenstrauss theory 2
a2 t2
≤ exp( ).
2
t
E(exp(tαX)) ≤ exp(t2 α2 ) (3)
2
σ
≤ exp(|α2 | t2 ) (4)
2
For the second part compute:
1 1
E (exp(t(X1 + X2 ))) = E (exp(tX1 p)) p E (exp(tX2 q)) q
p1 q1
σ2 σ2
2
t
≤ exp( 1 t2 p2 exp( 2 t2 q 2 = exp (pσ12 + qσ22 ) .
2 2 2
2 2
1 σ1 +σ2
For example, if we choose p=q=
2 we get 4 (meaning that Cauchy-Schwartz is suboptimal in that
case). The idea is to optimize this bound over p ≥ 1. This gives the following choice:
σ2
p∗ = +1
σ1
2
(σ1 +σ2 )2
and thus leads to the bound E [exp(t(X1 + X2 ))] ≤ exp( t 2 ).
Johnson-Lindenstrauss theory 3
Theorem. Assume that X1 is αp 1 -subgaussian and X2 is α2 -subgaussian, and that moreover X1 and X2 are
independent, then (X1 + X2 ) is σ12 + σ22 -subgaussian.
Proof.
2 p
t(X1 +X2 t(X1 ) t(X2 )]=e
σ1
2
t2 σ2
+ 22 t2 t2 ( σ12 + σ22 )
E[e ] = E[e ]E[e = exp( )
2
where the rst equality holds because X1 and X2 are independent.
Theorem (Characterization of subgaussian variables). Let assume E(X) = 0. Then the following proposi-
tions are equivalent1 :
Proof. 1 ⇒ 2
√
We can assume that c1 = 1 (otherwise consider c1 X instead of X ). Then use Fubini's theorem to show
p
R +∞
that E|X| =
−∞
ptp−1 P (|X| ≥ t) dt.
1 1 1 √ 2
And so, E (|X|p ) p ≤ 2 p ( p2 ) 2 ≤ p √ (since p ≥ 2).
2
|{z}
c2
2 ⇒ 3 Same remark: we start by assuming c2 = 1, or then we can reduce the problem to that one by dividing
X by c2 .
1 and in particular are equivalent to being subgaussian
2 Remark: This is a very simple equality, but it is very frequently used in probabilities, .
Johnson-Lindenstrauss theory 4
X E[(aX 2 )n ]
E[exp(aX 2 )] = 1 +
n!
n≥1
E(X 2n )
X
≤1+ an
n!
n≥1
√ 2n
X 2n
≤1+ an (since c2 = 1)
n!
n≥1
X 2n nn
≤1+ an
n!
n≥1
X n n
≤1+ an (2e)n (by using n! ≥ , see Appendix)
e
n≥1
1 X 1 n
≤2 (choosing 2ae ≤ , and using = 1)
2 2
n≥1
3⇒4
Z 1
E (exp(tX)) = 1 + (1 − y)E(t2 X 2 exp(ytX))dy (Tailor expansion + E(X) = 0)
0
Z 1
(1 − y)E X 2 t2 exp(t|x|) dy
≤1+
0
t2
≤ 1 + E X 2 exp(t|X|)
2
t2 t2 X2
2 2 2
≤ 1 + E X exp( + c3 ) (using ab ≤ a /2 + b /2)
2 2c3 2
t2 t2 X2
2
≤ 1 + exp( ) E X exp( c3 )
2 2c3 2
| {z }
≤ c2 E[exp(X 2 c3 )] using X≤exp(X)
3
2
5t
≤ exp
2c3
4⇒1 (Cherno-Bernstein)
3 ∀λ ≥ 0, P (X ≥ t) = P (exp(λX) ≥ exp(λt)) Markov
4 :
E(exp(λX))
P(X ≥ t) ≤ (9)
exp(λt)
λ2
t
≤ exp c4 − λt (Optimization w.r.t. λ → λ∗ = ) (10)
2 c4
2
−t
≤ exp (11)
2c4
4 Reminder E[X 2 ]
of the Chebychev inequality: P(X ≥ t) ≤ t2
.
Johnson-Lindenstrauss theory 5
(
1
EX = 0
Lemma. Assume X is subgaussian such that (E[|X|p ]) p
√
p ≤K for some K ≥ 0 and that then,
EX 2 = 1
1
2 2
for |t| ≤
∃c > 0, E exp t(X − 1) ≤ exp(t c) . (12)
2eK 2
X tp
E (exp(tY )) = 1 + E(Y )t + E(Y p )
p
p≥2
X tp
=1+ E(Y p ) .
p
p≥2
1 1
E |X 2 − 1|p p ≤ E X 2p p + 1 ≤ K 2 p + 1 .
one obtains
X |t|p
2p pp K 2p + 1
E(exp(tY )) ≤ 1 +
p!
p≥2
X |t|p
2p pp K 2p + 1
≤1+
p!
p≥2
p |t|p
X p p
≤1+ 2|t|eK 2 + (by using p! ≥ , see Appendix)
p! e
p≥2
X |t|p
p
≤ 1 + t2 2eK 2 2|t|eK 2 + .
(p + 2)!
p≥0
1
For |t| ≤ 2eK 2 there exist c such as :
Proof.
" k
!# k
t X 2 Y t 2
E exp √ (Xi − 1) = E exp √ (Xi − 1)
k i=1 i=1
k
k √
Y
2 k
≤ exp(t c/k) (for |t| ≤ )
i=1
2eK 2
≤ exp(t2 c) .
Johnson-Lindenstrauss theory 6
Theorem (Johnson-Lindenstrauss's Lemma). Let X be a subgaussian random variable such that E(|X|p ) p
√
p ≤
K , and
E(exp(t(X 2 − 1))) ≤ exp(t2 c)
d
1 X
(T x)i = √ Ri,j xj ,
k j=1
or
{∀(x, y) ∈ Rd × Rd , (1 − ε)kx − yk2 ≤ kT x − T yk2 ≤ (1 + ε)kx − yk2 } (14)
Proof. x
Pd
Let us denote x ∈ Rd , u = kxk and Yi the column values of the output, i.e Yi = j=1 Ri,j xj . Then,
Xd d
X d
X
E(Yi ) = E Ri,j uj = E(Ri,j uj ) = uj E(Ri,j ) = 0
j=1 j=1 j=1
2
d d d d
u2j Var(Ri,j ) = 15
X X X X
Var(Yi ) = Var Ri,j uj = E Ri,j uj = Var (Ri,j uj ) =
j=1 j=1 j=1 j=1
So (Yi )i=1,··· ,k are independent and subgaussian thanks to Theorem 1 (same constant as X). Dening Z=
√1 (Y 2 + · · · + Y 2 − k), one can state the following bound:
k 1 k
√
P kT uk2 ≥ 1 + ε = P Z ≥ ε k
2
ε k
≤ exp − (following lemma)
4c
4c
Remind that k= ε2 β log d, so
β
2
1
P kT uk ≥ 1 + ε ≤ exp (−β log d) =
d
β
2
1
P kT uk ≤ 1 − ε ≤ exp (−β log d) =
d
Johnson-Lindenstrauss theory 7
Proof.
√ E(exp λZ)
∀λ ≥ 0, P(Z ≥ ε k) ≤ √ (15)
exp(λ k)
√
√ ε k
≤ exp λ2 c − λε k (Optimize w.r.t. λ→λ= ) (16)
2 2c
ε k
≤ exp − (17)
4c
• [Indyk and Motwani(1998)] and then [Dasgupta and Gupta(2003)]: the random space are generated in
iid
an explicit way: Ri,j ∼ N (0, 1).
iid
• [Achlioptas(2003)] extended the property for computationally more tractable random spaces: Ri,j ∼ ε
Rademacher. Interesting features of this distribution being that they require only sums and subtrac-
tions operations.
• [Matou²ek(2008)] generalized the proof for any subgaussian random variables for the elements of Ri,j .
• [Ailon and Chazelle(2009)] focused on a even faster implementation: R = Mk,d Hd Dd where Mk,d is
log2 d
random k × d sparse matrix (with probability q that a term is non zero, and Gaussian), D has
d
diagonal generated according to Rademacher distributions and H is the Hadamard matrix dened by
Hd Hd
H2d = and H1 = (1). The later allows for fast computation of matrix/vector multi-
Hd −Hd
plications: one can use recursively only sums/subtractions, leading to O(d log d) operations (similar to
the standard FFT).
Let us consider m points (x1 , · · · , xm ) in Rd and suppose that a new point x is coming. Imagine that one
d
needs to nd the closest point x ∈ R for simplicity (you can deal with the k-nn problem in a similar way),
meaning the following problem needs to be solved:
m
arg min d2 (xi , x)
i=1 | {z }
kxi −xk2
Computational cost for the naive way: O(md) operations are need, because one has to compute for
them points, the distance to x in Rd . On the other hand, using the Jonhson-Lindenstrauss theory, and using
d k
random projections of the form T : R → R . xi → T xi , one only needs to perform O(m log(d)) operations
(note that this does take into account the projection step that can be done as preliminary treatment).
Techniques similar to J.L: you could do "randomized" SVD eigenvalue decomposition. You might not
get a perfect eigenvalue decomposition, but with high probability you will get something that is "accurate
enough".
Appendix:Standard inequalities
Cauchy-Schartz, Hölder, etc.
Simple ones:
2n n! ≤ (2n)! (18)
Indeed it is true for n = 0, and for n≥1 then (2n)! ≥ 2n(2n − 1) . . . (n + 1)n! and then lower bound each
of the rst elements by 2.
n n
n! ≥ (19)
e
ni nn
P+∞
en = i=0 i! ≥ n! , where the later holds by comparing lower bound the sum by the term corresponding
to i = n.
References
[Achlioptas(2003)] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary
coins. Journal of computer and System Sciences, 66(4):671687, 2003.
[Ailon and Chazelle(2009)] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and ap-
proximate nearest neighbors. SIAM J. Comput., 39(1):302322, 2009. ISSN 0097-5397.
[Dasgupta and Gupta(2003)] S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and
lindenstrauss. Random Structures & Algorithms, 22(1):6065, 2003.
[Indyk and Motwani(1998)] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing
the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of
computing, pages 604613. ACM, 1998.