Modern Statistical Methods
Modern Statistical Methods
Michaelmas 2017
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
The remarkable development of computing power and other technology now allows
scientists and businesses to routinely collect datasets of immense size and complexity.
Most classical statistical methods were designed for situations with many observations
and a few, carefully chosen variables. However, we now often gather data where we
have huge numbers of variables, in an attempt to capture as much information as we
can about anything which might conceivably have an influence on the phenomenon
of interest. This dramatic increase in the number variables makes modern datasets
strikingly different, as well-established traditional methods perform either very poorly,
or often do not work at all.
Developing methods that are able to extract meaningful information from these large
and challenging datasets has recently been an area of intense research in statistics,
machine learning and computer science. In this course, we will study some of the
methods that have been developed to analyse such datasets. We aim to cover some of
the following topics.
– Kernel machines: the kernel trick, the representer theorem, support vector
machines, the hashing trick.
– Penalised regression: Ridge regression, the Lasso and variants.
– Graphical modelling: neighbourhood selection and the graphical Lasso. Causal
inference through structural equation modelling; the PC algorithm.
– High-dimensional inference: the closed testing procedure and the Benjamini–
Hochberg procedure; the debiased Lasso
Pre-requisites
Basic knowledge of statistics, probability, linear algebra and real analysis. Some
background in optimisation would be helpful but is not essential.
1
Contents III Modern Statistical Methods
Contents
0 Introduction 3
1 Classical statistics 4
2 Kernel machines 6
2.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 v-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Other kernel machines . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Large-scale kernel machines . . . . . . . . . . . . . . . . . . . . . 22
4 Graphical modelling 46
4.1 Conditional independence graphs . . . . . . . . . . . . . . . . . . 46
4.2 Structural equation modelling . . . . . . . . . . . . . . . . . . . . 51
4.3 The PC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 High-dimensional inference 59
5.1 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Inference in high-dimensional regression . . . . . . . . . . . . . . 62
Index 66
2
0 Introduction III Modern Statistical Methods
0 Introduction
In recent years, there has been a rather significant change in what sorts of data
we have to handle and what questions we ask about them, witnessed by the
popularity of the buzzwords “big data” and “machine learning”. In classical
statistics, we usually have a small set of parameters, and a very large data set.
We then use the large data set to estimate the parameters.
However, nowadays we often see scenarios where we have a very large number
of parameters, and the data set is relatively small. If we tried to apply our
classical linear regression, then we would just be able to tune the parameters so
that we have a perfect fit, and still have great freedom to change the parameters
without affecting the fitting.
One example is that we might want to test which genomes are responsible for
a particular disease. In this case, there is a huge number of genomes to consider,
and there is good reason to believe that most of them are irrelevant, i.e. the
parameters should be set to zero. Thus, we want to develop methods that find
the “best” fitting model that takes this into account.
Another problem we might encounter is that we just have a large data set,
and doing linear regression seems a bit silly. If we have so much data, we might
as well try to fit more complicated curves, such as polynomial functions and
friends. Perhaps more ambitiously, we might try to find the best continuously
differentiable function that fits the curve, or, as analysts will immediately suggest
as an alternative, weakly differentiable functions.
There are many things we can talk about, and we can’t talk about all of
them. In this course, we are going to cover 4 different topics of different size:
– Kernel machines
– The Lasso and its extensions
3
1 Classical statistics III Modern Statistical Methods
1 Classical statistics
This is a course on modern statistical methods. Before we study methods, we
give a brief summary of what we are not going to talk about, namely classical
statistics.
So suppose we are doing regression. We have some predictors xi ∈ Rp and
responses Yi ∈ R, and we hope to find a model that describes Y as a function of
x. For convenience, define the vectors
T T
x1 Y1
.. ..
X = . , Y = . .
xTn YnT
Y = Xβ 0 + ε,
where ε is some (hopefully small) error random variable. Our goal is then to
estimate β 0 given the data we have.
If X has full column rank, so that X T X is invertible, then we can use ordinary
least squares to estimate β 0 , with estimate
This assumes nothing about ε itself, but if we assume that Eε = 0 and var(E) =
σ 2 I, then this estimate satisfies
– Eβ β̂ OLS = (X T X)−1 X T Xβ 0 = β 0
The maximum likelihood estimator then maximize `(θ) over θ to get θ̂.
Similar to ordinary least squares, there is a theorem that says maximum
likelihood estimation is the “best”. To do so, we introduce the Fisher information
matrix . This is a family of d × d matrices indexed by θ, defined by
∂2
Ijk (θ) = −Eθ `(θ) .
∂θj ∂θj
The relevant theorem is
4
1 Classical statistics III Modern Statistical Methods
5
2 Kernel machines III Modern Statistical Methods
2 Kernel machines
We are going to start a little bit slowly, and think about our linear model
Y = Xβ 0 + ε, where E(ε) = 0 and var(ε) = σ 2 I. Ordinary least squares is an
unbiased estimator, so let’s look at biased estimators.
For a biased estimator, β̃, we should not study the variance, but the mean
squared error
The first term is, of course, just the variance, and the second is the squared bias.
So the point is that if we pick a clever biased estimator with a tiny variance,
then this might do better than unbiased estimators with large variance.
(µ̂R R 2 2
λ , β̂λ ) = argmin {kY − µ1 − Xβk2 + λkβk2 },
(µ,β)∈R×Rp
Note also that the Ridge regression formula makes sense only if each entry in β
have the same order of magnitude, or else the penalty will only have a significant
effect on the terms of large magnitude. Standard practice is to√subtract from
each column of X its mean, and then scale it to have `2 norm n. The actual
number is not important here, but it will in the case of the Lasso.
By differentiating, one sees that the solution to the optimization problem is
1X
µ̂R
λ = Ȳ = Yi
n
β̂λR = (X T X + λI)−1 X T Y.
6
2 Kernel machines III Modern Statistical Methods
Note that we can always pick λ such that the matrix (X T X + λI) is invertible.
In particular, this can work even if we have more parameters than data points.
This will be important when we work with the Lasso later on.
If some god-like being told us a suitable value of λ to use, then Ridge
regression always work better than ordinary least squares.
Theorem. Suppose rank(X) = p. Then for λ > 0 sufficiently small (depending
on β 0 and σ 2 ), we have that
is positive definite.
Proof. We know that the first term is just σ 2 (X T X)−1 . The right-hand-side has
a variance term and a bias term. We first look at the bias:
E[β̂ − β 0 ] = (X T X + λI)−1 X T Xβ 0 − β 0
= (X T X + λI)−1 (X T X + λI − λI)β 0 − β 0
= −λ(X T X + λI)−1 β 0 .
Note that both terms appearing in the squared error look like
2σ 2 I + σ 2 λ(X T X)−1 − λβ 0 (β 0 )T
2σ 2
is positive definite, which is true for 0 < λ < kβ 0 k22
.
While this is nice, this is not really telling us much, because we don’t know
how to pick the correct λ. It also doesn’t tell us when we should expect a big
improvement from Ridge regression.
To understand this better, we need to use the singular value decomposition.
7
2 Kernel machines III Modern Statistical Methods
8
2 Kernel machines III Modern Statistical Methods
One thing to note about this is that we are thinking of X̃ and Ỹ as arbitrary
data sets of size n, as opposed to the one we have actually got. This might be a
more tractable problem, since we are not working with our actual data set, but
general data sets.
The method of cross-validation estimates this by splitting the data into v
folds.
(X (v) , Y (v) ) Av
(X (1) , Y (1) ) A1
Ai is called the observation indices, which is the set of all indices j so that the
jth data point lies in the ith fold.
We let (X (−k) , Y (−k) ) be all data except that in the kth fold. We define
v
1XX
CV (λ) = (Yi − xTi β̂λR (X (−k) , Y (−k) ))2
n
k=1 i∈Ak
9
2 Kernel machines III Modern Statistical Methods
We write λCV for the minimizer of this, and pick β̂λRCV (X, Y ) as our estimate.
This tends to work very will in practice.
But we can ask ourselves — why do we have to pick a single λ to make the
estimate? We can instead average over a range of different λ. Suppose we have
computed β̂λR on a grid of λ-values λ1 > λ2 > · · · > λL . Our plan is to take a
good weighted average of the λ. Concretely, we want to minimize
v L
!2
1XX X
Yi − wi xTi β̂λRi (X (−k) , Y (−k) )
n i=1
k=1 i∈Ak
over w ∈ [0, ∞)L . This is known as stacking. This tends to work better than
just doing v-fold cross-validation. Indeed, it must — cross-validation is just a
special case where we restrict wi to be zero in all but one entries. Of course, this
method comes with some heavy computational costs.
Of course, we can do Ridge regression, as long as we add the products xik xi`
as predictors. But this requires O(p2 ) many predictors. Even if we use the
XX T way, this has a cost of O(n2 p2 ), and if we used the naive method, it would
require O(np4 ) operations.
We can do better than this. The idea is that we might be able to compute
K directly. Consider
X
(1 + xTi xj )2 = 1 + 2xTi xj + xik xi` xjk xj` .
k,`
10
2 Kernel machines III Modern Statistical Methods
If we set Kij = (1+xTi xj )2 and form K(K +λI)−1 Y , this is equivalent to forming
ridge regression with (∗) as our predictors. Note that here we don’t scale our
columns to have the same `2 norm. This is pretty interesting, because computing
this is only O(n2 p). We managed to kill a factor of p in this computation. The
key idea here was that the fitted values depend only on K, and not on the values
of xij itself.
Consider the general scenario where we try to predict the value of Y given
a predictor x ∈ X . In general, we don’t even assume X has some nice linear
structure where we can do linear regression.
If we want to do Ridge regression, then one thing we can do is that we
can try to construct some map φ : X → RD for some D, and then run Ridge
regression using {φ(xi )} as our predictors. If we want to fit some complicated,
non-linear model, then D can potentially be very huge, and this can be costly.
The observation above is that instead of trying to do this, we can perhaps find
some magical way of directly computing
If we can do so, then we can simply use the above formula to obtain the fitted
values (we shall discuss the problem of making new predictions later).
Since it is only the function k that matters, perhaps we can specify a
“regression method” simply by providing a suitable function k : X × X → R.
If we are given such a function k, when would it arise from a “feature map”
φ : X → RD ?
More generally, we will find that it is convenient to allow for φ to take values
in infinite-dimensional vector spaces instead (since we don’t actually have to
compute φ!).
Definition (Inner product space). an inner product space is a real vector space
H endowed with a map h · , · i : H × H → R and obeys
– Symmetry: hu, vi = hv, ui
– Linearity: If a, b ∈ R, then hau + bw, vi = ahu, vi + bhw, vi.
– Positive definiteness: hu, ui ≥ 0 with hu, ui = 0 iff u = 0.
If we want k to come from a feature map φ, then an immediate necessary
condition is that k has to be symmetric. There is also another condition that
corresponds to the positive-definiteness of the inner product.
Proposition. Given φ : X × X → H, define k : X × X → R by
Kij = k(xi , xj )
is positive semi-definite.
11
2 Kernel machines III Modern Statistical Methods
≥0
Kij = k(xi , xj )
is positive semi-definite.
We will prove that every positive-definite kernel comes from a feature map.
However, before that, let’s look at some examples.
Example. Suppose k1 , k2 , . . . , kn are kernels. Then
– If α1 , α2 ≥ 0, then α1 k1 + α2 k2 is a kernel. Moreover, if
Example. The linear kernel is k(x, x0 ) = xT x0 . To see this, we see that this is
given by the feature map φ = id and taking the standard inner product on Rp .
Example. The polynomial kernel is k(x, x0 ) = (1 + xT x0 )d for all d ∈ N.
We saw this last time with d = 2. This is a kernel since both 1 and xT x0 are
kernels, and sums and products preserve kernels.
Example. The Gaussian kernel is
kx − x0 k22
0
k(x, x ) = exp − .
2σ 2
The quantity σ is known as the bandwidth.
To show that it is a kernel, we decompose
We define
kxk2 kx0 k22
0
k1 (x, x ) = exp − 22 exp − .
2σ 2σ 2
12
2 Kernel machines III Modern Statistical Methods
k · k2
This is a kernel by taking φ( · ) = exp − 2σ22 .
Next, we can define
T 0 X ∞ r
1 xT x0
x x
k2 (x, x0 ) = exp = .
σ2 r=0
r! σ2
We see that this is the infinite linear combination of powers of kernels, hence is
a kernel. Thus, it follows that k = k1 k2 is a kernel.
Note also that any feature map giving this k must take values in an infinite
dimensional inner product space. Roughly speaking, this is because we have
arbitrarily large powers of x and x0 in the expansion of k.
Example. The Sobolev kernel is defined as follows: we take =[0, 1], and let
k(x, x0 ) = min(x, x0 ).
A slick proof that this is a kernel is to notice that this is the covariance function
of Brownian motion.
Example. The Jaccard similarity is defined as follows: Take X to be the set of
all subsets of {1, . . . , p}. For x, x0 ∈ X , define
|x∩x0 |
(
0 0 x ∪ x0 6= ∅
k(x, x ) = |x∪x | .
1 otherwise
We have to check that this is an inner product, but even before that, we need to
check that this is well-defined, since f and g can be represented in the form (∗)
in multiple ways. To do so, simply observe that
X m
n X n
X m
X
αi βj k(xi , x0j ) = αi g(xi ) = βj f (x0j ). (†)
i=1 j=1 i=1 j=1
13
2 Kernel machines III Modern Statistical Methods
The first equality shows that the definition of our inner product does not depend
on the representation of g, while the second equality shows that it doesn’t depend
on the representation of f .
To show this is an inner product, note that it is clearly symmetric and bilinear.
To show it is positive definite, note that we have
n X
X m
hf, f i = αi k(xi , xj )αj ≥ 0
i=1 j=1
Here we used the Cauchy–Schwarz inequality, which, if you inspect the proof,
does not require positive definiteness, just positive semi-definiteness. So it follows
that f ≡ 0. Thus, we know that H is indeed an inner product space.
We now have to construct a feature map. Define φ : X → H by
φ(x) = k( · , x).
as desired.
We shall boost this theorem up to its full strength, where we say there is
a “unique” inner product space with such a feature map. Of course, it will not
be unique unless we impose appropriate quantifiers, since we can just throw in
some random elements into H.
Recall (or learn) from functional analysis that any inner product space B is
a normed space, with norm
kf k2 = hf, f iB .
14
2 Kernel machines III Modern Statistical Methods
and it can be shown that after augmenting H with such limits, we obtain a
Hilbert space. In fact, it is a special type of Hilbert space.
Definition (Reproudcing kernel Hilbert space (RKHS)). A Hilbert space B of
functions f : X → R is a reproducing kernel Hilbert space if for each x ∈ X ,
there exists a kx ∈ B such that
hkx , f i = f (x)
for all x ∈ B.
The function k : X × X → R given by
k(x, x0 ) = xT x0 .
By definition, we have
( n
)
X
p
H= f : R → R | f (x) = αi xTi x
i=1
n p
for some n ∈ N, x1 , . . . , x ∈ R . We then see that this is equal to
min(x, x0 )
x
x0
15
2 Kernel machines III Modern Statistical Methods
Since we allow arbitrary linear combinations of these things and pointwise limits,
this gives rise to a large class of functions. In particular, this includes all Lipschitz
functions that are 0 at the origin.
In fact, the resulting space is a Sobolev space, with the norm given by
Z 1 1/2
0 2
kf k = f (x) dx .
0
where H is the RKHS of the linear kernel. Now this way of writing ridge
regression makes sense for an arbitrary RKHS, so we might think this is what
we should solve in general.
But if H is infinite dimensional, then naively, it would be quite difficult to
solve (∗). The solution is provided by the representer theorem.
Theorem (Representer theorem). Let H be an RKHS with reproducing kernel
k. Let c be an arbitrary loss function and J : [0, ∞) → R any strictly increasing
function. Then the minimizer fˆ ∈ H of
and thus we can rewrite our optimization problem as looking for the α̂ ∈ Rn
that minimizes
16
2 Kernel machines III Modern Statistical Methods
K α̂ = K(K + λI)−1 Y.
fˆ = u + v
So we know that
Meanwhile,
kf k2H = ku + vk2H = kuk2H + kvk2H ,
using the fact that u and v are orthogonal. So we know
How well does our kernel machine do? Let H be an RKHS with reproducing
kernel k, and f 0 ∈ H. Consider a model
Yi = f 0 (xi ) + εi
17
2 Kernel machines III Modern Statistical Methods
Theorem. We have
n n
1X σ2 X d2i λ
E(f 0 (xi ) − fˆλ (xi ))2 ≤ +
n i=1 n i=1 (di + λ)2 4n
n
σ2 1 X
di λ
≤ min ,λ + .
n λ i=1 r 4n
Given a data set, we can compute the eigenvalues di , and thus we can compute
this error bound.
Proof. We know from the representer theorem that
1 ≥ kf 0 k2H ≥ αT Kα.
= λ2 kU (D + λI)−1 DU T 2
| {z α} k2
θ
n 2
X λ
= θi2
i=1
(di + λ)2
Now we have
αT Kα = αT U DU T α = αT U DD+ DU T ,
where D+ is diagonal and
(
+ d−1
i di > 0
Dii = .
0 otherwise
18
2 Kernel machines III Modern Statistical Methods
by Hölder’s inequality with (p, q) = (1, ∞). Finally, use the inequality that
(a + b)2 ≥ 4ab
= σ 2 tr K 2 (K + λI)−2
= σ 2 tr U D2 U T (U DU T + λI)−2
= σ 2 tr D2 (D + λI)−2
n
2
X d2i
=σ .
i=1
(di + λ)2
d2i di di λ
Finally, writing (di +λ)2 = λ (di +λ)2 , we have
d2i
di
≤ min 1, ,
(di + λ)2 4λ
and we have the second bound.
How can we interpret this? If we look at the formula, then we see that we can
have good benefits if the di ’s decay quickly, and the exact values of the di depend
only on the choice of xi . So suppose the xi are random and idd and independent
of ε. Then the entire analysis is still valid by conditioning on x1 , . . . , xn .
We define µ̂i = dni , and λn = nλ . Then we can rewrite our result to say
n n
σ2 1 X
1 X 0 ˆ 2 µ̂i λn
E (f (xi ) − fλ (xi )) ≤ E min , λn + ≡ Eδn (λn ).
n i=1 λn n i=1 4 4
We want to relate this more directly to the kernel somehow. Given a density
p(x) on X , Mercer’s theorem guarantees the following eigenexpansion
∞
X
0
k(x, x ) = µj ej (x)ej (x0 ),
j=1
19
2 Kernel machines III Modern Statistical Methods
and Z
ek (x)ej (x)p(x) dx = 1j=k .
X
One can then show that
n ∞
1 X µ̂i 1X µ
i
E min , λn ≤ min , λn
n i=1 4 n i=1 4
up to some absolute constant factors. For a particular density p(x) of the input
space, we can try to solve the integral equation to figure out what the µj ’s are,
and we can find the expected bound.
When k is the Sobolev kernel and p(x) is the uniform density, then it turns
out we have
µj 1
= 2 .
4 π (2j − 1)2
P∞
By drawing a picture, we see that we can bound i=1 min µ4i , λn by
∞ µ Z ∞ √
X i 1 λn λn
min , λn ≤ λn ∧ 2 2
dj ≤ + .
i=1
4 0 π (2j − 1) π 2
So we find that
σ2
E(δn (λn )) = O 1/2
+ λn .
nλn
We can then find the λn that minimizes this, and we see that we should pick
2 2/3
σ
λn ∼ ,
n
2 2/3
which gives an error rate of ∼ σn .
20
2 Kernel machines III Modern Statistical Methods
yi xT
i β
maximize M among β ∈ Rp , M ≥ 0 subject to kβk2 ≥ M.
This optimization problem gives the hyperplane that maximizes the margin
between the two classes.
Let’s think about the more realistic case where we cannot completely separate
the two sets. We then want to impose a penalty for each point on the wrong
side, and attempt to minimize the distance.
There are two options. We can use the penalty
n
Yi xTi β
X
λ M−
i=1
kβk2 +
1
Replacing max kβk 2
with minimizing kβk22 and adding instead of subtracting the
penalty part, we modify this to say
n
!
X
minp kβk22 + λ (1 − Yi xTi β)+ .
β∈R
i=1
1
The final change we make is that we replace λ with λ, and multiply the whole
equation by λ to get
n
!
X
min λkβk22 + (1 − Yi xTi β)+ .
β∈Rp
i=1
This looksPmuch more like what we’ve seen before, with λkβk22 being the penalty
n
term and i=1 (1 − Yi xTi β)+ being the loss function.
The final modification is that we want to allow planes that don’t necessarily
pass through the origin. To do this, we allow ourselves to translate all the xi ’s
by a fixed vector δ ∈ Rp . This gives
n
!
X
2 T
min
p p
λkβk2 + 1 − Yi (xi − δ) β +
β∈R ,δ∈R
i=1
21
2 Kernel machines III Modern Statistical Methods
The representer theorem (or rather, a slight variant of it) tells us that the above
optimization problem is equivalent to the support vector machine
n
X
(µ̂λ , α̂λ ) = argmin (1 − Yi (KiT α + µ))+ + λαT Kα
(µ,α)∈R×Rn i=1
To kernelize this, we introduce an error term of λkβk22 , and then kernelize this.
The resulting optimization problem is then given by
n
!
X
argmin log(1 + exp(−Yi f (xi ))) + λkf k2H .
f ∈H i=1
22
2 Kernel machines III Modern Statistical Methods
θ̂ = (ΦT Φ + λI)−1 ΦT Y.
The computational cost of this is O((Lb)3 + n(Lb)2 ). The key thing of course is
that this is linear in n, and we can choose the size of L so that we have a good
trade-off between the computational cost and the accuracy of the approximation.
Now to predict a new x, we form
1
√ (φ̂1 (x), . . . , φ̂L (x))T θ̂,
L
and this is O(Lb).
In 2007, Rahimi and Recht proposed a random map for shift-invariant kernels,
i.e. kernels k such that k(x, x0 ) = h(x − x0 ) for some h (we work with X = Rp ).
A common example is the Gaussian kernel.
One motivation of this comes from a classical result known as Bochner’s
theorem.
Theorem (Bochner’s theorem). Let k : Rp × Rp → R be a continuous kernel.
Then k is shift-invariant if and only if there exists some distribution F on Rp
and c > 0 such that if W ∼ F , then
k(x, x0 ) = cE cos((x − x0 )T W ).
c cos((x − x0 )T W ) = φ̂(x)φ̂(x0 ).
E cos(x + u) cos(y + u) = E(cos x cos u − sin x sin u)(cos y cos u − sin y sin u)
Since u has the same distribution as −u, we see that E cos u sin u = 0.
Also, cos2 u + sin2 u = 1. Since u ranges uniformly in [−π, π], by symmetry,
we have E cos2 u = E sin2 u = 21 . So we find that
23
2 Kernel machines III Modern Statistical Methods
Then
Example. Take
0 1 0 2
k(x, x ) = exp − 2 kx − x k2 ,
2σ
the Gaussian kernel. If W ∼ N (0, σ −2 I), then
T 2 2
Eeit W
= e−ktk2 /(2σ )
= E cos(t2 W ).
24
3 The Lasso and beyond III Modern Statistical Methods
25
3 The Lasso and beyond III Modern Statistical Methods
the most, and then keep doing this until a fixed number of predictors have been
added. This is quite a nice approach, and is computationally quite fast even
if p is large. However, this method is greedy, and if we make a mistake at the
beginning, then the method blows up. In general, this method is rather unstable,
and is not great from a practical perspective.
The key difference is that we use an `1 norm on β rather than the `2 norm,
p
X
kβk1 = |βk |.
k=1
This makes it drastically different from Ridge regression. We will see that for λ
large, it will make all of the entries of β exactly zero, as opposed to being very
close to zero.
We can compare thisPto best subset regression, where we replace kβk1 with
p
something of the form k=1 1βk >0 . But the beautiful property of the Lasso
is that the optimization problem is now continuous, and in fact convex. This
allows us to solve it using standard convex optimization techniques.
Why is the `1 norm so different from the `2 norm? Just as in Ridge regression,
we may center and scale X, and center Y , so that we can remove µ from the
objective. Define
1
Qλ (β) = kY − Xβk22 + λkβk1 .
2n
Any minimizer β̂λL of Qλ (β) must also be a solution to
So imagine we are given a value of kβ̂λL k1 , and we try to solve the above
optimization problem with pictures. The region kβk1 ≤ kβ̂λL k1 is given by a
rotated square
26
3 The Lasso and beyond III Modern Statistical Methods
On the other hand, the minimum of kY − Xβk22 is at β̂ OLS , and the contours
are ellipses centered around this point.
To solve the minimization problem, we should pick the smallest contour that hits
the square, and pick the intersection point to be our estimate of β 0 . The point
is that since the unit ball in the `1 -norm has these corners, this β 0 is likely to
be on the corners, hence has a lot of zeroes. Compare this to the case of Ridge
regression, where the constraint set is a ball:
We can plot their unit spheres, and see that they look like
We see that q = 1 is the smallest value of q for which there are corners, and
also the largest value of q for which the constraint set is still convex. Thus, q = 1
is the sweet spot for doing regression.
27
3 The Lasso and beyond III Modern Statistical Methods
But
√ is this actually good? Suppose the columns of X are scaled to have `2
norm n, and, after centering, we have a normal linear model
Y = Xβ 0 + ε − ε̄1,
We’d like to bound kX T εk∞ , but it can be arbitrarily large since ε is a Gaussian.
However, with high probability, it is small. Precisely, define
1 T
Ω= kX εk∞ ≤ λ .
n
2
In a later lemma, we will show later that P(Ω) ≥ 1 − 2p−(A /2−1)
. Assuming Ω
holds, we have
1
kXβ 0 − X β̂k22 ≤ λkβ̂ − β 0 k − λkβ̂k + λkβ 0 k ≤ 2λkβ 0 k1 .
2n
28
3 The Lasso and beyond III Modern Statistical Methods
Note that EeαW is just the moment generating function of W , which we can
compute quite straightforwardly.
We can immediately apply this when W is a normal random variable, W ∼
N (0, σ 2 ). Then
2 2
EeαW = eα σ /2 .
So we have
α2 σ 2
2 2
P(W ≥ t) ≤ inf exp − αt = e−t /(2σ ) .
α>0 2
Observe that in fact this tail bound works for any random variable whose moment
2 2
generating function is bounded above by eα σ /2 .
29
3 The Lasso and beyond III Modern Statistical Methods
for all α ∈ R.
Corollary. Any sub-Gaussian random variable W with parameter σ satisfies
2
/2σ 2
P(W ≥ t) ≤ e−t .
Recall that the sum of two independent Gaussians is still a Gaussian. This
continues to hold for sub-Gaussian random variables.
Proposition. Let (Wi )ni=1 be independent mean-zero sub-Gaussian random
variables with parameters (σi )ni=0 , and let γ ∈ Rn . Then γ T W is sub-Gaussian
with parameter
X 1/2
(γi σi )2 .
Proof. We have
n
! n
X Y
E exp α γi Wi = E exp (αγi Wi )
i=1 i=1
n
α2 2 2
Y
≤ exp γ σ
i=1
2 i i
n
!
α2 X 2 2
= exp σ γ .
2 i=1 i i
We can now prove our bound for the Lasso, which in fact works for any
sub-Gaussian random variable.
Lemma. Suppose (εi )ni=1 are independent, mean-zero sub-Gaussian with com-
mon parameter σ. Let r
log p
λ = Aσ .
n
√
Let X be a matrix whose columns all have norm n. Then
1 2
P kX εk∞ ≤ λ ≥ 1 − 2p−(A /2−1) .
T
n
30
3 The Lasso and beyond III Modern Statistical Methods
Note that we have the factor of 2 since we need to consider the two cases
1 T 1 T
n Xj ε > λ and − n Xj ε > λ.
Plugging in our expression of λ, we write the bound as
1 2
2p exp − A2 log p = 2p1−A /2 .
2
This is all we need for our result on the Lasso. We are going to go a bit
further into this topic of concentration inequalities, because we will need them
later when we impose conditions on the design matrix. In particular, we would
like to bound the tail probabilities of products.
Definition (Bernstein’s condition). We say that a random variable W satisfies
Bernstein’s condition with parameters (σ, b) where a, b > 0 if
1
E[|W − EW |k ] ≤ k! σ 2 bk−2
2
for k = 2, 3, . . ..
The point is that these bounds on the moments lets us bound the moment
generating function of W .
Proposition (Bernstein’s inequality). Let W1 , W2 , . . . , Wn be independent ran-
dom variables with EWi = µ, and suppose each Wi satisfies Bernstein’s condition
with parameters (σ, b). Then
2 2
α(Wi −µ) α σ /2 1
Ee ≤ exp for all |α| < ,
1 − b|α| b
n
!
2
1 X nt
P Wi − µ ≥ t ≤ exp − for all t ≥ 0.
n i=1 2(σ 2 + bt)
2
Note that for large t, the bound goes as e−t instead of e−t .
Proof. For the first part, we fix i and write W = Wi . Let |α| < 1b . Then
∞
X 1 k
Eeα(Wi −µ) = E α |Wi − µ|k
k!
k=0
∞
σ 2 α2 X k−2 k−2
≤1+ |α| b
2
k=2
σ 2 α2 1
=1+
2 1 − |α|b
2 2
α σ /2
≤ exp .
1 − b|α|
31
3 The Lasso and beyond III Modern Statistical Methods
Then
n
! n
1X Y α
E exp α(Wi − µ) = E exp (Wi − µ)
n i=1 i=1
n
α 2 2
!
n σ /2
≤ exp n ,
1 − b αn
1
assuming α
n < b . So it follows that
α 2
n
! !
1X −αt n σ 2 /2
P Wi − µ ≥ t ≤e exp n .
n i=1 1 − b α
n
Setting
α t 1
= ∈ 0,
n bt + σ 2 b
gives the result.
Lemma. Let W, Z be mean-zero sub-Gaussian random variables with parameters
σW and σZ respectively. Then W Z satisfies Bernstein’s condition with parameter
(8σW σZ , 4σW σZ ).
Proof. For any random variable Y (which we will later take to be W Z), for
k > 1, we know
k
1 1
E|Y − EY |k = 2k E Y − EY
2 2
k
k 1 1
≤ 2 E |Y | + |EY | .
2 2
Note that k k k
|Y | + 1 |EY | ≤ |Y | + |EY |
1
2 2 2
by Jensen’s inequality. Applying Jensen’s inequality again, we have
|EY |k ≤ E|Y |k .
E|Y − EY |k ≤ 2k E|Y |k .
32
3 The Lasso and beyond III Modern Statistical Methods
where again we have a factor of 2 to account for both signs. We perform yet
another substitution
t2 t
x= 2 , dx = 2 dt.
2σN σW
Then we get
Z ∞
EW 2k
= 2k+1 σW
2k 2
kσW xk−1 e−x dx = 4 · k!σW
2
.
0
non-convex convex
We are actually more interested in convex functions. We shall let our functions
to take value ∞, so let us define R̄ = R ∪ {∞}. The point is that if we want our
function to be defined in [a, b], then it is convenient to extend it to be defined
on all of R by setting the function to be ∞ outside of [a, b].
33
3 The Lasso and beyond III Modern Statistical Methods
for all x, y ∈ Rd and t ∈ (0, 1). Moreover, we require that f (x) < ∞ for at least
one x.
We say it is strictly convex if the inequality is strict for all x, y and t ∈ (0, 1).
(1 − t)x + ty
x y
Why is this helpful? Suppose c∗ is the minimum of f . Then note that for any θ,
we have
Thus, if we can find some θ∗ , x∗ such that x∗ minimizes L(x, θ∗ ) and g(x∗ ) = 0,
then this is indeed the optimal solution.
This gives us a method to solve the optimization problem — for each fixed θ,
solve the unconstrained optimization problem argminx L(x, θ). If we are doing
this analytically, then we would have a formula for x in terms of θ. Then seek
for a θ such that g(x) = 0 holds.
34
3 The Lasso and beyond III Modern Statistical Methods
Subgradients
Usually, when we have a function to optimize, we take its derivative and set it
to zero. This works well if our function is actually differentiable. However, the
`1 norm is not a differentiable function, since |x| is not differentiable at 0. This
is not some exotic case we may hope to avoid most of the time — when solving
the Lasso, we actively want our solutions to have zeroes, so we really want to
get to these non-differentiable points.
Thus, we seek some generalized notion of derivative that works on functions
that are not differentiable.
Definition (Subgradient). A vector v ∈ Rd is a subgradient of a convex function
at x if f (y) ≥ f (x) + v T (y − x) for all y ∈ Rd .
The set of subgradients of f at x is denoted ∂f (x), and is called the subdif-
ferential .
f f
Proof. Both sides are equivalent to the requirement that f (y) ≥ f (x∗ ) for all
y.
We are interested in applying this to the Lasso. So we want to compute the
subdifferential of the `1 norm. Let’s first introduce some notation.
Notation. For x ∈ Rd and A ⊆ {1 . . . , d}, we write xA for the sub-vector of x
formed by the components of x induced by A. We write x−j = x{j}c = x{1,...,d}\j .
Similarly, we write x−jk = x{jk}c etc.
35
3 The Lasso and beyond III Modern Statistical Methods
We write
−1
xi < 0
sgn(xi ) = 1 xi > 0 ,
0 otherwise
So
|yj | ≥ v T (y − x).
We claim this holds iff vj ∈ [−1, 1] and v−j = 0. The ⇐ direction is an immediate
calculation, and to show ⇒, we pick y−j = v−j + x−j , and yj = 0. Then we have
T
0 ≥ v−j v−j .
|yj | ≥ vj yj
for all yj . This is then true iff vj ∈ [−1, 1]. Forming the set sum of the ∂gj (x)
gives the result.
where ρ̂L L
λ = {j : β̂λ,j 6= 0}.
36
3 The Lasso and beyond III Modern Statistical Methods
Proof. Fix λ > 0 and stop writing it. Suppose β̂ (1) and β̂ (2) are two Lasso
solutions at λ. Then
Q(β̂ (1) ) = Q(β̂ (2) ) = c∗ .
As Q is convex, we know
∗ 1 (1) 1 (2) 1 1
c =Q β̂ + β̂ ≤ Q(β̂ (1) ) + Q(β̂ (2) ) = c∗ .
2 2 2 2
Moreover, since k · k22 is strictly convex, we can have equality only if X β̂ (1) =
X β̂ (2) . So we are done.
Definition (Equicorrelation set). Define the equicorrelation set Êλ to be the
set of k such that
1 T
|X (Y − X β̂λL )| = λ,
n k
or equivalently, the k with νk = ±1, which is well-defined since it depends only
on the fitted values.
By the KKT conditions, Êλ contains the set of non-zeroes of Lasso solution,
but may be strictly bigger than that.
Note that if rk(XÊλ ) = |Êλ |, then the Lasso solution must be unique since
So β̂ (1) = β̂ (2) .
37
3 The Lasso and beyond III Modern Statistical Methods
Y = Xβ 0 .
S = {k : βk0 6= 0}.
We can understand this a bit as follows — the kth entry of this is the dot product
of sgn(βS0 ) with (XST XS )−1 XST Xk . This is the coefficient vector we would obtain
if we tried to regress Xk on XS . If this is large, then this suggests we expect Xk
to look correlated to XS , and so it would be difficult to determine if k is part of
S or not.
Theorem.
for all k ∈ S, then there exists a Lasso solution β̂λL with sgn(β̂λL ) = sgn(β 0 ).
(ii) If there exists a Lasso solution with sgn(β̂λL ) = sgn(β 0 ), then k∆k∞ ≤ 1.
1 XST XS XST XN
0
βS − β̂S ν̂S
T = λ .
n XN XS XN T XN −β̂N ν̂N
Call the top and bottom equations (1) and (2) respectively.
38
3 The Lasso and beyond III Modern Statistical Methods
By the KKT conditions, we know kν̂N k∞ ≤ 1, and the LHS is exactly λ∆.
To prove (1), we need to exhibit a β̂ that agrees in sign with β̂ and satisfies
the equations (1) and (2). In particular, β̂N = 0. So we try
−1 !
0 1 T 0 0
(β̂S , ν̂S ) = βS − λ X XS sgn(βS ), sgn(βS )
n S
(β̂N , νN ) = (0, ∆).
sgn(β̂S ) = sgn(βS0 ),
Y = Xβ 0 + ε − ε̄1,
where the εi are independent and have common sub-Gaussian parameter σ. Let
S, s, N be as before.
As before, the Lasso doesn’t always behave well, and whether or not it does
is controlled by the compatibility factor.
Definition (Compatibility factor). Define the compatibility factor to be
1 2
n kXβk2 s
φ2 = infp 1 2
= inf kXS βS − XN βN k22 .
s kβS k1
β∈R β∈Rp n
kβN k1 ≤3kβS k1 kβS k=1
βS 6=0 kβN k1 ≤3
Note that we are free to use a minus sign inside the norm since the problem
is symmetric in βN ↔ −βN .
In some sense, this φ measures how well we can approximate XS βS just with
the noise variables.
39
3 The Lasso and beyond III Modern Statistical Methods
and so
1 2
n kXβk2
φ2 ≥ inf 2 = cmin .
β6=0 kβk2
Of course, we don’t expect the minimum eigenvalue to be positive, but we have
the restriction in infimum in the definition of φ2 and we can hope to have a
positive value of φ2 .
Theorem. Assume φ2 > 0, and let β̂ be the Lasso solution with
p
λ = Aσ log p/n.
2
Then with probability at least 1 − 2p−(A /8−1)
, we have
We can move 2λkβ̂k1 to the other side, and applying the triangle inequality, we
have
1
kX(β̂ − β 0 )k22 ≤ 3λkβ̂ − β 0 k.
n
If we manage to bound the RHS from above as well, so that
1
3λkβ̂ − β 0 k ≤ cλ √ kX(β̂ − β 0 )k2
n
40
3 The Lasso and beyond III Modern Statistical Methods
3λkβ̂ − β 0 k1 ≤ c2 λ2 .
Simplifying, we obtain
1 16λ2 s
kX(β̂ − β 0 )k22 + λkβ̂ − β 0 k1 ≤ .
n φ2
If we want to be impressed by this result, then we should make sure that
the compatibility condition is a reasonable assumption to make on the design
matrix.
41
3 The Lasso and beyond III Modern Statistical Methods
β T Σβ
φ2Σ (S) = inf 2
β:kβN k1 ≤3kβS k1 , βs 6=0 kβS k1 /|S|
φ2Θ (S)
max |Θjk − Σjk | ≤ .
j,k 32|S|
Then
1 2
φ2Σ (S) ≥ φ (S).
2 Θ
Proof. We suppress the dependence on S for notational convenience. Let s = |S|
and
φ2 (S)
t= Θ .
32s
We have
|β T (Σ − Θ)β| ≤ kβk1 k(Σ − Θ)βk∞ ≤ tkβk21 ,
where we applied Hölder twice.
If kβN k ≤ 3kβS k1 , then we have
p
β T Θβ
kβk1 ≤ 4kβS k1 ≤ 4 √ .
φΘ / s
Thus, we have
φ2Θ 16β T Θβ 1
β T Θβ − · = β T Θβ ≤ β T Σβ.
32s φ2Θ /s 2
Define
φ2Σ,s = min φ2Σ (S).
S:|S|=s
Note that if
φ2Θ,s
max |Θjk − Σjk | ≤ ,
jk 32s
then
1 2
φ2Σ (S) ≥ φ (S).
2 Θ
for all S with |S| = s. In particular,
1 2
φ2Σ,s ≥ φ .
2 Θ,s
42
3 The Lasso and beyond III Modern Statistical Methods
Theorem. Suppose the prows of X are iid and each entry is sub-Gaussian with
parameter v. Suppose s log p/n → 0 as n → ∞, and φ2Σ0 ,s is bounded away
from 0. Then if Σ0 = EΣ̂, then we have
2 1 2
P φΣ̂,s ≥ φΣ0 ,s → 1 as n → ∞.
2
Proof. It is enough to show that
!
φ2Σ0 ,s
P max |Σ̂jk − Σ0jk | ≤ →0
jk 32s
as n → ∞.
φ2Σ0 ,s
Let t = 32s. Then
X
0
P max |Σ̂jk − Σjk | ≥ t ≤ P(|Σ̂jk − Σ0jk | ≥ t).
j,k
j,k
Recall that
n
1X
Σ̂jk = Xij Xik .
n i=1
So we can apply Bernstein’s inequality to bound
nt2
0
P(|Σ̂jk − Σjk ) ≤ 2 exp − ,
2(64v 4 + 4v 2 t)
since σ = 8v 2 and b = 4v 2 . So we can bound
2s2
cn
0 2 cn
P max |Σ̂jk − Σjk | ≥ t ≤ 2p exp − 2 = 2 exp − 2 c − →0
j,k s s n log p
for some constant c.
Corollary. Suppose the rows of X are iid mean-zero multivariate Gaussian
with variance Σ0 . Suppose Σn has minimum eigenvalue bounded from below by
0
cmin
p > 0, and suppose the diagonal entries of Σ are bounded from above. If
s log p/n → 0, then
1
P φ2Σ̂,s ≥ cmin → 1 as n → ∞.
2
43
3 The Lasso and beyond III Modern Statistical Methods
where g is convex and differentiable, and each hj is convex but not necessarily
differentiable. In the case of the Lasso, the first is the least squared term, and
the hj is the `1 term.
There are two things we can do to make this faster. We typically solve the
Lasso on a grid of λ values λ0 > λ1 > · · · > λL , and then picking the appropriate
λ by v-fold cross-validation. In this case, we can start solving at λ0 , and then
for each i > 0, we solve the λ = λi problem by picking x(0) to be the optimal
solution to the λi−1 problem. In fact, even if we already have a fixed λ value we
want to use, it is often advantageous to solve the Lasso with a larger λ-value,
and then use that as a warm start to get to our desired λ value.
Another strategy is an active set strategy. If p is large, then this loop may
take a very long time. Since we know the Lasso should set a lot of things to zero,
for ` = 1, . . . , L, we set
A = {k : β̂λL`−1 ,k 6= 0}.
We then perform coordinate descent only on coordinates in A to obtain a Lasso
solution β̂ with β̂Ac = 0. This may not be the actual Lasso solution. To check
this, we use the KKT conditions. We set
c 1 T
V = k ∈ A : |Xk (Y − X β̂)| > λ` .
n
If V = ∅, and we are done. Otherwise, we add V to our active set A, and then
run coordinate descent again on this active set.
44
3 The Lasso and beyond III Modern Statistical Methods
For example, if
Yi = µi + εi ,
and we believe that (µi )ni=1 form a piecewise constant sequence, we can estimate
µ0 by ( )
n−1
X
argmin kY − µk22 +λ |µi+1 − µi | .
µ
i=1
Example (Elastic net). We can use
EN 1 2 2
β̂λ,α = argmin kY − Xβk2 + λ(αkβk1 + (1 − α)kβk2 ) .
β 2n
for α ∈ [0, 1]. What the `2 norm does is that it encourages highly positively
correlated variables to have similar estimated coefficients.
For example, if we have duplicate columns, then the `1 penalty encourages
us to take one of the coefficients to be 0, while the `2 penalty encourages the
coefficients to be the same.
Another class of variations try to reduce the bias of the Lasso. Although
the bias of the Lasso is a necessary by-product of reducing the variance of the
estimate, it is sometimes desirable to reduce this bias.
The LARS-OLS hybrid takes the Ŝλ obtained by the Lasso, and then re-
estimate βŜ0 by OLS. We can also re-estimate using the Lasso on XŜλ , and this
λ
is known as the relaxed Lasso.
In the adaptive Lasso, we obtain an initial estimate of β 0 , e.g. with the Lasso,
and then solve
1 X |β k |
β̂λadapt = argmin kY − Xβk22 + λ .
β:βŜ =0 2n |β̂k |
k∈Ŝ
45
4 Graphical modelling III Modern Statistical Methods
4 Graphical modelling
4.1 Conditional independence graphs
So far, we have been looking at prediction problems. But sometimes we may
want to know more than that. For example, there is a positive correlation
between the wine budget of a college, and the percentage of students getting
firsts. This information allows us to make predictions, in the sense that if we
happen to know the wine budget of a college, but forgot the percentage of
students getting firsts, then we can make a reasonable prediction of the latter
based on the former. However, this does not suggest any causal relation between
the two — increasing the wine budget is probably not a good way to increase
the percentage of students getting firsts!
Of course, it is unlikely that we can actually figure out causation just by
looking at the data. However, there are some things we can try to answer. If
we gather more data about the colleges, then we would probably find that the
colleges that have larger wine budget and more students getting firsts are also
the colleges with larger endowment and longer history. If we condition on all
these other variables, then there is not much correlation left between the wine
budget and the percentage of students getting firsts. This is what we are trying
to capture in conditional independence graphs.
We first introduce some basic graph terminology. For the purpose of condi-
tional independence graphs, we only need undirected graphs. But later, we need
the notion of direct graphs as well, so our definitions will be general enough to
include those.
Definition (Graph). A graph is a pair G = (V, E), where V is a set and
E ⊆ (V, V ) such that (v, v) 6∈ E for all v ∈ V .
Definition (Edge). We say there is an edge between j and k and that j and k
are adjacent if (j, k) ∈ E or (k, j) ∈ E.
Definition (Undirected edge). An edge (j, k) is undirected if also (k, j) ∈ E.
Otherwise, it is directed and we write j → k to represent it. We also say that j
is a parent of k, and write pa(k) for the set of all parents of k.
Definition ((Un)directed graph). A graph is (un)directed if all its edges are
(un)directed.
Definition (Skeleton). The skeleton of G is a copy of G with every edge replaced
by an undirected edge.
Definition (Subgraph). A graph G1 = (V, E) is a subgraph of G = (V, E) if
V1 ⊆ V and E1 ⊆ E. A proper subgraph is one where either of the inclusions are
proper inclusions.
As discussed, we want a graph that encodes the conditional dependence of
different variables. We first define what this means. In this section, we only
work with undirected graphs.
Definition (Conditional independence). Let X, Y, Z be random vectors with
joint density fXY Z . We say that X is conditionally independent of Y given Z,
written X q Y | Z, if
fXY |Z (x, y | z) = fX|Z (x | z)fY |Z (y | z).
46
4 Graphical modelling III Modern Statistical Methods
Equivalently,
fX|Y Z (x | y, z) = fX|Z (x | z)
for all y.
We shall ignore all the technical difficulties, and take as an assumption that
all these conditional densities exist.
Since we are going to talk about conditional distributions a lot, the following
calculation will be extremely useful.
Proposition. Suppose Z ∼ Np (µ, Σ) and Σ is positive definite. Then
47
4 Graphical modelling III Modern Statistical Methods
Proof. Of course, we can just compute this directly, maybe using moment
generating functions. But for pleasantness, we adopt a different approach. Note
that for any M , we have
ZA = M ZB + (ZA − M ZB ).
We shall pick M such that ZA − M ZB is independent of ZB , i.e. such that
0 = cov(ZB , ZA − M ZB ) = ΣB,A − ΣB,B M T .
So we should take
M = (Σ−1 T −1
B,B ΣB,A ) = ΣA,B ΣB,B .
We already know that ZA − M ZB is Gaussian, so to understand it, we only need
to know its mean and variance. We have
E[ZA − M ZB ] = µA − M µB = µA − ΣAB Σ−1
BB µB
var(ZA − M ZB ) = ΣA,A − 2ΣA,B Σ−1 −1 −1
B,B ΣB,A + ΣA,B ΣB,B ΣB,B ΣB,B ΣB,A
Neighbourhood selection
We are going to specialize to A = {k} and B = {1, . . . , n} \ {k}. Then we can
separate out the “mean” part and write
T
Zk = Mk + Z−k Σ−1
−k,−k Σ−k,k + εk ,
where
Mk = µk − µT−k Σ−1
k,−k Σ−k,k ,
Proof. If the jth component of Σ−1 −k,−k Σ−k,k is 0, then the distribution of
Zk | Z−k will not depend on (Z−k )j = Zj 0 (here j 0 is either j or j + 1, depending
on where k is). So we know
d
Zk | Z−k = Zk | Z−kj 0 .
This is the same as saying Zk q Zj 0 | Z−kj 0 .
Neighbourhood selection exploits this fact. Given x1 , . . . , xn which are iid
∼ Z and
X = (xT1 , · · · , xTn )T ,
we can estimate Σ−1
−k,−k Σ−k,k by regressing Xk on X−k using the Lasso (with
an intercept term). We then obtain selected sets Ŝk . There are two ways of
estimating the CIG based on these:
48
4 Graphical modelling III Modern Statistical Methods
S = P − QR−1 QT .
(ii)
S −1 −S −1 QR−1
−1
M = −1 T −1 .
−R Q S R−1 + R−1 QT S −1 QR−1
This tells us Zk qZj | Z−kj iff Ωjk = 0. Thus, we can approximate the conditional
independence graph by computing the precision matrix Ω.
Our method to estimate the precision matrix is similar to the Lasso. Recall
that the density of Np (µ, Σ) is
1 1 T −1
P (z) = exp − (z − µ) Σ (z − µ) .
(2π)p/2 (det Σ)1/2 2
49
4 Graphical modelling III Modern Statistical Methods
Then
X X
(xi − µ)T Ω(xi − µ) = (xi − x̄ + x̄ − µ)T Ω(xi − x̄ + x̄ − µ)
X
= (xi − x̄)T Ω(xi − x̄) + n(X̄ − µ)T Ω(X̄ − µ).
We have
X X
(xi − x̄)T Ω(xi − x̄) = tr (xi − x̄)T Ω(xi − x̄) = n tr(SΩ).
So we now have
n
`(µ, Ω) = − tr(SΩ) − log det Ω + (X̄ − µ)T Ω(X̄ − µ) .
2
We are really interested in estimating Ω. So we should try to maximize this over
µ, but that is easy, since this is the same as minimizing (X̄ − µ)T Ω(X̄ − µ), and
we know Ω is positive-definite. So we should set µ = X̄. Thus, we have
n
`(Ω) = maxp `(µ, Ω) = − tr(SΩ) − log det ω .
µ∈R 2
So we can solve for the MLE of Ω by solving
min − log det Ω + tr(SΩ) .
Ω:Ω0
One can show that this is convex, and to find the MLE, we can just differentiate
∂ ∂
log det Ω = (Ω−1 )jk , tr(SΩ) = Sjk ,
∂Ωjk ∂Ωjk
using that S and Ω are symmetric. So provided that S is positive definite, the
maximum likelihood estimate is just
Ω = S −1 .
But we are interested in the high dimensional situation, where we have loads of
variables, and S cannot be positive definite. To solve this, we use the graphical
Lasso.
The graphical Lasso solves
argmin − log det Ω + tr(SΩ) + λkΩk1 ,
Ω:Ω0
where X
kΩk1 = Ωjk .
jk
Often, people don’t sum over the diagonal elements, as we want to know if off-
diagonal elements ought to be zero. Similar to the case of the Lasso, this gives a
sparse estimate of Ω from which we may estimate the conditional independence
graph.
50
4 Graphical modelling III Modern Statistical Methods
a b a b
c c
Zk = hk (Zpk , εk ),
Z3 = ε3 ∼ Bernoulli(0.25)
Z2 = 1{ε2 (1+Z3 )> 12 } , ε2 ∼ U [0, 1]
Z1 = 1{ε1 (Z2 +Z3 )> 12 } , ε1 ∼ U [0, 1].
51
4 Graphical modelling III Modern Statistical Methods
Z2 Z3
Z1
Note that the structural equation model for Z determines its distribution,
but the converse is not true. For example, the following two distinct structural
equation give rise to the same distribution:
Z1 = ε Z1 = Z2
Z2 = Z1 Z2 = ε
Indeed, if we have two variables that are just always the same, it is hard to tell
which is the cause and which is the effect.
It would be convenient if we could order our variables in a way that Zk
depends only on Zj for j < k. This is known as a topological ordering:
Definition (Descendant). We say k is a descendant of j if there is a directed
path from j to k. The set of descendant of j will be denoted de(j).
Definition (Topological ordering). Given a DAG G with V = {1, . . . , p} we say
that a permutation π : V → V is a topological ordering if π(j) < π(k) whenever
k ∈ de(j).
Thus, given a topological ordering π, we can write Zk as a function of
επ−1 (1) , . . . , επ−1 (π(k)) .
How do we understand structural equational models? They give us informa-
tion that are not encoded in the distribution itself. One way to think about them
is via interventions. We can modify a structural equation model by replacing
the equation for Zk by setting, e.g. Zk = a. In real life, this may correspond
to forcing all students to go to catch up workshops. This is called a perfect
intervention. The modified SEM gives a new joint distribution for Z. Expecta-
tions or probabilities with respect to the new distribution are written by adding
“do(Zk = a)”. For example, we write
Z3 = ε3 ∼ Bernoulli(0.25)
Z2 = 1
Z1 = 1ε1 (1+Z3 )> 12 , ε1 ∼ U [0, 1].
52
4 Graphical modelling III Modern Statistical Methods
s
p q
a b
a b
s
p q
a b
Here we expect
– Za and Zb are not independent.
– Za and Zb are independent given Zs .
To see (i), we observe that if we know about Za , then this allows us to predict
some information about Zs , which would in turn let us say something about Zb .
53
4 Graphical modelling III Modern Statistical Methods
j`−1 j` j`+1
or there is some j` such that the path is of this form, but neither j` nor any of
its descendants are in S.
(ii) the global Markov property if for all disjoint A, B, S such that A, B is
d-separated by S, then ZA q ZB | ZS .
Zk q Z{1,...,p}\{k∪pa(k)} | Zpa(k) .
Thus,
f (zk | z1 , . . . , zk−1 ) = f (zk | zpa(k) ).
54
4 Graphical modelling III Modern Statistical Methods
Causal minimality
If P is generated by an SEM with DAG G, then from the above, we know that
P is Markov with respect to G. The converse is also true: if P is Markov with
respect to a DAG G, then there exists a SEM with DAG G that generates P .
This immediately implies that P will be Markov with respect to many DAGs.
For example, a DAG whose skeleton is complete will always work. This suggests
the following definition:
Definition (Causal minimality). A distribution P satisfies causal minimality
with respect to G but not any proper subgraph of G.
Faithfulness
To describe the final issue, consider the SEM
55
4 Graphical modelling III Modern Statistical Methods
Z1
Z2 Z3
Z2 Z3
56
4 Graphical modelling III Modern Statistical Methods
···
a
57
4 Graphical modelling III Modern Statistical Methods
This gives us the Markov equivalence class, and we may orient the other edges
using other properties like acyclicity.
If we want to apply this to data sets, then we need to apply some conditional
independence tests instead of querying our oracle to figure out if things are
conditional dependence. However, errors in the algorithm propagate, and the
whole process may be derailed by early errors. Moreover, the result of the
algorithm may depend on how we iterate through the nodes. People have tried
many ways to fix these problems, but in general, this method is rather unstable.
Yet, if we have large data sets, this can produce quite decent results.
58
5 High-dimensional inference III Modern Statistical Methods
5 High-dimensional inference
5.1 Multiple testing
Finally, we talk about high-dimensional inference. Suppose we have come up
with a large number of potential drugs, and want to see if they are effective in
killing bacteria. Naively, we might try to run a hypothesis test on each of them,
using a p < 0.05 threshold. But this is a terrible idea, since each test has a 0.05
chance of giving a false positive, so even if all the drugs are actually useless, we
would have incorrectly believed that a lot of them are useful, which is not the
case.
In general, suppose we have some null hypothesis H1 , . . . , Hm . By definition,
a p value pi for Hi is a random variable such that
PHi (pi ≤ α) ≤ α
59
5 High-dimensional inference III Modern Statistical Methods
H1234
H1 H2 H3 H4
In this case, we reject H1 but not H2 by closed testing. While H23 is rejected,
we cannot tell if it is H2 or H3 that should be rejected.
This might seem like a very difficult procedure to analyze, but it turns out it
is extremely simple.
Theorem. Closed testing makes no false rejections with probability ≥ 1 − α.
In particular, FWER ≤ α.
Proof. In order for there to be a false rejection, we must have falsely rejected
HI0 with the local test, which has probability at most α.
But of course this doesn’t immediately give us an algorithm we can apply to
data. Different choices for the local test give rise to different multiple testing
procedures. One famous example is Holm’s procedure. This takes φI to be the
α
Bonferroni test, where φI = 1 iff pi ≤ |I| for some i ∈ I.
When m is large, then we don’t want to compute all φI , since there are 2I
computations to do. So we might want to find a shortcut. With a moment of
thought, we see that Holm’s procedure amounts to the following:
– Let (i) be the index of the ith smallest p-value, so p(1) ≤ p(2) ≤ · · · ≤ p(m) .
α
– Step 1: If p(1) ≤ m , then reject H(1) and go to step 2. Otherwise, accept
all null hypothesis.
α
– Step i: If p(i) ≤ m−i+1 , then reject H(i) and go to step i + 1. Otherwise,
accept H(i) , H(i+1) , . . . , H(m) .
– Step m: If p(m) ≤ α, then reject H(m) . Otherwise, accept H(m) .
The interesting thing about this is that it has the same bound on FWER as the
Bonferroni correction, but the conditions here are less lenient.
But if m is very large, then the criterion for accepting p(1) is still quite strict.
The problem is that controlling FWER is a very strong condition. Instead of
controlling the probability that there is one false rejection, when m is large, it
might be more reasonable to control the proportion of false discoveries. Many
modern multiple testing procedures aim to control the false discovery rate
N
FDR = E .
max{R, 1}
The funny maximum in the denominator is just to avoid division by zero. When
R = 0, then we must have N = 0 as well, so what is put in the denominator
doesn’t really matter.
The Benjamini–Hochberg procedure attempts to control the FDR at level α
and works as follows:
60
5 High-dimensional inference III Modern Statistical Methods
αi
– Let k̂ = max i : p(i) ≤ m . Then reject M(1) , . . . , H(k̂) , or accept all
hypothesis if k̂ is not defined.
Under certain conditions, this does control the false discovery rate.
The brilliant idea is, for each i ∈ I0 , let Ri be the number of rejections when
applying a modified Benjamini–Hochberg procedure to p\i = {p1 , . . . , pM } \ {pi }
with cutoff
\i α(j + 1)
k̂i = max j : p(j) ≤
m
We observe that for i ∈ I0 and r ≥ 1, we have
n αr o n ar αr αs o
pi ≤ , R = r = pi ≤ , p(r) ≤ , p(s) > for all s ≥ r
m m m m
n αr \i αr \i αs o
= pi ≤ , p(r−1) ≤ , p(s−1) > for all s > r
m mo m
n αr
= pi ≤ , Ri = r − 1 .
m
The key point is that Ri = r − 1 depends only on the other p-values. So the
FDR is equal to
M
XX 1 αr
FDR = P pi ≤ , Ri = r − 1
r=1
r M
i∈I0
M
XX 1 αr
= P pi ≤ P(Ri = r − 1)
r=1
r m
i∈I0
61
5 High-dimensional inference III Modern Statistical Methods
αr αr
Using that P(pi ≤ m) ≤ m by definition, this is
m
α XX
≤ P(Ri = r − 1)
M
i∈I0 r=1
α X
= P(Ri ∈ {0, . . . , m − 1})
M
i∈I0
αM0
= .
M
This is one of the most used procedures in modern statistics, especially in
the biological sciences.
62
5 High-dimensional inference III Modern Statistical Methods
We hope we can choose Θ̂ so that δ is small. We can then use the quantity
1
b = β̂ + λΘ̂ν̂ = β̂ + Θ̂X T (Y − X β̂)
n
as our modified estimator, called the debiased Lasso.
How do we bound ∆? We already know that (under compatibility and
sparsity conditions), we can make the `1 norm of kβ 0 − β̂k small with high
probability. So if the `∞ norm of each of the rows of Θ̂Σ̂ − 1 is small, then
Hölder allows us to bound ∆.
Write θ̂j for the jth row of Θ̂. Then
k(Σ̂Θ̂T )j − Ik∞ ≤ η
Xj − X−j γ̂ (j)
X θ̂j = T
.
Xj (X − X−j γ̂ (j) )/n
Thus, we have XjT X θ̂j n = 1, and by the KKT conditions for the Lasso, we have
τ̂j T
kX−j X θ̂j k∞ ≤ λj .
n
63
5 High-dimensional inference III Modern Statistical Methods
λj
Now this is good as long as we can ensure τ̂j2
to be small. When is this true?
We can consider a random design setting, where each row of X is iid Np (0, Σ)
for some positive-definite Σ. Write Ω = Σ−1 .
Then by our study of the neighbourhood selection procedure, we know that
for each j, we can write
Xj = X−j γ (j) + ε(j) ,
(j)
where εi | X−j ∼ N (0, Ω−1 jj ) are iid and γ
(j)
= −Ω−1
jj Ω−j,j . To apply our
(j)
results, we need to ensure that γ are sparse. Let use therefore define
X
sj = 1Ωjk 6=0 ,
k6=j
and as n, p → ∞,
log(p)
P k∆k∞ > A2 s √ → 0.
n
Note that here X is not centered
√ and scaled.
We see that in particular, n(b̂j − βj0 ) ∼ N (0, σ 2 (Θ̂Σ̂Θ̂T )jj ). In fact, one
can show that
1 kXj − X−j γ̂ (j) }22
dj = .
n τ̂j2
This suggests an approximate (1 − α)-level confidence interval for βj0 ,
q q
CI = bj − Zα/2 σ dj /n, b̂j + Zα/2 σ dj /n ,
where Zα is the upperpα point of N (0, 1). Note that here we are getting confidence
intervals of width ∼ 1/n. In particular, there is no log p dependence, if we are
only trying to estimate βj0 only.
2 T 2 T (j)
– n kX Σk∞ ≤ λ and n kX−j ε k∞ ≤ λ.
p
– 1 (j) 2
n Σ k2 ≥ (Ωjj )−1 (1 − 4 (log p)/n)
Question 13 on example sheet 4 shows that P(Λn ) → 1 for A1 sufficiently
large. So we will work on the event Λn .
64
5 High-dimensional inference III Modern Statistical Methods
for some constant c1 . We now seek a lower bound for τ̂j2 . Consider linear models
Ω−1
jj = var(Xij | Xi,−j ) ≤ var(Xij ) = Σij ≤ A.
65
Index III Modern Statistical Methods
Index
de, 52 faithfulness, 56
sgn, 36 false discovery rate, 60
v-fold cross-validation, 9 family-wise error rate, 59
v-structure, 54 Fisher information matrix, 4
xA , 35 fused Lasso, 45
x−j , 35 FWER, 59
66
Index III Modern Statistical Methods
67