0% found this document useful (0 votes)
49 views

Modern Statistical Methods

This document provides an overview and outline of the course "Modern Statistical Methods". It discusses how modern datasets have increased in size and complexity, making traditional statistical methods less applicable. The course will cover four main topics: kernel machines, the Lasso and its extensions, graphical modeling and causal inference, and multiple testing and high-dimensional inference. These topics develop methods to extract meaningful information from large, high-dimensional datasets. The document provides context on classical statistical methods like linear regression and maximum likelihood estimation before focusing on modern approaches.

Uploaded by

bhattaramrevanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Modern Statistical Methods

This document provides an overview and outline of the course "Modern Statistical Methods". It discusses how modern datasets have increased in size and complexity, making traditional statistical methods less applicable. The course will cover four main topics: kernel machines, the Lasso and its extensions, graphical modeling and causal inference, and multiple testing and high-dimensional inference. These topics develop methods to extract meaningful information from large, high-dimensional datasets. The document provides context on classical statistical methods like linear regression and maximum likelihood estimation before focusing on modern approaches.

Uploaded by

bhattaramrevanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Part III — Modern Statistical Methods

Based on lectures by R. D. Shah


Notes taken by Dexter Chua

Michaelmas 2017

These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.

The remarkable development of computing power and other technology now allows
scientists and businesses to routinely collect datasets of immense size and complexity.
Most classical statistical methods were designed for situations with many observations
and a few, carefully chosen variables. However, we now often gather data where we
have huge numbers of variables, in an attempt to capture as much information as we
can about anything which might conceivably have an influence on the phenomenon
of interest. This dramatic increase in the number variables makes modern datasets
strikingly different, as well-established traditional methods perform either very poorly,
or often do not work at all.

Developing methods that are able to extract meaningful information from these large
and challenging datasets has recently been an area of intense research in statistics,
machine learning and computer science. In this course, we will study some of the
methods that have been developed to analyse such datasets. We aim to cover some of
the following topics.

– Kernel machines: the kernel trick, the representer theorem, support vector
machines, the hashing trick.
– Penalised regression: Ridge regression, the Lasso and variants.
– Graphical modelling: neighbourhood selection and the graphical Lasso. Causal
inference through structural equation modelling; the PC algorithm.
– High-dimensional inference: the closed testing procedure and the Benjamini–
Hochberg procedure; the debiased Lasso

Pre-requisites

Basic knowledge of statistics, probability, linear algebra and real analysis. Some
background in optimisation would be helpful but is not essential.

1
Contents III Modern Statistical Methods

Contents
0 Introduction 3

1 Classical statistics 4

2 Kernel machines 6
2.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 v-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Other kernel machines . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Large-scale kernel machines . . . . . . . . . . . . . . . . . . . . . 22

3 The Lasso and beyond 25


3.1 The Lasso estimator . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Basic concentration inequalities . . . . . . . . . . . . . . . . . . . 29
3.3 Convex analysis and optimization theory . . . . . . . . . . . . . . 33
3.4 Properties of Lasso solutions . . . . . . . . . . . . . . . . . . . . 36
3.5 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Computation of Lasso solutions . . . . . . . . . . . . . . . . . . . 43
3.7 Extensions of the Lasso . . . . . . . . . . . . . . . . . . . . . . . 44

4 Graphical modelling 46
4.1 Conditional independence graphs . . . . . . . . . . . . . . . . . . 46
4.2 Structural equation modelling . . . . . . . . . . . . . . . . . . . . 51
4.3 The PC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 High-dimensional inference 59
5.1 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Inference in high-dimensional regression . . . . . . . . . . . . . . 62

Index 66

2
0 Introduction III Modern Statistical Methods

0 Introduction
In recent years, there has been a rather significant change in what sorts of data
we have to handle and what questions we ask about them, witnessed by the
popularity of the buzzwords “big data” and “machine learning”. In classical
statistics, we usually have a small set of parameters, and a very large data set.
We then use the large data set to estimate the parameters.
However, nowadays we often see scenarios where we have a very large number
of parameters, and the data set is relatively small. If we tried to apply our
classical linear regression, then we would just be able to tune the parameters so
that we have a perfect fit, and still have great freedom to change the parameters
without affecting the fitting.
One example is that we might want to test which genomes are responsible for
a particular disease. In this case, there is a huge number of genomes to consider,
and there is good reason to believe that most of them are irrelevant, i.e. the
parameters should be set to zero. Thus, we want to develop methods that find
the “best” fitting model that takes this into account.
Another problem we might encounter is that we just have a large data set,
and doing linear regression seems a bit silly. If we have so much data, we might
as well try to fit more complicated curves, such as polynomial functions and
friends. Perhaps more ambitiously, we might try to find the best continuously
differentiable function that fits the curve, or, as analysts will immediately suggest
as an alternative, weakly differentiable functions.
There are many things we can talk about, and we can’t talk about all of
them. In this course, we are going to cover 4 different topics of different size:
– Kernel machines
– The Lasso and its extensions

– Graphical modeling and causal inference


– Multiple testing and high-dimensional inference
The four topics are rather disjoint, and draw on different mathematical skills.

3
1 Classical statistics III Modern Statistical Methods

1 Classical statistics
This is a course on modern statistical methods. Before we study methods, we
give a brief summary of what we are not going to talk about, namely classical
statistics.
So suppose we are doing regression. We have some predictors xi ∈ Rp and
responses Yi ∈ R, and we hope to find a model that describes Y as a function of
x. For convenience, define the vectors
 T  T
x1 Y1
 ..   .. 
X =  . , Y =  . .
xTn YnT

The linear model then assumes there is some β 0 ∈ Rp such that

Y = Xβ 0 + ε,

where ε is some (hopefully small) error random variable. Our goal is then to
estimate β 0 given the data we have.
If X has full column rank, so that X T X is invertible, then we can use ordinary
least squares to estimate β 0 , with estimate

β̂ OLS = argmin kY − xβk22 = (X T X)−1 X T Y.


β∈Rp

This assumes nothing about ε itself, but if we assume that Eε = 0 and var(E) =
σ 2 I, then this estimate satisfies
– Eβ β̂ OLS = (X T X)−1 X T Xβ 0 = β 0

– varβ (β̂ OLS ) = (X T X −1 )X T var(ε)X(X T X)−1 = σ 2 (X T X)−1 .


In particular, this is an unbiased estimator. Even better, this is the best linear
unbiased estimator. More precisely, the Gauss–Markov theorem says any other
linear estimator β̃ = AY has var(β̃) − var(β̂ OLS ) positive semi-definite.
Of course, ordinary least squares is not the only way to estimate β 0 . Another
common method for estimating parameters is maximum likelihood estimation,
and this works for more general models than linear regression. For people who are
already sick of meeting likelihoods, this will be the last time we meet likelihoods
in this course.
Suppose we want to estimate a parameter θ via knowledge of some data Y .
We assume Y has density f (y; θ). We define the log-likelihood by

`(θ) = log f (Y, θ).

The maximum likelihood estimator then maximize `(θ) over θ to get θ̂.
Similar to ordinary least squares, there is a theorem that says maximum
likelihood estimation is the “best”. To do so, we introduce the Fisher information
matrix . This is a family of d × d matrices indexed by θ, defined by
∂2
 
Ijk (θ) = −Eθ `(θ) .
∂θj ∂θj
The relevant theorem is

4
1 Classical statistics III Modern Statistical Methods

Theorem (Cramér–Rao bound). If θ̃ is an unbiased estimator for θ, then


var(θ̃) − I −1 (θ) is positive semi-definite.
Moreover, asymptotically, as n → ∞, the maximum likelihood estimator is
unbiased and achieves the Carmér–Rao bound.
Another wonderful fact about the maximum likelihood estimator is that
asymptotically, it is normal distributed, and so it is something we understand
well.
This might seem very wonderful, but there are a few problems here. The
results we stated are asymptotic, but what we actually see in real life is that
as n → ∞, the value of p also increases. In these contexts, the asymptotic
property doesn’t tell us much. Another issue is that these all talk about unbiased
estimators. In a lot of situations of interest, it turns out biased methods do
much much better than these methods we have.
Another thing we might be interested is that as n gets large, we might want
to use more complicated models that simple parametric models, as we have much
more data to mess with. This is not something ordinary least squares provides
us with.

5
2 Kernel machines III Modern Statistical Methods

2 Kernel machines
We are going to start a little bit slowly, and think about our linear model
Y = Xβ 0 + ε, where E(ε) = 0 and var(ε) = σ 2 I. Ordinary least squares is an
unbiased estimator, so let’s look at biased estimators.
For a biased estimator, β̃, we should not study the variance, but the mean
squared error

E[(β̃ − β 0 )(β̃ − β 0 )T ] = E(β̃ − Eβ̃ + E β̃ − β 0 )(β̃ − Eβ̃ + E β̃ − β 0 )T


= var(β̃) + (Eβ̃ − β 0 )(Eβ̃ − β 0 )T

The first term is, of course, just the variance, and the second is the squared bias.
So the point is that if we pick a clever biased estimator with a tiny variance,
then this might do better than unbiased estimators with large variance.

2.1 Ridge regression


Ridge regression was introduced in around 1970. The idea of Ridge regression is
to try to shrink our estimator a bit in order to lessen the variance, at the cost of
introducing a bias. We would hope then that this will result in a smaller mean
squared error.

Definition (Ridge regression). Ridge regression solves

(µ̂R R 2 2
λ , β̂λ ) = argmin {kY − µ1 − Xβk2 + λkβk2 },
(µ,β)∈R×Rp

where 1 is a vector of all 1’s. Here λ ≥ 0 is a tuning parameter, and it controls


how much we penalize a large choice of β.
Here we explicitly have an intercept term. Usually, we eliminate this by
adding a column of 1’s in X. But here we want to separate that out, since
we do not want give a penalty for large µ. For example, if the parameter is
temperature, then if we decide to measure in degrees Celsius rather than Kelvins,
then we don’t want the resulting µ̂, β̂ to change.
More precisely, our formulation of Ridge regression is so that if we make the
modification
Y 0 = c1 + Y,
then we have
0
µ̂R R
λ (Y ) = µ̂λ (Y ) + c.

Note also that the Ridge regression formula makes sense only if each entry in β
have the same order of magnitude, or else the penalty will only have a significant
effect on the terms of large magnitude. Standard practice is to√subtract from
each column of X its mean, and then scale it to have `2 norm n. The actual
number is not important here, but it will in the case of the Lasso.
By differentiating, one sees that the solution to the optimization problem is
1X
µ̂R
λ = Ȳ = Yi
n
β̂λR = (X T X + λI)−1 X T Y.

6
2 Kernel machines III Modern Statistical Methods

Note that we can always pick λ such that the matrix (X T X + λI) is invertible.
In particular, this can work even if we have more parameters than data points.
This will be important when we work with the Lasso later on.
If some god-like being told us a suitable value of λ to use, then Ridge
regression always work better than ordinary least squares.
Theorem. Suppose rank(X) = p. Then for λ > 0 sufficiently small (depending
on β 0 and σ 2 ), we have that

E(β̂ OLS − β 0 )(β̂ OLS − β 0 )T − E(β̂λR − β 0 )(β̂λR − β 0 )T (∗)

is positive definite.
Proof. We know that the first term is just σ 2 (X T X)−1 . The right-hand-side has
a variance term and a bias term. We first look at the bias:

E[β̂ − β 0 ] = (X T X + λI)−1 X T Xβ 0 − β 0
= (X T X + λI)−1 (X T X + λI − λI)β 0 − β 0
= −λ(X T X + λI)−1 β 0 .

We can also compute the variance

var(β̂) = σ 2 (X T X + λI)−1 X T X(X T X + λI)−1 .

Note that both terms appearing in the squared error look like

(X T X + λI)−1 something(X T X + λI)−1 .

So let’s try to write σ 2 (X T X)−1 in this form. Note that

(X T X)−1 = (X T X + λI)−1 (X T X + λI)(X T X)−1 (X T X + λI)(X T X + λI)−1


= (X T X + λI)−1 (X T X + 2λI + λ2 (X T X)−1 )(X T X + λI)−1 .

Thus, we can write (∗) as



(X T X + λI)−1 σ 2 (X T X + 2λI + λ2 (X T X)−1 )

− σ 2 X T X − λ2 β 0 (β 0 )T (X T X + λI)−1
 
= λ(X T X + λI)−1 2σ 2 I + λ(X T X)−1 − λβ 0 (β 0 )T (X T X + λI)−1

Since λ > 0, this is positive definite iff

2σ 2 I + σ 2 λ(X T X)−1 − λβ 0 (β 0 )T
2σ 2
is positive definite, which is true for 0 < λ < kβ 0 k22
.

While this is nice, this is not really telling us much, because we don’t know
how to pick the correct λ. It also doesn’t tell us when we should expect a big
improvement from Ridge regression.
To understand this better, we need to use the singular value decomposition.

7
2 Kernel machines III Modern Statistical Methods

Theorem (Singular value decomposition). Let X ∈ Rn×p be any matrix. Then


it has a singular value decomposition (SVD)
X = U D V T,
n×p n×n n×p p×p

where U, V are orthogonal matrices, and D11 ≥ D22 ≥ · · · ≥ Dmm ≥ 0, where


m = min(n, p), and all other entries are zero.
When n > p, there is an alternative version where U is an n × p matrix
with orthogonal columns, and D is a p × p diagonal matrix. This is done by
replacing U by its first p columns and D by its first p rows. This is known as
the thin singular value decomposition. In this case, U T U = Ip but U U T is not
the identity.
Let’s now apply try to understand Ridge regressions a little better. Suppose
n > p. Then using the thin SVD, the fitted values from ridge regression are
X β̂λR = X(X T X + λI)−1 X T Y
= U DV T (V D2 V T + λI)−1 V DU T Y.
We now note that V D2 V T + λI = V (D2 + λI)V T , since V is still orthogonal.
We then have
(V (D2 + λI)V T )−1 = V (D2 + λI)−1 V T
Since D2 and λI are both diagonal, it is easy to compute their inverses as well.
We then have
p 2
X Djj
X β̂λR = U D2 (D2 + λI)−1 U T Y = Uj T
2 + λ Uj Y
j=1
Djj

Here Uj refers to the jth column of U .


Now note that the columns of U form an orthonormal basis for the column
space of X. If λ = 0, then this is just a fancy formula for the projection of Y
onto the column space of X. Thus, what this formula is telling us is that we
look at this projection, look at the coordinates in terms of the basis given by
the columns of U , and scale accordingly.
We can now concretely see the effect of our λ. The shrinking depends on the
size of Djj , and the larger Djj is, the less shrinking we do.
This is not very helpful if we don’t have a concrete interpretation of the Djj ,
or rather, the columns of U . It turns out the columns of U are what are known
as the normalized principal components of X.
Take u ∈ Rp , kuk2 = 1. The sample variance of Xu ∈ Rn is then
1 1 1
kXuk22 = uT X T Xu = uT V D2 V t u.
n n n
We write w = V T u. Then kwk2 = 1 since V is orthogonal. So we have
1 1 1X 2 1 2
kXuk22 = wT D2 w = wj Djj 2 ≤ D11 ,
n n n j n

and the bound is achieved when w = e1 , or equivalently, u = V e1 = V1 . Thus, V1


gives the coefficients of the linear combination of columns of X that has largest
sample variance. We can then write the result as
XV1 = U DV T V1 = U1 D11 .

8
2 Kernel machines III Modern Statistical Methods

We can extend this to a description of the other columns of U , which is done


in the example sheet. Roughly, the subsequent principle components obey the
same optimality conditions with the added constraint of being orthogonal to all
earlier principle components.
The conclusion is that Ridge regression works best if EY lies in the space
spanned by the large principal components of X.

2.2 v-fold cross-validation


In practice, the above analysis is not very useful, since it doesn’t actually tell
us what λ we should pick. If we are given a concrete data set, how do we know
what λ we should pick?
One common method is to use v-fold cross-validation. This is a very general
technique that allows us to pick a regression method from a variety of options.
We shall explain the method in terms of Ridge regression, where our regression
methods are parametrized by a single parameter λ, but it should be clear that
this is massively more general.
Suppose we have some data set (Yi , xi )ni=1 , and we are asked to predict the
value of Y ∗ given a new predictors x∗ . We may want to pick λ to minimize the
following quantity:
n o
E (Y ∗ − (x∗ )T β̂λR (X, Y ))2 | X, Y .

It is difficult to actually do this, and so an easier target to minimize is the


expected prediction error
h n oi
E E (Y ∗ − (x∗ )T β̂λR (X̃, Ỹ ))2 | X̃, Ỹ

One thing to note about this is that we are thinking of X̃ and Ỹ as arbitrary
data sets of size n, as opposed to the one we have actually got. This might be a
more tractable problem, since we are not working with our actual data set, but
general data sets.
The method of cross-validation estimates this by splitting the data into v
folds.

(X (v) , Y (v) ) Av

(X (1) , Y (1) ) A1

Ai is called the observation indices, which is the set of all indices j so that the
jth data point lies in the ith fold.
We let (X (−k) , Y (−k) ) be all data except that in the kth fold. We define
v
1XX
CV (λ) = (Yi − xTi β̂λR (X (−k) , Y (−k) ))2
n
k=1 i∈Ak

9
2 Kernel machines III Modern Statistical Methods

We write λCV for the minimizer of this, and pick β̂λRCV (X, Y ) as our estimate.
This tends to work very will in practice.
But we can ask ourselves — why do we have to pick a single λ to make the
estimate? We can instead average over a range of different λ. Suppose we have
computed β̂λR on a grid of λ-values λ1 > λ2 > · · · > λL . Our plan is to take a
good weighted average of the λ. Concretely, we want to minimize

v L
!2
1XX X
Yi − wi xTi β̂λRi (X (−k) , Y (−k) )
n i=1
k=1 i∈Ak

over w ∈ [0, ∞)L . This is known as stacking. This tends to work better than
just doing v-fold cross-validation. Indeed, it must — cross-validation is just a
special case where we restrict wi to be zero in all but one entries. Of course, this
method comes with some heavy computational costs.

2.3 The kernel trick


We now come to the actual, main topic of the chapter. We start with a very
simple observation. An alternative way of writing the fitted values from Ridge
regression is
(X T X + λI)X T = X T (XX T + λI).
One should be careful that the λI are different matrices on both sides, as they
have different dimensions. Multiplying inverses on both sides, we have

X T (XX T + λI)−1 = (X T X + λI)−1 X T .

We can multiply the right-hand-side by Y to obtain the β̂ from Ridge regression,


and multiply on the left to obtain the fitted values. So we have

XX T (XX T + λI)−1 Y = X(X T X + λI)−1 X T Y = X β̂λR .

Note that X T X is a p × p matrix, and takes O(np2 ) time to compute. On the


other hand, XX T is an n × n matrix, and takes O(n2 p) time to compute. If
p  n, then this way of computing the fitted values would be much quicker. The
key point to make is that the fitted values from Ridge regression only depends
on K = XX T (and Y ). Why is this important?
Suppose we believe we have a quadratic signal
X
Yi = xTi β + xik xi` θk` + εi .
k,`

Of course, we can do Ridge regression, as long as we add the products xik xi`
as predictors. But this requires O(p2 ) many predictors. Even if we use the
XX T way, this has a cost of O(n2 p2 ), and if we used the naive method, it would
require O(np4 ) operations.
We can do better than this. The idea is that we might be able to compute
K directly. Consider
X
(1 + xTi xj )2 = 1 + 2xTi xj + xik xi` xjk xj` .
k,`

10
2 Kernel machines III Modern Statistical Methods

This is equal to the inner product between vectors of the form


√ √
(1, 2xi1 , . . . , 2xip , xi1 xi1 , . . . , xi1 xip , xi2 xi1 , . . . , xip xip ) (∗)

If we set Kij = (1+xTi xj )2 and form K(K +λI)−1 Y , this is equivalent to forming
ridge regression with (∗) as our predictors. Note that here we don’t scale our
columns to have the same `2 norm. This is pretty interesting, because computing
this is only O(n2 p). We managed to kill a factor of p in this computation. The
key idea here was that the fitted values depend only on K, and not on the values
of xij itself.
Consider the general scenario where we try to predict the value of Y given
a predictor x ∈ X . In general, we don’t even assume X has some nice linear
structure where we can do linear regression.
If we want to do Ridge regression, then one thing we can do is that we
can try to construct some map φ : X → RD for some D, and then run Ridge
regression using {φ(xi )} as our predictors. If we want to fit some complicated,
non-linear model, then D can potentially be very huge, and this can be costly.
The observation above is that instead of trying to do this, we can perhaps find
some magical way of directly computing

Kij = k(xi , xj ) = hφ(xi ), φ(xj )i.

If we can do so, then we can simply use the above formula to obtain the fitted
values (we shall discuss the problem of making new predictions later).
Since it is only the function k that matters, perhaps we can specify a
“regression method” simply by providing a suitable function k : X × X → R.
If we are given such a function k, when would it arise from a “feature map”
φ : X → RD ?
More generally, we will find that it is convenient to allow for φ to take values
in infinite-dimensional vector spaces instead (since we don’t actually have to
compute φ!).
Definition (Inner product space). an inner product space is a real vector space
H endowed with a map h · , · i : H × H → R and obeys
– Symmetry: hu, vi = hv, ui
– Linearity: If a, b ∈ R, then hau + bw, vi = ahu, vi + bhw, vi.
– Positive definiteness: hu, ui ≥ 0 with hu, ui = 0 iff u = 0.
If we want k to come from a feature map φ, then an immediate necessary
condition is that k has to be symmetric. There is also another condition that
corresponds to the positive-definiteness of the inner product.
Proposition. Given φ : X × X → H, define k : X × X → R by

k(x, x0 ) = hφ(x), φ(x0 )i.

Then for any x1 , . . . , xn ∈ X , the matrix K ∈ Rn × Rn with entries

Kij = k(xi , xj )

is positive semi-definite.

11
2 Kernel machines III Modern Statistical Methods

Proof. Let x1 , . . . , xn ∈ X , and α ∈ Rn . Then


X X
αi k(xi , xj )αj = αi hφ(xi ), φ(xj )i
i,j i,j
* +
X X
= αi φ(xi ), αj φ(xj )
i j

≥0

since the inner product is positive definite.


This suggests the following definition:
Definition (Positive-definite kernel). A positive-definite kernel (or simply
kernel ) is a symmetric map k : X × X → R such that for all n ∈ N and
x1 , . . . , xn ∈ X , the matrix K ∈ Rn × Rn with entries

Kij = k(xi , xj )

is positive semi-definite.
We will prove that every positive-definite kernel comes from a feature map.
However, before that, let’s look at some examples.
Example. Suppose k1 , k2 , . . . , kn are kernels. Then
– If α1 , α2 ≥ 0, then α1 k1 + α2 k2 is a kernel. Moreover, if

k(x, x0 ) = lim km (x, x0 )


m→∞

exists, then k is a kernel.


– The pointwise product k1 k2 is a kernel, where

(k1 k2 )(x, x0 ) = k1 (x, x0 )k2 (x, x0 ).

Example. The linear kernel is k(x, x0 ) = xT x0 . To see this, we see that this is
given by the feature map φ = id and taking the standard inner product on Rp .
Example. The polynomial kernel is k(x, x0 ) = (1 + xT x0 )d for all d ∈ N.
We saw this last time with d = 2. This is a kernel since both 1 and xT x0 are
kernels, and sums and products preserve kernels.
Example. The Gaussian kernel is
kx − x0 k22
 
0
k(x, x ) = exp − .
2σ 2
The quantity σ is known as the bandwidth.
To show that it is a kernel, we decompose

kx − x0 k22 = kxk22 + kx0 k22 − 2xT x0 .

We define
kxk2 kx0 k22
   
0
k1 (x, x ) = exp − 22 exp − .
2σ 2σ 2

12
2 Kernel machines III Modern Statistical Methods

k · k2
 
This is a kernel by taking φ( · ) = exp − 2σ22 .
Next, we can define
 T 0 X ∞ r
1 xT x0

x x
k2 (x, x0 ) = exp = .
σ2 r=0
r! σ2

We see that this is the infinite linear combination of powers of kernels, hence is
a kernel. Thus, it follows that k = k1 k2 is a kernel.
Note also that any feature map giving this k must take values in an infinite
dimensional inner product space. Roughly speaking, this is because we have
arbitrarily large powers of x and x0 in the expansion of k.
Example. The Sobolev kernel is defined as follows: we take =[0, 1], and let
k(x, x0 ) = min(x, x0 ).
A slick proof that this is a kernel is to notice that this is the covariance function
of Brownian motion.
Example. The Jaccard similarity is defined as follows: Take X to be the set of
all subsets of {1, . . . , p}. For x, x0 ∈ X , define
|x∩x0 |
(
0 0 x ∪ x0 6= ∅
k(x, x ) = |x∪x | .
1 otherwise

Theorem (Moore–Aronszajn theorem). For every kernel k : X × X → R, there


exists an inner product space H and a feature map φ : X → H such that
k(x, x0 ) = hφ(x), φ(x0 )i.
This is not actually the full Moore–Aronszajn theorem, but a simplified
version of it.
Proof. Let H denote the vector space of functions f : X → R of the form
n
X
f( · ) = αi k( · , xi ) (∗)
i=1

for some n ∈ N, α1 , . . . , αn ∈ R and x1 , . . . , xn ∈ X . If


m
X
g( · ) = βj k( · , x0j ) ∈ H,
i=1

then we tentatively define the inner product of f and g by


n X
X m
hf, gi = αi βj k(xi , x0j ).
i=1 j=1

We have to check that this is an inner product, but even before that, we need to
check that this is well-defined, since f and g can be represented in the form (∗)
in multiple ways. To do so, simply observe that
X m
n X n
X m
X
αi βj k(xi , x0j ) = αi g(xi ) = βj f (x0j ). (†)
i=1 j=1 i=1 j=1

13
2 Kernel machines III Modern Statistical Methods

The first equality shows that the definition of our inner product does not depend
on the representation of g, while the second equality shows that it doesn’t depend
on the representation of f .
To show this is an inner product, note that it is clearly symmetric and bilinear.
To show it is positive definite, note that we have
n X
X m
hf, f i = αi k(xi , xj )αj ≥ 0
i=1 j=1

since the kernel is positive semi-definite. It remains to check that if hf, f i = 0,


then f = 0 as a function. To this end, note the important reproducing property:
by (†), we have
hk( · , x), f i = f (x).
This says k(·, x) represents the evaluation-at-x linear functional.
In particular, we have

f (x)2 = hk( · , x), f i2 ≤ hk( · , x), k( · , x)ihf, f i = 0.

Here we used the Cauchy–Schwarz inequality, which, if you inspect the proof,
does not require positive definiteness, just positive semi-definiteness. So it follows
that f ≡ 0. Thus, we know that H is indeed an inner product space.
We now have to construct a feature map. Define φ : X → H by

φ(x) = k( · , x).

Then we have already observed that

hφ(x), φ(x0 )i = hk( · , x), k( · , x0 )i = k(x, x0 ),

as desired.
We shall boost this theorem up to its full strength, where we say there is
a “unique” inner product space with such a feature map. Of course, it will not
be unique unless we impose appropriate quantifiers, since we can just throw in
some random elements into H.
Recall (or learn) from functional analysis that any inner product space B is
a normed space, with norm

kf k2 = hf, f iB .

We say sequence (fm ) in a normed space B is Cauchy if kfm − fn kB → 0 as


n, m → ∞. An inner product space is complete if every Cauchy sequence has a
limit (in the space), and a complete inner product space is called a Hilbert space.
One important property of a Hilbert space B is that if V is a closed subspace
of B, then every f ∈ B has a unique representation as f = v + w, where v ∈ V
and
w ∈ V ⊥ = {u ∈ B : hu, yi = 0 for all y ∈ V }.
By adding limits of Cauchy sequences to H, we can obtain a Hilbert space.
Indeed, if (fm ) is Cauchy in H, then

|fm (x) − fn (x)| ≤ k 1/2 (x, x)kfm − fn kH → 0

14
2 Kernel machines III Modern Statistical Methods

as m, n → ∞. Since every Cauchy sequence in R converges (i.e. R is complete),


it makes sense to define a limiting function

f (x) = lim fn (x),


n→∞

and it can be shown that after augmenting H with such limits, we obtain a
Hilbert space. In fact, it is a special type of Hilbert space.
Definition (Reproudcing kernel Hilbert space (RKHS)). A Hilbert space B of
functions f : X → R is a reproducing kernel Hilbert space if for each x ∈ X ,
there exists a kx ∈ B such that

hkx , f i = f (x)

for all x ∈ B.
The function k : X × X → R given by

k(x, x0 ) = hkx , kx0 i = kx (x0 ) = kx0 (x)

is called the reproducing kernel associated with B.


By the Riesz representation theorem, this condition is equivalent to saying
that pointwise evaluation is continuous.
We know the reproducing kernel associated with an RKHS is a positive-
definite kernel, and the Moore–Aronszajn theorem can be extended to show
that to each kernel k, there is a unique representing kernel Hilbert space H and
feature map φ : X → H such that k(x, x0 ) = hφ(x), φ(x0 )i.
Example. Take the linear kernel

k(x, x0 ) = xT x0 .

By definition, we have
( n
)
X
p
H= f : R → R | f (x) = αi xTi x
i=1

n p
for some n ∈ N, x1 , . . . , x ∈ R . We then see that this is equal to

H = {f : Rp → R | f (x) = β T x for some β ∈ Rp },

and if f (x) = β T x, then kf k2H = k(β, β) = kβk22 .


Example. Take the Sobolev kernel, where X = [0, 1] and k(x, x0 ) = min(x, x0 ).
Then H includes all linear combinations of functions of the form x 7→ min(x, x0 ),
where x0 ∈ [0, 1], and their pointwise limits. These functions look like

min(x, x0 )

x
x0

15
2 Kernel machines III Modern Statistical Methods

Since we allow arbitrary linear combinations of these things and pointwise limits,
this gives rise to a large class of functions. In particular, this includes all Lipschitz
functions that are 0 at the origin.
In fact, the resulting space is a Sobolev space, with the norm given by
Z 1 1/2
0 2
kf k = f (x) dx .
0

For example, if we take f = min( · , x), then by definition we have

k min( · , x)k2H = min(x, x) = x,

whereas the formula we claimed gives


Z x
12 dt = x.
0

In general, it is difficult to understand the RKHS, but if we can do so, this


provides a lot of information on what we are regularizing in the kernel trick.

2.4 Making predictions


Let’s now try to understand what exactly we are doing when we do ridge
regression with a kernel k. To do so, we first think carefully what we were doing
in ordinary ridge regression, which corresponds to using the linear kernel. For
the linear kernel, the L2 norm kβk22 corresponds exactly to the norm in the
RKHS kf k2H . Thus, an alternative way of expressing ridge regression is
( n )
X
ˆ
f = argmin 2
(Yi − f (xi )) + λkf k 2
, (∗)H
f ∈H i=1

where H is the RKHS of the linear kernel. Now this way of writing ridge
regression makes sense for an arbitrary RKHS, so we might think this is what
we should solve in general.
But if H is infinite dimensional, then naively, it would be quite difficult to
solve (∗). The solution is provided by the representer theorem.
Theorem (Representer theorem). Let H be an RKHS with reproducing kernel
k. Let c be an arbitrary loss function and J : [0, ∞) → R any strictly increasing
function. Then the minimizer fˆ ∈ H of

Q1 (f ) = c(Y, x1 , . . . , xn , f (x1 ), . . . , f (xn )) + J(kf k2H )

lies in the linear span of {k( · , xi )}ni=1 .


Given this theorem, we know our optimal solution can be written in the form
n
X
fˆ( · ) = α̂i k( · , xi ),
i=1

and thus we can rewrite our optimization problem as looking for the α̂ ∈ Rn
that minimizes

Q2 (α) = c(Y, x1 , . . . , xn , Kα) + J(αT Kα),

16
2 Kernel machines III Modern Statistical Methods

over α ∈ Rn (with Kij = k(xi , xj )).


For an arbitrary c, this can still be quite difficult, but in the case of Ridge
regression, this tells us that (∗) is equivalent to minimizing

kY − Kαk22 + λαT Kα,

and by differentiating, we find that this is given by solving

K α̂ = K(K + λI)−1 Y.

Of course K α̂ is also our fitted values, so we see that if we try to minimize


(∗), then the fitted values is indeed given by K(K + λI)−1 Y , as in our original
motivation.
Proof. Suppose fˆ minimizes Q1 . We can then write

fˆ = u + v

where u ∈ V = span{k( · , xi ) : i = 1, . . . , n} and v ∈ V ⊥ . Then

fˆ(xi ) = hf, k( · , xi )i = hu + v, k( · , xi )i = hu, k( · , xi )i = u(xi ).

So we know that

c(Y, x1 , . . . , xn , f (x1 ), . . . , f (xn )) = c(Y, x1 , . . . , xn , u(x1 ), . . . , u(xn )).

Meanwhile,
kf k2H = ku + vk2H = kuk2H + kvk2H ,
using the fact that u and v are orthogonal. So we know

J(kf k2H ) ≥ J(kuk2H )

with equality iff v = 0. Hence Q1 (f ) ≥ Q1 (u) with equality iff v = 0, and so we


must have v = 0 by optimality.
Thus, we know that the optimizer in fact lies in V , and then the formula of
Q2 just expresses Q1 in terms of α.

How well does our kernel machine do? Let H be an RKHS with reproducing
kernel k, and f 0 ∈ H. Consider a model

Yi = f 0 (xi ) + εi

for i = 1, . . . , n, and assume Eε = 0, var(ε) = σ 2 I.


For convenience, We assume kf 0 k2H ≤ 1. There is no loss in generality by
assuming this, since we can always achieve this by scaling the norm.
Let K be the kernel matrix Kij = k(xi , xj ) with eigenvalues d1 ≥ d2 ≥ · · · ≥
dn ≥ 0. Define
( n )
X
fˆλ = argmin 2
(Yi − f (xi )) + λkf k2
. H
f ∈H i=1

17
2 Kernel machines III Modern Statistical Methods

Theorem. We have
n n
1X σ2 X d2i λ
E(f 0 (xi ) − fˆλ (xi ))2 ≤ +
n i=1 n i=1 (di + λ)2 4n
n
σ2 1 X
 
di λ
≤ min ,λ + .
n λ i=1 r 4n

Given a data set, we can compute the eigenvalues di , and thus we can compute
this error bound.
Proof. We know from the representer theorem that

(fˆλ (xi ), . . . , fˆλ (xn ))T = K(K + λI)−1 Y.

Also, there is some α ∈ Rn such that

(f 0 (x1 ), . . . , f 0 (xn ))T = Kα.

Moreover, on the example sheet, we show that

1 ≥ kf 0 k2H ≥ αT Kα.

Consider the eigen-decomposition K = U DU T , where U is orthogonal, Dii = di


and Dij = 0 for i 6= j. Then we have
n
X
E (f 0 (xi ) − fˆλ (xi ))2 = EkKα − K(K + λI)−1 (Kα + ε)k22
i=1

Noting that Kα = (K + λI)(K + λI)−1 Kα, we obtain


2
= E λ(K + λI)−1 Kα − K(K + λI)−1 ε 2

2 2
= λ2 (K + λI)−1 Kα 2 + E K(K + λI)−1 ε 2 .

| {z } | {z }
(I) (II)

At this stage, we can throw in the eigen-decomposition of K to write (I) as


2
(I) = λ2 (U DU T + λI)−1 U DU T α 2

= λ2 kU (D + λI)−1 DU T 2
| {z α} k2
θ
n 2
X λ
= θi2
i=1
(di + λ)2

Now we have
αT Kα = αT U DU T α = αT U DD+ DU T ,
where D+ is diagonal and
(
+ d−1
i di > 0
Dii = .
0 otherwise

18
2 Kernel machines III Modern Statistical Methods

We can then write this as


X θ2
i
αT Kα = .
di
di >0

The key thing we know about this is that it is ≤ 1.


Note that by definition of θi , we see that if di = 0, then θi = 0. So we can
write
X θ2 di λ2 di λ
i
(II) = ≤ λ max
di (di + λ)2 i=1,...,n (di + λ)2
i:di ≥0

by Hölder’s inequality with (p, q) = (1, ∞). Finally, use the inequality that

(a + b)2 ≥ 4ab

to see that we have


λ
(I) ≤
.
4
So we are left with (II), which is a bit easier. Using the trace trick, we have

(II) = EεT (K + λI)−1 K 2 (K + λI)−1 ε


= E tr K(K + λI)−1 εεT (K + λI)−1 K


= tr K(K + λI)−1 E(εεT )(K + λI)−1 K




= σ 2 tr K 2 (K + λI)−2


= σ 2 tr U D2 U T (U DU T + λI)−2


= σ 2 tr D2 (D + λI)−2

n
2
X d2i
=σ .
i=1
(di + λ)2

d2i di di λ
Finally, writing (di +λ)2 = λ (di +λ)2 , we have

d2i
 
di
≤ min 1, ,
(di + λ)2 4λ
and we have the second bound.
How can we interpret this? If we look at the formula, then we see that we can
have good benefits if the di ’s decay quickly, and the exact values of the di depend
only on the choice of xi . So suppose the xi are random and idd and independent
of ε. Then the entire analysis is still valid by conditioning on x1 , . . . , xn .
We define µ̂i = dni , and λn = nλ . Then we can rewrite our result to say
n n
σ2 1 X
 
1 X 0 ˆ 2 µ̂i λn
E (f (xi ) − fλ (xi )) ≤ E min , λn + ≡ Eδn (λn ).
n i=1 λn n i=1 4 4

We want to relate this more directly to the kernel somehow. Given a density
p(x) on X , Mercer’s theorem guarantees the following eigenexpansion

X
0
k(x, x ) = µj ej (x)ej (x0 ),
j=1

19
2 Kernel machines III Modern Statistical Methods

where the eigenvalues µj and eigenfunctions ej obey


Z
0
µj ej (x ) = k(x, x0 )ej (x)p(x) dx
X

and Z
ek (x)ej (x)p(x) dx = 1j=k .
X
One can then show that
n   ∞
1 X µ̂i 1X µ
i

E min , λn ≤ min , λn
n i=1 4 n i=1 4

up to some absolute constant factors. For a particular density p(x) of the input
space, we can try to solve the integral equation to figure out what the µj ’s are,
and we can find the expected bound.
When k is the Sobolev kernel and p(x) is the uniform density, then it turns
out we have
µj 1
= 2 .
4 π (2j − 1)2
P∞
By drawing a picture, we see that we can bound i=1 min µ4i , λn by


∞ µ  Z ∞ √
X i 1 λn λn
min , λn ≤ λn ∧ 2 2
dj ≤ + .
i=1
4 0 π (2j − 1) π 2

So we find that
σ2
 
E(δn (λn )) = O 1/2
+ λn .
nλn
We can then find the λn that minimizes this, and we see that we should pick
 2 2/3
σ
λn ∼ ,
n
 2 2/3
which gives an error rate of ∼ σn .

2.5 Other kernel machines


Recall that the representer theorem was much more general than what we used
it for. We could apply it to any loss function. In particular, we can apply it
to classification problems. For example, we might want to predict whether an
email is a spam. In this case, we have Y ∈ {−1, 1}n .
Suppose we are trying to find a hyperplane that separates the two sets
{xi }yi =1 and {xi }i:Yi =−1 . Consider the really optimistic case where they can
indeed be completely separated by a hyperplane through the origin. So there
exists β ∈ Rp (the normal vector) with yi xTI β > 0 for all i. Moreover, it would
be nice if we can maximize the minimum distance between any of the points and
the hyperplane defined by β. This is given by
yi xTi β
max .
kβk2
Thus, we can formulate our problem as

20
2 Kernel machines III Modern Statistical Methods

yi xT
i β
maximize M among β ∈ Rp , M ≥ 0 subject to kβk2 ≥ M.
This optimization problem gives the hyperplane that maximizes the margin
between the two classes.
Let’s think about the more realistic case where we cannot completely separate
the two sets. We then want to impose a penalty for each point on the wrong
side, and attempt to minimize the distance.
There are two options. We can use the penalty
n 
Yi xTi β
X 
λ M−
i=1
kβk2 +

where t+ = t1t≥0 . Another option is to take


n 
Yi xTi β
X 
λ 1−
i=1
M kβk2 +

which is perhaps more natural, since now everything is “dimensionless”. We


will actually use neither, but will start with the second form and massage it to
something that can be attacked with our existing machinery. Starting with the
second option, we can write our optimization problem as
n   !
X Yi xTi β
max M −λ 1− .
M ≥0,β∈Rp
i=1
M kβk2 +

Since the objective function is invariant to any positive scaling of β, we may


1
assume kβk2 = M , and rewrite this problem as maximizing
n
1 X
−λ (1 − Yi xTi β)+ .
kβk2 i=1

1
Replacing max kβk 2
with minimizing kβk22 and adding instead of subtracting the
penalty part, we modify this to say
n
!
X
minp kβk22 + λ (1 − Yi xTi β)+ .
β∈R
i=1

1
The final change we make is that we replace λ with λ, and multiply the whole
equation by λ to get
n
!
X
min λkβk22 + (1 − Yi xTi β)+ .
β∈Rp
i=1

This looksPmuch more like what we’ve seen before, with λkβk22 being the penalty
n
term and i=1 (1 − Yi xTi β)+ being the loss function.
The final modification is that we want to allow planes that don’t necessarily
pass through the origin. To do this, we allow ourselves to translate all the xi ’s
by a fixed vector δ ∈ Rp . This gives
n
!
X
2 T

min
p p
λkβk2 + 1 − Yi (xi − δ) β +
β∈R ,δ∈R
i=1

21
2 Kernel machines III Modern Statistical Methods

Since δ T β always appears together, we can simply replace it with a constant µ,


and write our problem as
n
X
1 − Yi (xTi β + µ) + λkβk22 .

(µ̂, β̂) = argmin +
(∗)
(µ,β)∈R×Rp i=1

This gives the support vector classifier .


This is still just fitting a hyperplane. Now we would like to “kernelize” this,
so that we can fit in a surface instead. Let H be the RKHS corresponding to
the linear kernel. We can then write (∗) as
n
X
(µ̂λ , fˆλ ) = argmin (1 − Yi (f (xi ) + µ))+ + λkf k2H .
(µ,f )∈R×H i=1

The representer theorem (or rather, a slight variant of it) tells us that the above
optimization problem is equivalent to the support vector machine
n
X
(µ̂λ , α̂λ ) = argmin (1 − Yi (KiT α + µ))+ + λαT Kα
(µ,α)∈R×Rn i=1

where Kij = k(xi , xj ) and k is the reproducing kernel of H. Predictions at a


new x are then given by
n
!
X
sign µ̂λ + α̂λ,i k(x, xi ) .
i=1

One possible alternative to this is to use logistic regression. Suppose that


 
P(Yi = 1)
log = xTi β 0 ,
P(Yi = −1)

and that Y1 , . . . , Yn are independent. An estimate of β 0 can be obtained through


maximizing the likelihood, or equivalently,
n
X
argmin log(1 + exp(−Yi xTi β)).
b∈Rp i=1

To kernelize this, we introduce an error term of λkβk22 , and then kernelize this.
The resulting optimization problem is then given by
n
!
X
argmin log(1 + exp(−Yi f (xi ))) + λkf k2H .
f ∈H i=1

We can then solve this using the representer theorem.

2.6 Large-scale kernel machines


How do we apply kernel machines at large scale? Whenever we want to make a
prediction with a kernel machine, we need O(n) many steps, which is quite a pain
if we work with a large data set, say a few million of them. But even before that,

22
2 Kernel machines III Modern Statistical Methods

forming K ∈ Rn×n and inverting K + λI or equivalent can be computationally


too expensive.
One of the more popular approaches to tackle these difficulties is to form a
randomized feature map φ̂ : X → Rb such that

Eφ̂(x)T φ̂(x0 ) = k(x, x0 )

To increase the quality of the approximation, we can use


1
x 7→ √ (φ̂1 (x), . . . , φ̂L (x))T ∈ RLb ,
L
where the φ̂i (x) are iid with the same distribution as φ̂.
Let Φ be the matrix with ith row √1L (φ̂1 (x), . . . , φ̂L (x)). When performing,
e.g. kernel ridge regression, we can compute

θ̂ = (ΦT Φ + λI)−1 ΦT Y.

The computational cost of this is O((Lb)3 + n(Lb)2 ). The key thing of course is
that this is linear in n, and we can choose the size of L so that we have a good
trade-off between the computational cost and the accuracy of the approximation.
Now to predict a new x, we form
1
√ (φ̂1 (x), . . . , φ̂L (x))T θ̂,
L
and this is O(Lb).
In 2007, Rahimi and Recht proposed a random map for shift-invariant kernels,
i.e. kernels k such that k(x, x0 ) = h(x − x0 ) for some h (we work with X = Rp ).
A common example is the Gaussian kernel.
One motivation of this comes from a classical result known as Bochner’s
theorem.
Theorem (Bochner’s theorem). Let k : Rp × Rp → R be a continuous kernel.
Then k is shift-invariant if and only if there exists some distribution F on Rp
and c > 0 such that if W ∼ F , then

k(x, x0 ) = cE cos((x − x0 )T W ).

Our strategy is then to find some φ̂ depending on W such that

c cos((x − x0 )T W ) = φ̂(x)φ̂(x0 ).

We are going to take b = 1, so we don’t need a transpose on the right.


The idea is then to play around with trigonometric identities to try to write
c cos((x − x0 )T W ). After some work, we find the following solution:
Let u ∼ U [−π, π], and take x, y ∈ R. Then

E cos(x + u) cos(y + u) = E(cos x cos u − sin x sin u)(cos y cos u − sin y sin u)

Since u has the same distribution as −u, we see that E cos u sin u = 0.
Also, cos2 u + sin2 u = 1. Since u ranges uniformly in [−π, π], by symmetry,
we have E cos2 u = E sin2 u = 21 . So we find that

2E cos(x + u) cos(y + u) = (cos x cos y + sin x sin y) = cos(x − y).

23
2 Kernel machines III Modern Statistical Methods

Thus, given a shift-invariant kernel k with associated distribution F , we set


W ∼ F independently of u. Define

φ̂(x) = 2c cos(W T x + u).

Then

Eφ̂(x)φ̂(x0 ) = 2cE[E[cos(W T x + u) cos(W T x0 + u) | W ]]


= cE cos(W T (x − x0 ))
= k(x, x0 ).

In order to get this to work, we must find the appropriate W . In certain


cases, this W is actually not too hard to find:

Example. Take  
0 1 0 2
k(x, x ) = exp − 2 kx − x k2 ,

the Gaussian kernel. If W ∼ N (0, σ −2 I), then
T 2 2
Eeit W
= e−ktk2 /(2σ )
= E cos(t2 W ).

24
3 The Lasso and beyond III Modern Statistical Methods

3 The Lasso and beyond


We are interested in the situation where we have a very large number of variables
compared to the size of the data set. For example the data set might be all the
genetic and environmental information about patients, and we want to predict if
they have diabetes. The key property is that we expect that most of the data
are not useful, and we want to select a small subset of variables that are relevant,
and form prediction models based on them.
In these cases, ordinary least squares is not a very good tool. If there are
more variables than the number of data points, then it is likely that we can
pick the regression coefficients so that our model fits our data exactly, but this
model is likely to be spurious, since we are fine-tuning our model based on the
random fluctuations of a large number of irrelevant variables. Thus, we want to
find a regression method that penalizes non-zero coefficients. Note that Ridge
regression is not good enough — it encourages small coefficients, but since it
uses the `2 norm, it’s quite lenient towards small, non-zero coefficients, as the
square of a small number is really small.
There is another reason to favour models that sets a lot of coefficients to
zero. Consider the linear model
Y = Xβ 0 + ε,
where as usual Eε = 0 and var(ε) = σ 2 I. Let S = {k : βk0 6= 0}, and let s = |S|.
As before, we suppose we have some a priori belief that s  p.
For the purposes of this investigation, suppose X has full column rank, so
that we can use ordinary least squares. Then the prediction error is
1 1
EkXβ 0 − X β̂ OLS k22 = E(β̂ OLS − β 0 )T X T X(β̂ OLS − β 0 )
n n
1
= E tr(β̂ OLS − β 0 )(β̂ OLS − β 0 )T X T X
n
1
= tr E(β̂ OLS − β 0 )(β̂ OLS − β 0 )T X T X
n
1
= tr Eσ 2 (X T X)−1 X T X
n
p
= σ2 .
n
Note that this does not depend on what what β 0 is, but only on σ 2 , p and n.
In the situation we are interested in, we expect s  p. So if we can find
S and find ordinary least squares just on these, then we would have a mean
squared prediction error of σ 2 ns , which would be much much smaller.
We first discuss a few classical model methods that does this.
The first approach we may think of is the best subsets method, where we try
to do regression on all possible choices of S and see which is the “best”. For any
set M of indices, we let XM be the submatrix of X formed from the columns of
X with index in M . We then regress Y on XM for every M ⊆ {1, . . . , p}, and
then pick the best model via cross-validation, for example. A big problem with
this is that the number of subsets grows exponentially with p, and so becomes
infeasible for, say, p > 50.
Another method might be forward selection. We start with an intercept-only
model, and then add to the existing model the predictor that reduces the RSS

25
3 The Lasso and beyond III Modern Statistical Methods

the most, and then keep doing this until a fixed number of predictors have been
added. This is quite a nice approach, and is computationally quite fast even
if p is large. However, this method is greedy, and if we make a mistake at the
beginning, then the method blows up. In general, this method is rather unstable,
and is not great from a practical perspective.

3.1 The Lasso estimator


The Lasso (Tibshirani, 1996) is a seemingly small modification of Ridge regression
that solves our problems. It solves
1
(µ̂L L
λ , β̂λ ) = argmin kY − µ1 − Xβk22 + λkβk1 .
(µ,β)∈R×Rp 2n

The key difference is that we use an `1 norm on β rather than the `2 norm,
p
X
kβk1 = |βk |.
k=1

This makes it drastically different from Ridge regression. We will see that for λ
large, it will make all of the entries of β exactly zero, as opposed to being very
close to zero.
We can compare thisPto best subset regression, where we replace kβk1 with
p
something of the form k=1 1βk >0 . But the beautiful property of the Lasso
is that the optimization problem is now continuous, and in fact convex. This
allows us to solve it using standard convex optimization techniques.
Why is the `1 norm so different from the `2 norm? Just as in Ridge regression,
we may center and scale X, and center Y , so that we can remove µ from the
objective. Define
1
Qλ (β) = kY − Xβk22 + λkβk1 .
2n
Any minimizer β̂λL of Qλ (β) must also be a solution to

min kY − Xβk22 subject to kβk1 ≤ kβ̂λL k1 .

Similarly, β̂λR is a solution of

min kY − Xβk22 subject to kβk2 ≤ kβ̂λR k2 .

So imagine we are given a value of kβ̂λL k1 , and we try to solve the above
optimization problem with pictures. The region kβk1 ≤ kβ̂λL k1 is given by a
rotated square

26
3 The Lasso and beyond III Modern Statistical Methods

On the other hand, the minimum of kY − Xβk22 is at β̂ OLS , and the contours
are ellipses centered around this point.

To solve the minimization problem, we should pick the smallest contour that hits
the square, and pick the intersection point to be our estimate of β 0 . The point
is that since the unit ball in the `1 -norm has these corners, this β 0 is likely to
be on the corners, hence has a lot of zeroes. Compare this to the case of Ridge
regression, where the constraint set is a ball:

Generically, for Ridge regression, we would expect the solution to be non-zero


everywhere.
More generally, we can try to use the `q norm given by
p
!1/q
X q
kβkq = βk .
k=1

We can plot their unit spheres, and see that they look like

q = 0.5 q=1 q = 1.5 q=2

We see that q = 1 is the smallest value of q for which there are corners, and
also the largest value of q for which the constraint set is still convex. Thus, q = 1
is the sweet spot for doing regression.

27
3 The Lasso and beyond III Modern Statistical Methods

But
√ is this actually good? Suppose the columns of X are scaled to have `2
norm n, and, after centering, we have a normal linear model

Y = Xβ 0 + ε − ε̄1,

with ε ∼ Nn (0, σ 2 I).

Theorem. Let β̂ be the Lasso solution with


r
log p
λ = Aσ
n
2
for some A. Then with probability 1 − 2p−(A /2−1) , we have
r
1 0 2 log p 0
kXβ − X β̂k2 ≤ 4Aσ kβ k1 .
n n
Crucially, this is proportional to log p, and not just p. On the other hand,
unlike what we usually see for ordinary least squares, we have √1n , and not n1 .
We will later obtain better bounds when we make some assumptions on the
design matrix.
Proof. We don’t really have a closed form solution for β̂, and in general it doesn’t
exist. So the only thing we can use is that it in fact minimizes Qλ (β). Thus, by
definition, we have
1 1
kY − X β̂k22 + λkβ̂k1 ≤ kY − Xβ 0 k22 + λkβ 0 k1 .
2n 2n
We know exactly what Y is. It is Xβ 0 + ε − ε̄1. If we plug this in, we get
1 1
kXβ 0 − X β̂k22 ≤ εT X(β̂ − β 0 ) + λkβ 0 k1 − λkβ̂k1 .
2n n
Here we use the fact that X is centered, and so is orthogonal to 1.
Now Hölder tells us

|εT X(β̂ − β 0 )| ≤ kX T εk∞ kβ̂ − β 0 k1 .

We’d like to bound kX T εk∞ , but it can be arbitrarily large since ε is a Gaussian.
However, with high probability, it is small. Precisely, define
 
1 T
Ω= kX εk∞ ≤ λ .
n
2
In a later lemma, we will show later that P(Ω) ≥ 1 − 2p−(A /2−1)
. Assuming Ω
holds, we have
1
kXβ 0 − X β̂k22 ≤ λkβ̂ − β 0 k − λkβ̂k + λkβ 0 k ≤ 2λkβ 0 k1 .
2n

28
3 The Lasso and beyond III Modern Statistical Methods

3.2 Basic concentration inequalities


We now have to prove the lemma we left out in the theorem just now. In this
section, we are going to prove a bunch of useful inequalities that we will later
use to prove bounds.
Consider the event Ω as defined before. Then
 
  p  
1 [ 1 T
P kX T εk∞ > λ = P  |Xj ε| > λ 
n j=1
n
p  
X 1 T
≤ P |Xj ε| > λ .
j=1
n
2
Now n1 XjT ε ∼ N (0, σn ). So we just have to bound tail probabilities of normal
random variables.
The simplest tail bound we have is Markov’s inequality.
Lemma (Markov’s inequality). Let W be a non-negative random variable. Then
1
P(W ≥ t) ≤ EW.
t
Proof. We have
t1W ≥t ≤ W.
The result then follows from taking expectations and then dividing both sides
by t.
While this is a very simple bound, it is actually quite powerful, because it
assumes almost nothing about W . This allows for the following trick: given any
strictly increasing function ϕ : R → (0, ∞) and any random variable W , we have
Eϕ(W )
P(W ≥ t) = P(ϕ(W ) ≥ ϕ(t)) ≤ .
ϕ(t)
So we get a bound on the tail probability of W for any such function. Even
better, we can minimize the right hand side over a class of functions to get an
even better bound.
In particular, applying this with ϕ(t) = eαt gives
Corollary (Chernoff bound). For any random variable W , we have
P(W ≥ t) ≤ inf e−αt EeαW .
α>0

Note that EeαW is just the moment generating function of W , which we can
compute quite straightforwardly.
We can immediately apply this when W is a normal random variable, W ∼
N (0, σ 2 ). Then
2 2
EeαW = eα σ /2 .
So we have
α2 σ 2
 
2 2
P(W ≥ t) ≤ inf exp − αt = e−t /(2σ ) .
α>0 2
Observe that in fact this tail bound works for any random variable whose moment
2 2
generating function is bounded above by eα σ /2 .

29
3 The Lasso and beyond III Modern Statistical Methods

Definition (Sub-Gaussian random variable). A random variable W is sub-


Gaussian (with parameter σ) if
2
σ 2 /2
Eeα(W −EW ) ≤ eα

for all α ∈ R.
Corollary. Any sub-Gaussian random variable W with parameter σ satisfies
2
/2σ 2
P(W ≥ t) ≤ e−t . 

In general, bounded random variables are sub-Gaussian.


Lemma (Hoeffding’s lemma). If W has mean zero and takes values in [a, b],
then W is sub-Gaussian with parameter b−a
2 .

Recall that the sum of two independent Gaussians is still a Gaussian. This
continues to hold for sub-Gaussian random variables.
Proposition. Let (Wi )ni=1 be independent mean-zero sub-Gaussian random
variables with parameters (σi )ni=0 , and let γ ∈ Rn . Then γ T W is sub-Gaussian
with parameter
X 1/2
(γi σi )2 .

Proof. We have
n
! n
X Y
E exp α γi Wi = E exp (αγi Wi )
i=1 i=1
n
α2 2 2
Y  
≤ exp γ σ
i=1
2 i i
n
!
α2 X 2 2
= exp σ γ .
2 i=1 i i

We can now prove our bound for the Lasso, which in fact works for any
sub-Gaussian random variable.
Lemma. Suppose (εi )ni=1 are independent, mean-zero sub-Gaussian with com-
mon parameter σ. Let r
log p
λ = Aσ .
n

Let X be a matrix whose columns all have norm n. Then
 
1 2
P kX εk∞ ≤ λ ≥ 1 − 2p−(A /2−1) .
T
n

In particular, this includes ε ∼ Nn (0, σ 2 I).


Proof. We have
  X p  
1 T 1 T
P kX εk∞ > λ ≤ P |X ε| > λ .
n j=1
n j

30
3 The Lasso and beyond III Modern Statistical Methods

But ± n1 XjT ε are both sub-Gaussian with parameter


!1/2
X  Xij 2 σ
σ =√ .
i
n n
Then by our previous corollary, we get
p    2 
X 1 T λ n
P |Xj ε|∞ > λ ≤ 2p exp − 2 .
j=1
n 2σ

Note that we have the factor of 2 since we need to consider the two cases
1 T 1 T
n Xj ε > λ and − n Xj ε > λ.
Plugging in our expression of λ, we write the bound as
 
1 2
2p exp − A2 log p = 2p1−A /2 .
2
This is all we need for our result on the Lasso. We are going to go a bit
further into this topic of concentration inequalities, because we will need them
later when we impose conditions on the design matrix. In particular, we would
like to bound the tail probabilities of products.
Definition (Bernstein’s condition). We say that a random variable W satisfies
Bernstein’s condition with parameters (σ, b) where a, b > 0 if
1
E[|W − EW |k ] ≤ k! σ 2 bk−2
2
for k = 2, 3, . . ..
The point is that these bounds on the moments lets us bound the moment
generating function of W .
Proposition (Bernstein’s inequality). Let W1 , W2 , . . . , Wn be independent ran-
dom variables with EWi = µ, and suppose each Wi satisfies Bernstein’s condition
with parameters (σ, b). Then
 2 2 
α(Wi −µ) α σ /2 1
Ee ≤ exp for all |α| < ,
1 − b|α| b
n
!
2
 
1 X nt
P Wi − µ ≥ t ≤ exp − for all t ≥ 0.
n i=1 2(σ 2 + bt)
2
Note that for large t, the bound goes as e−t instead of e−t .
Proof. For the first part, we fix i and write W = Wi . Let |α| < 1b . Then
∞  
X 1 k
Eeα(Wi −µ) = E α |Wi − µ|k
k!
k=0

σ 2 α2 X k−2 k−2
≤1+ |α| b
2
k=2
σ 2 α2 1
=1+
2 1 − |α|b
 2 2 
α σ /2
≤ exp .
1 − b|α|

31
3 The Lasso and beyond III Modern Statistical Methods

Then
n
! n
1X Y α 
E exp α(Wi − µ) = E exp (Wi − µ)
n i=1 i=1
n
α 2 2
 !
n σ /2
≤ exp n ,
1 − b αn

1
assuming α
n < b . So it follows that

α 2
n
!  !
1X −αt n σ 2 /2
P Wi − µ ≥ t ≤e exp n .
n i=1 1 − b α
n

Setting  
α t 1
= ∈ 0,
n bt + σ 2 b
gives the result.
Lemma. Let W, Z be mean-zero sub-Gaussian random variables with parameters
σW and σZ respectively. Then W Z satisfies Bernstein’s condition with parameter
(8σW σZ , 4σW σZ ).
Proof. For any random variable Y (which we will later take to be W Z), for
k > 1, we know
k
1 1
E|Y − EY |k = 2k E Y − EY

2 2
k
k 1 1

≤ 2 E |Y | + |EY | .
2 2

Note that k k k

|Y | + 1 |EY | ≤ |Y | + |EY |
1
2 2 2
by Jensen’s inequality. Applying Jensen’s inequality again, we have

|EY |k ≤ E|Y |k .

Putting the whole thing together, we have

E|Y − EY |k ≤ 2k E|Y |k .

Now take Y = W Z. Then


1/2
E|W Z − EW Z| ≤ 2k E|W Z|k ≤ 2k EW 2k (EZ 2k )1/2 ,

by the Cauchy–Schwarz inequality.


We know that sub-Gaussians satisfy a bound on the tail probability. We can
then use this to bound the moments of W and Z. First note that
Z ∞
2k
W = 1x<W 2k dx.
0

32
3 The Lasso and beyond III Modern Statistical Methods

Taking expectations of both sides, we get


Z ∞
EW 2k = P(x < W 2k ) dx.
0

Since we have a tail bound on W instead of W 2k , we substitute x = t2k . Then


dx = 2kt2k−1 dt. So we get
Z ∞
EW 2k = 2k t2k−1 P(|W | > t) dt
0
Z ∞
t2
 
= 4k t2k exp − 2 dt.
0 2σN

where again we have a factor of 2 to account for both signs. We perform yet
another substitution
t2 t
x= 2 , dx = 2 dt.
2σN σW
Then we get
Z ∞
EW 2k
= 2k+1 σW
2k 2
kσW xk−1 e−x dx = 4 · k!σW
2
.
0

Plugging this back in, we have

E|W Z − EW Z|k ≤ 2k 2k+1 k!σW


k k k
σZ σZ
1 2k+2 k k
= k!2 σW σZ
2
1
= k!(8σW σZ )2 (4σW σZ )k−2 .
2

3.3 Convex analysis and optimization theory


We’ll leave these estimates aside for a bit, and give more background on convex
analysis and convex optimization. Recall the following definition:
Definition (Convex set). A set A ⊆ Rd is convex if for any x, y ∈ A and
t ∈ [0, 1], we have
(1 − t)x + ty ∈ A.

non-convex convex

We are actually more interested in convex functions. We shall let our functions
to take value ∞, so let us define R̄ = R ∪ {∞}. The point is that if we want our
function to be defined in [a, b], then it is convenient to extend it to be defined
on all of R by setting the function to be ∞ outside of [a, b].

33
3 The Lasso and beyond III Modern Statistical Methods

Definition (Convex function). A function f : Rd → R̄ is convex iff

f ((1 − t)x + ty) ≤ (1 − t)f (x) + tf (y)

for all x, y ∈ Rd and t ∈ (0, 1). Moreover, we require that f (x) < ∞ for at least
one x.
We say it is strictly convex if the inequality is strict for all x, y and t ∈ (0, 1).

(1 − t)f (x) + tf (y)

(1 − t)x + ty
x y

Definition (Domain). Define the domain of a function f : Rd → R̄ to be

dom f = {x : f (x) < ∞}.

One sees that the domain of a convex function is always convex.


Proposition.
(i) Let f1 , . . . , fm : Rd → R̄ be convex with dom f1 ∩ · · · ∩ dom fm 6= ∅, and
let c1 , . . . , cm ≥ 0. Then c1 + · · · + cm fm is a convex function.
(ii) If f : Rd → R is twice continuously differentiable, then
(a) f is convex iff its Hessian is positive semi-definite everywhere.
(b) f is strictly convex if its Hessian positive definite everywhere.
Note that having a positive definite Hessian is not necessary for strict con-
vexity, e.g. x4 is strictly convex but has vanishing Hessian at 0.
Now consider a constrained optimization problem
minimize f (x) subject to g(x) = 0
where x ∈ R and g : Rd → Rb . The Lagrangian for this problem is
d

L(x, θ) = f (x) + θT g(x).

Why is this helpful? Suppose c∗ is the minimum of f . Then note that for any θ,
we have

inf L(x, θ) ≤ inf L(x, θ) = inf f (x) = c∗ .


x∈Rd x∈Rd ,g(x)=0 x∈Rd :g(x)=0

Thus, if we can find some θ∗ , x∗ such that x∗ minimizes L(x, θ∗ ) and g(x∗ ) = 0,
then this is indeed the optimal solution.
This gives us a method to solve the optimization problem — for each fixed θ,
solve the unconstrained optimization problem argminx L(x, θ). If we are doing
this analytically, then we would have a formula for x in terms of θ. Then seek
for a θ such that g(x) = 0 holds.

34
3 The Lasso and beyond III Modern Statistical Methods

Subgradients
Usually, when we have a function to optimize, we take its derivative and set it
to zero. This works well if our function is actually differentiable. However, the
`1 norm is not a differentiable function, since |x| is not differentiable at 0. This
is not some exotic case we may hope to avoid most of the time — when solving
the Lasso, we actively want our solutions to have zeroes, so we really want to
get to these non-differentiable points.
Thus, we seek some generalized notion of derivative that works on functions
that are not differentiable.
Definition (Subgradient). A vector v ∈ Rd is a subgradient of a convex function
at x if f (y) ≥ f (x) + v T (y − x) for all y ∈ Rd .
The set of subgradients of f at x is denoted ∂f (x), and is called the subdif-
ferential .

f f

This is indeed a generalization of the derivative, since


Proposition. Let f be convex and differentiable at x ∈ int(dom f ). Then
∂f (x) = {∇f (x)}.
The following properties are immediate from definition.
Proposition. Suppose f and g are convex with int(dom f ) ∩ int(dom g) 6= ∅,
and α > 0. Then

∂(αf )(x) = α∂f (x) = {αv : v ∈ ∂f (x)}


∂(f + g)(x) = ∂g(x) + ∂f (x).

The condition (for convex differentiable functions) that “x is a minimum iff


f 0 (x) = 0” now becomes
Proposition. If f is convex, then

x∗ ∈ argmin f (x) ⇔ 0 ∈ ∂f (x∗ ).


x∈Rd

Proof. Both sides are equivalent to the requirement that f (y) ≥ f (x∗ ) for all
y.
We are interested in applying this to the Lasso. So we want to compute the
subdifferential of the `1 norm. Let’s first introduce some notation.
Notation. For x ∈ Rd and A ⊆ {1 . . . , d}, we write xA for the sub-vector of x
formed by the components of x induced by A. We write x−j = x{j}c = x{1,...,d}\j .
Similarly, we write x−jk = x{jk}c etc.

35
3 The Lasso and beyond III Modern Statistical Methods

We write 
−1
 xi < 0
sgn(xi ) = 1 xi > 0 ,

0 otherwise

and sgn(x) = (sgn(x1 ), . . . , sgn(xd ))T .


First note that k · k1 is convex, as it is a norm.
Proposition. For x ∈ Rd and A ∈ {j : xj 6= 0}, we have

∂kxk1 = {v ∈ Rd : kvk∞ ≤ 1, vA = sgn(xA )}.

Proof. It suffices to look at the subdifferential of the absolute value function,


and then add them up.
For j = 1, . . . , d, we define gj : Rd → R by sending x to |xj |. If xj =
6 0, then
gj is differentiable at x, and so we know ∂gj (x) = {sgn(xj )ej }, with ej the jth
standard basis vector.
When xj = 0, if v ∈ ∂gj (xj ), then

gj (y) ≥ gj (x) + v T (y − x).

So
|yj | ≥ v T (y − x).
We claim this holds iff vj ∈ [−1, 1] and v−j = 0. The ⇐ direction is an immediate
calculation, and to show ⇒, we pick y−j = v−j + x−j , and yj = 0. Then we have
T
0 ≥ v−j v−j .

So we know that v−j = 0. Once we know this, the condition says

|yj | ≥ vj yj

for all yj . This is then true iff vj ∈ [−1, 1]. Forming the set sum of the ∂gj (x)
gives the result.

3.4 Properties of Lasso solutions


Let’s now apply this to the Lasso. Recall that the Lasso objective was
1
Qλ (β) = kY − Xβk22 + λkβk1 .
2n

We know this is a convex function in β. So we know β̂λL is a minimizer of Qλ (β)


iff 0 ∈ ∂Qλ (β̂).
Let’s subdifferentiate Qλ and see what this amounts to. We have
 
1 T n o
∂Qλ (β̂) = − X (Y − Xβ) + λ ν̂ ∈ Rd : kν̂k∞ ≤ 1, ν̂|ρ̂Lλ = sgn(β̂λ, L
ρ̂L ) ,
n λ

where ρ̂L L
λ = {j : β̂λ,j 6= 0}.

36
3 The Lasso and beyond III Modern Statistical Methods

Thus, 0 ∈ ∂Qλ (β̂λ ) is equivalent to


1 1 T
ν̂ = · X (Y − X β̂λ )
λ n
satisfying
L
kν̂k∞ ≤ 1, ν̂ρ̂Lλ = sgn(β̂λ,ρ̂L ).
λ

These are known as the KKT conditions for the Lasso.


In principle, there could be several solutions to the Lasso. However, at least
the fitted values are always unique.

Proposition. X β̂λL is unique.

Proof. Fix λ > 0 and stop writing it. Suppose β̂ (1) and β̂ (2) are two Lasso
solutions at λ. Then
Q(β̂ (1) ) = Q(β̂ (2) ) = c∗ .
As Q is convex, we know
 
∗ 1 (1) 1 (2) 1 1
c =Q β̂ + β̂ ≤ Q(β̂ (1) ) + Q(β̂ (2) ) = c∗ .
2 2 2 2

So 12 β̂ (1) + 21 β̂ (2) is also a minimizer.


Since the two terms in Q(β) are individually convex, it must be the case that
2 2 1 2
1
(Y − X β̂ (1) ) + 1 (Y − X β̂ (2) ) = 1

(1) (2)

− X β̂ + − X β̂

2 2 2
Y
2 2
Y
2
2
1 (1)
(β̂ + β̂ (2) ) = 1 kβ̂ (1) k1 + 1 kβ̂ (2) k1 .

2 2 2
1

Moreover, since k · k22 is strictly convex, we can have equality only if X β̂ (1) =
X β̂ (2) . So we are done.
Definition (Equicorrelation set). Define the equicorrelation set Êλ to be the
set of k such that
1 T
|X (Y − X β̂λL )| = λ,
n k
or equivalently, the k with νk = ±1, which is well-defined since it depends only
on the fitted values.
By the KKT conditions, Êλ contains the set of non-zeroes of Lasso solution,
but may be strictly bigger than that.
Note that if rk(XÊλ ) = |Êλ |, then the Lasso solution must be unique since

XÊλ (β̂ (1) − β̂ (2) ) = 0.

So β̂ (1) = β̂ (2) .

37
3 The Lasso and beyond III Modern Statistical Methods

3.5 Variable selection


So we have seen that the Lasso picks out some “important” variables and discards
the rest. How well does it do the job?
For simplicity, consider a noiseless linear model

Y = Xβ 0 .

Our objective is to find the set

S = {k : βk0 6= 0}.

We may wlog assume S = {1, . . . , s}, and N = {1 . . . p} \ S (as usual X is n × p).


We further assume that rk(XS ) = s.
In general, it is difficult to find out S even if we know |S|. Under certain
conditions, the Lasso can do the job correctly. This condition is dictated by the
`∞ norm of the quantity
T
∆ = XN XS (XST XS )−1 sgn(βS0 ).

We can understand this a bit as follows — the kth entry of this is the dot product
of sgn(βS0 ) with (XST XS )−1 XST Xk . This is the coefficient vector we would obtain
if we tried to regress Xk on XS . If this is large, then this suggests we expect Xk
to look correlated to XS , and so it would be difficult to determine if k is part of
S or not.
Theorem.

(i) If k∆k∞ ≤ 1, or equivalently

max | sgn(βS0 )T (XST XS )−1 XST Xk | ≤ 1,


k∈N

and moreover  −1


1 T
|βk0 | 0 T
> λ sgn(βS ) X Xj

n j

k

for all k ∈ S, then there exists a Lasso solution β̂λL with sgn(β̂λL ) = sgn(β 0 ).

(ii) If there exists a Lasso solution with sgn(β̂λL ) = sgn(β 0 ), then k∆k∞ ≤ 1.

We are going to make heavy use of the KKT conditions.


Proof. Write β̂ = β̂λL , and write Ŝ = {k : β̂k 6= 0}. Then the KKT conditions
are that
1 T 0
X (β − β̂) = λν̂,
n
where kν̂k∞ ≤ 1 and ν̂Ŝ = sgn(β̂Ŝ ).
We expand this to say

1 XST XS XST XN
  0   
βS − β̂S ν̂S
T = λ .
n XN XS XN T XN −β̂N ν̂N

Call the top and bottom equations (1) and (2) respectively.

38
3 The Lasso and beyond III Modern Statistical Methods

It is easier to prove (ii) first. If there is such a solution, then β̂N = 0. So


from (1), we must have
1 T
X XS (βS0 − β̂S ) = λν̂S .
n S
1 T
Inverting n XS XS , we learn that
 −1
1 T
βS0 − β̂S = λ X XS sgn(βS0 ).
n S

Substitute this into (2) to get


 −1
1 T 1 T
λ XN XS X XS sgn(βS0 ) = λν̂N .
n n S

By the KKT conditions, we know kν̂N k∞ ≤ 1, and the LHS is exactly λ∆.
To prove (1), we need to exhibit a β̂ that agrees in sign with β̂ and satisfies
the equations (1) and (2). In particular, β̂N = 0. So we try
 −1 !
0 1 T 0 0
(β̂S , ν̂S ) = βS − λ X XS sgn(βS ), sgn(βS )
n S
(β̂N , νN ) = (0, ∆).

This is by construction a solution. We then only need to check that

sgn(β̂S ) = sgn(βS0 ),

which follows from the second condition.

Prediction and estimation


We now consider other question of how good the Lasso functions as a regression
method. Consider the model

Y = Xβ 0 + ε − ε̄1,

where the εi are independent and have common sub-Gaussian parameter σ. Let
S, s, N be as before.
As before, the Lasso doesn’t always behave well, and whether or not it does
is controlled by the compatibility factor.
Definition (Compatibility factor). Define the compatibility factor to be
1 2
n kXβk2 s
φ2 = infp 1 2
= inf kXS βS − XN βN k22 .
s kβS k1
β∈R β∈Rp n
kβN k1 ≤3kβS k1 kβS k=1
βS 6=0 kβN k1 ≤3

Note that we are free to use a minus sign inside the norm since the problem
is symmetric in βN ↔ −βN .
In some sense, this φ measures how well we can approximate XS βS just with
the noise variables.

39
3 The Lasso and beyond III Modern Statistical Methods

Definition (Compatibility condition). The compatibility condition is φ2 > 0.


1 T
Note that if Σ̂ = nX X has minimum eigenvalue cmin > 0, then we have
2
φ ≥ cmin . Indeed,
√ √
kβS k1 = sgn(βS )T βS ≤ skβS k2 ≤ skβk2 ,

and so
1 2
n kXβk2
φ2 ≥ inf 2 = cmin .
β6=0 kβk2
Of course, we don’t expect the minimum eigenvalue to be positive, but we have
the restriction in infimum in the definition of φ2 and we can hope to have a
positive value of φ2 .
Theorem. Assume φ2 > 0, and let β̂ be the Lasso solution with
p
λ = Aσ log p/n.
2
Then with probability at least 1 − 2p−(A /8−1)
, we have

1 16λ2 s 16A2 log p sσ 2


kX(β 0 − β̂)k22 + λkβ̂ − β 0 k1 ≤ 2
= .
n φ φ2 n
This is actually two bounds. This simultaneously bounds the error in the
fitted values, and a bound on the error in predicting β̂ − β 0 .
Recall that in our previous bound, we had a bound of ∼ √1n , and now we have
2
∼ n1 . Note also that sσn is the error we would get if we magically knew which
were the non-zero variables and did ordinary least squares on those variables.
This also tells us that if β 0 has a component that is large, then β̂ must be
non-zero in that component as well. So while the Lasso cannot predict exactly
which variables are non-zero, we can at least get the important ones.
Proof. Start with the basic inequality Qλ (β̂) ≤ Qλ (β 0 ), which gives us
1 1
kX(β 0 − β̂)k22 + λkβ̂k1 ≤ εT X(β̂ − β 0 ) + λkβ 0 k1 .
2n n
We work on the event  
1 T 1
Ω= kX εk∞ ≤ λ ,
n 2
where after applying Hölder’s inequality, we get
1
kX(β 0 − β̂)k22 + 2λkβ̂k1 ≤ λkβ̂ − β 0 k1 + 2λkβ 0 k1 .
n

We can move 2λkβ̂k1 to the other side, and applying the triangle inequality, we
have
1
kX(β̂ − β 0 )k22 ≤ 3λkβ̂ − β 0 k.
n
If we manage to bound the RHS from above as well, so that
1
3λkβ̂ − β 0 k ≤ cλ √ kX(β̂ − β 0 )k2
n

40
3 The Lasso and beyond III Modern Statistical Methods

for some c, then we obtain the bound


1
kX(β − β 0 )k22 ≤ c2 λ2 .
n
Plugging this back into the second bound, we also have

3λkβ̂ − β 0 k1 ≤ c2 λ2 .

To obtain these bounds, we want to apply the definition of φ2 to β̂ − β 0 . We thus


need to show that the β̂ − β 0 satisfies the conditions required in the infimum
taken.
Write
1
a= kX(β̂ − β 0 )k22 .

Then we have

a + 2(kβ̂n k1 + kβ̂S k1 ) ≤ kβ̂S − βS0 k1 + kβ̂N k1 + 2kβS0 k1 .

Simplifying, we obtain

a + kβ̂N k1 ≤ kβ̂S − βS0 k1 + 2kβS0 k1 − 2kβ̂S k1 .

Using the triangle inequality, we write this as

a + kβ̂N − β 0 kN ≤ 3kβ̂S − βS0 k1 .

So we immediately know we can apply the compatibility condition, which gives


us
1
kX(β̂ − β 0 )k22
φ2 ≤ n 1 .
0 2
s kβ̂S − βS k1
Also, we have
1
kX(β̂ − β 0 )k22 + λkβ̂ − β 0 k1 ≤ 4λkβ̂S − βS0 k1 .
n
Thus, using the compatibility condition, we have
r
1 0 2 0 4λ s
kX(β̂ − β )k2 + λkβ̂ − β k ≤ kX(β̂ − β 0 )k2 .
n φ n

Thus, dividing through by √1 kX(β̂ − β 0 )k2 , we obtain


n

1 0 4λ s
√ kX(β̂ − β )k2 ≤ . (∗)
n φ

So we substitute into the RHS (∗) and obtain

1 16λ2 s
kX(β̂ − β 0 )k22 + λkβ̂ − β 0 k1 ≤ .
n φ2
If we want to be impressed by this result, then we should make sure that
the compatibility condition is a reasonable assumption to make on the design
matrix.

41
3 The Lasso and beyond III Modern Statistical Methods

The compatibility condition and random design


For any Σ im Rp×p , define

β T Σβ
φ2Σ (S) = inf 2
β:kβN k1 ≤3kβS k1 , βs 6=0 kβS k1 /|S|

Our original φ2 is then the same as φ2Σ̂ (S).


If we want to analyze how φ2Σ (S) behaves for a “random” Σ, then it would
be convenient to know that this depends continuously on Σ. For our purposes,
the following lemma suffices:
Lemma. Let Θ, Σ ∈ Rp×p . Suppose φ2Θ (S) > 0 and

φ2Θ (S)
max |Θjk − Σjk | ≤ .
j,k 32|S|

Then
1 2
φ2Σ (S) ≥ φ (S).
2 Θ
Proof. We suppress the dependence on S for notational convenience. Let s = |S|
and
φ2 (S)
t= Θ .
32s
We have
|β T (Σ − Θ)β| ≤ kβk1 k(Σ − Θ)βk∞ ≤ tkβk21 ,
where we applied Hölder twice.
If kβN k ≤ 3kβS k1 , then we have
p
β T Θβ
kβk1 ≤ 4kβS k1 ≤ 4 √ .
φΘ / s

Thus, we have

φ2Θ 16β T Θβ 1
β T Θβ − · = β T Θβ ≤ β T Σβ.
32s φ2Θ /s 2

Define
φ2Σ,s = min φ2Σ (S).
S:|S|=s

Note that if
φ2Θ,s
max |Θjk − Σjk | ≤ ,
jk 32s
then
1 2
φ2Σ (S) ≥ φ (S).
2 Θ
for all S with |S| = s. In particular,
1 2
φ2Σ,s ≥ φ .
2 Θ,s

42
3 The Lasso and beyond III Modern Statistical Methods

Theorem. Suppose the prows of X are iid and each entry is sub-Gaussian with
parameter v. Suppose s log p/n → 0 as n → ∞, and φ2Σ0 ,s is bounded away
from 0. Then if Σ0 = EΣ̂, then we have
 
2 1 2
P φΣ̂,s ≥ φΣ0 ,s → 1 as n → ∞.
2
Proof. It is enough to show that
!
φ2Σ0 ,s
P max |Σ̂jk − Σ0jk | ≤ →0
jk 32s
as n → ∞.
φ2Σ0 ,s
Let t = 32s. Then
  X
0
P max |Σ̂jk − Σjk | ≥ t ≤ P(|Σ̂jk − Σ0jk | ≥ t).
j,k
j,k

Recall that
n
1X
Σ̂jk = Xij Xik .
n i=1
So we can apply Bernstein’s inequality to bound
nt2
 
0
P(|Σ̂jk − Σjk ) ≤ 2 exp − ,
2(64v 4 + 4v 2 t)
since σ = 8v 2 and b = 4v 2 . So we can bound
2s2
   cn    
0 2 cn
P max |Σ̂jk − Σjk | ≥ t ≤ 2p exp − 2 = 2 exp − 2 c − →0
j,k s s n log p
for some constant c.
Corollary. Suppose the rows of X are iid mean-zero multivariate Gaussian
with variance Σ0 . Suppose Σn has minimum eigenvalue bounded from below by
0
cmin
p > 0, and suppose the diagonal entries of Σ are bounded from above. If
s log p/n → 0, then
 
1
P φ2Σ̂,s ≥ cmin → 1 as n → ∞.
2

3.6 Computation of Lasso solutions


We have had enough of bounding things. In this section, let’s think about how
we can actually run the Lasso. What we present here is actually a rather general
method to find the minimum of a function, known as coordinate descent.
Suppose we have a function f : Rd → R. We start with an initial guess x(0)
and repeat for m = 1, 2, . . .
(m) (m−1) (m−1)
x1 = argmin f (x1 , x2 , . . . , xd )
x1
(m) (m) (m−1) (m−1)
x2 = argmin f (x1 , x2 , x3 , . . . , xd )
x2
..
.
(m) (m) (m) (m)
xd = argmin f (x1 , x2 , . . . , xd−1 , xd )
xd

43
3 The Lasso and beyond III Modern Statistical Methods

until the result stabilizes.


This was proposed for solving the Lasso a long time ago, and a Stanford
(m) (m)
group tried this out. However, instead of using x1 when computing x2 , they
(m−1)
used x1 instead. This turned out to be pretty useless, and so the group
abandoned the method. After trying some other methods, which weren’t very
good, they decided to revisit this method and fixed their mistake.
For this to work well, of course the coordinatewise minimizations have to be
easy (which is the case for the Lasso, where we even have explicit solutions). This
converges to the global minimizer if the minimizer is unique, {x : f (x) ≤ f (x(0) )}
is compact, and if f has the form
X
f (x) = g(x) + hj (xj ),
j

where g is convex and differentiable, and each hj is convex but not necessarily
differentiable. In the case of the Lasso, the first is the least squared term, and
the hj is the `1 term.
There are two things we can do to make this faster. We typically solve the
Lasso on a grid of λ values λ0 > λ1 > · · · > λL , and then picking the appropriate
λ by v-fold cross-validation. In this case, we can start solving at λ0 , and then
for each i > 0, we solve the λ = λi problem by picking x(0) to be the optimal
solution to the λi−1 problem. In fact, even if we already have a fixed λ value we
want to use, it is often advantageous to solve the Lasso with a larger λ-value,
and then use that as a warm start to get to our desired λ value.
Another strategy is an active set strategy. If p is large, then this loop may
take a very long time. Since we know the Lasso should set a lot of things to zero,
for ` = 1, . . . , L, we set
A = {k : β̂λL`−1 ,k 6= 0}.
We then perform coordinate descent only on coordinates in A to obtain a Lasso
solution β̂ with β̂Ac = 0. This may not be the actual Lasso solution. To check
this, we use the KKT conditions. We set
 
c 1 T
V = k ∈ A : |Xk (Y − X β̂)| > λ` .
n
If V = ∅, and we are done. Otherwise, we add V to our active set A, and then
run coordinate descent again on this active set.

3.7 Extensions of the Lasso


There are many ways we can modify the Lasso. The first thing we might want to
change in the Lasso is to replace the least squares loss with other log-likelihoods.
Another way to modify the Lasso is to replace the `1 penalty with something
else in order to encourage a different form of sparsity.
Example (Group Lasso). Given a partition
G1 ∪ · · · Gq = {1, . . . , p},
the group Lasso penalty is
q
X
λ mj kβGj k2 ,
j=1

44
3 The Lasso and beyond III Modern Statistical Methods

where {mj } is some sort of weight to account


p for the fact that the groups have
different sizes. Typically, we take mj = |Gj |.
If we take Gi = {i}, then we recover the original Lasso. If we take q = 1,
then we recover Ridge regression. What this does is that it encourages the entire
group to be all zero, or all non-zero.
0
Example. Another variation is the fused Lasso. If βj+1 is expected to be close
0
to βj , then a fused Lasso penalty may be appropriate, with
p−1
X
λ1 |βj+1 − βj | + λ2 kβk1 .
j=1

For example, if
Yi = µi + εi ,
and we believe that (µi )ni=1 form a piecewise constant sequence, we can estimate
µ0 by ( )
n−1
X
argmin kY − µk22 +λ |µi+1 − µi | .
µ
i=1
Example (Elastic net). We can use
 
EN 1 2 2
β̂λ,α = argmin kY − Xβk2 + λ(αkβk1 + (1 − α)kβk2 ) .
β 2n
for α ∈ [0, 1]. What the `2 norm does is that it encourages highly positively
correlated variables to have similar estimated coefficients.
For example, if we have duplicate columns, then the `1 penalty encourages
us to take one of the coefficients to be 0, while the `2 penalty encourages the
coefficients to be the same.
Another class of variations try to reduce the bias of the Lasso. Although
the bias of the Lasso is a necessary by-product of reducing the variance of the
estimate, it is sometimes desirable to reduce this bias.
The LARS-OLS hybrid takes the Ŝλ obtained by the Lasso, and then re-
estimate βŜ0 by OLS. We can also re-estimate using the Lasso on XŜλ , and this
λ
is known as the relaxed Lasso.
In the adaptive Lasso, we obtain an initial estimate of β 0 , e.g. with the Lasso,
and then solve
 
 1 X |β k | 
β̂λadapt = argmin kY − Xβk22 + λ .
β:βŜ =0  2n |β̂k | 
k∈Ŝ

We can also try to use a non-convex penalty. We can attempt to solve


( n
)
1 2
X
argmin kY − Xβk2 + pλ (|βk |) ,
β 2n
k=1

where pλ : [0, ∞) → p[0, ∞) is a non-convex function. One common example is


the MCP , given by  
u
p0λ (u) = λ − ,
γ +
where γ is an extra tuning parameter. This tends to give estimates even sparser
than the Lasso.

45
4 Graphical modelling III Modern Statistical Methods

4 Graphical modelling
4.1 Conditional independence graphs
So far, we have been looking at prediction problems. But sometimes we may
want to know more than that. For example, there is a positive correlation
between the wine budget of a college, and the percentage of students getting
firsts. This information allows us to make predictions, in the sense that if we
happen to know the wine budget of a college, but forgot the percentage of
students getting firsts, then we can make a reasonable prediction of the latter
based on the former. However, this does not suggest any causal relation between
the two — increasing the wine budget is probably not a good way to increase
the percentage of students getting firsts!
Of course, it is unlikely that we can actually figure out causation just by
looking at the data. However, there are some things we can try to answer. If
we gather more data about the colleges, then we would probably find that the
colleges that have larger wine budget and more students getting firsts are also
the colleges with larger endowment and longer history. If we condition on all
these other variables, then there is not much correlation left between the wine
budget and the percentage of students getting firsts. This is what we are trying
to capture in conditional independence graphs.
We first introduce some basic graph terminology. For the purpose of condi-
tional independence graphs, we only need undirected graphs. But later, we need
the notion of direct graphs as well, so our definitions will be general enough to
include those.
Definition (Graph). A graph is a pair G = (V, E), where V is a set and
E ⊆ (V, V ) such that (v, v) 6∈ E for all v ∈ V .
Definition (Edge). We say there is an edge between j and k and that j and k
are adjacent if (j, k) ∈ E or (k, j) ∈ E.
Definition (Undirected edge). An edge (j, k) is undirected if also (k, j) ∈ E.
Otherwise, it is directed and we write j → k to represent it. We also say that j
is a parent of k, and write pa(k) for the set of all parents of k.
Definition ((Un)directed graph). A graph is (un)directed if all its edges are
(un)directed.
Definition (Skeleton). The skeleton of G is a copy of G with every edge replaced
by an undirected edge.
Definition (Subgraph). A graph G1 = (V, E) is a subgraph of G = (V, E) if
V1 ⊆ V and E1 ⊆ E. A proper subgraph is one where either of the inclusions are
proper inclusions.
As discussed, we want a graph that encodes the conditional dependence of
different variables. We first define what this means. In this section, we only
work with undirected graphs.
Definition (Conditional independence). Let X, Y, Z be random vectors with
joint density fXY Z . We say that X is conditionally independent of Y given Z,
written X q Y | Z, if
fXY |Z (x, y | z) = fX|Z (x | z)fY |Z (y | z).

46
4 Graphical modelling III Modern Statistical Methods

Equivalently,
fX|Y Z (x | y, z) = fX|Z (x | z)
for all y.
We shall ignore all the technical difficulties, and take as an assumption that
all these conditional densities exist.

Definition (Conditional independence graph (CIG)). Let P be the law of


Z = (Z1 , . . . , Zp )T . The conditional independent graph (CIG) is the graph whose
vertices are {1, . . . , p}, and contains an edge between j and k iff Zj and Zk are
conditionally dependent given Z−jk .

More generally, we make the following definition:


Definition (Pairwise Markov property). Let P be the law of Z = (Z1 , . . . , Zp )T .
We say P satisfies the pairwise Markov property with respect to a graph G if for
any distinct, non-adjacent vertices j, k, we have Zj q Zk | Z−jk .
Example. If G is a complete graph, then P satisfies the pairwise Markov
property with respect to G.
The conditional independence graph is thus the minimal graph satisfying the
pairwise Markov property. It turns out that under mild conditions, the pairwise
Markov property implies something much stronger.

Definition (Separates). Given a triple of (disjoint) subsets of nodes A, B, S,


we say S separates A from B if every path from a node in A to a node in B
contains a node in S.
Definition (Global Markov property). We say P satisfies the global Markov
property with respect to G if for any triple of disjoint subsets of V (A, B, S), if
S separates A and B, then ZA q ZB | ZS .

Proposition. If P has a positive density, then if it satisfies the pairwise Markov


property with respect to G, then it also satisfies the global Markov property.
This is a really nice result, but we will not prove this. However, we will prove
a special case in the example sheet.
So how do we actually construct the conditional independence graph? To do
so, we need to test our variables for conditional dependence. In general, this is
quite hard. However, in the case where we have Gaussian data, it is much easier,
since independence is the same as vanishing covariance.
Notation (MA,B ). Let M be a matrix. Then MA,B refers to the submatrix
given by the rows in A and columns in B.

Since we are going to talk about conditional distributions a lot, the following
calculation will be extremely useful.
Proposition. Suppose Z ∼ Np (µ, Σ) and Σ is positive definite. Then

ZA | ZB = zB ∼ N|A| (µA + ΣA,B Σ−1 −1


B,B (zB − µB ), ΣA,A − ΣA,B ΣB,B ΣB,A ).

47
4 Graphical modelling III Modern Statistical Methods

Proof. Of course, we can just compute this directly, maybe using moment
generating functions. But for pleasantness, we adopt a different approach. Note
that for any M , we have
ZA = M ZB + (ZA − M ZB ).
We shall pick M such that ZA − M ZB is independent of ZB , i.e. such that
0 = cov(ZB , ZA − M ZB ) = ΣB,A − ΣB,B M T .
So we should take
M = (Σ−1 T −1
B,B ΣB,A ) = ΣA,B ΣB,B .
We already know that ZA − M ZB is Gaussian, so to understand it, we only need
to know its mean and variance. We have
E[ZA − M ZB ] = µA − M µB = µA − ΣAB Σ−1
BB µB
var(ZA − M ZB ) = ΣA,A − 2ΣA,B Σ−1 −1 −1
B,B ΣB,A + ΣA,B ΣB,B ΣB,B ΣB,B ΣB,A

= ΣA,A − ΣA,B Σ−1


B,B ΣB,A .

Then we are done.

Neighbourhood selection
We are going to specialize to A = {k} and B = {1, . . . , n} \ {k}. Then we can
separate out the “mean” part and write
T
Zk = Mk + Z−k Σ−1
−k,−k Σ−k,k + εk ,

where
Mk = µk − µT−k Σ−1
k,−k Σ−k,k ,

εk | Z−k ∼ N (0, Σk,k − Σk,−k Σ−1


−k,−k Σ−k,k ).

This now looks like we are doing regression.


We observe that
Lemma. Given k, let j 0 be such that (Z−k )j = Zj 0 . This j 0 is either j or j + 1,
depending on whether it comes after or before k.
If the jth component of Σ−1
−k,−k Σ−k,k is 0, then Zk q Zj | Z−kj .
0 0

Proof. If the jth component of Σ−1 −k,−k Σ−k,k is 0, then the distribution of
Zk | Z−k will not depend on (Z−k )j = Zj 0 (here j 0 is either j or j + 1, depending
on where k is). So we know
d
Zk | Z−k = Zk | Z−kj 0 .
This is the same as saying Zk q Zj 0 | Z−kj 0 .
Neighbourhood selection exploits this fact. Given x1 , . . . , xn which are iid
∼ Z and
X = (xT1 , · · · , xTn )T ,
we can estimate Σ−1
−k,−k Σ−k,k by regressing Xk on X−k using the Lasso (with
an intercept term). We then obtain selected sets Ŝk . There are two ways of
estimating the CIG based on these:

48
4 Graphical modelling III Modern Statistical Methods

– OR rule: We add the edge (j, k) if j ∈ Ŝk or k ∈ Ŝj .

– AND rule: We add the edge (j, k) if j ∈ Ŝk and k ∈ Ŝj .

The graphical Lasso


Another way of finding the conditional independence graph is to compute
var(Zjk | Z−jk ) directly. The following lemma will be useful:
Lemma. Let M ∈ Rp×p be positive definite, and write
 
P Q
M= ,
QT R

where P and Q are square. The Schur complement of R is

S = P − QR−1 QT .

Note that this has the same size as P . Then


(i) S is positive definite.

(ii)
S −1 −S −1 QR−1
 
−1
M = −1 T −1 .
−R Q S R−1 + R−1 QT S −1 QR−1

(iii) det(M ) = det(S) det(R)


We have seen this Schur complement when we looked at var(ZA | ZAc )
previously, where we got

var(ZA | ZAc ) = ΣA,A − ΣA,AC Σ−1 −1


Ac ,Ac ΣAc ,A = ΩA,A ,

where Ω = Σ−1 is the precision matrix .


Specializing to the case where A = {j, k}, we have
 
1 Ωk,k −Ωj,k
var(Z{j,k} | Z−jk ) =
det(ΩA,A ) −Ωj,k Ωj,j .

This tells us Zk qZj | Z−kj iff Ωjk = 0. Thus, we can approximate the conditional
independence graph by computing the precision matrix Ω.
Our method to estimate the precision matrix is similar to the Lasso. Recall
that the density of Np (µ, Σ) is
 
1 1 T −1
P (z) = exp − (z − µ) Σ (z − µ) .
(2π)p/2 (det Σ)1/2 2

The log-likelihood of (µ, Ω) based on an iid sample (X1 , . . . , Xn ) is (after dropping


a constant)
n
n 1X
`(µ, Ω) = log det Ω − (xi − µ)T Ω(xi − µ).
2 2 i=1

49
4 Graphical modelling III Modern Statistical Methods

To simplify this, we let


n n
1X 1X
x̄ = xi , S= (xi − x̄)(xi − x̄)T .
n i=1 n i=1

Then
X X
(xi − µ)T Ω(xi − µ) = (xi − x̄ + x̄ − µ)T Ω(xi − x̄ + x̄ − µ)
X
= (xi − x̄)T Ω(xi − x̄) + n(X̄ − µ)T Ω(X̄ − µ).

We have
X X  
(xi − x̄)T Ω(xi − x̄) = tr (xi − x̄)T Ω(xi − x̄) = n tr(SΩ).

So we now have
n 
`(µ, Ω) = − tr(SΩ) − log det Ω + (X̄ − µ)T Ω(X̄ − µ) .
2
We are really interested in estimating Ω. So we should try to maximize this over
µ, but that is easy, since this is the same as minimizing (X̄ − µ)T Ω(X̄ − µ), and
we know Ω is positive-definite. So we should set µ = X̄. Thus, we have
n 
`(Ω) = maxp `(µ, Ω) = − tr(SΩ) − log det ω .
µ∈R 2
So we can solve for the MLE of Ω by solving
 
min − log det Ω + tr(SΩ) .
Ω:Ω0

One can show that this is convex, and to find the MLE, we can just differentiate
∂ ∂
log det Ω = (Ω−1 )jk , tr(SΩ) = Sjk ,
∂Ωjk ∂Ωjk
using that S and Ω are symmetric. So provided that S is positive definite, the
maximum likelihood estimate is just

Ω = S −1 .

But we are interested in the high dimensional situation, where we have loads of
variables, and S cannot be positive definite. To solve this, we use the graphical
Lasso.
The graphical Lasso solves
 
argmin − log det Ω + tr(SΩ) + λkΩk1 ,
Ω:Ω0

where X
kΩk1 = Ωjk .
jk

Often, people don’t sum over the diagonal elements, as we want to know if off-
diagonal elements ought to be zero. Similar to the case of the Lasso, this gives a
sparse estimate of Ω from which we may estimate the conditional independence
graph.

50
4 Graphical modelling III Modern Statistical Methods

4.2 Structural equation modelling


The conditional independence graph only tells us which variables are related to
one another. However, it doesn’t tell us any causal relation between the different
variables. We first need to explain what we mean by a causal model. For this,
we need the notion of a directed acyclic graph.
Definition (Path). A path from j to k is a sequence j = j1 , j2 , . . . , jm = k of
(at least two) distinct vertices such that j` and j`+1 are adjacent.
A path is directed if j` → j`+1 for all `.
Definition (Directed acyclic graph (DAG)). A directed cycle is (almost) a
directed path but with the start and end points the same.
A directed acyclic graph (DAG) is a directed graph containing no directed
cycles.

a b a b

c c

DAG not DAG

We will use directed acyclic graphs to encode causal structures, where we


have a path from a to b if a “affects” b.

Definition (Structural equation model (SEM)). A structural equation model S


for a random vector Z ∈ Rp is a collection of equations

Zk = hk (Zpk , εk ),

where k = 1, . . . , p and ε1 , . . . , εp are independent, and pk ⊆ {1, . . . , p} \ {k} and


such that the graph with pa(k) = pk is a directed acyclic graph.
Example. Consider three random variables:
– Z1 = 1 if a student is taking a course, 0 otherwise

– Z2 = 1 if a student is attending catch up lectures, 0 otherwise


– Z3 = 1 if a student heard about machine learning before attending the
course, 0 otherwise.
Suppose

Z3 = ε3 ∼ Bernoulli(0.25)
Z2 = 1{ε2 (1+Z3 )> 12 } , ε2 ∼ U [0, 1]
Z1 = 1{ε1 (Z2 +Z3 )> 12 } , ε1 ∼ U [0, 1].

This is then an example of a structural equation modelling, and the corresponding


DAG is

51
4 Graphical modelling III Modern Statistical Methods

Z2 Z3

Z1

Note that the structural equation model for Z determines its distribution,
but the converse is not true. For example, the following two distinct structural
equation give rise to the same distribution:

Z1 = ε Z1 = Z2
Z2 = Z1 Z2 = ε

Indeed, if we have two variables that are just always the same, it is hard to tell
which is the cause and which is the effect.
It would be convenient if we could order our variables in a way that Zk
depends only on Zj for j < k. This is known as a topological ordering:
Definition (Descendant). We say k is a descendant of j if there is a directed
path from j to k. The set of descendant of j will be denoted de(j).
Definition (Topological ordering). Given a DAG G with V = {1, . . . , p} we say
that a permutation π : V → V is a topological ordering if π(j) < π(k) whenever
k ∈ de(j).
Thus, given a topological ordering π, we can write Zk as a function of
επ−1 (1) , . . . , επ−1 (π(k)) .
How do we understand structural equational models? They give us informa-
tion that are not encoded in the distribution itself. One way to think about them
is via interventions. We can modify a structural equation model by replacing
the equation for Zk by setting, e.g. Zk = a. In real life, this may correspond
to forcing all students to go to catch up workshops. This is called a perfect
intervention. The modified SEM gives a new joint distribution for Z. Expecta-
tions or probabilities with respect to the new distribution are written by adding
“do(Zk = a)”. For example, we write

E(Zj | do(Zk = a)).

In general, this is different to E(Zj | Zk = a), since, for example, if we conditioned


on Z2 = a in our example, then that would tell us something about Z3 .
Example. After the intervention do(Z2 = 1), i.e. we force everyone to go to the
catch-up lectures, we have a new SEM with

Z3 = ε3 ∼ Bernoulli(0.25)
Z2 = 1
Z1 = 1ε1 (1+Z3 )> 12 , ε1 ∼ U [0, 1].

Then, for example, we can compute


1 3 3 1 9
P(Z1 = 1 | do(Z2 = 1)) = · + + = ,
2 4 4 4 16

52
4 Graphical modelling III Modern Statistical Methods

and by high school probability, we also have


7
P(Z1 = 1 | Z2 = 1) = .
12
To understand the DAGs associated to structural equation models, we would
like to come up with Markov properties analogous to what we had for undirected
graphs. This, in particular, requires the correct notion of “separation”, which we
call d-separation. Our notion should be such that if S d-separates A and B in
the DAG, then ZA and ZB are conditionally independent given ZS . Let’s think
about some examples. For example, we might have a DAG that looks like this:

s
p q
a b

Then we expect that


(i) Za and Zs are not independent;
(ii) Za and Zs are independent given Zq ;
(iii) Za and Zb are independent;
(iv) Za and Zb are not independent given Zs .
We can explain a bit more about the last one. For example, the structural
equation model might telling us Zs = Za + Zb + ε. In this case, if we know that
Za is large but Zs is small, then chances are, Zb is also large (in the opposite
sign). The point is that both Za and Zb both contribute to Zs , and if we know
one of the contributions and the result, we can say something about the other
contribution as well.
Similarly, if we have a DAG that looks like

a b

then as above, we know that Za and Zb are not independent given Zs .


Another example is

s
p q
a b

Here we expect
– Za and Zb are not independent.
– Za and Zb are independent given Zs .
To see (i), we observe that if we know about Za , then this allows us to predict
some information about Zs , which would in turn let us say something about Zb .

53
4 Graphical modelling III Modern Statistical Methods

Definition (Blocked). In a DAG, we say a path (j1 , . . . , jm ) between j1 and


jm is blocked by a set of nodes S (with neither j1 nor jm in S) if there is some
j` ∈ S and the path is not of the form

j`−1 j` j`+1

or there is some j` such that the path is of this form, but neither j` nor any of
its descendants are in S.

Definition (d-separate). If G is a DAG, given a triple of (disjoint) subsets of


nodes A, B, S, we say S d-separates A from B if S blocks every path from A to
B.
For convenience, we define

Definition (v-structure). A set of three nodes is called a v-structure if one node


is a child of the two other nodes, and these two nodes are not adjacent.
It is now natural to define
Definition (Markov properties). Let P be the distribution of Z and let f be
the density. Given a DAG G, we say P satisfies

(i) the Markov factorization criterion if


p
Y
f (z1 , . . . , zp ) = f (zk | zpa(k) ).
k=1

(ii) the global Markov property if for all disjoint A, B, S such that A, B is
d-separated by S, then ZA q ZB | ZS .

Proposition. If P has a density with respect to a product measure, then (i)


and (ii) are equivalent.
How does this fit into the structural equation model?
Proposition. Let P be the structural equation model with DAG G. Then P
obeys the Markov factorization property.

Proof. We assume G is topologically ordered (i.e. the identity map is a topological


ordering). Then we can always write

f (z1 , . . . , zp ) = f (z1 )f (z2 | z1 ) · · · z(zp | z1 , z2 , . . . , zp−1 ).

By definition of a topological order, we know pa(k) ⊆ {1, . . . , k − 1}. Since Zk


is a function of Zpa(k) and independent noise εk . So we know

Zk q Z{1,...,p}\{k∪pa(k)} | Zpa(k) .

Thus,
f (zk | z1 , . . . , zk−1 ) = f (zk | zpa(k) ).

54
4 Graphical modelling III Modern Statistical Methods

4.3 The PC algorithm


We now want to try to find out the structural equation model given some data,
and in particular, determine the causal structure. As we previously saw, there is
no hope of determining this completely, even if we know the distribution of the
Z completely. Let’s consider the different obstacles to this problem.

Causal minimality
If P is generated by an SEM with DAG G, then from the above, we know that
P is Markov with respect to G. The converse is also true: if P is Markov with
respect to a DAG G, then there exists a SEM with DAG G that generates P .
This immediately implies that P will be Markov with respect to many DAGs.
For example, a DAG whose skeleton is complete will always work. This suggests
the following definition:
Definition (Causal minimality). A distribution P satisfies causal minimality
with respect to G but not any proper subgraph of G.

Markov equivalent DAGs


It is natural to aim for finding a causally minimal DAG. However, this does not
give a unique solution, as we saw previously with the two variables that are
always the same.
In general, two different DAGs may satisfy the same set of d-separations,
and then a distribution is Markov with respect to one iff its Markov with respect
to the other, and we cannot distinguish between the two.
Definition (Markov equivalence). For a DAG G, we let

M(G) = {distributions P such that P is Markov with respect to G}.

We say two DAGs G1 , G2 are are Markov equivalent if M(G1 ) = M(G2 ).


What is nice is that there is a rather easy way of determining when two
DAGs are Markov equivalent.
Proposition. Two DAGs are Markov equivalent iff they have the same skeleton
and same set of v-structure.
The set of all DAGs that are Markov equivalent to a given DAG can be
represented by a CPDAG (completed partial DAG), which contains an edge
(j, k) iff some member of the equivalence class does.

Faithfulness
To describe the final issue, consider the SEM

Z1 = ε1 , Z2 = αZ1 + ε2 , Z3 = βZ1 + γZ2 + ε3 .

We take ε ∼ N3 (0, I). Then we have Z = (Z1 , Z2 , Z3 ) ∼ N (0, Σ), where


 
1 α β + αγ
Σ= α α2 + 1 α(β + αγ) + γ .
β + αγ α(β + αγ) + γ β 2 + γ 2 (α2 + 1) + 1 + 2αβγ

55
4 Graphical modelling III Modern Statistical Methods

Z1

Z2 Z3

Now if we picked values of α, β, γ such that


β + αγ = 0,
then we obtained an extra independence relation Z1 q Z3 in our system. For
example, if we pick β = −1 and α, γ = 1, then
 
1 1 0
Σ = 1 2 1 .
0 1 2
While there is an extra independence relation, we cannot remove any edge while
still satisfying the Markov property. Indeed:
– If we remove 1 → 2, then this would require Z1 q Z2 , but this is not true.
– If we remove 2 → 3, then this would require Z2 q Z3 | Z1 , but we have
     
2 1 1  1 1
var((Z2 , Z3 ) | Z1 ) = − 1 0 = ,
1 2 0 1 2
and this is not diagonal.
– If we remove 1 → 3, then this would require Z1 q Z3 | Z2 , but
   
1 0 1 1 
var((Z1 , Z3 ) | Z2 ) = − 1 1 ,
0 2 2 1
which is again non-diagonal.
So this DAG satisfies causal minimality. However, P can also be generated by
the structural equation model
1
Z̃1 = ε̃1 , Z̃2 = Z̃1 + Z̃3 + ε̃2 , Z̃3 = ε̃3 ,
2
where the ε̃i are independent with
ε̃1 ∼ N (0, 1), ε̃2 ∼ N (0, 2), ε̃3 ∼ N (0, 21 ).
Then this has the DAG
Z1

Z2 Z3

This is a strictly smaller DAG in terms of the number of edges involved. It is


easy to see that this satisfies causal minimality.
Definition (Faithfulness). We say P is faithful to a DAG G if it is Markov
with respect to G and for all A, B, S disjoint, ZA q ZB | ZS implies A, B are
d-separated by S.

56
4 Graphical modelling III Modern Statistical Methods

Determining the DAG


We shall assume our distribution is faithful to some G0 , and see if we can figure
out G0 from P , or even better, from data.
To find G, the following proposition helps us compute the skeleton:
Proposition. If nodes j and k are adjacent in a DAG G, then no set can
d-separate them.
If they are not adjacent, and π is a topological order for G with π(j) < π(k),
then they are d-separated by pa(k).
Proof. Only the last part requires proof. Consider a path j = j1 , . . . , jm = k.
Start reading the path from k and go backwards. If it starts as jm−1 → k, then
jm−1 is a parent of k and blocks the path. Otherwise, it looks like k → jm−1 .
We keep going down along the path until we first see something of the form

···
a

Thus must exist, since j is not a descendant of k by topological ordering. So it


suffices to show that a does not have a descendant in pa(k), but if it did, then
this would form a closed loop.
Finding the v-structures is harder, and at best we can do so up to Markov
equivalence. To do that, observe the following:
Proposition. Suppose we have j − ` − k in the skeleton of a DAG.

(i) If j → ` ← k, then no S that d-separates j can have ` ∈ S.


(ii) If there exists S that d-separates j and k and ` 6∈ S, then j → ` ← k.
Denote the set of nodes adjacent to the vertex k in the graph G by adj(G, k).
We can now describe the first part of the PC algorithm, which finds the
skeleton of the “true DAG”:
(i) Set Ĝ to be the complete undirected graph. Set ` = −1.
(ii) Repeat the following steps:
(a) Set ` = ` + 1:
(b) Repeat the following steps:
i. Select a new ordered pair of nodes j, k that are adjacent in Ĝ and
such that | adj(Ĝ, j) \ {k}| ≥ `.
ii. Repeat the following steps:
A. Choose a new S ⊆ adj(Ĝ, j) \ {k} with |S| = `.
B. If Zj q Zk | ZS , then delete the edge jk, and store S(k, j) =
S(j, k) = S
C. Repeat until j − k is deleted or all S chosen.
iii. Repeat until all pairs of adjacent nodes are inspected.

57
4 Graphical modelling III Modern Statistical Methods

(c) Repeat until ` ≥ p − 2.


Suppose P is faithful to a DAG G 0 . At each stage of the algorithm, the skeleton
of G 0 will be a subgraph of Ĝ. On the other hand, edges (j, k) remaining at
termination will have

Zj q Zk | ZS for all S ⊆ (Ĝ, k), S ⊆ (Ĝ, j).

So they must be adjacent in G 0 . Thus, Ĝ and G0 have the same skeleton.


To find the v-structures, we perform:
(i) For all j − l − k in Ĝ, do:
(a) If ` 6∈ S(j, k), then orient j → ` ← k.

This gives us the Markov equivalence class, and we may orient the other edges
using other properties like acyclicity.
If we want to apply this to data sets, then we need to apply some conditional
independence tests instead of querying our oracle to figure out if things are
conditional dependence. However, errors in the algorithm propagate, and the
whole process may be derailed by early errors. Moreover, the result of the
algorithm may depend on how we iterate through the nodes. People have tried
many ways to fix these problems, but in general, this method is rather unstable.
Yet, if we have large data sets, this can produce quite decent results.

58
5 High-dimensional inference III Modern Statistical Methods

5 High-dimensional inference
5.1 Multiple testing
Finally, we talk about high-dimensional inference. Suppose we have come up
with a large number of potential drugs, and want to see if they are effective in
killing bacteria. Naively, we might try to run a hypothesis test on each of them,
using a p < 0.05 threshold. But this is a terrible idea, since each test has a 0.05
chance of giving a false positive, so even if all the drugs are actually useless, we
would have incorrectly believed that a lot of them are useful, which is not the
case.
In general, suppose we have some null hypothesis H1 , . . . , Hm . By definition,
a p value pi for Hi is a random variable such that

PHi (pi ≤ α) ≤ α

for all α ∈ [0, 1].


Let m0 = |I0 | be the number of true null hypothesis. Given a procedure for
rejecting hypothesis (a multiple testing procedure), we let N be the number of
false rejections (false positives), and R the total number of rejections. One can
also think about the number of false negatives, but we shall not do that here.
Traditionally, multiple-testing procedures sought to control the family-wise
error rate (FWER), defined by P(N ≥ 1). The simplest way to minimize this is
α
to use the Bonferroni correction, which rejects Hi if pi ≤ m . Usually, we might
have α ∼ 0.05, and so this would be very small if we have lots of hypothesis (e.g.
a million). Unsurprisingly, we have

Theorem. When using the Bonferroni correction, we have


m0 α
FWER ≤ E(N ) ≤ ≤ α.
m
Proof. The first inequality is Markov’s inequality, and the last is obvious. The
second follows from
!
X X  α  m0 α
E(N ) = E 1pi ≤α/m = P pi ≤ ≤ .
m m
i∈I0 i∈I0

The Bonferroni is a rather conservative procedure, since all these inequalities


can be quite loose. When we have a large number of hypotheses, the criterion
for rejection is very very strict. Can we do better?
A more sophisticated approach is the closed testing procedure. For each
non-empty subset I ⊆ {1, . . . , m}, we let HI be the null hypothesis that Hi is
true for all i ∈ I. This is known as an intersection hypothesis. Suppose for each
I ⊆ {1, . . . , m} non-empty, we have an α-level test φI for HI (a local test), so
that
PHI (φI = 1) ≤ α.
Here ΦI takes values in {0, 1}, and φI = 1 means rejection. The closed testing
procedure then rejects HI iff for all J ⊇ I, we have φJ = 1.
Example. Consider the tests, where the red ones are the rejected one:

59
5 High-dimensional inference III Modern Statistical Methods

H1234

H134 H124 H134 H234

H12 H13 H14 H23 H24 H34

H1 H2 H3 H4

In this case, we reject H1 but not H2 by closed testing. While H23 is rejected,
we cannot tell if it is H2 or H3 that should be rejected.
This might seem like a very difficult procedure to analyze, but it turns out it
is extremely simple.
Theorem. Closed testing makes no false rejections with probability ≥ 1 − α.
In particular, FWER ≤ α.
Proof. In order for there to be a false rejection, we must have falsely rejected
HI0 with the local test, which has probability at most α.
But of course this doesn’t immediately give us an algorithm we can apply to
data. Different choices for the local test give rise to different multiple testing
procedures. One famous example is Holm’s procedure. This takes φI to be the
α
Bonferroni test, where φI = 1 iff pi ≤ |I| for some i ∈ I.
When m is large, then we don’t want to compute all φI , since there are 2I
computations to do. So we might want to find a shortcut. With a moment of
thought, we see that Holm’s procedure amounts to the following:
– Let (i) be the index of the ith smallest p-value, so p(1) ≤ p(2) ≤ · · · ≤ p(m) .
α
– Step 1: If p(1) ≤ m , then reject H(1) and go to step 2. Otherwise, accept
all null hypothesis.
α
– Step i: If p(i) ≤ m−i+1 , then reject H(i) and go to step i + 1. Otherwise,
accept H(i) , H(i+1) , . . . , H(m) .
– Step m: If p(m) ≤ α, then reject H(m) . Otherwise, accept H(m) .
The interesting thing about this is that it has the same bound on FWER as the
Bonferroni correction, but the conditions here are less lenient.
But if m is very large, then the criterion for accepting p(1) is still quite strict.
The problem is that controlling FWER is a very strong condition. Instead of
controlling the probability that there is one false rejection, when m is large, it
might be more reasonable to control the proportion of false discoveries. Many
modern multiple testing procedures aim to control the false discovery rate
 
N
FDR = E .
max{R, 1}
The funny maximum in the denominator is just to avoid division by zero. When
R = 0, then we must have N = 0 as well, so what is put in the denominator
doesn’t really matter.
The Benjamini–Hochberg procedure attempts to control the FDR at level α
and works as follows:

60
5 High-dimensional inference III Modern Statistical Methods

αi

– Let k̂ = max i : p(i) ≤ m . Then reject M(1) , . . . , H(k̂) , or accept all
hypothesis if k̂ is not defined.
Under certain conditions, this does control the false discovery rate.

Theorem. Suppose that for each i ∈ I0 , pi is independent of {pj : j 6= i}. Then


using the Benjamini–Hochberg procedure, the false discovery rate
N αM0
F DR = E ≤ ≤ α.
max(R, 1) M

Curiously, while the proof requires pi to be independent of the others, in


simulations, even when there is no hope that the pi are independent, it appears
that the Benjamini–Hochberg still works very well, and people are still trying to
understand what it is that makes Benjamini–Hochberg work so well in general.
Proof. The false discovery rate is
M
N X N
E = E 1R=r
max(R, 1) r=1 r
m
X 1 X
= E 1pi ≤αr/M 1R=r
r=1
r
i∈I0
M
XX 1  αr 
= P pi ≤ ,R = r .
r=1
r m
i∈I0

The brilliant idea is, for each i ∈ I0 , let Ri be the number of rejections when
applying a modified Benjamini–Hochberg procedure to p\i = {p1 , . . . , pM } \ {pi }
with cutoff  
\i α(j + 1)
k̂i = max j : p(j) ≤
m
We observe that for i ∈ I0 and r ≥ 1, we have
n αr o n ar αr αs o
pi ≤ , R = r = pi ≤ , p(r) ≤ , p(s) > for all s ≥ r
m m m m
n αr \i αr \i αs o
= pi ≤ , p(r−1) ≤ , p(s−1) > for all s > r
m mo m
n αr
= pi ≤ , Ri = r − 1 .
m
The key point is that Ri = r − 1 depends only on the other p-values. So the
FDR is equal to
M
XX 1  αr 
FDR = P pi ≤ , Ri = r − 1
r=1
r M
i∈I0
M
XX 1  αr 
= P pi ≤ P(Ri = r − 1)
r=1
r m
i∈I0

61
5 High-dimensional inference III Modern Statistical Methods

αr αr
Using that P(pi ≤ m) ≤ m by definition, this is
m
α XX
≤ P(Ri = r − 1)
M
i∈I0 r=1
α X
= P(Ri ∈ {0, . . . , m − 1})
M
i∈I0
αM0
= .
M
This is one of the most used procedures in modern statistics, especially in
the biological sciences.

5.2 Inference in high-dimensional regression


We have more-or-less some answer as to how to do hypothesis testing, given that
we know how to obtain these p-values. But how do we obtain these in the first
place?
For example, we might be trying to do regression, and are trying figure out
which coefficients are non-zero. The the low dimension setting, with the normal
linear model Y = Xβ 0 + ε, where ε ∼ Nn (0, σ 2 I). In the low-dimensional setting,

we have n(β̂ OLS − β 0 ) ∼ Np (0, σ 2 ( n1 X T X)−1 ). Since this does not depend on
β 0 , we can use this to form confidence intervals and hypothesis tests.
However, if we have more coefficients than there are data points, then we
can’t do ordinary least squares. So we need to look for something else. For
example, we√ might want to replace the OLS estimate with the Lasso estimate.
However, n(β̂λL − β 0 ) has an intractable distribution. In particular, since β̂λL
has a bias, the distribution will depend on β 0 in a complicated way.
The recently introduced debiased Lasso tries to overcomes these issues. See
van de Geer, Bühlmann, Ritov, Dezeure (2014) for more details. Let β̂ be the
Lasso solution at λ > 0. Recall the KKT conditions that says ν̂ defined by
1 T
X (Y − X β̂) = λν̂
n
satisfies kν̂k∞ ≤ 1 and ν̂Ŝ = sgn(β̂Ŝ ), where Ŝ = {k : β̂k 6= 0}.
We set Σ̂ = n1 X T X. Then we can rewrite the KKT conditions as
1 T
Σ̂(β̂ − β 0 ) + λν̂ = X ε.
n
What we are trying to understand is β̂ − β 0 . So it would be nice if we can find
some sort of inverse to Σ̂. If so, then β̂ − β 0 plus some correction term involving
v̂ would then be equal to a Gaussian.
Of course, the problem is that in the high dimensional setting, that Σ̂ has no
hope of being invertible. So we want to find some approximate inverse Θ̂ so that
the error we make is not too large. If we are equipped with such a Θ̂, then we
have
√ 1
n(β̂ + λΘ̂ν̂ − β 0 ) = √ Θ̂X T ε + ∆,
n
where √
∆= n(Θ̂Σ̂ − I)(β 0 − β̂).

62
5 High-dimensional inference III Modern Statistical Methods

We hope we can choose Θ̂ so that δ is small. We can then use the quantity
1
b = β̂ + λΘ̂ν̂ = β̂ + Θ̂X T (Y − X β̂)
n
as our modified estimator, called the debiased Lasso.
How do we bound ∆? We already know that (under compatibility and
sparsity conditions), we can make the `1 norm of kβ 0 − β̂k small with high
probability. So if the `∞ norm of each of the rows of Θ̂Σ̂ − 1 is small, then
Hölder allows us to bound ∆.
Write θ̂j for the jth row of Θ̂. Then

k(Σ̂Θ̂T )j − Ik∞ ≤ η

is equivalent to |(Σ̂Θ̂T )kj | ≤ η for k 6= j and |(Σ̂Θ̂T )jj − 1| ≤ η. Using the


definition of Σ̂, these are equivalent to

1 T 1
|Xk X θ̂j | ≤ η, XjT X θ̂j − 1 ≤ η.

n n

The first is the same as saying


1
kX T X θ̂j k∞ ≤ η.
n −j
This is quite reminiscent of the KKT conditions for the Lasso. So let us define
 
1
γ̂ (j) = argmin kXj − X−j γk22 + λj kγk1
γ∈Rp−1 2n
1 1
τ̂j2 = XjT (Xj − X−j γ̂ (j) ) = kXj − X−j γ̂ (j) k22 + λj kγ̂ (j) k1 .
n n
The second equality is an exercise on the example sheet.
We can then set
1 (j) (j) (j) (j)
θ̂j = − (γ̂ , . . . , γ̂j−1 , −1, γ̂j , . . . , γ̂p−1 )T .
τ̂j2 1

The factor is there so that the second inequality holds.


Then by construction, we have

Xj − X−j γ̂ (j)
X θ̂j = T
.
Xj (X − X−j γ̂ (j) )/n

Thus, we have XjT X θ̂j n = 1, and by the KKT conditions for the Lasso, we have

τ̂j T
kX−j X θ̂j k∞ ≤ λj .
n

Thus, with the choice of Θ̂ above, we have


√ λj
k∆k∞ ≤ nkβ̂ − β 0 k1 max .
j τ̂j2

63
5 High-dimensional inference III Modern Statistical Methods

λj
Now this is good as long as we can ensure τ̂j2
to be small. When is this true?
We can consider a random design setting, where each row of X is iid Np (0, Σ)
for some positive-definite Σ. Write Ω = Σ−1 .
Then by our study of the neighbourhood selection procedure, we know that
for each j, we can write
Xj = X−j γ (j) + ε(j) ,
(j)
where εi | X−j ∼ N (0, Ω−1 jj ) are iid and γ
(j)
= −Ω−1
jj Ω−j,j . To apply our
(j)
results, we need to ensure that γ are sparse. Let use therefore define
X
sj = 1Ωjk 6=0 ,
k6=j

and set smax = max(maxj sj , s).


Theorem. Suppose the maximum eigenvalue pof Σ is always at least cmin > 0
and maxj Σjj ≤ 1. Suppose further that smax log(p)/n
p → 0. Then there exists
constants A1 , A2 such that setting λ = λj = A1 log(p)/n, we have

n(b̂ − β 0 ) = W + ∆
W | X ∼ Np (0, σ 2 Θ̂Σ̂Θ̂T ),

and as n, p → ∞,  
log(p)
P k∆k∞ > A2 s √ → 0.
n
Note that here X is not centered
√ and scaled.
We see that in particular, n(b̂j − βj0 ) ∼ N (0, σ 2 (Θ̂Σ̂Θ̂T )jj ). In fact, one
can show that
1 kXj − X−j γ̂ (j) }22
dj = .
n τ̂j2
This suggests an approximate (1 − α)-level confidence interval for βj0 ,
 q q 
CI = bj − Zα/2 σ dj /n, b̂j + Zα/2 σ dj /n ,

where Zα is the upperpα point of N (0, 1). Note that here we are getting confidence
intervals of width ∼ 1/n. In particular, there is no log p dependence, if we are
only trying to estimate βj0 only.

Proof. Consider the sequence of events Λn defined by the following properties:


– φΣ̂,s ≥ cmin /2 and φ2Σ̂ ≥ cmin /2 for all j
−j,−j ,sj

2 T 2 T (j)
– n kX Σk∞ ≤ λ and n kX−j ε k∞ ≤ λ.
p
– 1 (j) 2
n Σ k2 ≥ (Ωjj )−1 (1 − 4 (log p)/n)
Question 13 on example sheet 4 shows that P(Λn ) → 1 for A1 sufficiently
large. So we will work on the event Λn .

64
5 High-dimensional inference III Modern Statistical Methods

By our results on the Lasso, we know


p
kβ 0 − β̂k1 ≤ c1 s log p/n.

for some constant c1 . We now seek a lower bound for τ̂j2 . Consider linear models

Xj = X−j γ (j) + ε(j) ,


(j)
where the sparsity of γ (j) is sj , and εi |X−j ∼ N (0, Ω−1
jj ). Note that

Ω−1
jj = var(Xij | Xi,−j ) ≤ var(Xij ) = Σij ≤ A.

Also, the maximum eigenvalue of Ω is at most c−1 −1


min . So Ωjj ≤ cmin . So
−1
Ωjj ≥ cmin . So by Lasso theory, we know
r
(j) (j) log p
kγ − γ̂ k1 ≤ c2 sj
n
for some constant c2 . Then we have
1
τ̂j2 = kXj − X−j γ̂ (j) k22 + λkγ̂ (j) k1
n
1
≥ kε(j) + X−j (γ (j) − γ̂ (j) )k22
n
1 2
≥ kε(j) k22 − kX−j T (j)
ε k∞ kγ (j) − γ̂ (j) k1
n n !
r r r
−1 log p log p log p
≥ Ωjj 1 − 4 − c2 sj + A1
n n n

In the limit, this tends to Ω−1 1 −1 1


jj . So for large n, this is ≥ 2 Ωjj ≥ 2 cmin .
Thus, we have
r
√ log p −1 log p
k∆k∞ ≤ 2λ nc1 s c = A2 s √ .
n min n

65
Index III Modern Statistical Methods

Index

de, 52 faithfulness, 56
sgn, 36 false discovery rate, 60
v-fold cross-validation, 9 family-wise error rate, 59
v-structure, 54 Fisher information matrix, 4
xA , 35 fused Lasso, 45
x−j , 35 FWER, 59

active set, 44 Gaussian kernel, 12


adaptive Lasso, 45 global Markov property, 47, 54
adjacent, 46 graph, 46
graphical Lasso, 50
bandwidth, 12 group Lasso, 44
Benjamini–Hochberg procedure, 60
Bernstein’s condition, 31 Hilbert space, 14
Bernstein’s inequality, 31 Hoeffding’s lemma, 30
blocked, 54 Holm’s procedure, 60
Bochner’s theorem, 23
Bonferroni correction, 59 inner product space, 11
intersection hypothesis, 59
Cauchy sequence, 14 intervention
causal minimality, 55 perfect, 52
Chernoff bound, 29
CIG, 47 Jaccard similarity, 13
closed testing procedure, 59
kernel, 12
compatibility condition, 40
KKT conditions, 37
compatibility factor, 39
complete inner product space, 14 Lagrangian, 34
conditional independence, 46 LARS-OLS hybrid, 45
conditional independence graph, 47 linear kernel, 12
convex function, 34 local test, 59
convex set, 33 log-likelihood, 4
coordinate descent, 43 logistic regression, 22
CPDAG, 55
Cramér–Rao bound, 5 Markov equivalence, 55
Markov factorization criterion, 54
d-separate, 54 Markov’s inequality, 29
DAG, 51 maximum likelihood estimator, 4
debiased Lasso, 62, 63 MCP, 45
descendant, 52 mean squared error, 6
directed cycle, 51 Mercer’s theorem, 19
directed edge, 46 moment generating function, 29
directed graph, 46 Moore–Aronszajn theorem, 13
directed path, 51 multiple testing procedure, 59
domain, 34
normalized principal components, 8
edge, 46
elastic net, 45 observation indices, 9
equicorrelation set, 37 ordinary least squares, 4

66
Index III Modern Statistical Methods

pairwise Markov property, 47 separates, 47


parent, 46 singular value decomposition, 8
path, 51 skeleton, 46
PC algorithm, 57 Sobolev kernel, 13
perfect intervention, 52 stacking, 10
polynomial kernel, 12 structural equation model, 51
positive-definite kernel, 12 sub-Gaussian random variable, 30
precision matrix, 49 subdifferential, 35
predictors, 4 subgradient, 35
proper subgraph, 46 subgraph, 46
proper, 46
relaxed Lasso, 45
support vector classifier, 22
representer theorem, 16
support vector machine, 22
reproducing kernel, 15
SVD, 8
reproducing kernel Hilbert space, 15
reproducing property, 14
thin singular value decomposition, 8
responses, 4
Ridge regression, 6 thin SVD, 8
RKHS, 15 topological ordering, 52

Schur complement, 49 undirected edge, 46


SEM, 51 undirected graph, 46

67

You might also like