Administrivia
HW2 due inabout 1 week (9/14 at 2pm).
Each person gets 1 late day, if you want to use a late day we’ll ask to
fill a form (late day counts towards the group member filling form).
Max 1 late day per HW.
Please submit your groups by end of today (form on Ed Discussion
post by Rachitha).
Supervised learning inone slide
Loss function: What is the right loss function for the task?
Representation: What class of functions should we use?
Optimization: How can we efficiently solve the empirical risk
minimization problem?
Generalization: Will the predictions of our model transfer
gracefully to unseen examples?
All related! And the fuel which powers everything is data.
5.
Summary: Optimization methods
GD/SGDis a first-order optimization method.
GD/SGM coverages to a stationary point. For convex objectives, this is all
we need. For nonconvex objectives, it is possible to get stuck at local
minimizers or “bad” saddle points (random initialization escapes “good”
saddle points).
Newton’s method is a second-order optimization method.
Newton’s method has a much faster convergence rate, but each iteration
also takes much longer. Usually for large scale problems, GD/SGD and their
variants are the methods of choice.
Relaxing our assumptions
Weassumed that the function class is finite-sized. Results can be
extended to infinite function classes (such as separating hyperplanes).
We considered 0-1 loss. Can extend to real-valued loss (such as for
regression).
We assumed realizability. Can prove similar theorem which guarantees
small generalization gap without realizability (but with an /#
instead of
/ in the denominator). This is called agnostic learning.
21.
Rule of thumbfor generalization
Suppose the functions ! in our function class ℱ have # parameters which can be set.
Assume we discretize these parameters so they can take $ possible values each.
How much data do we need to have small generalization gap?
A useful rule of thumb: to guarantee generalization, make sure that your
training data set size ) is at least linear in the number 0 of free parameters
in the function that you’re trying to learn.
What if alinear model is not a good fit?
Let’s go back to the regression setup (output ! ∈ #).
A linear model could be a bad fit for the following data:
%
&
Why is regularizationuseful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?
41.
Why is regularizationuseful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?
In general, a viewpoint is that we should always be trying to fit a more expressive model if
possible. We want our function class to be rich enough that we could almost overfit if we
are not careful.
Since we’re often in this regime where the models we want to fit are more and more
complex, regularization is very useful to help generalization (it’s also a relatively simple
knob to control).
A “Bayesian view”of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Let’s continue with the linear model, and Q3 from the practice problems for today.
48.
A “Bayesian view”of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Let’s continue with the linear model, and Q3 from the practice problems for today.
49.
A “Bayesian view”of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Bayesian view: A prior over !
50.
A “Bayesian view”of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Bayesian view: A prior over !
51.
An equivalent form,and a “Frequentist view”
“Frequentist” approach to justifying regularization is to argue that if the true model has a
specific property, then regularization will allow you to recover a good approximation to the
true model. We this view, we can equivalently formulate regularization as:
52.
Encouraging sparsity: ℓ#regularization
Continuing from the frequentist view, having small norm is one possible structure to impose
on the model. Another very common one is sparsity.
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
E.g. ! = 1, 0, −1, 0, 0.2, 0, 0 is 3-sparse
53.
Encouraging sparsity: ℓ#regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Expression
levels
in $ samples
Suppose we want to fit a linear models from
gene expression to an outcome (disease,
phenotype etc.).
% is huge, but likely that only a few genes are
related.
% Genes
54.
Encouraging sparsity: ℓ#regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
E.g. ! = 1.5, 0, −1.1, 0, 0.25, 0, 0 is more interpretable than,
! = 1, 0.2, −1.3, 0.15, 0.2, 0.05, 0.12
For a sparse model, it could be easier to understand the model. It is also easier to verify
whether the features which have a high weight have a relation with the outcome (they are
not spurious artifacts of the data).
55.
Encouraging sparsity: ℓ#regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense model.
We’ll see more on the third point next.