lec3_annotated.pdf ml csci 567 vatsal sharan

CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 3, Sep 8

Administrivia
HW2 due in about 1 week (9/14 at 2pm).
Each person gets 1 late day, if you want to use a late day we’ll ask to
fill a form (late day counts towards the group member filling form).
Max 1 late day per HW.
Please submit your groups by end of today (form on Ed Discussion
post by Rachitha).

Supervised learning in one slide
Loss function: What is the right loss function for the task?
Representation: What class of functions should we use?
Optimization: How can we efficiently solve the empirical risk
minimization problem?
Generalization: Will the predictions of our model transfer
gracefully to unseen examples?
All related! And the fuel which powers everything is data.

Summary: Optimization methods
GD/SGD is a first-order optimization method.
GD/SGM coverages to a stationary point. For convex objectives, this is all
we need. For nonconvex objectives, it is possible to get stuck at local
minimizers or “bad” saddle points (random initialization escapes “good”
saddle points).
Newton’s method is a second-order optimization method.
Newton’s method has a much faster convergence rate, but each iteration
also takes much longer. Usually for large scale problems, GD/SGD and their
variants are the methods of choice.

Definition: The function class of separating hyperplanes is defined as
ℱ = {$ % = &'() *!
% : * ∈ ℝ"
}.
Representation

Use a convex surrogate loss
Loss function

Maximum likelihood estimation
Minimizing logistic loss is exactly doing MLE for the sigmoid model!

Reviewing definitions
The analysis we’ll do could also help you solve Problem 3 on HW1.

Assumptions for today’s theory

Intuition: When does ERM generalize?

Relaxing our assumptions
We assumed that the function class is finite-sized. Results can be
extended to infinite function classes (such as separating hyperplanes).
We considered 0-1 loss. Can extend to real-valued loss (such as for
regression).
We assumed realizability. Can prove similar theorem which guarantees
small generalization gap without realizability (but with an /#
instead of
/ in the denominator). This is called agnostic learning.

Rule of thumb for generalization
Suppose the functions ! in our function class ℱ have # parameters which can be set.
Assume we discretize these parameters so they can take $ possible values each.
How much data do we need to have small generalization gap?
A useful rule of thumb: to guarantee generalization, make sure that your
training data set size ) is at least linear in the number 0 of free parameters
in the function that you’re trying to learn.

What if a linear model is not a good fit?
Let’s go back to the regression setup (output ! ∈ #).
A linear model could be a bad fit for the following data:
%
&

A solution: nonlinearly transformed features

Regression with nonlinear basis

Overfitting and
Regularization

Should we use a very complicated mapping?
See Colab notebook

Underfitting and overfitting
See Colab notebook

Method 1: More data!!
See Colab notebook

Method 2: Control model complexity

Magnitude of the weights
See Colab notebook

How to make the weights small?

ℓ! regularization with non-linear basis: The effect of "
See Colab notebook

ℓ! regularization with non-linear basis : A tradeoff
See Colab notebook

Why is regularization useful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?

Why is regularization useful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?
In general, a viewpoint is that we should always be trying to fit a more expressive model if
possible. We want our function class to be rich enough that we could almost overfit if we
are not careful.
Since we’re often in this regime where the models we want to fit are more and more
complex, regularization is very useful to help generalization (it’s also a relatively simple
knob to control).

How to solve the regularized objective #(%)?
Let’s go back to the original linear model.

Aside: Least-squares when '"
' is not invertible

A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Let’s continue with the linear model, and Q3 from the practice problems for today.

A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Bayesian view: A prior over !

An equivalent form, and a “Frequentist view”
“Frequentist” approach to justifying regularization is to argue that if the true model has a
specific property, then regularization will allow you to recover a good approximation to the
true model. We this view, we can equivalently formulate regularization as:

Encouraging sparsity: ℓ# regularization
Continuing from the frequentist view, having small norm is one possible structure to impose
on the model. Another very common one is sparsity.
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
E.g. ! = 1, 0, −1, 0, 0.2, 0, 0 is 3-sparse

Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Expression
levels
in $ samples
Suppose we want to fit a linear models from
gene expression to an outcome (disease,
phenotype etc.).
% is huge, but likely that only a few genes are
related.
% Genes

Advantage:
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
E.g. ! = 1.5, 0, −1.1, 0, 0.25, 0, 0 is more interpretable than,
! = 1, 0.2, −1.3, 0.15, 0.2, 0.05, 0.12
For a sparse model, it could be easier to understand the model. It is also easier to verify
whether the features which have a high weight have a relation with the outcome (they are
not spurious artifacts of the data).

Advantage:
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense model.
We’ll see more on the third point next.

lec3_annotated.pdf ml csci 567 vatsal sharan

More Related Content

Similar to lec3_annotated.pdf ml csci 567 vatsal sharan

Recently uploaded

lec3_annotated.pdf ml csci 567 vatsal sharan