CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 3, Sep 8
Administrivia
HW2 due in about 1 week (9/14 at 2pm).
Each person gets 1 late day, if you want to use a late day we’ll ask to
fill a form (late day counts towards the group member filling form).
Max 1 late day per HW.
Please submit your groups by end of today (form on Ed Discussion
post by Rachitha).
Recap
Supervised learning in one slide
Loss function: What is the right loss function for the task?
Representation: What class of functions should we use?
Optimization: How can we efficiently solve the empirical risk
minimization problem?
Generalization: Will the predictions of our model transfer
gracefully to unseen examples?
All related! And the fuel which powers everything is data.
Summary: Optimization methods
GD/SGD is a first-order optimization method.
GD/SGM coverages to a stationary point. For convex objectives, this is all
we need. For nonconvex objectives, it is possible to get stuck at local
minimizers or “bad” saddle points (random initialization escapes “good”
saddle points).
Newton’s method is a second-order optimization method.
Newton’s method has a much faster convergence rate, but each iteration
also takes much longer. Usually for large scale problems, GD/SGD and their
variants are the methods of choice.
Linear classifiers
Definition: The function class of separating hyperplanes is defined as
ℱ = {$ % = &'() *!
% : * ∈ ℝ"
}.
Representation
Use a convex surrogate loss
Loss function
Optimization
Maximum likelihood estimation
Minimizing logistic loss is exactly doing MLE for the sigmoid model!
Generalization
Reviewing definitions
The analysis we’ll do could also help you solve Problem 3 on HW1.
Assumptions for today’s theory
Intuition: When does ERM generalize?
Relaxing our assumptions
We assumed that the function class is finite-sized. Results can be
extended to infinite function classes (such as separating hyperplanes).
We considered 0-1 loss. Can extend to real-valued loss (such as for
regression).
We assumed realizability. Can prove similar theorem which guarantees
small generalization gap without realizability (but with an /#
instead of
/ in the denominator). This is called agnostic learning.
Rule of thumb for generalization
Suppose the functions ! in our function class ℱ have # parameters which can be set.
Assume we discretize these parameters so they can take $ possible values each.
How much data do we need to have small generalization gap?
A useful rule of thumb: to guarantee generalization, make sure that your
training data set size ) is at least linear in the number 0 of free parameters
in the function that you’re trying to learn.
Nonlinear basis
What if a linear model is not a good fit?
Let’s go back to the regression setup (output ! ∈ #).
A linear model could be a bad fit for the following data:
%
&
A solution: nonlinearly transformed features
A solution: nonlinearly transformed features
Regression with nonlinear basis
Example
Example
See Colab notebook
Why nonlinear?
Overfitting and
Regularization
Should we use a very complicated mapping?
See Colab notebook
Underfitting and overfitting
See Colab notebook
Method 1: More data!!
See Colab notebook
Method 2: Control model complexity
Magnitude of the weights
See Colab notebook
How to make the weights small?
How to make the weights small?
ℓ! regularization with non-linear basis: The effect of "
See Colab notebook
ℓ! regularization with non-linear basis : A tradeoff
See Colab notebook
Why is regularization useful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?
Why is regularization useful?
If you don’t have sufficient data to fit your more expressive model, then ERM will overfit.
Regularization helps with generalization.
So should it not be useful in many practical settings, where we have enough data?
In general, a viewpoint is that we should always be trying to fit a more expressive model if
possible. We want our function class to be rich enough that we could almost overfit if we
are not careful.
Since we’re often in this regime where the models we want to fit are more and more
complex, regularization is very useful to help generalization (it’s also a relatively simple
knob to control).
Understanding
regularization
How to solve the regularized objective #(%)?
Let’s go back to the original linear model.
Aside: Least-squares when '"
' is not invertible
Aside: Least-squares when '"
' is not invertible
Aside: Least-squares when '"
' is not invertible
A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Let’s continue with the linear model, and Q3 from the practice problems for today.
A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Let’s continue with the linear model, and Q3 from the practice problems for today.
A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Bayesian view: A prior over !
A “Bayesian view” of ℓ! regularization
Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of
maximum likelihood estimation (MLE).
Bayesian view: A prior over !
An equivalent form, and a “Frequentist view”
“Frequentist” approach to justifying regularization is to argue that if the true model has a
specific property, then regularization will allow you to recover a good approximation to the
true model. We this view, we can equivalently formulate regularization as:
Encouraging sparsity: ℓ# regularization
Continuing from the frequentist view, having small norm is one possible structure to impose
on the model. Another very common one is sparsity.
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
E.g. ! = 1, 0, −1, 0, 0.2, 0, 0 is 3-sparse
Encouraging sparsity: ℓ# regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Expression
levels
in $ samples
Suppose we want to fit a linear models from
gene expression to an outcome (disease,
phenotype etc.).
% is huge, but likely that only a few genes are
related.
% Genes
Encouraging sparsity: ℓ# regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
E.g. ! = 1.5, 0, −1.1, 0, 0.25, 0, 0 is more interpretable than,
! = 1, 0.2, −1.3, 0.15, 0.2, 0.05, 0.12
For a sparse model, it could be easier to understand the model. It is also easier to verify
whether the features which have a high weight have a relation with the outcome (they are
not spurious artifacts of the data).
Encouraging sparsity: ℓ# regularization
Sparsity of !: Number of non-zero coefficients in !. Same as ||#||!
Advantage:
Sparse models are a natural inductive bias in many settings. In many applications we have
numerous possible features, only some of which may have any relationship with the label.
Sparse models may also be more interpretable. They could narrow down a small number
of features which carry a lot of signal.
Data required to learn sparse model maybe significantly less than to learn dense model.
We’ll see more on the third point next.

lec3_annotated.pdf ml csci 567 vatsal sharan

  • 1.
    CSCI 567: MachineLearning Vatsal Sharan Fall 2022 Lecture 3, Sep 8
  • 2.
    Administrivia HW2 due inabout 1 week (9/14 at 2pm). Each person gets 1 late day, if you want to use a late day we’ll ask to fill a form (late day counts towards the group member filling form). Max 1 late day per HW. Please submit your groups by end of today (form on Ed Discussion post by Rachitha).
  • 3.
  • 4.
    Supervised learning inone slide Loss function: What is the right loss function for the task? Representation: What class of functions should we use? Optimization: How can we efficiently solve the empirical risk minimization problem? Generalization: Will the predictions of our model transfer gracefully to unseen examples? All related! And the fuel which powers everything is data.
  • 5.
    Summary: Optimization methods GD/SGDis a first-order optimization method. GD/SGM coverages to a stationary point. For convex objectives, this is all we need. For nonconvex objectives, it is possible to get stuck at local minimizers or “bad” saddle points (random initialization escapes “good” saddle points). Newton’s method is a second-order optimization method. Newton’s method has a much faster convergence rate, but each iteration also takes much longer. Usually for large scale problems, GD/SGD and their variants are the methods of choice.
  • 6.
  • 7.
    Definition: The functionclass of separating hyperplanes is defined as ℱ = {$ % = &'() *! % : * ∈ ℝ" }. Representation
  • 8.
    Use a convexsurrogate loss Loss function
  • 9.
  • 10.
    Maximum likelihood estimation Minimizinglogistic loss is exactly doing MLE for the sigmoid model!
  • 11.
  • 12.
    Reviewing definitions The analysiswe’ll do could also help you solve Problem 3 on HW1.
  • 13.
  • 14.
    Intuition: When doesERM generalize?
  • 20.
    Relaxing our assumptions Weassumed that the function class is finite-sized. Results can be extended to infinite function classes (such as separating hyperplanes). We considered 0-1 loss. Can extend to real-valued loss (such as for regression). We assumed realizability. Can prove similar theorem which guarantees small generalization gap without realizability (but with an /# instead of / in the denominator). This is called agnostic learning.
  • 21.
    Rule of thumbfor generalization Suppose the functions ! in our function class ℱ have # parameters which can be set. Assume we discretize these parameters so they can take $ possible values each. How much data do we need to have small generalization gap? A useful rule of thumb: to guarantee generalization, make sure that your training data set size ) is at least linear in the number 0 of free parameters in the function that you’re trying to learn.
  • 22.
  • 23.
    What if alinear model is not a good fit? Let’s go back to the regression setup (output ! ∈ #). A linear model could be a bad fit for the following data: % &
  • 24.
    A solution: nonlinearlytransformed features
  • 25.
    A solution: nonlinearlytransformed features
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Should we usea very complicated mapping? See Colab notebook
  • 32.
  • 33.
    Method 1: Moredata!! See Colab notebook
  • 34.
    Method 2: Controlmodel complexity
  • 35.
    Magnitude of theweights See Colab notebook
  • 36.
    How to makethe weights small?
  • 37.
    How to makethe weights small?
  • 38.
    ℓ! regularization withnon-linear basis: The effect of " See Colab notebook
  • 39.
    ℓ! regularization withnon-linear basis : A tradeoff See Colab notebook
  • 40.
    Why is regularizationuseful? If you don’t have sufficient data to fit your more expressive model, then ERM will overfit. Regularization helps with generalization. So should it not be useful in many practical settings, where we have enough data?
  • 41.
    Why is regularizationuseful? If you don’t have sufficient data to fit your more expressive model, then ERM will overfit. Regularization helps with generalization. So should it not be useful in many practical settings, where we have enough data? In general, a viewpoint is that we should always be trying to fit a more expressive model if possible. We want our function class to be rich enough that we could almost overfit if we are not careful. Since we’re often in this regime where the models we want to fit are more and more complex, regularization is very useful to help generalization (it’s also a relatively simple knob to control).
  • 42.
  • 43.
    How to solvethe regularized objective #(%)? Let’s go back to the original linear model.
  • 44.
    Aside: Least-squares when'" ' is not invertible
  • 45.
    Aside: Least-squares when'" ' is not invertible
  • 46.
    Aside: Least-squares when'" ' is not invertible
  • 47.
    A “Bayesian view”of ℓ! regularization Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of maximum likelihood estimation (MLE). Let’s continue with the linear model, and Q3 from the practice problems for today.
  • 48.
    A “Bayesian view”of ℓ! regularization Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of maximum likelihood estimation (MLE). Let’s continue with the linear model, and Q3 from the practice problems for today.
  • 49.
    A “Bayesian view”of ℓ! regularization Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of maximum likelihood estimation (MLE). Bayesian view: A prior over !
  • 50.
    A “Bayesian view”of ℓ! regularization Maximum a posteriori probability (MAP) estimation: A Bayesian generalization of maximum likelihood estimation (MLE). Bayesian view: A prior over !
  • 51.
    An equivalent form,and a “Frequentist view” “Frequentist” approach to justifying regularization is to argue that if the true model has a specific property, then regularization will allow you to recover a good approximation to the true model. We this view, we can equivalently formulate regularization as:
  • 52.
    Encouraging sparsity: ℓ#regularization Continuing from the frequentist view, having small norm is one possible structure to impose on the model. Another very common one is sparsity. Sparsity of !: Number of non-zero coefficients in !. Same as ||#||! E.g. ! = 1, 0, −1, 0, 0.2, 0, 0 is 3-sparse
  • 53.
    Encouraging sparsity: ℓ#regularization Sparsity of !: Number of non-zero coefficients in !. Same as ||#||! Advantage: Sparse models are a natural inductive bias in many settings. In many applications we have numerous possible features, only some of which may have any relationship with the label. Expression levels in $ samples Suppose we want to fit a linear models from gene expression to an outcome (disease, phenotype etc.). % is huge, but likely that only a few genes are related. % Genes
  • 54.
    Encouraging sparsity: ℓ#regularization Sparsity of !: Number of non-zero coefficients in !. Same as ||#||! Advantage: Sparse models are a natural inductive bias in many settings. In many applications we have numerous possible features, only some of which may have any relationship with the label. Sparse models may also be more interpretable. They could narrow down a small number of features which carry a lot of signal. E.g. ! = 1.5, 0, −1.1, 0, 0.25, 0, 0 is more interpretable than, ! = 1, 0.2, −1.3, 0.15, 0.2, 0.05, 0.12 For a sparse model, it could be easier to understand the model. It is also easier to verify whether the features which have a high weight have a relation with the outcome (they are not spurious artifacts of the data).
  • 55.
    Encouraging sparsity: ℓ#regularization Sparsity of !: Number of non-zero coefficients in !. Same as ||#||! Advantage: Sparse models are a natural inductive bias in many settings. In many applications we have numerous possible features, only some of which may have any relationship with the label. Sparse models may also be more interpretable. They could narrow down a small number of features which carry a lot of signal. Data required to learn sparse model maybe significantly less than to learn dense model. We’ll see more on the third point next.