Learning3 6pp
Learning3 6pp
edu/q Question
What’s the true objective of machine learning?
• So far in this class, we have tried to cast everything as a well-defined optimization problem. We have even
written down an objective function, which is the average loss (error) on the training data.
• But it turns out that that’s not really the true goal. That’s only what we tell our optimization friends
Review
so that there’s something concrete and actionable. The true goal is to minimize error on unseen future
examples; in other words, we need to generalize. As we’ll see, this is perhaps the most important aspect
of machine learning and statistics — albeit a more elusive one. Feature extractor φ(x):
length>10 :1
feature extractor fracOfAlpha : 0.85
[email protected] contains @ :1
arbitrary! endsWith com : 1
endsWith org : 0
Prediction score:
• Linear predictor: score = w · φ(x)
Pk
• Neural network: score = j=1 wj σ(vj · φ(x))
• First a review: last lecture we spoke at length about the importance of features, how to organize them
using feature templates, and how we can get interesting non-linearities by choosing the feature extractor
φ judiciously. This is you using all your domain knowledge about the problem.
Review
• Given the feature extractor φ, we can use that to define a prediction score, either using a linear predictor
or a neural network. If you use neural networks, you typically have to work less hard at designing features,
but you end up with a harder learning problem. There is a human-machine tradeoff here. Loss function Loss(x, y, w):
4
Loss(x, y, w)
2
(for binary classification)
1
0
-3 -2 -1 0 1 2 3
margin (w · φ(x))y
w ← w − η∇w Loss(x, y, w)
Generalization
Unsupervised learning
Summary
• Now let’s be a little more critical about what we’ve set out to optimize. So far, we’ve declared that we
Training error want to minimize the training loss.
Loss minimization:
min TrainLoss(w)
w
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain
• Clearly, machine learning can’t be about just minimizing the training loss. The rote learning algorithm does
A strawman algorithm a perfect job of that, and yet is clearly a bad idea. It overfits to the training data and doesn’t generalize
to unseen examples.
Classification Regression
• So what is the true objective then? Taking a step back, what we’re doing is building a system which
Evaluation happens to use machine learning, and then we’re going to deploy it. What we really care about is how
accurate that system is on those unseen future inputs.
• Of course, we can’t access unseen future examples, so the next best thing is to create a test set. As
much as possible, we should treat the test set as a pristine thing that’s unseen and from the future. We
definitely should not tune our predictor based on the test error, because we wouldn’t be able to do that
Dtrain Learner f on future examples.
• Of course at some point we have to run our algorithm on the test set, but just be aware that each time
this is done, the test set becomes less good of an indicator of how well you’re actually doing.
How good is the predictor f ?
• To demonstrate overfitting in a simple setting, consider the problem of predicting whether a number x is
Overfitting example positive. The data we get is noisy, where 25% of the labels have been flipped randomly.
• If we use rote learning, we will get 0% training error, memorizing the labels. The linear classifier will get
25% error because it misclassifies the noisy examples.
• However, in the test set, the noise might come in different positions. Now, the rote predictor will get 50%
Example: overfitting error because it will misclassify on the inputs where there was noise in the training and in the test. In
contrast, the linear classifier is stable and still has 25% error.
• Rote learning overfits. The linear predictor generalizes better.
Input: x ∈ {−4, −3, −2, −1, 1, 2, 3, 4}
Output: y = sign(x), but 25% labels flipped
x -4 -3 -2 -1 1 2 3 4
y (train) - + - - + + + -
y (test) + - - - + - + +
rote predictions - + - - + + + -
linear predictions - - - - + + + +
Train error Test error
Rote 0% 50% — overfits!
Linear 25% 25% — generalizes!
CS221 / Spring 2018 / Sadigh 16
• It’s useful to look at another simple algorithm, which always returns the most frequent output no matter
Another strawman algorithm what the input is. Though this algorithm cannot fit the data very well (and thus gets high error rates), it
generalizes very well.
• Note that strictly speaking, generalization doesn’t necessarily mean that you’re doing well at test time. It
just means that your training error and your test error are not too different.
Algorithm: majority algorithm
• So far, we have an intuitive feel for what overfitting is. How do we make this precise? In particular, when
Generalization does a learning algorithm generalize from the training set to the test set?
Dtrain Dtest
• Here’s a cartoon that can help you understand the balance between fitting and generalization. Out
Approximation and estimation error there somewhere, there is a magical predictor f ∗ that classifies everything perfectly. This predictor is
unattainable; all we can hope to do is to use a combination of our domain knowledge and data to
approximate that. The question is: how far are we away from f ∗ ?
• Recall that our learning framework consists of (i) choosing a hypothesis class F (by defining the feature
All predictors extractor) and then (ii) choosing a particular predictor fˆ from F.
• Approximation error is how far the entire hypothesis class is from the target predictor f ∗ . Larger
F hypothesis classes have lower approximation error. Let g ∈ F be the best predictor in the hypothesis class
Learning in the sense of minimizing test error g = arg minf ∈F Err(f ). Here, distance is just the differences in test
f∗ approx. error est. error Feature extraction error: Err(g) − Err(f ∗ ).
g • Estimation error is how good the predictor fˆ returned by the learning algorithm is with respect to the best
fˆ in the hypothesis class: Err(fˆ) − Err(g). Larger hypothesis classes have higher estimation error because
it’s harder to find a good predictor based on limited data.
• We’d like both approximation and estimation errors to be small, but there’s a tradeoff here.
• Without formalizing it, we can understand the learning theory using the following analogy. Suppose you
Estimation error analogy find a wallet on the ground, and you’re trying to figure out who it belongs to. (Assume all people are
honest in this example.)
• If your hypothesis is that it was just the people around you, you can go to each of them and ask a question
to try to see if that is the person’s wallet. If there are only a few people, then you could just ask a few
basic questions (e.g., first name), and with high confidence if you find a match, then it’s probably the right
person.
• However, if you decide to email 10,000 people and ask the same basic questions, then you’ll probably have
a lot of matches and if you just choose one arbitraily, then chances are that you’ve probably got the wrong
person.
• In this analogy, the questions (examples) try to help you identify the correct person (hypothesis). To
Scenario 1: ask few people around stretch this analogy a bit, stochastic gradient descent is like asking a person: ”which direction do you
think I should go to find the correct person?”
• Another way to visualize generalization is by looking at how the various errors vary as a function of the
Training and test error size of the hypothesis class (something we will define more precisely later).
• If the hypothesis class is too small (majority algorithm), then the training error will be large (corresponding
to large approximation error).
• If the hypothesis class is too large (rote learning), then the approximation error is small, but the estimation
error is large.
0.5 • The goldilocks hypothesis class is one where we balances both errors, resulting in a nice fit.
Training error
error
Test error
0.0
0 1 2
remove features
• For each weight vector w, we have a predictor fw (for classification, fw (x) = w · φ(x)). So the hypothesis
Controlling size of hypothesis class class F = {fw } is all the predictors as w ranges. By controlling the number of possible values of w that
the learning algorithm is allowed to choose from, we control the size of the hypothesis class and thus guard
against overfitting.
• There are two ways to do this: keeping the dimensionality d small, and keeping the norm kwk (length of
Linear predictors are specified by weight vector w ∈ Rd w) small.
[whiteboard: x 7→ w1 x]
• The most intuitive way to reduce overfitting is to reduce the number of features (or feature templates).
Controlling the dimensionality Mathematically, you can think about removing a feature φ(x)37 as simply only allowing its corresponding
weight to be zero (w37 = 0).
• Operationally, if you have a few feature templates, then it’s probably easier to just manually include or
exclude them — this will give you more intuition.
Manual feature (template) selection: • If you have a lot of individual features, you can apply more automatic methods for selecting features, but
these are beyond the scope of this class.
• Add features if they help
• Remove features if they don’t help
λ
min TrainLoss(w) + kwk2
w 2
Algorithm: gradient descent
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w [TrainLoss(w)] +λw)
• A really cheap way to keep the weights small is to do early stopping. As we run more iterations of gradient
Controlling the norm: early stopping descent, the objective function improves. If we cared about the objective function, this would always be a
good thing. However, our true objective is not the training loss.
• Each time we update the weights, w has the potential of getting larger, so by running gradient descent a
fewer number of iterations, we are implicitly ensuring that w stays small.
• Though early stopping seems hacky, there is actually some theory behind it. And one paradoxical note is
Algorithm: gradient descent that we can sometimes get better solutions by performing less computation.
Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)
Intuition: if have fewer updates, then kwk can’t get too big.
Lesson: try to minimize the training error, but don’t try too hard.
• We’ve seen several ways to control the size of the hypothesis class (and thus reducing variance) based on
Summary so far either reducing the dimensionality or reducing the norm.
• It is important to note that what matters is the size of the hypothesis class, not how ”complex” the
predictors in the hypothesis class look. To put it another way, using complex features backed by 1000 lines
of code doesn’t hurt you if there are only 5 of them.
Key idea: keep it simple • Now the question is: how do we actually decide how big to make the hypothesis class, and in what ways
(which features)?
Try to minimize training error, but keep the hypothesis class small.
Properties of the learning algorithm (features, regularization pa- Solution: randomly take out 10-50% of training and use it instead of the
rameter λ, number of iterations T , step size η, etc.). test set to estimate test error.
• However, if we make the hypothesis class too small, then the approximation error gets too big. In practice,
how do we decide the appropriate size? Generally, our learning algorithm has multiple hyperparameters
to set. These hyperparameters cannot be set by the learning algorithm on the training data because we
Development cycle
would just choose a degenerate solution and overfit. On the other hand, we can’t use the test set either
because then we would spoil the test set.
• The solution is to invent something that looks like a test set. There’s no other data lying around, so we’ll Problem: simplified named-entity recognition
have to steal it from the training set. The resulting set is called the validation set.
• With this validation set, now we can simply try out a bunch of different hyperparameters and choose the
setting that yields the lowest error on the validation set. Which hyperparameter values should we try? Input: a string x (e.g., President [Barack Obama] in)
Generally, you should start by getting the right order of magnitude (e.g., λ = 0.0001, 0.001, 0.01, 0.1, 1, 10)
and then refining if necessary. Output: y, whether x contains a person or not (e.g., +1)
Unsupervised learning:
• Clustering: Dtrain only contains inputs x
• Unlabeled data is much cheaper to obtain (we can get 100 million
unlabeled examples)
Output:
• There are many forms of unsupervised learning, corresponding to different types of latent structures you
Types of unsupervised learning want to pull out of your data. In this class, we will focus on one of them: clustering.
• The task of clustering is to take a set of points as input and return a partitioning of the points into K
Clustering clusters. We will represent the partitioning using an assignment vector z = [z1 , . . . , zn ]. For each i,
zi ∈ {1, . . . , K} specifies which of the K clusters point i is assigned to.
Definition: clustering
[whiteboard]
Setup:
• Each cluster k = 1, . . . , K is represented by a centroid µk ∈ Rd
• Intuition: want each point φ(xi ) close to its assigned centroid µzi
Objective function:
n
X
Losskmeans (z, µ) = kφ(xi ) − µzi k2
i=1
• How do we solve this optimization problem? We can’t quite just use gradient descent because there are
K-means: simple example discrete variables (assignment variables zi ). We can’t really use dynamic programming because there are
continuous variables (the centroids µk ).
• To motivate the solution, consider a simple example with four points. As always, let’s try to break up the
problem into subproblems.
Example: one-dimensional • What if we knew the optimal centroids? Then computing the assignment vectors is trivial (for each point,
choose the closest center).
• What if we knew the optimal assignments? Then computing the centroids is also trivial (one can check
Input: Dtrain = {0, 2, 10, 12} that this is just averaging the points assigned to that center).
• The only problem is that we don’t know the optimal centroids or assignments, and unlike in dynamic
Output: K = 2 centroids µ1 , µ2 ∈ R programming, the two depend on one another cyclically.
If know assignments z1 = z2 = 1, z3 = z4 = 2:
µ1 = arg minµ (0 − µ)2 + (2 − µ)2 = 1
µ2 = arg minµ (10 − µ)2 + (12 − µ)2 = 11
• And now the leap of faith is this: start with an arbitrary setting of the centroids (not optimal). Then
K-means algorithm alternate between choosing the best assignments given the centroids, and choosing the best centroids given
the assignments. This is the K-means algorithm.
• Now, turning things around, let’s suppose we knew what the assignments z were. We can again look at
K-means algorithm (Step 2) the K-means objective function and try to optimize it with respect to the centroids µ. The best µk is to
place the centroid at the average of all the points assigned to cluster k; this is step two.
• Now we have the two ingredients to state the full K-means algorithm. We start by initializing all the
K-means algorithm centroids randomly. Then, we iteratively alternate back and forth between steps 1 and 2, optimizing z
given µ and vice-versa.
Objective:
Algorithm: K-means
Initialize µ1 , . . . , µK randomly.
For t = 1, . . . , T :
Step 1: set assignments z given µ
Step 2: set centroids µ given z
[demo]
Example: one-dimensional
Initialization (random): µ1 = 0, µ2 = 2
Iteration 1:
• Step 1: z1 = 1, z2 = 2, z3 = 2, z4 = 2
• Step 2: µ1 = 0, µ2 = 8
Iteration 2:
• Step 1: z1 = 1, z2 = 1, z3 = 2, z4 = 2
• Step 2: µ1 = 1, µ2 = 11
CS221 / Spring 2018 / Sadigh 72
• K-means is guaranteed to decrease the loss function each iteration and will converge to a local minimum,
Local minima but it is not guaranteed to find the global minimum, so one must exercise caution when applying K-means.
• One solution is to simply run K-means several times from multiple random initializations and then choose
the solution that has the lowest loss.
• Or we could try to be smarter in how we initialize K-means. K-means++ is an initialization scheme which
K-means is guaranteed to converge to a local minimum, but is not guar- places centroids on training points so that these centroids tend to be distant from one another.
• Difficult optimization:
Unsupervised learning
• Many of the ideas surrounding fitting functions was known in other fields long before computers, let alone
A brief history AI.
• When computers arrived on the scene, learning was definitely on people’s radar, although this was detached
from the theoretical statistical and optimization foundations.
• In 1969, Minsky and Papert wrote a famous paper Perceptrons, which showed the limitations of linear
1795: Gauss proposed least squares (astronomy) classifiers with the famous XOR example (similar to our car collision example), which killed off this type
of research. AI largely turned to rule-based and symbolic methods.
• Since the 1980s, machine learning has increased its role in AI, been placed on a more solid mathematical
1940s: logistic regression (statistics) foundation with its connection with optimization and statistics.
• While there is a lot of optimism today about the potential of machine learning, there are still a lot of
unsolved problems.
1952: Arthur Samuel built program that learned to play checkers (AI)
• Going ahead, one major thrust is to improve the capabilities of machine learning. Broadly construed,
Challenges machine learning is about learning predictors from some input to some output. The simplest case is when
the output is just a label, but increasingly, researchers have been using the same machine learning tools for
doing translation (output is a sentence), speech synthesis (output is a waveform), and image generation
(output is an image).
Capabilities: • Another important direction is being able to leverage the large amounts of unlabeled data to learn good
representations. Can we automatically discover the underlying structure (e.g., a 3D model of the world
from videos)? Can we learn a causal model of the world? How can we make sure that the representations
• More complex prediction problems (translation, generation) we are learning are useful for some other task?
• A second major thrust has to do with the context in which machine learning is now routinely being applied,
• Unsupervised learning: automatically discover structure for example in high-stakes scenarios such as self-driving cars. But machine learning does not exist in a
vacuum. When machine learning systems are deployed to real users, it changes user behavior, and since
the same systems are being trained on this user-generated data, resulting in feedback loops.
Responsibilities: • We also want to build ML systems which are fair. The real world is not fair; thus the data generated from
it will reflect these discriminatory biases. Can we overcome these biases?
• Feedback loops: predictions affect user behavior, which generates • The strength of machine learning lies in being able to aggregate information across many individuals.
data However, this appears to require a central organization that collects all this data, which seems poor
practice from the point of view of protecting privacy. Can we perform machine learning while protecting
individual privacy? For example, local differential privacy mechanisms inject noise into an individual’s
• Fairness: build classifiers that don’t discriminate? measurement before sending it to the central server.
• Finally, there is the issue of trust of machine learning systems in high-stakes situations. As these systems
• Privacy: can we pool data together become more complex, it becomes harder for humans to ”understand” how and why a system is making
a particular decision.
• Interpretability: can we understand what algorithms are doing?