0% found this document useful (0 votes)
61 views23 pages

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

This document discusses moving beyond classification problems to problems involving regression and predicting probabilities. It introduces the concept of loss minimization as a unifying framework for these types of prediction problems. Specifically, it discusses: 1) Using different loss functions like square loss and absolute loss for regression problems to predict real-valued outputs. 2) Using maximum likelihood and log loss for problems involving predicting probabilities. 3) How minimizing certain loss functions relates to estimating properties of the conditional distribution, like the mean for square loss and the median for absolute loss.

Uploaded by

M
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views23 pages

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

This document discusses moving beyond classification problems to problems involving regression and predicting probabilities. It introduces the concept of loss minimization as a unifying framework for these types of prediction problems. Specifically, it discusses: 1) Using different loss functions like square loss and absolute loss for regression problems to predict real-valued outputs. 2) Using maximum likelihood and log loss for problems involving predicting probabilities. 3) How minimizing certain loss functions relates to estimating properties of the conditional distribution, like the mean for square loss and the median for absolute loss.

Uploaded by

M
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Beyond Classification

Rob Schapire
Princeton University
[currently visiting Yahoo! Research]
Classification and Beyond

• earlier, studied classification learning


• goal: learn to classify examples into fixed set of categories
• want to predict correct class as often as possible
• many applications
• however, often faced with learning problems that don’t fit this
paradigm:
• predicting real-valued quantities:
• how many times will some web page be visited?
• how much will be bid on a particular advertisement?
• predicting probabilities:
• what is the probability user will click on some link?
• how likely is it that some user is a spammer?
This Lecture

• general techniques for:


• predicting real-valued quantities — “regression”
• predicting probabilities
• central, unifying idea: loss minimization
Regression
Example: Weather Prediction

• meteorologists A and B apply for job


• to test which is better:
• ask each to predict how much it will rain
• observe actual amount
• repeat

predictions actual
A B outcome
Monday 1.2 0.5 0.9
Tuesday 0.1 0.3 0.0
Wednesday 2.0 1.0 2.1
• how to judge who gave better predictions?
Example (cont.)
• natural idea:
• measure discrepancy between predictions and outcomes
• e.g., measure using absolute difference
• choose forecaster with closest predictions overall

predictions actual difference


A B outcome A B
Monday 1.2 0.5 0.9 0.3 0.4
Tuesday 0.1 0.3 0.0 0.1 0.3
Wednesday 2.0 1.0 2.1 0.1 1.1
0.5 1.8

• could have measured discrepancy in other ways


• e.g., difference squared
• which measure to use?
Loss

• each forecast scored using loss function


x = weather conditions
f (x) = predicted amount
y = actual outcome
• loss function L(f (x), y ) measures discrepancy between
prediction f (x) and outcome y
• e.g.:
• absolute loss: L(f (x), y ) = |f (x) − y |
• square loss: L(f (x), y ) = (f (x) − y )2
• which L to use?
• need to understand properties of loss functions
Square Loss
• square loss often sensible because encourages predictions close
to true expectation
• fix x
• say y random with µ = E [y ]
• predict f = f (x)
• can show:
 
E [L(f , y )] = E (f − y )2 = (f − µ)2 + Var(y )
| {z }
intrinsic randomness

• therefore:
• minimized when f = µ
• lower square loss ⇒ f closer to µ
• forecaster with lowest square loss has predictions closest
to E [y |x] on average
Learning for Regression
• say examples (x, y ) generated at random
• expected square loss
 
E [Lf ] ≡ E (f (x) − y )2

minimized when f (x) = E [y |x] for all x


• how to minimize from training data (x1 , y1 ), . . . , (xm , ym )?
• attempt to find f with minimum empirical loss:
m
1 X
Ê [Lf ] ≡ (f (xi ) − yi )2
m
i =1

• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will


approximately minimize E [Lf ]
• to be possible, need to choose f of restricted form to avoid
overfitting
Linear Regression

• e.g., if x ∈ Rn , could choose to use linear predictors of form


f (x) = w · x
• then need to find w to minimize
m
1 X
(w · xi − yi )2
m
i =1

• can solve in closed form


• can also minimize on-line (e.g. using gradient descent)
Regularization

• to constrain predictor further, common to add regularization


term to encourage small weights:
m
1 X
(w · xi − yi )2 + λkwk2
m
i =1

(in this case, called “ridge regression”)


• can significantly improve performance by limiting overfitting
• requires tuning of λ parameter
• different forms of regularization have different properties
• e.g., using kwk1 instead tends to encourage “sparse”
solutions
Absolute Loss

• what if instead use L(f (x), y ) = |f (x) − y | ?


• can show E [|f (x) − y |] minimized when
f (x) = median of y ’s conditional distribution, given x
• potentially, quite different behavior from square loss
• not used so often
Summary so far

• can handle prediction of real-valued outcomes by:


• choosing a loss function
• computing a prediction rule with minimum loss on
training data
• different loss functions have different properties:
• square loss estimates conditional mean
• absolute loss estimates conditional median
• what if goal is to estimate entire conditional distribution of y
given x?
Estimating Probabilities
Weather Example (revisited)

• say goal now is to predict probability of rain


• again, can compare A and B’s predictions:
predictions actual
A B outcome
Monday 60% 80% rain
Tuesday 20% 70% no-rain
Wednesday 90% 50% no-rain
• which is better?
Plausible Approaches

• similar to classification
• but goal now is to predict probability of class
• could reduce to regression:

1 if rain
y=
0 if no-rain

• minimize square loss to estimate

E [y |x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural


(especially when more than two possible outcomes)
Different Approach: Maximum Likelihood

• each forecaster predicting distribution over set of outcomes


y ∈ {rain, no-rain} for given x
• can compute probability of observed outcomes according to
each forecaster — “likelihood”
predictions actual likelihood
A B outcome A B
Monday 60% 80% rain 0.6 0.8
Tuesday 20% 70% no-rain 0.8 0.3
Wednesday 90% 50% no-rain 0.1 0.5
likelihood(A) = .6 × .8 × .1
likelihood(B) = .8 × .3 × .5
• intuitively, higher likelihood ⇒ better fit of estimated
probabilities to observations
• so: choose maximum-likelihood forecaster
Log Loss

• given training data (x1 , y1 ), . . . , (xm , ym )


• f (y |x) = predicted probability of y on given x
m
Y
• likelihood of f = f (yi |xi )
i =1
• maximizing likelihood ≡ minimizing negative log likelihood
m
X
(− log f (yi |xi ))
i =1

• L(f (·|x), y ) = − log f (y |x) called “log loss”


Estimating Probabilities

• Pr[y |x] = true probability of y given x


• can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x]
• more generally,

E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])


+(intrinsic uncertainty of Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to


true conditional probabilities
Learning

• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to


minimize
1 X
(− log f (yi |xi ))
m
i

• as before, need to restrict form of f


• e.g.: if x ∈ Rn , y ∈ {0, 1}, common to use f of form

f (y = 1|x) = σ(w · x)

where σ(z) = 1/(1 + e −z )


• can numerically find w to minimize log loss
• “logistic regression”
Log Loss and Square Loss

• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with


mean w · x and fixed variance
• then minimizing log loss ≡ linear regression
• general: square loss ≡ log loss with gaussian conditional
probability distributions (and fixed variance)
Classification and Loss Minimization

• in classification learning, try to minimize 0-1 loss



1 if f (x) 6= y
L(f (x), y ) =
0 else

• expected 0-1 loss = generalization error


• empirical 0-1 loss = training error
• computationally and numerically difficult loss since
discontinuous and not convex
• to handle, both AdaBoost and SVM’s minimize alternative
surrogate losses
• AdaBoost: “exponential” loss
• SVM’s: “hinge” loss
Summary

• much of learning can be viewed simply as loss minimization


• different losses have different properties and purposes
• regression (real-valued labels):
• use square loss to estimate conditional mean
• use absolute loss to estimate conditional median
• estimating conditional probabilities:
• use log loss (≡ maximum likelihood)
• classification:
• use 0/1-loss (or surrogate)
• provides unified and flexible means of algorithm design

You might also like