0% found this document useful (0 votes)

61 views23 pages

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

This document discusses moving beyond classification problems to problems involving regression and predicting probabilities. It introduces the concept of loss minimization as a unifying framework for these types of prediction problems. Specifically, it discusses: 1) Using different loss functions like square loss and absolute loss for regression problems to predict real-valued outputs. 2) Using maximum likelihood and log loss for problems involving predicting probabilities. 3) How minimizing certain loss functions relates to estimating properties of the conditional distribution, like the mean for square loss and the median for absolute loss.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views23 pages

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Beyond Classification

Rob Schapire
Princeton University
[currently visiting Yahoo! Research]
Classification and Beyond

• earlier, studied classification learning

• goal: learn to classify examples into fixed set of categories
• want to predict correct class as often as possible
• many applications
• however, often faced with learning problems that don’t fit this
paradigm:
• predicting real-valued quantities:
• how many times will some web page be visited?
• how much will be bid on a particular advertisement?
• predicting probabilities:
• what is the probability user will click on some link?
• how likely is it that some user is a spammer?
This Lecture

• general techniques for:

• predicting real-valued quantities — “regression”
• predicting probabilities
• central, unifying idea: loss minimization
Regression
Example: Weather Prediction

• meteorologists A and B apply for job

• to test which is better:
• ask each to predict how much it will rain
• observe actual amount
• repeat

predictions actual
A B outcome
Monday 1.2 0.5 0.9
Tuesday 0.1 0.3 0.0
Wednesday 2.0 1.0 2.1
• how to judge who gave better predictions?
Example (cont.)
• natural idea:
• measure discrepancy between predictions and outcomes
• e.g., measure using absolute difference
• choose forecaster with closest predictions overall

predictions actual difference

A B outcome A B
Monday 1.2 0.5 0.9 0.3 0.4
Tuesday 0.1 0.3 0.0 0.1 0.3
Wednesday 2.0 1.0 2.1 0.1 1.1
0.5 1.8

• could have measured discrepancy in other ways

• e.g., difference squared
• which measure to use?
Loss

• each forecast scored using loss function

x = weather conditions
f (x) = predicted amount
y = actual outcome
• loss function L(f (x), y ) measures discrepancy between
prediction f (x) and outcome y
• e.g.:
• absolute loss: L(f (x), y ) = |f (x) − y |
• square loss: L(f (x), y ) = (f (x) − y )2
• which L to use?
• need to understand properties of loss functions
Square Loss
• square loss often sensible because encourages predictions close
to true expectation
• fix x
• say y random with µ = E [y ]
• predict f = f (x)
• can show:

E [L(f , y )] = E (f − y )2 = (f − µ)2 + Var(y )
| {z }
intrinsic randomness

• therefore:
• minimized when f = µ
• lower square loss ⇒ f closer to µ
• forecaster with lowest square loss has predictions closest
to E [y |x] on average
Learning for Regression
• say examples (x, y ) generated at random
• expected square loss

E [Lf ] ≡ E (f (x) − y )2

minimized when f (x) = E [y |x] for all x

• how to minimize from training data (x1 , y1 ), . . . , (xm , ym )?
• attempt to find f with minimum empirical loss:
m
1 X
Ê [Lf ] ≡ (f (xi ) − yi )2
m
i =1

• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will

approximately minimize E [Lf ]
• to be possible, need to choose f of restricted form to avoid
overfitting
Linear Regression

• e.g., if x ∈ Rn , could choose to use linear predictors of form

f (x) = w · x
• then need to find w to minimize
m
1 X
(w · xi − yi )2
m
i =1

• can solve in closed form

• can also minimize on-line (e.g. using gradient descent)
Regularization

• to constrain predictor further, common to add regularization

term to encourage small weights:
m
1 X
(w · xi − yi )2 + λkwk2
m
i =1

(in this case, called “ridge regression”)

• can significantly improve performance by limiting overfitting
• requires tuning of λ parameter
• different forms of regularization have different properties
• e.g., using kwk1 instead tends to encourage “sparse”
solutions
Absolute Loss

• what if instead use L(f (x), y ) = |f (x) − y | ?

• can show E [|f (x) − y |] minimized when
f (x) = median of y ’s conditional distribution, given x
• potentially, quite different behavior from square loss
• not used so often
Summary so far

• can handle prediction of real-valued outcomes by:

• choosing a loss function
• computing a prediction rule with minimum loss on
training data
• different loss functions have different properties:
• square loss estimates conditional mean
• absolute loss estimates conditional median
• what if goal is to estimate entire conditional distribution of y
given x?
Estimating Probabilities
Weather Example (revisited)

• say goal now is to predict probability of rain

• again, can compare A and B’s predictions:
predictions actual
A B outcome
Monday 60% 80% rain
Tuesday 20% 70% no-rain
Wednesday 90% 50% no-rain
• which is better?
Plausible Approaches

• similar to classification
• but goal now is to predict probability of class
• could reduce to regression:

1 if rain
y=
0 if no-rain

• minimize square loss to estimate

E [y |x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural

(especially when more than two possible outcomes)
Different Approach: Maximum Likelihood

• each forecaster predicting distribution over set of outcomes

y ∈ {rain, no-rain} for given x
• can compute probability of observed outcomes according to
each forecaster — “likelihood”
predictions actual likelihood
A B outcome A B
Monday 60% 80% rain 0.6 0.8
Tuesday 20% 70% no-rain 0.8 0.3
Wednesday 90% 50% no-rain 0.1 0.5
likelihood(A) = .6 × .8 × .1
likelihood(B) = .8 × .3 × .5
• intuitively, higher likelihood ⇒ better fit of estimated
probabilities to observations
• so: choose maximum-likelihood forecaster
Log Loss

• given training data (x1 , y1 ), . . . , (xm , ym )

• f (y |x) = predicted probability of y on given x
m
Y
• likelihood of f = f (yi |xi )
i =1
• maximizing likelihood ≡ minimizing negative log likelihood
m
X
(− log f (yi |xi ))
i =1

• L(f (·|x), y ) = − log f (y |x) called “log loss”

Estimating Probabilities

• Pr[y |x] = true probability of y given x

• can prove: E [− log f (y |x)] minimized when f (y |x) = Pr[y |x]
• more generally,

E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])

+(intrinsic uncertainty of Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to

true conditional probabilities
Learning

• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to

minimize
1 X
(− log f (yi |xi ))
m
i

• as before, need to restrict form of f

• e.g.: if x ∈ Rn , y ∈ {0, 1}, common to use f of form

f (y = 1|x) = σ(w · x)

where σ(z) = 1/(1 + e −z )

• can numerically find w to minimize log loss
• “logistic regression”
Log Loss and Square Loss

• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with

mean w · x and fixed variance
• then minimizing log loss ≡ linear regression
• general: square loss ≡ log loss with gaussian conditional
probability distributions (and fixed variance)
Classification and Loss Minimization

• in classification learning, try to minimize 0-1 loss

1 if f (x) 6= y
L(f (x), y ) =
0 else

• expected 0-1 loss = generalization error

• empirical 0-1 loss = training error
• computationally and numerically difficult loss since
discontinuous and not convex
• to handle, both AdaBoost and SVM’s minimize alternative
surrogate losses
• AdaBoost: “exponential” loss
• SVM’s: “hinge” loss
Summary

• much of learning can be viewed simply as loss minimization

• different losses have different properties and purposes
• regression (real-valued labels):
• use square loss to estimate conditional mean
• use absolute loss to estimate conditional median
• estimating conditional probabilities:
• use log loss (≡ maximum likelihood)
• classification:
• use 0/1-loss (or surrogate)
• provides unified and flexible means of algorithm design

The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Estimation Theory MCQ
83% (6)
Estimation Theory MCQ
8 pages
ML Opt
No ratings yet
ML Opt
89 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Unit II
100% (1)
Unit II
13 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
CH 1
No ratings yet
CH 1
24 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
ML - Unit 2
No ratings yet
ML - Unit 2
155 pages
ML-1
No ratings yet
ML-1
24 pages
ML-2
No ratings yet
ML-2
155 pages
Chapter 4a Riskmin-Reg - Commented4
No ratings yet
Chapter 4a Riskmin-Reg - Commented4
54 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
DL145611_03_Shallow
No ratings yet
DL145611_03_Shallow
92 pages
1 Intro
No ratings yet
1 Intro
5 pages
Notes 05
No ratings yet
Notes 05
51 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
2 LossAndOptimization
No ratings yet
2 LossAndOptimization
130 pages
01_lecturenote_SRM
No ratings yet
01_lecturenote_SRM
9 pages
Regression
No ratings yet
Regression
39 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Unit 2 ML_Ver 2
No ratings yet
Unit 2 ML_Ver 2
129 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
Lecture 05 - Logistic Regression
No ratings yet
Lecture 05 - Logistic Regression
10 pages
Chap10 Logistic Regression
No ratings yet
Chap10 Logistic Regression
36 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
No ratings yet
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
46 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Unit 2 ML
No ratings yet
Unit 2 ML
201 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
UnderstandingDeepLearning 03-26-25 C 31 38
No ratings yet
UnderstandingDeepLearning 03-26-25 C 31 38
8 pages
Regression
No ratings yet
Regression
45 pages
ML_basics_lecture2_linear_classification
No ratings yet
ML_basics_lecture2_linear_classification
34 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
383-Fall11-Lec19
No ratings yet
383-Fall11-Lec19
30 pages
Linear Models
No ratings yet
Linear Models
30 pages
Logistic Regression
No ratings yet
Logistic Regression
18 pages
week2
No ratings yet
week2
43 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Class 02
No ratings yet
Class 02
42 pages
04 LossFunctions
No ratings yet
04 LossFunctions
22 pages
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
RSK Test
No ratings yet
RSK Test
5 pages
Answer
100% (3)
Answer
5 pages
Sobo 1040-Business Mathematics and Statistics - August 2022 Final Exams
No ratings yet
Sobo 1040-Business Mathematics and Statistics - August 2022 Final Exams
2 pages
3150 2010 November 01 Agung Santoso-with-cover-page-V2
No ratings yet
3150 2010 November 01 Agung Santoso-with-cover-page-V2
18 pages
Stat-703 Final Paper
No ratings yet
Stat-703 Final Paper
13 pages
Type 1 Study Report Example Output Description
No ratings yet
Type 1 Study Report Example Output Description
3 pages
Regularization: The Problem of Overfitting
No ratings yet
Regularization: The Problem of Overfitting
24 pages
Fall2014stat220 hw02
No ratings yet
Fall2014stat220 hw02
2 pages
Math IV Quantum
No ratings yet
Math IV Quantum
147 pages
Data Mining - Utrecht University - 10. Slides
No ratings yet
Data Mining - Utrecht University - 10. Slides
49 pages
330 Lecture11 2014
No ratings yet
330 Lecture11 2014
61 pages
Estimation Bootstrap of The Probability Distributions of The Loss Function of Taguchi and of The Signal-To-Noise Ratio
No ratings yet
Estimation Bootstrap of The Probability Distributions of The Loss Function of Taguchi and of The Signal-To-Noise Ratio
10 pages
Introduction BS Final
No ratings yet
Introduction BS Final
54 pages
Exercises: Applied Bayesian Analysis and Numerical Methods (STK4021)
No ratings yet
Exercises: Applied Bayesian Analysis and Numerical Methods (STK4021)
30 pages
STAT14S - PSPP: Exercise Using PSPP To Explore Bivariate Linear Regression
No ratings yet
STAT14S - PSPP: Exercise Using PSPP To Explore Bivariate Linear Regression
4 pages
Regression On Real Estate
No ratings yet
Regression On Real Estate
54 pages
Slides Algo Ds Hash Universal 2 Typed
No ratings yet
Slides Algo Ds Hash Universal 2 Typed
8 pages
Angie S.: Subject: Probability and Statistics
No ratings yet
Angie S.: Subject: Probability and Statistics
4 pages
Forecasting
No ratings yet
Forecasting
16 pages
Midterm Psych
No ratings yet
Midterm Psych
84 pages
02 RegressionAnalysis
No ratings yet
02 RegressionAnalysis
32 pages
Glossary For Isye 6501 Introduction To Analytics Modeling
No ratings yet
Glossary For Isye 6501 Introduction To Analytics Modeling
24 pages
Lesson 1 Intro To Hypothesis Testing
No ratings yet
Lesson 1 Intro To Hypothesis Testing
26 pages
Applied Regression Analysis and Generalized Linear Models 3rd Edition John Fox ebook download PDF instant access
100% (2)
Applied Regression Analysis and Generalized Linear Models 3rd Edition John Fox ebook download PDF instant access
73 pages
Sinta 3_
No ratings yet
Sinta 3_
14 pages
Ge 4 Revision
No ratings yet
Ge 4 Revision
3 pages
Final Project Memo - QNT550
No ratings yet
Final Project Memo - QNT550
7 pages
Lecture No 34 - 06 November 2023
No ratings yet
Lecture No 34 - 06 November 2023
9 pages
Fundamentals of Forecasting Using Excel: Dr. Kenneth D. Lawrence Dr. Ronald K. Klimberg Dr. Sheila M. Lawrence
No ratings yet
Fundamentals of Forecasting Using Excel: Dr. Kenneth D. Lawrence Dr. Ronald K. Klimberg Dr. Sheila M. Lawrence
7 pages

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Uploaded by

Beyond Classification Beyond Classification Beyond Classification Beyond Classification

Uploaded by

Beyond Classification

• earlier, studied classification learning

• general techniques for:

• meteorologists A and B apply for job

predictions actual difference

• could have measured discrepancy in other ways

• each forecast scored using loss function

minimized when f (x) = E [y |x] for all x

• if ∀f : Ê [Lf ] ≈ E [Lf ] then f that minimizes Ê [Lf ] will

• e.g., if x ∈ Rn , could choose to use linear predictors of form

• can solve in closed form

• to constrain predictor further, common to add regularization

(in this case, called “ridge regression”)

• what if instead use L(f (x), y ) = |f (x) − y | ?

• can handle prediction of real-valued outcomes by:

• say goal now is to predict probability of rain

• minimize square loss to estimate

E [y |x] = Pr[y = 1|x] = Pr[rain|x]

• reasonable, though somewhat awkward and unnatural

• each forecaster predicting distribution over set of outcomes

• given training data (x1 , y1 ), . . . , (xm , ym )

• L(f (·|x), y ) = − log f (y |x) called “log loss”

• Pr[y |x] = true probability of y given x

E [− log f (y |x)] = (average distance between f (·|x) and Pr[·|x])

• so: minimizing log loss encourages choice of predictor close to

• given training data (x1 , y1 ), . . . , (xm , ym ), choose f (y |x) to

• as before, need to restrict form of f

where σ(z) = 1/(1 + e −z )

• e.g.: if x ∈ Rn , y ∈ R, can take f (y |x) to be gaussian with

• in classification learning, try to minimize 0-1 loss

• expected 0-1 loss = generalization error

• much of learning can be viewed simply as loss minimization

You might also like