0% found this document useful (0 votes)
4 views

06LogisticRegression

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

06LogisticRegression

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Introduction to

Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[email protected]

https://sites.google.com/a/ucp.edu.pk/mai/iml/
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun
Logistic Regression
A Classification Algorithm

One of the most popular and most widely


used learning algorithm today

2
Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign ?

0: “Negative Class” (e.g., benign tumor)


1: “Positive Class” (e.g., malignant tumor)
0: “Negative Class” (e.g., benign tumor)
𝑦 ∈ {0 ,1 , 2 , 3 } 1: “Positive Class 1” (e.g., type 1 tumor)
2: “Positive Class 2” (e.g., type 2 tumor)

3
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
4
Threshold classifier output at 0.5:
If , predict “y = 1”
If , predict “y = 0”
5
Bad thing to do for linear regression

Before we just got lucky!


6
Classification: y = 0 or 1

can be > 1 or < 0

Logistic Regression:

Classification task

7
Hypothesis Representation

8
Logistic Regression Model
Want

0.5

Sigmoid function 0

Logistic function
Need to select parameters so that this line fits data
Do it with an algorithm later 9
Interpretation of Hypothesis Output
= estimated probability that y = 1 on a new input x

Example: If

Tell patient that 70% chance of tumor being malignant

h 𝜃 ( 𝑥 ) =𝑃 ( 𝑦=1| 𝑥 ; 𝜃 ¿ “probability that y = 1, given x,


parameterized by ”
𝑦 =0 𝑜𝑟 1
10
Decision boundary
Logistic regression or 𝒉𝜽 ( 𝒙)=¿

𝜽 𝑻 𝒙=¿
If
`
If 𝑧 is + ve then 𝑔 ( 𝑧 ) ≥ 0.5i . e . 𝜃 𝑇 𝑥 ≥ 0
If
If 𝑇
i . e . 𝜃 𝑥< 0
12
For any example with features x1, x2 that satisfy this equation predicts y=1
Decision Boundary
x2
3
2

1 2 3 x1

Predict “ “ if

15
Non-linear decision boundaries
x2

-1 1 x1
-1
Predict “ “ if
x2

x1

16
Cost function
To fit the parameters
Training set:

m examples

How to choose parameters ?


18
Cost function
_____ regression:
Linear
Logistic 1
𝑇
𝑥
1+𝑒 −𝜃

“non-convex function” “convex function”

19
Logistic regression cost function

If y = 1

Cost

0 1 20
Logistic regression cost function

If y = 1

Cost

0 1 21
Logistic regression cost function

If y = 0

Cost

0 1 22
Logistic regression cost function

If y = 0 Cost = 0 if ,
But as

Cost Captures the intuition that if


(predict , but ,
We penalize learning algorithm by very
large cost
0 1 23
Simplified cost function and gradient
descent
Logistic regression cost function

𝐶𝑜𝑠𝑡 ( h 𝜃 ( 𝑥 ) , 𝑦 ) =− 𝑦 log ( h𝜃 ( 𝑥 ) ) − ( 1 − 𝑦 ) log ⁡(1 −h 𝜃 ( 𝑥 ) )

𝐼𝑓 𝑦=1 : 𝐶𝑜𝑠𝑡 ( h 𝜃 ( 𝑥 ) , 𝑦 ) =− 𝑙𝑜𝑔 h𝜃 ( 𝑥)


𝐼𝑓 𝑦=0 :𝐶𝑜𝑠𝑡 ( h𝜃 ( 𝑥 ) , 𝑦 ) =−𝑙𝑜𝑔 (1− h¿¿ 𝜃(𝑥 ))¿
25
Logistic regression cost function
Why do we chose this function when other cost
functions exist?
• This cost function can be derived from statistics
using the principle of maximum likelihood
estimation
– An efficient method to find parameters in data for
different models
– It is a convex function
Logistic regression cost function

To fit parameters :

Hypothesis estimating the


probability that y=1
To make a prediction given new :
Output
27
Gradient Descent

Want :
Repeat

(simultaneously update all )

𝒎
𝝏 𝟏
𝑱 ( 𝜽 )= ∑ ( 𝒉 𝜽 𝒙 − 𝒚 ) 𝒙 𝒋
( (𝒊 )
) ( 𝒊) (𝒊 )
𝝏𝜽 𝒎 𝒊=𝟏
28
Gradient Descent

Want :
Repeat

(simultaneously update all )

Algorithm looks identical to linear regression! Difference


But actually they are very different from each other 29
1
Hypothesis h𝜃 ( 𝑥) = −𝜃 𝑥
𝑇

1+ 𝑒
Cost function
1
𝐽 ( 𝜃 )= ¿
𝑚

Gradient Descent

𝒎
𝟏
𝜃 𝑗 =𝜃 𝑗 − 𝛼 ∑ (𝒉 𝜽 ( 𝒙 ) − 𝒚 ) 𝒙 𝒋
( 𝒊) (𝒊 ) (𝒊)
𝒎 𝒊=𝟏
Multi-class classification
One-vs-all algorithm
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

37
Binary classification: Multi-class classification:

x2 x2

x1 x1
38
One-vs-all (one-vs-rest):

x2

x1
Class 1:
Class 2:
Class 3:
39
One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes

40
Regularization

41
The problem of overfitting
• So far we've seen a few algorithms
• Work well for many applications, but can suffer from
the problem of overfitting

42
Overfitting with linear regression
Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

Overfitting: If we have too many features, the learned hypothesis


may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
The hypothesis is just too large, too variable and we don't have enough data to
constrain it to give us a good hypothesis 43
Example: Logistic regression

x2 x2 x2

x1 x1 x1

( = sigmoid function)
Addressing overfitting:
size of house

Price
no. of bedrooms
no. of floors
age of house
average income in neighborhood Size
kitchen size
• Plotting hypothesis is one way to decide whether
overfitting occurs or not
• But with lots of features and little data we cannot
visualize, and therefore:
• Hard to select the degree of polynomial
• What features to keep and which to drop
Addressing overfitting:

Options:
1. Reduce number of features. (but this means loosing
information)
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
Cost function

47
Intuition

Price
Price

Size of house Size of house

Suppose we penalize and make , really small.


Regularization.
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Housing:
Unlike the polynomial
― Features: example, we don't know what
― Parameters: are the high order terms
How do we pick the ones that need to be shrunk?
With regularization, take cost function and modify it to shrink all the parameters

By convention you don't penalize θ0 - minimization is from θ1 onwards


Regularization.

• Using the regularized objective


(i.e. cost function with

Price
regularization term)
• We get a much smoother curve
which fits the data and gives a
much better hypothesis
Size of house
λ is the regularization parameter
Controls a trade off between our two goals
1) Want to fit the training set well
2) Want to keep parameters small
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps too large for


our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps for too large


for our problem, say )?
Price

Size of house
Regularized linear regression

54
Regularized linear regression
Gradient descent 𝜕
𝐽 (𝜃)
𝜕𝜃0
Repeat

[ + 𝝀
𝒎
𝜽 𝒋 ]
(regularized)

Same as before
Interesting term:
Usually learning rate is small and m is large Ex.
Regularized logistic regression

57
Regularized logistic regression.

x2

x1
Cost function:
Gradient descent
Repeat

[ + 𝝀
𝒎
𝜽 𝒋 ]
(regularized)
End

You might also like