0% found this document useful (0 votes)
24 views138 pages

ML Unit II_Final

The document outlines the syllabus for a course on Supervised Learning Techniques, covering topics such as Regression Analysis, Decision Trees, Support Vector Machines, and Bayesian Classification. It details the processes involved in model construction, usage, and the principles behind various classification algorithms. Additionally, it explains key concepts like entropy, information gain, and feature scaling in the context of machine learning.

Uploaded by

yashikov99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views138 pages

ML Unit II_Final

The document outlines the syllabus for a course on Supervised Learning Techniques, covering topics such as Regression Analysis, Decision Trees, Support Vector Machines, and Bayesian Classification. It details the processes involved in model construction, usage, and the principles behind various classification algorithms. Additionally, it explains key concepts like entropy, information gain, and feature scaling in the context of machine learning.

Uploaded by

yashikov99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 138

ML-UNIT II

SCHOOL OF COMPUTER ENGINEERING & TECHNOLOGY


Syllabus
Supervised Learning Techniques
• Regression Analysis : Residual Analysis , Multiple Linear Regression,
Logistic regression
• Distance Based Models: Nearest Neighbor classification
• Tree Based Models: Decision Trees
• Probabilistic Model: Naive Bayes Classifier
• Support Vector Machine: Maximum Margin Classifier, Kernels, SVM
Case Study

06/27/2025 CSP42B: AML 2


Decision Tree
Step 1: Model Construction

Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Step 2: Model Usage

Classifier

Testing Unseen
Data Data

(Jeff, Professor, 4)

Tenured?
Decision Tree Classification
Decision Tree Terminologies

 Decision tree may be n-ary, n ≥ 2.


 There is a special node called root node.
 Internal nodes are test attribute/decision attribute
 Leaf nodes are class labels
 Edges of a node represent the outcome for a value of the test node.
 In a path, a node with same label is never repeated.
 Decision tree is not unique, as different ordering of internal nodes
can give different decision tree
Rule Extraction From decision
Tree
■ Rules are easier to understand than large
trees
■ One rule is created for each path from the
root to a leaf
■ Each attribute-value pair along a path
forms a conjunction: the leaf holds the
class prediction
■ Rules are mutually exclusive and
exhaustive

Example: Rule extraction from our buys_computer decision-tree


IF age = youth AND student = no THEN buys_computer = no
IF age = youth AND student = yes THEN buys_computer = yes
IF age = middle_aged THEN buys_computer = yes
IF age = senior AND credit_rating = excellent THEN buys_computer = yes
IF age = youth AND credit_rating = fair THEN buys_computer = no
Attribute Selection : Information
gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”

1 : Calculate Entropy for Class Labels

2 : Calculate Information of Each Attribute (one by one)

3 : Calculate Gain of Each Attribute (one by one)

4: Select Attribute with max gain as a root attribute

Gain(age)> any other gain. so age is selected as root


node
Intermediate DT of the buys_computer
dataset

Age is the splitting attribute at the root node of DT. Repeat the
procedure to determine the splitting attribute(except age) along
each of the branches from root node till stopping condition is
reached to generate the final DT
Final DT of the buys_computer
dataset
Example : DT Creation
Example : DT Creation
Example : DT Creation
Example : Usage of Information Gain
and Entropy in DT Creation
Decision Tree
A Classifier ( tree structure): used in classification and regression
Classification mostly uses Decision tree
Decision tree Model act as classifier : tree structure classifier
New input( unlabeled, unknown) is fed to the model and model classifies it to
a particular class
Decision tree has two nodes:
 Decision node( branch: test conducted either yes or no) [corresponds to an Attribute]
 Leaf node( no branch) [corresponds to a Class Label ]
Finally leaf node is a class….( assign a class to next coming sample)

06/27/2025 CSP42B: AML 16


Classification & Regression Trees (CART)
 DT creates a model that predicts the value of a target (or dependent variable)
based on the values of several input (or independent variables)
 CART was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
Olshen and Charles Stone.
 The main elements of CART (and any decision tree algorithm) are:
• Rules for splitting data at a node based on the value of one variable
• Stopping rules for deciding when a branch is terminal and can be split no more
• Finally, a prediction for the target variable in each terminal node.

06/27/2025 17
Decision Tree
 Ex.: Let’s say you want to predict whether a person is fit given their
information like age, eating habit, and physical activity, etc.

 The decision nodes here are questions like ‘What’s the age?’, ‘Does he
exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes
like either ‘fit’, or ‘unfit’.

 Binary classification problem (yes /no type)

06/27/2025 CSP42B: AML 18


Decision Tree Types
There are two main types of Decision Trees:
 Classification trees (Yes/No types)
 the outcome variable (like ‘fit’ or ‘unfit’) or decision variable is
Categorical. Tree is used to identify the "class"

 Regression trees (Continuous data types)


 the decision or the outcome variable is Continuous (like 123). tree is used
to predict target variable value.
06/27/2025 CSP42B: AML 19
06/27/2025 Data Science 20
Entropy
 Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S,
 It’s the measure of the amount of uncertainty or randomness in data.
 Intuitively, it tells us about the predictability of a certain event.
 Example, consider a coin toss whose probability of heads is 0.5 and probability of tails is
0.5. Here the entropy is the highest possible, since there’s no way of determining what the
outcome might be.
 Alternatively, consider a coin which has heads on both the sides, the entropy of such an
event can be predicted perfectly since we know beforehand that it’ll always be heads. In
other words, this event has no randomness hence it’s entropy is zero.
 In particular, lower values imply less uncertainty while higher values imply high uncertainty.

06/27/2025 CSP42B: AML 21


Support Vector
Machines (SVM)
UNIT II
SVM
• “Support Vector Machine” (SVM) is a supervised
machine learning algorithm which can be used for both classification
or regression challenges.
• it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate.
• Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).
How does it work?

A thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the
two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.
How does it work?

maximizing the distances between nearest


data point (either class) and hyper-plane
will help us to decide the right hyper-
plane. This distance is called as Margin.
Margin in SVM
Margin for hyper-plane C is high as compared to
both A and B. Hence, we name the right hyper-
plane as C.

If we select a hyper-plane having low margin then


there is high chance of miss-classification.
How does it Work?

Hyper-plane B as it has higher


margin compared to A. But, here is
the catch, SVM selects the hyper-
plane which classifies the classes
accurately prior to maximizing
margin.

Here, hyper-plane B has a


classification error and A has
classified all correctly. Therefore,
the right hyper-plane is A.
How does it work?

The SVM algorithm has a feature to


ignore outliers and find the hyper-
plane that has the maximum margin.
Hence, we can say, SVM classification
is robust to outliers.
How does it work?

SVM can solve this problem. Easily! It


solves this problem by
introducing additional feature. Here, we
will add a new feature z=x^2+y^2. Now,
let’s plot the data points on axis x and z:
kernel trick
• The SVM kernel is a function that takes low dimensional input space
and transforms it to a higher dimensional space i.e. it converts not
separable problem to separable problem.
• it does some extremely complex data transformations, then finds out
the process to separate the data based on the labels or outputs
you’ve defined.
Different Kernel Functions
SVM in Python
Exercise
• https://www.youtube.com/watch?v=LXGaYVXkGtg&t=189s
SVM
SVM
Bayesian Classification

 A statistical classifier: performs probabilistic prediction, i.e.,


predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Classification - Example
Bayesian Classification -
Probability

 Probability: How likely something is to happen

 Probability of an event happening =


Number of times it can happen
_________________________
Total number of outcomes
Bayesian Classification Theorem

 Let X be a data sample (“evidence”): class label is unknown


 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Bayesian Classification Theorem

 Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem

 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest


among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
Naïve Bayesian Classification
Theorem
 Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)

 This can be derived from Bayes’ theorem


 Since P(X) is constant for all classes, only

needs to be maximized
Naïve Bayesian Classification -
Example

Class:
C1:buys_computer =
‘yes’
C2:buys_computer = ‘no’

Data sample
X = (age = youth,
income = medium,
student = yes
credit_rating = fair)
Naïve Bayesian Classification -
Example
Test for X = (age = youth , income = medium,
student = yes, credit_rating = fair)
• Prior probability P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) =
5/14= 0.357
• To compute P(X|Ci) for each class, compute following conditional probabilities
P(age = youth| buys_computer = yes) = 2/9 = 0.222
P(age = youth | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
P(income = medium | buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 0.028>0.007 ..
• P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 =Therefore,
0.044 X belongs to class
(“buys_computer = yes”)
P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
• P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Comment on Naive Bayes
Classification

 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence i.e. effect of an
attribute value on a given class is independent of the values of
other attributes, therefore loss of accuracy
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
https://www.kaggle.com/code/skalskip/iris-data-visualization-
and-knn-classification

68
Feature Scaling
• Feature scaling is the method to limit the
range of variables so that they can be
compared on common grounds.
What is Regression
Types of Regression
Cntd..
Regression analysis is a statistical technique that attempts to explore
and model the relationship between two or more variables.
Understanding Regression
• Dependent variable (The value to be predicted)
• Independent Variable (The predictors)

• Assume relationship between independent and dependent variable


is straight line.
• y = a+ bx
• a – intercept on y axis(indicates value of y when x = 0)
• b – slope (how much line rises for each increase in x)

06/27/2025 74
Cntd..
• Linear Regression (Straight line model)
i. Simple linear Regression (Single independent variable and dependent variable is
continuous)
ii. Multiple regression (More than one independent variable and dependent
variable is continuous)
• Logistic Regression (Binary categorical outcome)
.Logistic model is used to model the probability of a certain class or
event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
.. This can be extended to model several classes of events such as
determining whether an image contains a cat, dog, lion, etc..

• For example, the weight, height, and age represent continuous variable.
• a person's gender, occupation, or marital status are categorical or discrete variables
06/27/2025 75
Cntd..
Example: Advertising through TV, Radio, Newspaper to increase sales
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?

• linear regression can be used to answer each of these questions

06/27/2025 76
Cntd..
• Straightforward simple linear regression approach for predicting a
quantitative response Y on the basis of a single predictor variable X

• Mathematically regressing Y on X (or Y onto X)


• X may represent TV advertising and Y may represent sales.

06/27/2025 77
06/27/2025 78
Simple Linear Regression
ANALYZING DA

IV DV
Simple Linear Regression
SALARY EQUATION PLOTTING
(₹)

y = b0 + b1 *
x1

SALARY = b0 + b1 *
+10
EXPERIENCE
K

+1 Yr HOW much Salary will increase?

+1 Yr EXPERIEN
CE
Simple Linear Regression

Constant Coefficient

y = b0 + b1 *
x1

Dependent variable Independent variable


(DV) (IV)
Simple Linear Regression
ORDINARY LEAST SQUARES

• How SLR finds Best Fitting Line from our Data


SALARY (₹)
Mr. ABC
yi
Modeled
Observatio
n
y^i

SUM ( y - y^ )2 -> min

EXPERIEN
Least Square Method

• Finds the line of best fit for a dataset, providing a visual


demonstration of the relationship between the data
points.
• The differences between the actual and estimated
function values on the training examples are called
residuals

• The least-squares method, consists in finding ˆ f such that


is minimised
Ex: Least Square method using
Univariate Regression

x y x – x mean y - y mean (x – x mean)2 (x – x


mean).
(y – y
mean)

1 2 -2 -2 4 4
2 4 -1 0 1 0
3 5 0 1 0 0
4 4 1 0 1 0
5 5 2 1 4 2
Problem Statement

• Last year, five randomly selected students took a


math aptitude test before they began their statistics
course. The Statistics Department has three
questions.
• What linear regression equation best predicts
statistics performance, based on math aptitude
scores?
• If a student made an 80 on the aptitude test, what
grade would we expect her to make in statistics?
• How well does the regression equation fit the data?
Student x y x – x mean y - y mean (x – x (x – x
mean)2 mean).
(y – y
mean)
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
mean 78 77
How to Find the Regression Equation

• The regression equation is a linear equation of the form: ŷ = b0 + b1x


• First, we solve for the regression coefficient (b1):
• b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
• b1 = 470/730
• b1 = 0.644
• Once we know the value of the regression coefficient (b1), we can solve
for the regression slope (b0):
• b0 = y - b1 * x
• b0 = 77 - (0.644)(78)
• b0 = 26.768
• Therefore, the regression equation is: ŷ = 26.768 + 0.644x .
How to Use the Regression Equation

• In our example, the independent variable is the


student's score on the aptitude test.
• The dependent variable is the student's statistics
grade.
• If a student made an 80 on the aptitude test, the
estimated statistics grade (ŷ) would be:
• ŷ = b0 + b1x
• ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80
• ŷ = 26.768 + 51.52 = 78.288
How to Find the Coefficient of
Determination

• Whenever you use a regression equation, you should ask how


well the equation fits the data.
• One way to assess fit is to check the coefficient of determination
, which can be computed from the following formula.
• R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• where N is the number of observations used to fit the model,
• Σ is the summation symbol, xi is the x value for observation i,
• x is the mean x value,
• yi is the y value for observation i,
• y is the mean y value,
• σx is the standard deviation of x, and
• σy is the standard deviation of y.
• σx = sqrt [ Σ ( xi - x )2 / N ]
• σx = sqrt( 730/5 ) = sqrt(146) = 12.083
• σy = sqrt [ Σ ( yi - y )2 / N ]
• σy = sqrt( 630/5 ) = sqrt(126) = 11.225
• And finally, we compute the coefficient of determination
• (R2): R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2
• R2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48
• A coefficient of determination equal to 0.48 indicates that about 48% of the
variation in statistics grades (the dependent variable) can be explained by the
relationship to math aptitude scores (the independent variable).
• This would be considered a good fit to the data, in the sense that it would
substantially improve an educator's ability to predict student performance in
statistics class.
Cntd..
xy
11
23
43
32
55

When we have a single input attribute (x) and we want to use linear regression, this is called simple linear regression.

If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear regression.

The procedure for linear regression is different and simpler than that for multiple linear regression

06/27/2025 95
Cntd..
With simple linear regression we want to model our data as follows: y = B0 + B1 * x

This is a line where y is the output variable we want to predict, x is the input variable we know

and B0 and B1 are coefficients that we need to estimate that move the line around.

06/27/2025 96
Calculating the sum of these squared values gives us up denominator of 10 Now we can
calculate the value of our slope. B1 = 8 / 10 so further B1 = 0.8

06/27/2025 97
06/27/2025 98
06/27/2025 99
06/27/2025 100
Linear Regression
 Simple linear regression: models the relationship between a dependent
variable and one independent variables using a linear function

06/27/2025 CSP42B: AML 101


What is “Linear”?
 Remember the LINE equation : Y = mX + B
 Expected value of (y) at a given level of (x): E ( yi / xi )    xi

06/27/2025 CSP42B: AML 102


Predicted value for an individual…

yi =  + *xi + random errori

Fixed – Follows a normal


exactly on distribution
the line

06/27/2025 CSP42B: AML 103


Multiple Linear Regression
 Multiple linear regression: two or more (multiple) explanatory variables
are used to predict the dependent variable

 Ex. Demand of a product varies directly with


the change in demographic characteristics (age,
income) of a market area.

06/27/2025 CSP42B: AML 104


Regression Modeling
 A simple regression model (one
independent variable) fits a regression
line in 2-dimensional space

A multiple regression model with


two explanatory variables fits a
regression plane in 3-dimensional
space
06/27/2025 CSP42B: AML 105
Multiple Linear Regression

06/27/2025 CSP42B: AML 106


Simple and Multiple Regression

06/27/2025 CSP42B: AML 107


06/27/2025 108
06/27/2025 109
Multiple Linear Regression
Suppose we have the following dataset with one response variable y and two
predictor variables X1 and X2:

Use the following steps to fit a multiple linear regression model to this dataset.

06/27/2025 110
Multiple Linear Regression
Step 1: Calcúlate X12, X22, X1y, X2y and X1X2.

06/27/2025 111
Multiple Linear Regression
Step 2: Calculate Regression Sums.
Next, make the following regression sum calculations:
Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -200.375

06/27/2025 112
Multiple Linear Regression
Step 3: Calculate b0, b1, and b2.

The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148

The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656

The formula to calculate b0 is: y – b1X1 – b2X2

Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

06/27/2025 113
Multiple Linear Regression
Step 5: Place b0, b1, and b2 in the estimated linear regression equation.

The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2

In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2

06/27/2025 114
Residual
 Residual: degree of discrepancy between the model assumed and data
observed
 Residuals (e) : Difference between the observed value of the dependent
variable (y) and the predicted value (ŷ)

ˆi
ei  yi  y
 Each data point has one residual
 Residual = [Observed value] – [Predicted value]
 Residual is a more precise estimator of zero

06/27/2025 CSP42B: AML 115


Residual

Fig.1: Plotting the residual (Random Error)


06/27/2025 CSP42B: AML 116
Residual Analysis
Ex.

Fig.2: Plotting the four data points for given equation

06/27/2025 CSP42B: AML 117


Residual Analysis

 Residual analysis helps for evaluating


the goodness of a model
 Ideally, deviation must be near to zero

Fig.3: Residual analysis plot

06/27/2025 CSP42B: AML 118


Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?

Y
60
40
20
0 X
0 20 40 60
EPI 809/Spring 2008 119
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
EPI 809/Spring 2008 120
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope unchanged

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 121
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 122
Least Squares (LS)
 ‘Best Fit’ Means Difference Between Actual Y-Values & Predicted Y-Values
are a Minimum.
 But Positive Differences Off-Set Negative. So square errors!

 LS Minimizes the Sum of the Squared Differences (errors) (SSE)

   ˆ
n 2 n
Yi  Yˆi 2
i
i 1 i 1

EPI 809/Spring 2008 123


Least Squares Graphically
n
LS minimizes   i   1   2   3   4
 2  2  2  2  2

i 1
Y  
Y2    X  
0 1 2 2

^4
^2
^1 ^3
  
Yi   0   1 X i
X
EPI 809/Spring 2008 124
Coefficient Equations
• Prediction equation
yˆ i ˆ0  ˆ1 xi
• Sample slope
SS xy  xi  x yi  y 
ˆ1  
2
SS xx  ix  x 
• Sample Y - intercept

ˆ0  y  ˆ1x
EPI 809/Spring 2008 125
Linear classification
Nonlinearity
Logistic Regression

• Logistic Regression is one of the most


commonly used Machine Learning
algorithms that is used to model a binary
variable that takes only 2 values – 0 and 1.
• The objective of Logistic Regression is to
develop a mathematical equation that can
give us a score in the range of 0 to 1. This
score gives us the probability of the
variable taking the value 1.
Logistic regression

• Even if called regression, this is a classification


method which is based on the probability for a
sample to belong to a class.
• As our probabilities must be continuous in R and
bounded between (0, 1), it's necessary to
introduce a threshold function to filter the term z.
• The name logistic comes from the decision to use
the sigmoid (or logistic) function:
Logistic regression

• If ‘Z’ goes to infinity, Y(predicted) will


become 1 and if ‘Z’ goes to negative
infinity, Y(predicted) will become 0.
Applications

• How does the probability of getting lung cancer


(yes vs. no) change for every additional pound
a person is overweight and for every pack of
cigarettes smoked per day?
• Do body weight, calorie intake, fat intake, and
age have an influence on the probability of
having a heart attack (yes vs. no)?
• Spam Detection
• Tumour Prediction
• Credit Card Fraud
• Marketing
The Logistic Regression Model
 The logistic distribution constrains the estimated probabilities to lie
between 0 and 1.
 The estimated probability is:

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p = .50


 as  +  X gets really big, p approaches 1
 as  +  X gets really small, p approaches 0

06/27/2025 CSP42B: AML 132


The Logistic Regression Model

06/27/2025 CSP42B: AML 133


Linear Vs Logistic

06/27/2025 CSP42B: AML 134


Regularization
Overfitting is one of the most serious kinds of problems related to machine
learning. It occurs when a model learns the training data too well. The model then
learns not only the relationships among data but also the noise in the dataset.
Overfitted models tend to have good performance with the data used to fit them
(the training data), but they behave poorly with unseen data (or test data, which is
data not used to fit the model).
Overfitting usually occurs with complex models. Regularization normally tries to
reduce or penalize the complexity of the model. Regularization techniques applied
with logistic regression mostly tend to penalize large coefficients 𝑏₀, 𝑏₁, …, 𝑏ᵣ:
•L1 regularization penalizes the LLF with the scaled sum of the absolute values of
the weights: |𝑏₀|+|𝑏₁|+⋯+|𝑏ᵣ|.
•L2 regularization penalizes the LLF with the scaled sum of the squares of the
weights: 𝑏₀²+𝑏₁²+⋯+𝑏ᵣ².
•Elastic-net regularization is a linear combination of L1 and L2 regularization.
Regularization can significantly improve model performance on unseen data.
Overfitting
 Natural end of process in DT is 100% purity in
each leaf
 This overfits the data, which end up fitting noise
in the data
 Overfitting leads to low predictive accuracy of
new data
 Past a certain point, the error rate for the
validation data starts to increase

06/27/2025 CSP42B: AML 136


Overfitting & Underfitting
 Training a Model:
 Overfit: if the model is performing much better on train set but not
performing well on cross-validation set
 Overfit is called as “High Variance” problem

 Underfit: if the model is not performing well on train set itself


 Underfit is considered as “High Bias” problem

06/27/2025 CSP42B: AML 137


06/27/2025 CSP42B: AML 138

You might also like