0% found this document useful (0 votes)

24 views138 pages

ML Unit II_Final

The document outlines the syllabus for a course on Supervised Learning Techniques, covering topics such as Regression Analysis, Decision Trees, Support Vector Machines, and Bayesian Classification. It details the processes involved in model construction, usage, and the principles behind various classification algorithms. Additionally, it explains key concepts like entropy, information gain, and feature scaling in the context of machine learning.

Uploaded by

yashikov99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views138 pages

ML Unit II_Final

Uploaded by

yashikov99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 138

ML-UNIT II

SCHOOL OF COMPUTER ENGINEERING & TECHNOLOGY

Syllabus
Supervised Learning Techniques
• Regression Analysis : Residual Analysis , Multiple Linear Regression,
Logistic regression
• Distance Based Models: Nearest Neighbor classification
• Tree Based Models: Decision Trees
• Probabilistic Model: Naive Bayes Classifier
• Support Vector Machine: Maximum Margin Classifier, Kernels, SVM
Case Study

06/27/2025 CSP42B: AML 2

Decision Tree
Step 1: Model Construction

Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Step 2: Model Usage

Classifier

Testing Unseen
Data Data

(Jeff, Professor, 4)

Tenured?
Decision Tree Classification
Decision Tree Terminologies

 Decision tree may be n-ary, n ≥ 2.

 There is a special node called root node.
 Internal nodes are test attribute/decision attribute
 Leaf nodes are class labels
 Edges of a node represent the outcome for a value of the test node.
 In a path, a node with same label is never repeated.
 Decision tree is not unique, as different ordering of internal nodes
can give different decision tree
Rule Extraction From decision
Tree
■ Rules are easier to understand than large
trees
■ One rule is created for each path from the
root to a leaf
■ Each attribute-value pair along a path
forms a conjunction: the leaf holds the
class prediction
■ Rules are mutually exclusive and
exhaustive

Example: Rule extraction from our buys_computer decision-tree

IF age = youth AND student = no THEN buys_computer = no
IF age = youth AND student = yes THEN buys_computer = yes
IF age = middle_aged THEN buys_computer = yes
IF age = senior AND credit_rating = excellent THEN buys_computer = yes
IF age = youth AND credit_rating = fair THEN buys_computer = no
Attribute Selection : Information
gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”

1 : Calculate Entropy for Class Labels

2 : Calculate Information of Each Attribute (one by one)

3 : Calculate Gain of Each Attribute (one by one)

4: Select Attribute with max gain as a root attribute

Gain(age)> any other gain. so age is selected as root

node
Intermediate DT of the buys_computer
dataset

Age is the splitting attribute at the root node of DT. Repeat the
procedure to determine the splitting attribute(except age) along
each of the branches from root node till stopping condition is
reached to generate the final DT
Final DT of the buys_computer
dataset
Example : DT Creation
Example : DT Creation
Example : DT Creation
Example : Usage of Information Gain
and Entropy in DT Creation
Decision Tree
A Classifier ( tree structure): used in classification and regression
Classification mostly uses Decision tree
Decision tree Model act as classifier : tree structure classifier
New input( unlabeled, unknown) is fed to the model and model classifies it to
a particular class
Decision tree has two nodes:
 Decision node( branch: test conducted either yes or no) [corresponds to an Attribute]
 Leaf node( no branch) [corresponds to a Class Label ]
Finally leaf node is a class….( assign a class to next coming sample)

06/27/2025 CSP42B: AML 16

Classification & Regression Trees (CART)
 DT creates a model that predicts the value of a target (or dependent variable)
based on the values of several input (or independent variables)
 CART was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard
Olshen and Charles Stone.
 The main elements of CART (and any decision tree algorithm) are:
• Rules for splitting data at a node based on the value of one variable
• Stopping rules for deciding when a branch is terminal and can be split no more
• Finally, a prediction for the target variable in each terminal node.

06/27/2025 17
Decision Tree
 Ex.: Let’s say you want to predict whether a person is fit given their
information like age, eating habit, and physical activity, etc.

 The decision nodes here are questions like ‘What’s the age?’, ‘Does he
exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes
like either ‘fit’, or ‘unfit’.

 Binary classification problem (yes /no type)

06/27/2025 CSP42B: AML 18

Decision Tree Types
There are two main types of Decision Trees:
 Classification trees (Yes/No types)
 the outcome variable (like ‘fit’ or ‘unfit’) or decision variable is
Categorical. Tree is used to identify the "class"

 Regression trees (Continuous data types)

 the decision or the outcome variable is Continuous (like 123). tree is used
to predict target variable value.
06/27/2025 CSP42B: AML 19
06/27/2025 Data Science 20
Entropy
 Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S,
 It’s the measure of the amount of uncertainty or randomness in data.
 Intuitively, it tells us about the predictability of a certain event.
 Example, consider a coin toss whose probability of heads is 0.5 and probability of tails is
0.5. Here the entropy is the highest possible, since there’s no way of determining what the
outcome might be.
 Alternatively, consider a coin which has heads on both the sides, the entropy of such an
event can be predicted perfectly since we know beforehand that it’ll always be heads. In
other words, this event has no randomness hence it’s entropy is zero.
 In particular, lower values imply less uncertainty while higher values imply high uncertainty.

06/27/2025 CSP42B: AML 21

Support Vector
Machines (SVM)
UNIT II
SVM
• “Support Vector Machine” (SVM) is a supervised
machine learning algorithm which can be used for both classification
or regression challenges.
• it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate.
• Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).
How does it work?

A thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the
two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.
How does it work?

maximizing the distances between nearest

data point (either class) and hyper-plane
will help us to decide the right hyper-
plane. This distance is called as Margin.
Margin in SVM
Margin for hyper-plane C is high as compared to
both A and B. Hence, we name the right hyper-
plane as C.

If we select a hyper-plane having low margin then

there is high chance of miss-classification.
How does it Work?

Hyper-plane B as it has higher

margin compared to A. But, here is
the catch, SVM selects the hyper-
plane which classifies the classes
accurately prior to maximizing
margin.

Here, hyper-plane B has a

classification error and A has
classified all correctly. Therefore,
the right hyper-plane is A.
How does it work?

The SVM algorithm has a feature to

ignore outliers and find the hyper-
plane that has the maximum margin.
Hence, we can say, SVM classification
is robust to outliers.
How does it work?

SVM can solve this problem. Easily! It

solves this problem by
introducing additional feature. Here, we
will add a new feature z=x^2+y^2. Now,
let’s plot the data points on axis x and z:
kernel trick
• The SVM kernel is a function that takes low dimensional input space
and transforms it to a higher dimensional space i.e. it converts not
separable problem to separable problem.
• it does some extremely complex data transformations, then finds out
the process to separate the data based on the labels or outputs
you’ve defined.
Different Kernel Functions
SVM in Python
Exercise
• https://www.youtube.com/watch?v=LXGaYVXkGtg&t=189s
SVM
SVM
Bayesian Classification

 A statistical classifier: performs probabilistic prediction, i.e.,

predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Classification - Example
Bayesian Classification -
Probability

 Probability: How likely something is to happen

 Probability of an event happening =

Number of times it can happen
_________________________
Total number of outcomes
Bayesian Classification Theorem

 Let X be a data sample (“evidence”): class label is unknown

 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Bayesian Classification Theorem

 Given training data X, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes theorem


 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest

among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
Naïve Bayesian Classification
Theorem
 Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)

 This can be derived from Bayes’ theorem

 Since P(X) is constant for all classes, only

needs to be maximized
Naïve Bayesian Classification -
Example

Class:
C1:buys_computer =
‘yes’
C2:buys_computer = ‘no’

Data sample
X = (age = youth,
income = medium,
student = yes
credit_rating = fair)
Naïve Bayesian Classification -
Example
Test for X = (age = youth , income = medium,
student = yes, credit_rating = fair)
• Prior probability P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) =
5/14= 0.357
• To compute P(X|Ci) for each class, compute following conditional probabilities
P(age = youth| buys_computer = yes) = 2/9 = 0.222
P(age = youth | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
P(income = medium | buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 0.028>0.007 ..
• P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 =Therefore,
0.044 X belongs to class
(“buys_computer = yes”)
P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
• P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Comment on Naive Bayes
Classification

 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence i.e. effect of an
attribute value on a given class is independent of the values of
other attributes, therefore loss of accuracy
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
https://www.kaggle.com/code/skalskip/iris-data-visualization-
and-knn-classification

68
Feature Scaling
• Feature scaling is the method to limit the
range of variables so that they can be
compared on common grounds.
What is Regression
Types of Regression
Cntd..
Regression analysis is a statistical technique that attempts to explore
and model the relationship between two or more variables.
Understanding Regression
• Dependent variable (The value to be predicted)
• Independent Variable (The predictors)

• Assume relationship between independent and dependent variable

is straight line.
• y = a+ bx
• a – intercept on y axis(indicates value of y when x = 0)
• b – slope (how much line rises for each increase in x)

06/27/2025 74
Cntd..
• Linear Regression (Straight line model)
i. Simple linear Regression (Single independent variable and dependent variable is
continuous)
ii. Multiple regression (More than one independent variable and dependent
variable is continuous)
• Logistic Regression (Binary categorical outcome)
.Logistic model is used to model the probability of a certain class or
event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
.. This can be extended to model several classes of events such as
determining whether an image contains a cat, dog, lion, etc..

• For example, the weight, height, and age represent continuous variable.
• a person's gender, occupation, or marital status are categorical or discrete variables
06/27/2025 75
Cntd..
Example: Advertising through TV, Radio, Newspaper to increase sales
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?

• linear regression can be used to answer each of these questions

06/27/2025 76
Cntd..
• Straightforward simple linear regression approach for predicting a
quantitative response Y on the basis of a single predictor variable X

• Mathematically regressing Y on X (or Y onto X)

• X may represent TV advertising and Y may represent sales.

06/27/2025 77
06/27/2025 78
Simple Linear Regression
ANALYZING DA

IV DV
Simple Linear Regression
SALARY EQUATION PLOTTING
(₹)

y = b0 + b1 *
x1

SALARY = b0 + b1 *
+10
EXPERIENCE
K

+1 Yr HOW much Salary will increase?

+1 Yr EXPERIEN
CE
Simple Linear Regression

Constant Coefficient

y = b0 + b1 *
x1

Dependent variable Independent variable

(DV) (IV)
Simple Linear Regression
ORDINARY LEAST SQUARES

• How SLR finds Best Fitting Line from our Data

SALARY (₹)
Mr. ABC
yi
Modeled
Observatio
n
y^i

SUM ( y - y^ )2 -> min

EXPERIEN
Least Square Method

• Finds the line of best fit for a dataset, providing a visual

demonstration of the relationship between the data
points.
• The differences between the actual and estimated
function values on the training examples are called
residuals

• The least-squares method, consists in finding ˆ f such that

is minimised
Ex: Least Square method using
Univariate Regression

x y x – x mean y - y mean (x – x mean)2 (x – x

mean).
(y – y
mean)

1 2 -2 -2 4 4
2 4 -1 0 1 0
3 5 0 1 0 0
4 4 1 0 1 0
5 5 2 1 4 2
Problem Statement

• Last year, five randomly selected students took a

math aptitude test before they began their statistics
course. The Statistics Department has three
questions.
• What linear regression equation best predicts
statistics performance, based on math aptitude
scores?
• If a student made an 80 on the aptitude test, what
grade would we expect her to make in statistics?
• How well does the regression equation fit the data?
Student x y x – x mean y - y mean (x – x (x – x
mean)2 mean).
(y – y
mean)
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Sum 390 385
mean 78 77
How to Find the Regression Equation

• The regression equation is a linear equation of the form: ŷ = b0 + b1x

• First, we solve for the regression coefficient (b1):
• b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
• b1 = 470/730
• b1 = 0.644
• Once we know the value of the regression coefficient (b1), we can solve
for the regression slope (b0):
• b0 = y - b1 * x
• b0 = 77 - (0.644)(78)
• b0 = 26.768
• Therefore, the regression equation is: ŷ = 26.768 + 0.644x .
How to Use the Regression Equation

• In our example, the independent variable is the

student's score on the aptitude test.
• The dependent variable is the student's statistics
grade.
• If a student made an 80 on the aptitude test, the
estimated statistics grade (ŷ) would be:
• ŷ = b0 + b1x
• ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80
• ŷ = 26.768 + 51.52 = 78.288
How to Find the Coefficient of
Determination

• Whenever you use a regression equation, you should ask how

well the equation fits the data.
• One way to assess fit is to check the coefficient of determination
, which can be computed from the following formula.
• R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• where N is the number of observations used to fit the model,
• Σ is the summation symbol, xi is the x value for observation i,
• x is the mean x value,
• yi is the y value for observation i,
• y is the mean y value,
• σx is the standard deviation of x, and
• σy is the standard deviation of y.
• σx = sqrt [ Σ ( xi - x )2 / N ]
• σx = sqrt( 730/5 ) = sqrt(146) = 12.083
• σy = sqrt [ Σ ( yi - y )2 / N ]
• σy = sqrt( 630/5 ) = sqrt(126) = 11.225
• And finally, we compute the coefficient of determination
• (R2): R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
• R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2
• R2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48
• A coefficient of determination equal to 0.48 indicates that about 48% of the
variation in statistics grades (the dependent variable) can be explained by the
relationship to math aptitude scores (the independent variable).
• This would be considered a good fit to the data, in the sense that it would
substantially improve an educator's ability to predict student performance in
statistics class.
Cntd..
xy
11
23
43
32
55

When we have a single input attribute (x) and we want to use linear regression, this is called simple linear regression.

If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear regression.

The procedure for linear regression is different and simpler than that for multiple linear regression

06/27/2025 95
Cntd..
With simple linear regression we want to model our data as follows: y = B0 + B1 * x

This is a line where y is the output variable we want to predict, x is the input variable we know

and B0 and B1 are coefficients that we need to estimate that move the line around.

06/27/2025 96
Calculating the sum of these squared values gives us up denominator of 10 Now we can
calculate the value of our slope. B1 = 8 / 10 so further B1 = 0.8

06/27/2025 97
06/27/2025 98
06/27/2025 99
06/27/2025 100
Linear Regression
 Simple linear regression: models the relationship between a dependent
variable and one independent variables using a linear function

06/27/2025 CSP42B: AML 101

What is “Linear”?
 Remember the LINE equation : Y = mX + B
 Expected value of (y) at a given level of (x): E ( yi / xi )    xi

06/27/2025 CSP42B: AML 102

Predicted value for an individual…

yi =  + *xi + random errori

Fixed – Follows a normal

exactly on distribution
the line

06/27/2025 CSP42B: AML 103

Multiple Linear Regression
 Multiple linear regression: two or more (multiple) explanatory variables
are used to predict the dependent variable

 Ex. Demand of a product varies directly with

the change in demographic characteristics (age,
income) of a market area.

06/27/2025 CSP42B: AML 104

Regression Modeling
 A simple regression model (one
independent variable) fits a regression
line in 2-dimensional space

A multiple regression model with

two explanatory variables fits a
regression plane in 3-dimensional
space
06/27/2025 CSP42B: AML 105
Multiple Linear Regression

06/27/2025 CSP42B: AML 106

Simple and Multiple Regression

06/27/2025 CSP42B: AML 107

06/27/2025 108
06/27/2025 109
Multiple Linear Regression
Suppose we have the following dataset with one response variable y and two
predictor variables X1 and X2:

Use the following steps to fit a multiple linear regression model to this dataset.

06/27/2025 110
Multiple Linear Regression
Step 1: Calcúlate X12, X22, X1y, X2y and X1X2.

06/27/2025 111
Multiple Linear Regression
Step 2: Calculate Regression Sums.
Next, make the following regression sum calculations:
Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -200.375

06/27/2025 112
Multiple Linear Regression
Step 3: Calculate b0, b1, and b2.

The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148

The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656

The formula to calculate b0 is: y – b1X1 – b2X2

Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

06/27/2025 113
Multiple Linear Regression
Step 5: Place b0, b1, and b2 in the estimated linear regression equation.

The estimated linear regression equation is: ŷ = b0 + b1x1 + b2x2

In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2

06/27/2025 114
Residual
 Residual: degree of discrepancy between the model assumed and data
observed
 Residuals (e) : Difference between the observed value of the dependent
variable (y) and the predicted value (ŷ)

ˆi
ei  yi  y
 Each data point has one residual
 Residual = [Observed value] – [Predicted value]
 Residual is a more precise estimator of zero

06/27/2025 CSP42B: AML 115

Residual

Fig.1: Plotting the residual (Random Error)

06/27/2025 CSP42B: AML 116
Residual Analysis
Ex.

Fig.2: Plotting the four data points for given equation

06/27/2025 CSP42B: AML 117

Residual Analysis

 Residual analysis helps for evaluating

the goodness of a model
 Ideally, deviation must be near to zero

Fig.3: Residual analysis plot

06/27/2025 CSP42B: AML 118

Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?

Y
60
40
20
0 X
0 20 40 60
EPI 809/Spring 2008 119
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
EPI 809/Spring 2008 120
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope unchanged

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 121
Thinking Challenge

How would you draw a line through the points? How do you
determine which line ‘fits best’?
Slope changed

Y
60
40
20
0 X
0 20 40 60
Intercept changed
EPI 809/Spring 2008 122
Least Squares (LS)
 ‘Best Fit’ Means Difference Between Actual Y-Values & Predicted Y-Values
are a Minimum.
 But Positive Differences Off-Set Negative. So square errors!

 LS Minimizes the Sum of the Squared Differences (errors) (SSE)

   ˆ
n 2 n
Yi  Yˆi 2
i
i 1 i 1

EPI 809/Spring 2008 123

Least Squares Graphically
n
LS minimizes   i   1   2   3   4
 2  2  2  2  2

i 1
Y  
Y2    X  
0 1 2 2

^4
^2
^1 ^3
  
Yi   0   1 X i
X
EPI 809/Spring 2008 124
Coefficient Equations
• Prediction equation
yˆ i ˆ0  ˆ1 xi
• Sample slope
SS xy  xi  x yi  y 
ˆ1  
2
SS xx  ix  x 
• Sample Y - intercept

ˆ0  y  ˆ1x
EPI 809/Spring 2008 125
Linear classification
Nonlinearity
Logistic Regression

• Logistic Regression is one of the most

commonly used Machine Learning
algorithms that is used to model a binary
variable that takes only 2 values – 0 and 1.
• The objective of Logistic Regression is to
develop a mathematical equation that can
give us a score in the range of 0 to 1. This
score gives us the probability of the
variable taking the value 1.
Logistic regression

• Even if called regression, this is a classification

method which is based on the probability for a
sample to belong to a class.
• As our probabilities must be continuous in R and
bounded between (0, 1), it's necessary to
introduce a threshold function to filter the term z.
• The name logistic comes from the decision to use
the sigmoid (or logistic) function:
Logistic regression

• If ‘Z’ goes to infinity, Y(predicted) will

become 1 and if ‘Z’ goes to negative
infinity, Y(predicted) will become 0.
Applications

• How does the probability of getting lung cancer

(yes vs. no) change for every additional pound
a person is overweight and for every pack of
cigarettes smoked per day?
• Do body weight, calorie intake, fat intake, and
age have an influence on the probability of
having a heart attack (yes vs. no)?
• Spam Detection
• Tumour Prediction
• Credit Card Fraud
• Marketing
The Logistic Regression Model
 The logistic distribution constrains the estimated probabilities to lie
between 0 and 1.
 The estimated probability is:

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p = .50

 as  +  X gets really big, p approaches 1
 as  +  X gets really small, p approaches 0

06/27/2025 CSP42B: AML 132

The Logistic Regression Model

06/27/2025 CSP42B: AML 133

Linear Vs Logistic

06/27/2025 CSP42B: AML 134

Regularization
Overfitting is one of the most serious kinds of problems related to machine
learning. It occurs when a model learns the training data too well. The model then
learns not only the relationships among data but also the noise in the dataset.
Overfitted models tend to have good performance with the data used to fit them
(the training data), but they behave poorly with unseen data (or test data, which is
data not used to fit the model).
Overfitting usually occurs with complex models. Regularization normally tries to
reduce or penalize the complexity of the model. Regularization techniques applied
with logistic regression mostly tend to penalize large coefficients 𝑏₀, 𝑏₁, …, 𝑏ᵣ:
•L1 regularization penalizes the LLF with the scaled sum of the absolute values of
the weights: |𝑏₀|+|𝑏₁|+⋯+|𝑏ᵣ|.
•L2 regularization penalizes the LLF with the scaled sum of the squares of the
weights: 𝑏₀²+𝑏₁²+⋯+𝑏ᵣ².
•Elastic-net regularization is a linear combination of L1 and L2 regularization.
Regularization can significantly improve model performance on unseen data.
Overfitting
 Natural end of process in DT is 100% purity in
each leaf
 This overfits the data, which end up fitting noise
in the data
 Overfitting leads to low predictive accuracy of
new data
 Past a certain point, the error rate for the
validation data starts to increase

06/27/2025 CSP42B: AML 136

Overfitting & Underfitting
 Training a Model:
 Overfit: if the model is performing much better on train set but not
performing well on cross-validation set
 Overfit is called as “High Variance” problem

 Underfit: if the model is not performing well on train set itself

 Underfit is considered as “High Bias” problem

06/27/2025 CSP42B: AML 137

06/27/2025 CSP42B: AML 138

Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
159 pages
UNIT - 3
No ratings yet
UNIT - 3
73 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
Artificial Intelligence in Mechanical Engineering
No ratings yet
Artificial Intelligence in Mechanical Engineering
139 pages
MACHINE LEARNING - III
No ratings yet
MACHINE LEARNING - III
53 pages
4_22865_IS465_2019_1__2_1_08ClassBasic
No ratings yet
4_22865_IS465_2019_1__2_1_08ClassBasic
43 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Learning
No ratings yet
Learning
51 pages
Machine learning algorithms laiki
No ratings yet
Machine learning algorithms laiki
123 pages
AIch5 (2)
No ratings yet
AIch5 (2)
50 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
87 pages
Chapter 02_DM tasks_Part I_Classification
No ratings yet
Chapter 02_DM tasks_Part I_Classification
58 pages
Introduction to AI
No ratings yet
Introduction to AI
51 pages
Refer For KNNDecison Tree SVM
No ratings yet
Refer For KNNDecison Tree SVM
90 pages
Detection of Cyber Attack in Network Using Machine Learning Techniques Final
100% (5)
Detection of Cyber Attack in Network Using Machine Learning Techniques Final
50 pages
Classification[1]
No ratings yet
Classification[1]
45 pages
05-classification
No ratings yet
05-classification
33 pages
Module 3
No ratings yet
Module 3
132 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Infant Mortality in Brazil a Survival Analysis Using Machine Learning Models7
No ratings yet
Infant Mortality in Brazil a Survival Analysis Using Machine Learning Models7
47 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
Unit 2
No ratings yet
Unit 2
28 pages
Slide 3
No ratings yet
Slide 3
23 pages
21091F0026 B.neelimaRani
No ratings yet
21091F0026 B.neelimaRani
70 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
UCS551 Chapter 6 - Classification
No ratings yet
UCS551 Chapter 6 - Classification
20 pages
CH 5
No ratings yet
CH 5
84 pages
UNIT II 2.1 ML Decision Tree Learning
No ratings yet
UNIT II 2.1 ML Decision Tree Learning
55 pages
Applied Computational Intelligence and Soft Computing - 2024 - Ahmed - Student Performance Prediction Using Machine (1)
No ratings yet
Applied Computational Intelligence and Soft Computing - 2024 - Ahmed - Student Performance Prediction Using Machine (1)
15 pages
DWM_Module 3 (1)
No ratings yet
DWM_Module 3 (1)
22 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Early Detection of Cardiovascular Diseases Using Machine Learning 2
No ratings yet
Early Detection of Cardiovascular Diseases Using Machine Learning 2
38 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
Fake News Detection
40% (10)
Fake News Detection
71 pages
DmUnit 3
No ratings yet
DmUnit 3
42 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
Machine
No ratings yet
Machine
61 pages
Ijerph 19 05215
No ratings yet
Ijerph 19 05215
19 pages
Fall Detection using IMU Sensors
No ratings yet
Fall Detection using IMU Sensors
18 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Plagiarism
No ratings yet
Plagiarism
24 pages
8 Classification
No ratings yet
8 Classification
45 pages
Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost
No ratings yet
Research on prediction of multi-class theft crimes by an optimized decomposition and fusion method based on XGBoost
10 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Classifying in Machine Learning
No ratings yet
Classifying in Machine Learning
26 pages
0 - Unpacking Participatory Budgeting in Józsefváros
No ratings yet
0 - Unpacking Participatory Budgeting in Józsefváros
23 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Final Answer Key Asst Professor It
No ratings yet
Final Answer Key Asst Professor It
20 pages
Classification
No ratings yet
Classification
33 pages
Class Basic
No ratings yet
Class Basic
67 pages
2019, Prediction of Wind Power Generation Base On Neural Network in Consideration of The Fault Time
No ratings yet
2019, Prediction of Wind Power Generation Base On Neural Network in Consideration of The Fault Time
10 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
The Usefulness of Artificial Intelligence For Safety Assessment of Different Transport Modes
No ratings yet
The Usefulness of Artificial Intelligence For Safety Assessment of Different Transport Modes
10 pages
Data Mining
No ratings yet
Data Mining
68 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Unit 3
No ratings yet
Unit 3
16 pages
Unit-2: Logistic Regression
No ratings yet
Unit-2: Logistic Regression
30 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Imp2
No ratings yet
Imp2
6 pages
Face Recognition Attendance System Based On Real-Time Video Processing
No ratings yet
Face Recognition Attendance System Based On Real-Time Video Processing
9 pages
Loan Fraud IJECES-14-02-12-1590
No ratings yet
Loan Fraud IJECES-14-02-12-1590
11 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Telecom_Customer_Churn
No ratings yet
Telecom_Customer_Churn
5 pages
Multi Disease Prediction System Using ML (Phase-II)
No ratings yet
Multi Disease Prediction System Using ML (Phase-II)
14 pages
Classification
No ratings yet
Classification
7 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Post Print EDUCON 2018
No ratings yet
Post Print EDUCON 2018
8 pages
Dissertation Data Mining
100% (2)
Dissertation Data Mining
4 pages
Interview Preparing - ML Draft
No ratings yet
Interview Preparing - ML Draft
12 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
Major Project (Kartik Joshi)
No ratings yet
Major Project (Kartik Joshi)
4 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Assignment 4 Proposal Presentation
No ratings yet
Assignment 4 Proposal Presentation
13 pages
Twitter Sentiment Analysis Using Deep Learning
No ratings yet
Twitter Sentiment Analysis Using Deep Learning
17 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
A Guide To Chemical Process Design and Optimization
No ratings yet
A Guide To Chemical Process Design and Optimization
36 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

ML Unit II_Final

Uploaded by

ML Unit II_Final

Uploaded by

ML-UNIT II

SCHOOL OF COMPUTER ENGINEERING & TECHNOLOGY

06/27/2025 CSP42B: AML 2

 Decision tree may be n-ary, n ≥ 2.

Example: Rule extraction from our buys_computer decision-tree

1 : Calculate Entropy for Class Labels

2 : Calculate Information of Each Attribute (one by one)

3 : Calculate Gain of Each Attribute (one by one)

4: Select Attribute with max gain as a root attribute

Gain(age)> any other gain. so age is selected as root

06/27/2025 CSP42B: AML 16

 Binary classification problem (yes /no type)

06/27/2025 CSP42B: AML 18

 Regression trees (Continuous data types)

06/27/2025 CSP42B: AML 21

maximizing the distances between nearest

If we select a hyper-plane having low margin then

Hyper-plane B as it has higher

Here, hyper-plane B has a

The SVM algorithm has a feature to

SVM can solve this problem. Easily! It

 A statistical classifier: performs probabilistic prediction, i.e.,

 Probability: How likely something is to happen

 Probability of an event happening =

 Let X be a data sample (“evidence”): class label is unknown

 Given training data X, posteriori probability of a hypothesis H,

 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest

 This can be derived from Bayes’ theorem

• Assume relationship between independent and dependent variable

• linear regression can be used to answer each of these questions

• Mathematically regressing Y on X (or Y onto X)

+1 Yr HOW much Salary will increase?

Dependent variable Independent variable

• How SLR finds Best Fitting Line from our Data

SUM ( y - y^ )2 -> min

• Finds the line of best fit for a dataset, providing a visual

• The least-squares method, consists in finding ˆ f such that

x y x – x mean y - y mean (x – x mean)2 (x – x

• Last year, five randomly selected students took a

• The regression equation is a linear equation of the form: ŷ = b0 + b1x

• In our example, the independent variable is the

• Whenever you use a regression equation, you should ask how

06/27/2025 CSP42B: AML 101

06/27/2025 CSP42B: AML 102

yi =  + *xi + random errori

Fixed – Follows a normal

06/27/2025 CSP42B: AML 103

 Ex. Demand of a product varies directly with

06/27/2025 CSP42B: AML 104

A multiple regression model with

06/27/2025 CSP42B: AML 106

06/27/2025 CSP42B: AML 107

The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148

The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]

Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656

The formula to calculate b0 is: y – b1X1 – b2X2

Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2

In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2

06/27/2025 CSP42B: AML 115

Fig.1: Plotting the residual (Random Error)

Fig.2: Plotting the four data points for given equation

06/27/2025 CSP42B: AML 117

 Residual analysis helps for evaluating

Fig.3: Residual analysis plot

06/27/2025 CSP42B: AML 118

 LS Minimizes the Sum of the Squared Differences (errors) (SSE)

EPI 809/Spring 2008 123

• Logistic Regression is one of the most

• Even if called regression, this is a classification

• If ‘Z’ goes to infinity, Y(predicted) will

• How does the probability of getting lung cancer

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p = .50

06/27/2025 CSP42B: AML 132

The estimated linear regression equation is: ŷ = b0 + b1x1 + b2x2