Unit 3 ML
Unit 3 ML
📦 Example:
Agar humare paas kuch photos hain (input) aur unke sath likha hai ki kaunsi dog ki hai, kaunsi cat ki (label), toh hum model ko train karte hain ki woh naye photo
ko dekhke bata sake ki dog hai ya cat.
✅ Definition:
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks.
It splits the data into smaller parts based on certain conditions (features), forming a tree-like
structure with decision rules.
🔍 How it works:
The algorithm selects the best feature to split the data.
It uses criteria like:
Gini Index
Entropy
Information Gain
It continues splitting until all data is classified or the tree reaches a stopping point.
📊 Example:
Predict if a person will play cricket based on weather:
[Weather?]
/ \
Sunny Rainy
| |
Play [Windy?]
/ \
Yes No
Don’t Play Play
⭐ Advantages:
Easy to understand and interpret
No need for much data preprocessing
Can handle both numerical and categorical data
Good for small datasets
⚠️ Disadvantages:
Can overfit the data (too many branches)
Sensitive to small changes in data
Complex trees are hard to interpret
Gini Index Measures data purity like entropy (used in CART trees)
📌 Applications:
Medical diagnosis
Email spam filtering
Customer segmentation
Weather prediction
Loan approval
"Kitna confusion (impurity) kam hua ek particular feature (question) ko use karke?"
💡 Real-Life Analogy:
Tum class mein ho. Agar tum ek question puchte ho jaise:
Toh Information Gain zyada hai → kyunki tumne clearly divide kar diya.
🔢 Formula:
🔷 2. Gini Index:
Yeh bhi ek tarah ka impurity checker hai.
💡 Real-Life Analogy:
Ek box mein sirf aam ho to Gini = 0 → perfect.
Agar aam, seb, kela sab mix hain → Gini high.
🔢 Formula:
✅ 1. Definition:
Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem.
It assumes that features are independent of each other, which is why it is called "Naive".
✅ 2. Bayes Theorem:
Naive Bayes is built on Bayes’ Theorem of probability:
P (B∣A) ⋅ P (A)
P (A∣B) =
P (B)
✅ 4. Working Steps:
1. Convert the data into a frequency table (for categorical) or use mean/variance (for continuous).
2. Calculate Prior Probability for each class (e.g., spam or not spam).
3. Calculate Likelihood for each feature given the class.
4. Apply Bayes’ Theorem to calculate posterior probability for each class.
5. Choose the class with the highest posterior probability as the prediction.
Multinomial Naive Bayes Used for text classification (like spam detection).
Assumes word counts.
Bernoulli Naive Bayes Used when features are binary (0 or 1, yes or no).
✅ 7. Advantages:
Simple and easy to implement
Works well with high-dimensional data
Fast and efficient for large datasets
Performs well even with less training data
✅ 8. Disadvantages:
Assumes independence of features (which is rarely true)
May not perform well with correlated features
Probability estimates can be poor if data is sparse
✅ 1. Definition:
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification (and sometimes regression).
It aims to find the optimal boundary (hyperplane) that best separates data into different classes.
✅ 2. Goal of SVM:
To find a decision boundary (hyperplane) that separates classes with the maximum margin, i.e., the largest possible distance between the boundary and the
nearest data points from each class.
✅ 3. Important Terminologies:
Term Explanation
Support Vectors The data points that lie closest to the hyperplane and influence
its position
Margin The distance between the hyperplane and the nearest support
vectors from either class
✅ 4. Types of SVM:
1. Linear SVM
Used when the data is linearly separable, i.e., classes can be divided by a straight line or plane.
2. Non-Linear SVM
Used when the data is not linearly separable.
In this case, SVM uses a kernel function to project data into a higher-dimensional space where it can be separated linearly.
Radial Basis Function (RBF) / Gaussian Widely used; works well with non-linear data
✅ 7. Example:
Let’s say you want to classify emails as “Spam” or “Not Spam” based on keywords.
SVM will analyze the training data, find a boundary that separates spam emails from non-spam emails, and then use that boundary to classify future emails.
✅ 8. Applications of SVM:
Text classification (e.g., spam detection)
Handwriting recognition
Face and object detection
Medical diagnosis
Bioinformatics (e.g., gene classification)
✅ 9. Advantages of SVM:
Effective in high-dimensional spaces
Works well with clear margin of separation
Robust to overfitting, especially in high-dimensional data
Suitable for both linear and non-linear classification
Random Forest is a machine learning algorithm that uses many decision trees to make better predictions. Each tree looks
at different random parts of the data and their results are combined by voting for classification or averaging for regression.
This helps in improving accuracy and reducing errors.
Random forest is also a ensemble learning technique which you can learn more about from: Ensemble Learning
✅ 1. Definition:
Linear Regression is a supervised machine learning algorithm used for solving regression problems.
It models the relationship between a dependent variable (output) and one or more independent variables (input) using a straight
line.
✅ 2. Objective:
To find the best-fit straight line that predicts the value of the dependent variable based on the independent variable(s).
Multiple Linear Regression Two or more independent variables (e.g., house price vs
size, location, number of rooms)
✅ 5. Working Steps:
1. Collect input (X) and output (Y) data.
2. Fit a line that minimizes the error between actual and predicted values.
3. Use least squares method to find the best values of a and b.
4. Predict new values of Y based on new inputs.
✅ 6. Use Case Examples:
Problem Regression Type
R² (R-squared score) How well the model fits the data (1 = perfect fit)
✅ 8. Advantages:
Simple to understand and implement
Fast and efficient
Works well when there is a linear relationship between variables
Good for baseline regression tasks
✅ 9. Disadvantages:
Only works when relationship is linear
Sensitive to outliers
Doesn’t work well with non-linear data
Assumes independence and no multicollinearity among variables
✅ 1. Definition:
Ordinary Least Squares (OLS) Regression is a method used in linear regression to find the best-fitting line by minimizing the sum of
the squared errors (differences between predicted and actual values).
✅ 2. Objective:
To estimate the regression line in such a way that the sum of squared residuals is as small as possible.
✅ 4. Residual (Error):
Residual = Actual value (Y) − Predicted value (Ŷ)
OLS tries to minimize the sum of the squares of these residuals:
∑(Xi − Xˉ )(Yi − Yˉ )
b= and a = Yˉ − bXˉ
∑(Xi − Xˉ )2
✅ 6. Assumptions of OLS:
1. Linearity – Relationship between X and Y is linear
2. Independence – Observations are independent of each other
3. Homoscedasticity – Constant variance of errors
4. Normality – Errors are normally distributed
5. No multicollinearity – In multiple regression, independent variables are not highly correlated
✅ 8. Applications:
Predicting house prices
Estimating salary based on experience
Forecasting sales based on marketing spend
Modeling relationships in economics or finance
✅ 9. Advantages of OLS:
Simple to understand and easy to compute
Provides best linear unbiased estimates (BLUE) under assumptions
Efficient when assumptions hold true
✅ 10. Disadvantages:
Sensitive to outliers
Assumes linear relationship
Performance declines if assumptions (like homoscedasticity, normality) are violated
Not suitable for non-linear problems without transformation
Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear
regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It
is used for binary classification where the output can be one of two possible categories such as Yes/No,
True/False or 0/1. It uses sigmoid function to convert inputs into a probability value between 0 and 1. In this
article, we will see the basics of logistic regression and its core concepts.
1. Binomial Logistic Regression: This type is used when the dependent variable has only two possible
categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and
is used for binary classification problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has three or more possible
categories that are not ordered. For example, classifying animals into categories like "cat," "dog" or "sheep." It
extends the binary logistic regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has three or more
categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high." It takes
the order of the categories into account when modeling.
1. Independent observations: Each data point is assumed to be independent of the others means there
should be no correlation or dependence between the input samples.
2. Binary dependent variables: It takes the assumption that the dependent variable must be binary, means
it can take only two values. For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The model assumes a
linear relationship between the independent variables and the log odds of the dependent variable which
means the predictors affect the log odds in a linear way.
4. No outliers: The dataset should not contain extreme outliers as they can distort the estimation of the logistic
regression coefficients.
5. Large sample size: It requires a sufficiently large sample size to produce reliable and stable results.
2. This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve called the
sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid function is perfect
for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.
If the sigmoid output is same or above the threshold, the input is classified as Class 1.
If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful class predictions.
X=
⋮ ⋱ ⋮
⎣ xn1 ... xnm ⎦
0 if Class 1
Y ={
1 if Class 2
Here xi is the ith observation of X, wi = [w1 , w2 , w3 , ⋯ , wm ] is the weights or Coefficient and bis the bias term
also known as intercept. Simply this can be represented as the dot product of weight and bias.
z =w⋅X +b
At this stage, z is a continuous value from the linear regression. Logistic regression then applies the sigmoid
function to z to convert it into a probability between 0 and 1 which can be used to predict the class.
Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e.
predicted y.
1
σ(z) = 1+e−z
Sigmoid function
As shown above the sigmoid function converts the continuous variable data into the probability i.e between 0 and
1.
σ(z) tends towards 1 as z→∞
σ(z) tends towards 0 as z → −∞
σ(z) is always bounded between 0 and 1
P (y = 1) = σ(z)
P (y = 0) = 1 − σ(z)
It models the odds of the dependent event occurring which is the ratio of the probability of the event to the
probability of it not occurring:
p(x)
1−p(x) = ez
Taking the natural logarithm of the odds gives the log-odds or logit:
p(x)
log [ ]=z
1 − p(x)
p(x)
log [ ]=w⋅X +b
1 − p(x)
p(x)
= ew⋅X+b ⋯ Exponentiate both sides
1 − p(x)
The goal is to find weights w and bias b that maximize the likelihood of observing the data.
For each data point i
for y = 1, predicted probabilities will be: p(X;b,w) =p(x)
for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1 − p(x)
n
L(b, w) = ∏i=1 p(xi )yi (1 − p(xi ))1−yi
i=1
n
= ∑ yi log p(xi ) + log(1 − p(xi )) − yi log(1 − p(xi ))
i=1
n n
p(xi )
= ∑ log(1 − p(xi )) + ∑ yi log
1 − p(xi
i=1 i=1
n n
= ∑ − log 1 − e−(w⋅xi +b) + ∑ yi (w ⋅ xi + b)
i=1 i=1
n n
= ∑ − log 1 + ew⋅xi +b + ∑ yi (w ⋅ xi + b)
i=1 i=1
To find the best w and b we use gradient ascent on the log-likelihood function. The gradient with respect to each
weight wj is:
n n
∂J (l(b, w) 1 w⋅xi +b
=−∑ e xij + ∑ yi xij
∂wj 1 + ew⋅xi +b
i=n i=1
n n
= − ∑ p(xi ; b, w)xij + ∑ yi xij
i=n i=1
n
= ∑ (yi − p(xi ; b, w))xij
i=n
1. Independent Variables: These are the input features or predictor variables used to make predictions
about the dependent variable.
2. Dependent Variable: This is the target variable that we aim to predict. In logistic regression, the
dependent variable is categorical.
3. Logistic Function: This function transforms the independent variables into a probability between 0 and 1
which represents the likelihood that the dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event happening to the probability of it not happening. It differs
from probability because probability is the ratio of occurrences to total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In logistic regression, the log-odds are modeled as a
linear combination of the independent variables and the intercept.
6. Coefficient: These are the parameters estimated by the logistic regression model which shows how strongly
the independent variables affect the dependent variable.
7. Intercept: The constant term in the logistic regression model which represents the log-odds when all
independent variables are equal to zero.
8. Maximum Likelihood Estimation (MLE): This method is used to estimate the coefficients of the logistic
regression model by maximizing the likelihood of observing the given data.