0% found this document useful (0 votes)
192 views18 pages

Unit 3 ML

Supervised learning is a machine learning technique where models are trained on labeled data to understand the relationship between inputs and outputs. It includes two main types: regression for numeric predictions and classification for categorical predictions. Techniques like Decision Trees, Naive Bayes, and Support Vector Machines are commonly used in supervised learning for various applications.

Uploaded by

shubhrajkumar707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views18 pages

Unit 3 ML

Supervised learning is a machine learning technique where models are trained on labeled data to understand the relationship between inputs and outputs. It includes two main types: regression for numeric predictions and classification for categorical predictions. Techniques like Decision Trees, Naive Bayes, and Support Vector Machines are commonly used in supervised learning for various applications.

Uploaded by

shubhrajkumar707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supervised Learning Kya Hota Hai?

Supervised learning ek aisa machine learning technique hai jisme:


Humare paas data hota hai jiska answer already known hota hai (yeh answer ko hum label bolte hain).
Model ko training dete hain ki woh input (X) se output (Y) ka relation samjhe.

📦 Example:
Agar humare paas kuch photos hain (input) aur unke sath likha hai ki kaunsi dog ki hai, kaunsi cat ki (label), toh hum model ko train karte hain ki woh naye photo
ko dekhke bata sake ki dog hai ya cat.

✅ Supervised Learning Techniques ke 2 main types hote hain:

1. Regression (जब output number ho)


📊 Jab aapko numeric value predict karni ho.
📌 Examples:
Ghar ka price predict karna
Temperature estimate karna
Student ka marks predict karna

🧠 Common Regression Techniques:


Linear Regression – straight line wali relationship
Polynomial Regression – curve wali relationship
Ridge/Lasso Regression – regularized regression (overfitting se bachata hai)

2. Classification (जब output category ho)


📚 Jab aapko category ya class predict karni ho.
📌 Examples:
Email spam hai ya nahi
Photo mein dog hai ya cat
Patient ko disease hai ya nahi

🧠 Common Classification Techniques:


Logistic Regression – yes/no prediction
K-Nearest Neighbors (KNN) – nearby points se predict karta hai
Decision Tree – tree structure follow karta hai
Random Forest – bohot saare decision trees milke decision lete hain
Support Vector Machine (SVM) – best boundary banata hai classes ke beech
Naive Bayes – probability based hota hai (zyada fast)

⚙️ Supervised Learning Process – Easy Steps


1. Data Collection: Labeled data collect karo
2. Data Splitting: Data ko Training aur Testing mein divide karo
3. Model Training: Training data se model sikhao
4. Model Testing: Test data pe model ko evaluate karo
5. Prediction: Naye data pe prediction karo

📈 Bonus: Accuracy Check Karne ke Tarike


Classification ke liye: Accuracy, Precision, Recall, F1-Score, Confusion Matrix
Regression ke liye: Mean Absolute Error (MAE), Mean Squared Error (MSE), R² Score
🌳 Decision Tree in Machine Learning – For Exam

✅ Definition:
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks.
It splits the data into smaller parts based on certain conditions (features), forming a tree-like
structure with decision rules.

🧩 Structure of a Decision Tree:


1. Root Node – The first/main decision (based on one feature)
2. Internal Nodes – Represent the decisions (conditions) based on features
3. Branches – Show the outcome of the decision (Yes/No, True/False)
4. Leaf Nodes – Final output or result (class label or value)

🔍 How it works:
The algorithm selects the best feature to split the data.
It uses criteria like:
Gini Index
Entropy
Information Gain
It continues splitting until all data is classified or the tree reaches a stopping point.

📊 Example:
Predict if a person will play cricket based on weather:

[Weather?]
/ \
Sunny Rainy
| |
Play [Windy?]
/ \
Yes No
Don’t Play Play
⭐ Advantages:
Easy to understand and interpret
No need for much data preprocessing
Can handle both numerical and categorical data
Good for small datasets

⚠️ Disadvantages:
Can overfit the data (too many branches)
Sensitive to small changes in data
Complex trees are hard to interpret

🧠 Important Terms (for theory questions):


Term Meaning

Entropy Measures impurity/disorder in data

Information Gain How much entropy is reduced after a split

Gini Index Measures data purity like entropy (used in CART trees)

📌 Applications:
Medical diagnosis
Email spam filtering
Customer segmentation
Weather prediction
Loan approval

✍️ Sample Answer (write this in exam):


A Decision Tree is a tree-structured algorithm used in supervised learning for classification and regression
tasks. It works by splitting the data based on certain conditions (features) and making decisions at each node.
The tree continues splitting until it reaches a final output at the leaf node. It uses criteria like Gini Index,
Entropy, and Information Gain to find the best splits. Decision Trees are easy to understand but may overfit the
data if not pruned properly.
🔶 1. Information Gain (IG):
Socho tumhare paas ek basket hai jisme alag-alag fruits hain: aam (mango) aur seb (apple).
Ab tum chahte ho ki basket ko is tarah divide karo ki har group mein sirf ek hi type ka fruit ho.
Isme tum check karte ho ki kya koi question (ya feature) aisa hai jo ye clear divide de sakta hai?

📌 Definition in Easy Words:


Information Gain batata hai ki:

"Kitna confusion (impurity) kam hua ek particular feature (question) ko use karke?"

💡 Real-Life Analogy:
Tum class mein ho. Agar tum ek question puchte ho jaise:

"Jo bacche science mein ache hain unko ek side karo."

Agar is question se clearly 2 group ban jaate hain:


Ek group: science toppers
Dusra group: average students

Toh Information Gain zyada hai → kyunki tumne clearly divide kar diya.

🔢 Formula:

Information Gain = Entropy (Before Split) – Weighted Entropy (After Split)

Entropy matlab kitna randomness/confusion hai.

🔷 2. Gini Index:
Yeh bhi ek tarah ka impurity checker hai.

📌 Definition in Easy Words:


Gini Index batata hai ki:

"Ek group kitna mix hai?"


Agar ek group mein sab same category ke hain → Gini low (best).
Agar sab mix-mix hain → Gini high (bad).

💡 Real-Life Analogy:
Ek box mein sirf aam ho to Gini = 0 → perfect.
Agar aam, seb, kela sab mix hain → Gini high.

🔢 Formula:

Gini Index = 1 - (p₁² + p₂² + ... + pₙ²)

jahan p₁, p₂, ... har class ka probability hai.

🤔 Difference Between IG & Gini:


Feature Information Gain Gini Index

Based on Entropy (log based) Probability (square based)

Goal Confusion kitna kam hua? Group kitna shuddh hai?

Value Higher = Better Lower = Better

Use ID3 Algorithm CART Algorithm

📌 Exam ke liye 2-3 Lines likhne layak:


Information Gain ek measure hai jo batata hai ki kisi feature se data ko split karne ke baad kitna randomness kam hota hai. Ye entropy ke concept par based hota
hai.
Gini Index batata hai ki ek group kitna pure hai. Agar group mein sab same class ke hain to Gini 0 hota hai. Ye CART algorithm mein use hota hai.
📘 Naive Bayes Algorithm – For Exams (Point by Point)

✅ 1. Definition:
Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem.
It assumes that features are independent of each other, which is why it is called "Naive".

✅ 2. Bayes Theorem:
Naive Bayes is built on Bayes’ Theorem of probability:

P (B∣A) ⋅ P (A)
P (A∣B) =
P (B)

P (A∣B): Probability of A given B (posterior probability)


P (B∣A): Probability of B given A (likelihood)
P (A): Probability of A (prior probability)
P (B): Probability of B (evidence)

✅ 3. Naive Assumption (Why "Naive"?):


It naively assumes that all features (input variables) are independent of each other.
In real-world data, this is rarely true, but the algorithm still works well.

✅ 4. Working Steps:
1. Convert the data into a frequency table (for categorical) or use mean/variance (for continuous).
2. Calculate Prior Probability for each class (e.g., spam or not spam).
3. Calculate Likelihood for each feature given the class.
4. Apply Bayes’ Theorem to calculate posterior probability for each class.
5. Choose the class with the highest posterior probability as the prediction.

✅ 5. Types of Naive Bayes:


Type Description

Multinomial Naive Bayes Used for text classification (like spam detection).
Assumes word counts.

Bernoulli Naive Bayes Used when features are binary (0 or 1, yes or no).

Gaussian Naive Bayes Used for continuous data assuming normal


distribution (like age, salary).
✅ 6. Applications:
Email spam filtering
Sentiment analysis
Document classification
Medical diagnosis
Real-time predictions

✅ 7. Advantages:
Simple and easy to implement
Works well with high-dimensional data
Fast and efficient for large datasets
Performs well even with less training data

✅ 8. Disadvantages:
Assumes independence of features (which is rarely true)
May not perform well with correlated features
Probability estimates can be poor if data is sparse

✅ 9. Simple Example (for understanding):


Let’s say we want to classify whether a message is Spam or Not Spam based on the words it contains:
Words like “free”, “win”, “money” are common in spam messages.
Naive Bayes will:
Count how many times “free” appears in spam vs non-spam.
Calculate the probability of “Spam” given “free”.
Repeat for all words and pick the class with highest probability.

✅ 10. Key Points for Exam:


Based on Bayes Theorem
Assumes independence between features
Used for classification problems
Simple, fast, and works well with large datasets
Three types: Multinomial, Bernoulli, Gaussian
📘 Support Vector Machines (SVM) – For Classification Problems (Exam-Oriented)

✅ 1. Definition:
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification (and sometimes regression).
It aims to find the optimal boundary (hyperplane) that best separates data into different classes.

✅ 2. Goal of SVM:
To find a decision boundary (hyperplane) that separates classes with the maximum margin, i.e., the largest possible distance between the boundary and the
nearest data points from each class.

✅ 3. Important Terminologies:
Term Explanation

Hyperplane A line (in 2D), a plane (in 3D), or higher-dimensional surface


that separates data into classes

Support Vectors The data points that lie closest to the hyperplane and influence
its position

Margin The distance between the hyperplane and the nearest support
vectors from either class

Kernel A mathematical function that transforms data into a higher


dimension to make it linearly separable

✅ 4. Types of SVM:
1. Linear SVM
Used when the data is linearly separable, i.e., classes can be divided by a straight line or plane.
2. Non-Linear SVM
Used when the data is not linearly separable.
In this case, SVM uses a kernel function to project data into a higher-dimensional space where it can be separated linearly.

✅ 5. Kernel Functions (for Non-Linear SVM):


Kernel Type Description

Linear Kernel Used for linearly separable data

Polynomial Kernel Maps data to a higher dimension using polynomial


functions

Radial Basis Function (RBF) / Gaussian Widely used; works well with non-linear data

Sigmoid Kernel Similar to neural networks; less commonly used

✅ 6. Working of SVM (Step-by-Step):


1. Take input data with labels (classes).
2. Identify the best hyperplane that separates the classes.
3. Calculate the margin for different possible hyperplanes.
4. Choose the one with maximum margin.
5. Use support vectors (nearest points) to define the final hyperplane.
6. For non-linear data, apply a kernel function to transform it into a higher dimension.
7. Use this model to classify new/unseen data points.

✅ 7. Example:
Let’s say you want to classify emails as “Spam” or “Not Spam” based on keywords.
SVM will analyze the training data, find a boundary that separates spam emails from non-spam emails, and then use that boundary to classify future emails.

✅ 8. Applications of SVM:
Text classification (e.g., spam detection)
Handwriting recognition
Face and object detection
Medical diagnosis
Bioinformatics (e.g., gene classification)

✅ 9. Advantages of SVM:
Effective in high-dimensional spaces
Works well with clear margin of separation
Robust to overfitting, especially in high-dimensional data
Suitable for both linear and non-linear classification

✅ 10. Disadvantages of SVM:


Not suitable for large datasets (slow training)
Performance is sensitive to choice of kernel and parameters
Doesn’t perform well with noisy or overlapping data
Less efficient when the number of features >> number of samples

✅ 11. Key Points to Write in Exam (Short Answer Format):


SVM is a supervised classification algorithm based on maximum margin separation.
It finds an optimal hyperplane that best separates different classes.
The closest data points to the hyperplane are called support vectors.
For non-linear data, SVM uses kernel functions to transform data.
SVM is useful in text classification, image recognition, and bioinformatics.
The main goal of SVM is to maximize the margin between the two classes. The larger the
margin the better the model performs on new and unseen data.

Key Concepts of Support Vector Machine


Hyperplane: A decision boundary separating different classes in feature space and is
represented by the equation wx + b = 0 in linear classification.
Support Vectors: The closest data points to the hyperplane, crucial for determining the
hyperplane and margin in SVM.
Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
Kernel: A function that maps data to a higher-dimensional space enabling SVM to handle
non-linearly separable data.
Hard Margin: A maximum-margin hyperplane that perfectly separates the data without
misclassifications.
Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly separable.
C: A regularization term balancing margin maximization and misclassification penalties. A
higher C value forces stricter penalty for misclassifications.
Hinge Loss: A loss function penalizing misclassified points or margin violations and is
combined with regularization in SVM.
Dual Problem: Involves solving for Lagrange multipliers associated with support vectors,
facilitating the kernel trick and efficient computation.
Random Forest Algorithm in
Machine Learning
Last Updated : 23 Jul, 2025

Random Forest is a machine learning algorithm that uses many decision trees to make better predictions. Each tree looks
at different random parts of the data and their results are combined by voting for classification or averaging for regression.
This helps in improving accuracy and reducing errors.

Working of Random Forest Algorithm


Create Many Decision Trees: The algorithm makes many decision trees each using a random part of the data. So
every tree is a bit different.
Pick Random Features: When building each tree it doesn’t look at all the features (columns) at once. It picks a few
at random to decide how to split the data. This helps the trees stay different from each other.
Each Tree Makes a Prediction: Every tree gives its own answer or prediction based on what it learned from its
part of the data.
Combine the Predictions:
For classification we choose a category as the final answer is the one that most trees agree on i.e majority
voting.
For regression we predict a number as the final answer is the average of all the trees predictions.
Why It Works Well: Using random data and features for each tree helps avoid overfitting and makes the overall
prediction more accurate and trustworthy.

Random forest is also a ensemble learning technique which you can learn more about from: Ensemble Learning

Key Features of Random Forest


Handles Missing Data: It can work even if some data is missing so you don’t always need to fill in the gaps yourself.
Shows Feature Importance: It tells you which features (columns) are most useful for making predictions which
helps you understand your data better.
Works Well with Big and Complex Data: It can handle large datasets with many features without slowing down
or losing accuracy.
Used for Different Tasks: You can use it for both classification like predicting types or labels
and regression like predicting numbers or amounts.

Assumptions of Random Forest


Each tree makes its own decisions: Every tree in the forest makes its own predictions without relying on others.
Random parts of the data are used: Each tree is built using random samples and features to reduce mistakes.
Enough data is needed: Sufficient data ensures the trees are different and learn unique patterns and variety.
Different predictions improve accuracy: Combining the predictions from different trees leads to a more
accurate final result.
📘 Linear Regression – For Regression Problems

✅ 1. Definition:
Linear Regression is a supervised machine learning algorithm used for solving regression problems.
It models the relationship between a dependent variable (output) and one or more independent variables (input) using a straight
line.

✅ 2. Objective:
To find the best-fit straight line that predicts the value of the dependent variable based on the independent variable(s).

✅ 3. Types of Linear Regression:


Type Description

Simple Linear Regression One independent variable (e.g., height vs weight)

Multiple Linear Regression Two or more independent variables (e.g., house price vs
size, location, number of rooms)

✅ 4. Equation of Linear Regression:


🔹 Simple Linear Regression:
Y = a + bX
Where:
Y = Predicted value (dependent variable)
X = Input value (independent variable)
a = Intercept (value of Y when X = 0)
b = Slope (rate of change in Y for one unit change in X)

🔹 Multiple Linear Regression:


Y = a + b 1 X1 + b 2 X2 + ⋯ + b n Xn
​ ​ ​ ​ ​ ​

✅ 5. Working Steps:
1. Collect input (X) and output (Y) data.
2. Fit a line that minimizes the error between actual and predicted values.
3. Use least squares method to find the best values of a and b.
4. Predict new values of Y based on new inputs.
✅ 6. Use Case Examples:
Problem Regression Type

Predicting house price Multiple Linear Regression

Predicting student marks from study hours Simple Linear Regression

Predicting salary based on experience Simple Linear Regression

✅ 7. Evaluation Metrics (for checking performance):


Metric Meaning

MSE (Mean Squared Error) Average of squared prediction errors

RMSE (Root Mean Squared Error) Square root of MSE

R² (R-squared score) How well the model fits the data (1 = perfect fit)

✅ 8. Advantages:
Simple to understand and implement
Fast and efficient
Works well when there is a linear relationship between variables
Good for baseline regression tasks

✅ 9. Disadvantages:
Only works when relationship is linear
Sensitive to outliers
Doesn’t work well with non-linear data
Assumes independence and no multicollinearity among variables

✅ 10. Exam-Ready 3–4 Line Answer:


Linear Regression is a supervised learning algorithm used for predicting numerical (continuous) values.
It finds the best-fit straight line to model the relationship between the dependent and independent variable(s).
It is used in applications like house price prediction, salary estimation, and student score forecasting
📘 Ordinary Least Squares (OLS) Regression – For Exams

✅ 1. Definition:
Ordinary Least Squares (OLS) Regression is a method used in linear regression to find the best-fitting line by minimizing the sum of
the squared errors (differences between predicted and actual values).

✅ 2. Objective:
To estimate the regression line in such a way that the sum of squared residuals is as small as possible.

This is why it's called "least squares" method.

✅ 3. Equation of Linear Regression (used in OLS):


Y = a + bX + e
Where:
Y = Dependent variable (output)
X = Independent variable (input)
a = Intercept (value of Y when X = 0)
b = Slope (change in Y per unit change in X)
e = Error term (residual)

✅ 4. Residual (Error):
Residual = Actual value (Y) − Predicted value (Ŷ)
OLS tries to minimize the sum of the squares of these residuals:

Minimize: ∑(Yi − Y^i )2


​ ​

✅ 5. Working Steps of OLS Regression:


1. Take dataset with input (X) and output (Y).
2. Fit a linear equation Y = a + bX to the data.
3. Calculate the residuals (errors).
4. Use calculus to minimize the total squared error.
5. Find optimal values of a and b using formulas:

∑(Xi − Xˉ )(Yi − Yˉ )
b= and a = Yˉ − bXˉ
​ ​

∑(Xi − Xˉ )2


✅ 6. Assumptions of OLS:
1. Linearity – Relationship between X and Y is linear
2. Independence – Observations are independent of each other
3. Homoscedasticity – Constant variance of errors
4. Normality – Errors are normally distributed
5. No multicollinearity – In multiple regression, independent variables are not highly correlated

✅ 7. Evaluation Metrics (for performance):


Metric Description

R² (R-squared) Measures how well the model explains the variation in


output

MSE (Mean Squared Error) Average of squared residuals

RMSE Square root of MSE (gives error in same units as output)

✅ 8. Applications:
Predicting house prices
Estimating salary based on experience
Forecasting sales based on marketing spend
Modeling relationships in economics or finance

✅ 9. Advantages of OLS:
Simple to understand and easy to compute
Provides best linear unbiased estimates (BLUE) under assumptions
Efficient when assumptions hold true

✅ 10. Disadvantages:
Sensitive to outliers
Assumes linear relationship
Performance declines if assumptions (like homoscedasticity, normality) are violated
Not suitable for non-linear problems without transformation

✅ **11. Exam-Ready Summary (3–4 lines):


Ordinary Least Squares (OLS) is a method in linear regression that estimates the best-fit line by minimizing the sum of squared
residuals (errors between actual and predicted values).
It gives accurate and unbiased results under specific statistical assumptions like linearity, independence, and homoscedasticity.
OLS is widely used in prediction and data modeling tasks.
Logistic Regression in
Machine Learning
Last Updated : 23 Jul, 2025

Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear
regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It
is used for binary classification where the output can be one of two possible categories such as Yes/No,
True/False or 0/1. It uses sigmoid function to convert inputs into a probability value between 0 and 1. In this
article, we will see the basics of logistic regression and its core concepts.

Types of Logistic Regression


Logistic regression can be classified into three main types based on the nature of the dependent variable:

1. Binomial Logistic Regression: This type is used when the dependent variable has only two possible
categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and
is used for binary classification problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has three or more possible
categories that are not ordered. For example, classifying animals into categories like "cat," "dog" or "sheep." It
extends the binary logistic regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has three or more
categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high." It takes
the order of the categories into account when modeling.

Assumptions of Logistic Regression


Understanding the assumptions behind logistic regression is important to ensure the model is applied correctly,
main assumptions are:

1. Independent observations: Each data point is assumed to be independent of the others means there
should be no correlation or dependence between the input samples.
2. Binary dependent variables: It takes the assumption that the dependent variable must be binary, means
it can take only two values. For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The model assumes a
linear relationship between the independent variables and the log odds of the dependent variable which
means the predictors affect the log odds in a linear way.
4. No outliers: The dataset should not contain extreme outliers as they can distort the estimation of the logistic
regression coefficients.
5. Large sample size: It requires a sufficiently large sample size to produce reliable and stable results.

Understanding Sigmoid Function


1. The sigmoid function is a important part of logistic regression which is used to convert the raw output of the
model into a probability value between 0 and 1.

2. This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve called the
sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid function is perfect
for this purpose.

3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.

If the sigmoid output is same or above the threshold, the input is classified as Class 1.
If it is below the threshold, the input is classified as Class 0.

This approach helps to transform continuous input values into meaningful class predictions.

How does Logistic Regression work?


Logistic regression model transforms the linear regression function continuous value output into categorical value
output using a sigmoid function which maps any real-valued set of independent variables input into a value
between 0 and 1. This function is known as the logistic function.

Suppose we have input features represented as a matrix:

⎡ x11 ​ ... x1m ⎤


x21 ​ ... x2m​

X= ​ ​ ​ ​

⋮ ⋱ ⋮
⎣ xn1 ​ ... xnm ⎦

and the dependent variable is Y having only binary value i.e 0 or 1.

0 if Class 1
Y ={
1 if Class 2
​ ​

then, apply the multi-linear function to the input variables X.


n
z = (∑i=1 wi xi ) + b ​ ​ ​

Here xi is the ith observation of X, wi = [w1 , w2 , w3 , ⋯ , wm ] is the weights or Coefficient and bis the bias term
​ ​ ​ ​ ​ ​

also known as intercept. Simply this can be represented as the dot product of weight and bias.

z =w⋅X +b
At this stage, z is a continuous value from the linear regression. Logistic regression then applies the sigmoid
function to z to convert it into a probability between 0 and 1 which can be used to predict the class.

Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e.
predicted y.
1
σ(z) = 1+e−z

Sigmoid function

As shown above the sigmoid function converts the continuous variable data into the probability i.e between 0 and
1.
σ(z) tends towards 1 as z→∞
σ(z) tends towards 0 as z → −∞
σ(z) is always bounded between 0 and 1

where the probability of being a class can be measured as:

P (y = 1) = σ(z)
P (y = 0) = 1 − σ(z)

Logistic Regression Equation and Odds:

It models the odds of the dependent event occurring which is the ratio of the probability of the event to the
probability of it not occurring:
p(x)
1−p(x) ​ = ez
Taking the natural logarithm of the odds gives the log-odds or logit:

p(x)
log [ ]=z
1 − p(x)

p(x)
log [ ]=w⋅X +b
1 − p(x)

p(x)
= ew⋅X+b ⋯ Exponentiate both sides
1 − p(x)

p(x) = ew⋅X+b ⋅ (1 − p(x))


​ ​

p(x) = ew⋅X+b − ew⋅X+b ⋅ p(x))


p(x) + ew⋅X+b ⋅ p(x)) = ew⋅X+b
p(x)(1 + ew⋅X+b ) = ew⋅X+b
ew⋅X+b
p(x) =
1 + ew⋅X+b

then the final logistic regression equation will be:


ew⋅X +b 1
p(X; b, w) = 1+ew⋅X +b

= 1+e−w⋅X +b

This formula represents the probability of the input belonging to Class 1.

Likelihood Function for Logistic Regression

The goal is to find weights w and bias b that maximize the likelihood of observing the data.
For each data point i
for y = 1, predicted probabilities will be: p(X;b,w) =p(x)
for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1 − p(x)

n
L(b, w) = ∏i=1 p(xi )yi (1 − p(xi ))1−yi
​ ​


Taking natural logs on both sides:


n
log(L(b, w)) = ∑ yi log p(xi ) + (1 − yi ) log(1 − p(xi ))
​ ​ ​ ​ ​

i=1
n
= ∑ yi log p(xi ) + log(1 − p(xi )) − yi log(1 − p(xi ))
​ ​ ​ ​ ​ ​

i=1
n n
p(xi )
= ∑ log(1 − p(xi )) + ∑ yi log

1 − p(xi
​ ​ ​ ​ ​

​ ​

i=1 i=1

n n
= ∑ − log 1 − e−(w⋅xi +b) + ∑ yi (w ⋅ xi + b)

​ ​ ​

i=1 i=1
n n
= ∑ − log 1 + ew⋅xi +b + ∑ yi (w ⋅ xi + b)

​ ​ ​

i=1 i=1

This is known as the log-likelihood function.

Gradient of the log-likelihood function

To find the best w and b we use gradient ascent on the log-likelihood function. The gradient with respect to each
weight wj is:

n n
∂J (l(b, w) 1 w⋅xi +b
=−∑ e xij + ∑ yi xij ​

∂wj 1 + ew⋅xi +b
​ ​ ​ ​ ​ ​ ​

i=n i=1

n n


= − ∑ p(xi ; b, w)xij + ∑ yi xij
​ ​ ​ ​ ​ ​

i=n i=1
n
= ∑ (yi − p(xi ; b, w))xij
​ ​ ​

i=n

Terminologies involved in Logistic Regression


Here are some common terms involved in logistic regression:

1. Independent Variables: These are the input features or predictor variables used to make predictions
about the dependent variable.
2. Dependent Variable: This is the target variable that we aim to predict. In logistic regression, the
dependent variable is categorical.
3. Logistic Function: This function transforms the independent variables into a probability between 0 and 1
which represents the likelihood that the dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event happening to the probability of it not happening. It differs
from probability because probability is the ratio of occurrences to total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In logistic regression, the log-odds are modeled as a
linear combination of the independent variables and the intercept.
6. Coefficient: These are the parameters estimated by the logistic regression model which shows how strongly
the independent variables affect the dependent variable.
7. Intercept: The constant term in the logistic regression model which represents the log-odds when all
independent variables are equal to zero.
8. Maximum Likelihood Estimation (MLE): This method is used to estimate the coefficients of the logistic
regression model by maximizing the likelihood of observing the given data.

You might also like