Unit-5 Notes Updated
Unit-5 Notes Updated
Python Code:
def simple_probability(favorable, total):
return favorable / total
ML Application:
• Used in Naive Bayes classifiers for prior probability estimation.
B. Conditional Probability
Definition: Probability of an event A given that another event B has occurred.
Formula:
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴 ∣ 𝐵) =
𝑃(𝐵)
Numerical Example:
In a deck of 52 cards:
• What is the probability of drawing a King given that the card is a Heart?
• P(King ∩ Heart) = 1 (only the King of Hearts)
• P(Heart) = 13/52 = 0.25
• Conditional Probability:
1/52 1
𝑃(King ∣ Heart) = = ≈ 0.077
13/52 13
Python Code:
def conditional_probability(p_a_and_b, p_b):
return p_a_and_b / p_b
ML Application:
• Used in Hidden Markov Models (HMMs) for state transitions.
C. Bayes’ Theorem
Definition: Updates probability estimates based on new evidence.
Formula:
𝑃(𝐵 ∣ 𝐴) ⋅ 𝑃(𝐴)
𝑃(𝐴 ∣ 𝐵) =
𝑃(𝐵)
Numerical Example:
A disease affects 1% of a population. A test is 99% accurate (1% false positives).
• P(Disease) = 0.01
• P(Test+ | Disease) = 0.99
• P(Test+ | No Disease) = 0.01
• Probability of having the disease given a positive test:
0.99 × 0.01
𝑃(Disease ∣ Test+) = = 0.5
(0.99 × 0.01) + (0.01 × 0.99)
Python Code:
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
p_not_a = 1 - p_a
p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
return (p_b_given_a * p_a) / p_b
ML Application:
• Spam detection (classifying emails using prior word probabilities).
Q2. Explain about discrete and continuous probability distributions. Provide examples of
each type and explain why they are suitable for different types of data.
Probability distributions describe how probabilities are distributed over the values of a
random variable. They are broadly classified into discrete and continuous distributions,
depending on the nature of the data.
Key Differences
Feature Discrete Distribution Continuous Distribution
Variable Type Countable (integers) Measurable (real numbers)
Probability Function PMF (Probability Mass Function) PDF (Probability Density Function)
Example Use Cases Defect counts, coin flips Height, temperature, time
Conclusion
• Use discrete distributions for countable data (e.g., number of successes).
• Use continuous distributions for measurable data (e.g., time, weight).
Q3. Explain Bayes' Theorem in detail, including: Its mathematical formulation and derivation
from conditional probability and significance of prior, likelihood, and posterior probabilities.
Problem Statement
At Springfield High School:
• 60% of students are female (P(Female) = 0.6)
• 40% are male (P(Male) = 0.4)
• 30% of females wear glasses (P(Glasses|Female) = 0.3)
• 20% of males wear glasses (P(Glasses|Male) = 0.2)
Question: If a randomly selected student wears glasses, what is the probability they are
female? (Find P(Female|Glasses))
Bayes' Theorem is a fundamental result in probability theory that describes how to update the
probabilities of hypotheses when given evidence. It is derived from the definition of
conditional probability.
Conditional Probability Basics:
• 𝑃(𝐴 ∣ 𝐵): Probability of event 𝐴 occurring given that 𝐵 is true.
• 𝑃(𝐵 ∣ 𝐴): Probability of event 𝐵 occurring given that 𝐴 is true.
Derivation:
1. Start with the definition of conditional probability for 𝑃(𝐴 ∣ 𝐵):
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴 ∣ 𝐵) =
𝑃(𝐵)
Here, 𝑃(𝐴 ∩ 𝐵) is the joint probability of 𝐴 and 𝐵.
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵 ∣ 𝐴) =
𝑃(𝐴)
3. Solve both equations for the joint probability 𝑃(𝐴 ∩ 𝐵):
𝑃(𝐵 ∣ 𝐴) ⋅ 𝑃(𝐴)
𝑃(𝐴 ∣ 𝐵) =
𝑃(𝐵)
Key Components:
• Posterior Probability (𝑃(𝐴 ∣ 𝐵)): Updated probability of 𝐴 after observing 𝐵.
• Prior Probability (𝑃(𝐴)): Initial probability of 𝐴 before seeing 𝐵.
• Likelihood (𝑃(𝐵 ∣ 𝐴)): Probability of observing 𝐵 if 𝐴 is true.
• Marginal Probability (𝑃(𝐵)): Total probability of 𝐵, calculated as:
𝑃(𝐵) = 𝑃(𝐵 ∣ 𝐴)𝑃(𝐴) + 𝑃(𝐵 ∣ ¬𝐴)𝑃(¬𝐴)
(For mutually exclusive and exhaustive events.)
Problem Statement
At Springfield High School:
• 60% of students are female (P(Female) = 0.6)
• 40% are male (P(Male) = 0.4)
• 30% of females wear glasses (P(Glasses|Female) = 0.3)
• 20% of males wear glasses (P(Glasses|Male) = 0.2)
Question: If a randomly selected student wears glasses, what is the probability they are
female? (Find P(Female|Glasses))
Step-by-Step Solution
1. Visual Representation (Optional but Helpful)
First, let's visualize the data for 100 students:
Female (60) Male (40) Total
Wears Glasses 18 (30% of 60) 8 (20% of 40) 26
No Glasses 42 32 74
Total 60 40 100
4. Python Verification
# Given probabilities
P_Female = 0.6
P_Male = 0.4
P_Glasses_given_Female = 0.3
P_Glasses_given_Male = 0.2
Scenario:
A machine learning model is deployed in a manufacturing plant to classify whether items
produced on an assembly line are defective (𝑌 = 1) or non-defective (𝑌 = 0). The model's
precision (probability of correctly identifying a defective item) is estimated to be 𝑝 = 0.95
based on historical data. In a batch of 𝑛 = 100 items, the model flags 𝑘 items as defective.
Tasks:
Probability Calculation:
Hypothesis Testing:
o The plant manager claims the model’s defective detection rate is less than
10%. Using the observed data (𝑘 = 10 defective predictions in 𝑛 = 100
trials), test this claim at a 5% significance level.
A false positive occurs when the model incorrectly classifies a non-defective item as
defective. The probability of a false positive is:
𝑃(FP) = 1 − 𝑝 = 0.05
The number of false positives follows:
𝑋 ∼ Binomial(𝑛 = 100, 𝑝 = 0.05)
We compute:
3
100
𝑃(𝑋 ≤ 3) = ∑ ( ) (0.05)𝑘 (0.95)100−𝑘
𝑘
𝑘=0
Calculations:
𝑃(𝑋 = 0) = 𝑒 −5 ≈ 0.0067
𝑃(𝑋 = 1) = 5𝑒 −5 ≈ 0.0337
52 𝑒 −5
𝑃(𝑋 = 2) = ≈ 0.0842
2
53 𝑒 −5
𝑃(𝑋 = 3) = ≈ 0.1404
6
Total Probability:
𝑃(𝑋 ≤ 3) ≈ 0.0067 + 0.0337 + 0.0842 + 0.1404 = 0.2650 (26.5%)
Interpretation:
There is a 26.5% chance that the model makes 3 or fewer false positives in 100 trials.
𝑝̂ (1 − 𝑝̂ )
𝑝̂ ± 𝑧𝛼/2 √
𝑛
Calculations:
Step1. Standard Error (SE):
0.10 × 0.90
𝑆𝐸 = √ = √0.0009 = 0.03
100
Interpretation:
We are 95% confident that the true defective rate lies between 4.12% and 15.88%.
𝑝̂ − 𝑝0 0.10 − 0.10
𝑧= = =0
√𝑝0 (1 − 𝑝0 ) √0.10 × 0.90
𝑛 100
Critical Value (One-Tailed Test, 𝛼 = 0.05):
𝑧critical = −1.645
Decision Rule:
• If 𝑧 < −1.645, reject 𝐻0 .
• Since 𝑧 = 0 > −1.645, we fail to reject 𝐻0 .
Conclusion:
There is not enough evidence to support the claim that the defective rate is less than 10%.
o If the model’s predicted defective rate (𝑝̂ ) does not match the true rate,
recalibration (e.g., Platt scaling) may be needed.
2. False Positive Control:
Summary
Task Result
𝑃(𝑋 = 5) 2.26 × 10−118 (extremely unlikely)
𝑃(𝑋 ≤ 3 FPs) 26.5%
95% CI for 𝑝 [4.12%, 15.88%]
Hypothesis Test (𝑝 < 0.10) Fail to reject 𝐻0 (no evidence defective rate < 10%)
Task:
Find the optimal hyperplane (decision boundary) using SVM.
Answer:
1. Optimal Hyperplane & Margin Maximization
• SVM finds the best decision boundary (a line/plane) that separates two classes.
• The "optimal" hyperplane is the one with the widest margin (empty space) between
the closest points of each class.
• Why? A larger margin makes the model more confident and less likely to overfit.
Multiple hyperplanes separate the data from two classes
2. Handling Non-Separable Data (Soft-Margin SVM)
• If data overlaps (no perfect line), SVM uses slack variables to allow some
misclassifications.
• The C parameter controls how much misclassification is allowed:
o Small C: Wide margin, more errors allowed.
o Large C: Narrow margin, fewer errors.
3. Kernel Trick for Nonlinear Data
• Problem: SVM is linear, but real-world data is often not.
• Solution: The kernel trick maps data to a higher dimension where it becomes
separable.
• Example Kernels:
o RBF (Radial Basis Function): Works well for complex, nonlinear patterns.
o Polynomial: Fits curved boundaries.
4. Real-World Example
• Spam Detection: SVM classifies emails as "spam" or "not spam" by finding patterns
in words (e.g., "free," "win").
Comparison to Logistic Regression
• SVM: Focuses on the "boundary" (margin).
• Logistic Regression: Focuses on "probability" of class membership.
Python Example:
from sklearn import svm
model = svm.SVC(kernel='rbf', C=1.0) # RBF kernel for nonlinear data
model.fit(X_train, y_train)
Key Idea:
SVM is like drawing the widest possible street between two classes, even if a few points
must be on the sidewalk.
we will use the Support Vector Machine (SVM) framework to find the optimal hyperplane
for binary classification. The goal of SVM is to maximize the margin between the two classes
while ensuring that all data points are correctly classified.
The support vectors are the points that lie closest to the decision boundary. For this dataset,
the support vectors are:
• From Class +1: (1,1)
• From Class −1: (0,0)
These points are closest to the decision boundary and will help define it.
4.2 Equation of the Decision Boundary
The decision boundary is equidistant from the support vectors. Let’s compute it step-by-step:
The line passing through the support vectors (1,1) and (0,0) has a slope:
1−0
𝑚= =1
1−0
The decision boundary is perpendicular to this line. The slope of the perpendicular
line is:
1
𝑚⊥ = − = −1
𝑚
The midpoint of the two support vectors is:
1+0 1+0
( , ) = (0.5,0.5)
2 2
The equation of the decision boundary (hyperplane) is:
𝑦 = −𝑥 + 1
Or equivalently:
𝑥+𝑦−1=0
Thus, the equation of the decision boundary is:
𝑥+𝑦−1=0
√2
Final Answer:
Optimal Hyperplane: 𝑥 + 𝑦 − 1 = 0
Support Vectors: (1,1) and (0,0)
Margin: √2
Python Implementation
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt
A hospital wants to predict whether patients have Diabetes (Class +1) or No Diabetes (Class
-1) based on two simple blood test metrics:
6. Glucose Level (mg/dL)
7. BMI (Body Mass Index)
Dataset (4 Patients):
Patient Glucose (x₁) BMI (x₂) Diagnosis (y)
1 150 30 +1 (Diabetic)
2 160 35 +1 (Diabetic)
3 80 20 -1 (Healthy)
4 90 22 -1 (Healthy)
Tasks:
Primal SVM Formulation:
o Plot the data and guess which points are likely support vectors.
Solve for Decision Boundary:
o Assume support vectors are Patient 1 (150, 30) and Patient 4 (90, 22).
o Solve for weights w = [w₁, w₂] and bias b.
Calculate Margin:
o A new patient has Glucose = 130, BMI = 28. Predict their class.
Ans- SVM Solution for Diabetes Prediction
Objective:
Find the optimal separating hyperplane by solving:
1 2
min (𝑤1 + 𝑤22 )
𝑤1 ,𝑤2 ,𝑏 2
Constraints:
For each patient 𝑖:
𝑦𝑖 (𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + 𝑏) ≥ 1
Explicit constraints:
• Patient 1 (Diabetic): 150𝑤1 + 30𝑤2 + 𝑏 ≥ 1
• Patient 2 (Diabetic): 160𝑤1 + 35𝑤2 + 𝑏 ≥ 1
• Patient 3 (Healthy): −80𝑤1 − 20𝑤2 − 𝑏 ≥ 1
• Patient 4 (Healthy): −90𝑤1 − 22𝑤2 − 𝑏 ≥ 1
From the plot below, the closest points to the potential decision boundary are:
• Patient 1 (150, 30)
• Patient 4 (90, 22)
These are the support vectors that will define the margin.
1 1
Margin = = = 30 units
∥𝑤∥ 2
√( 1 ) + 02
30
Python Verification
from sklearn import svm
X = [[150,30], [160,35], [80,20], [90,22]]
y = [1, 1, -1, -1]
clf = svm.SVC(kernel='linear', C=1e5)
clf.fit(X, y)
Output:
Weights: [0.0333 0.]
Bias: -4.0
Support Vectors: [[150. 30.], [90. 22.]]
Interpretation
• The model uses only Glucose levels (BMI is ignored) because the data is separable
along Glucose.
• Patients with Glucose > 120 mg/dL are classified as diabetic.
• The large margin (30 units) indicates a robust classifier.
This simple example mirrors real-world scenarios where one feature (e.g., Glucose) may
dominate predictions, and SVM provides an interpretable, margin-maximizing solution.