0% found this document useful (0 votes)
2 views29 pages

Unit II_ML

This document covers the fundamental concepts of random variables in statistics, including definitions, types (discrete and continuous), and their associated probability functions (PMF and PDF). It explains key formulas for calculating mean, variance, and properties of probability distributions, as well as statistical moments and the 5-number summary. Additionally, it discusses the applications of probability distribution functions in statistical inference, modeling, and risk assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

Unit II_ML

This document covers the fundamental concepts of random variables in statistics, including definitions, types (discrete and continuous), and their associated probability functions (PMF and PDF). It explains key formulas for calculating mean, variance, and properties of probability distributions, as well as statistical moments and the 5-number summary. Additionally, it discusses the applications of probability distribution functions in statistical inference, modeling, and risk assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT -2

BASIC MATHEMATICAL CONCEPT FOR ML

RANDOM VARIABLE
Random variable is a fundamental concept in statistics that bridges the gap between theoretical
probability and real-world data. A Random variable in statistics is a function that assigns a real
value to an outcome in the sample space of a random experiment. For example: if you roll a die,
you can assign a number to each possible outcome.

X: S →R
where,
X is Random Variable (It is usually denoted using capital letter)
S is Sample Space
R is Set of Real Numbers

Example 1
If two unbiased coins are tossed then find the random variable associated with that event.
Solution:
Suppose Two (unbiased) coins are tossed
X = number of heads. [X is a random variable or function]
Here, the sample space S = {HH, HT, TH, TT}

Example 2
Suppose a random variable X takes m different values, X = {x1, x2, x3………xm}, with
corresponding probabilities P(X = xi) = pi, where 1 ≤ i ≤ m.
The probabilities must satisfy the following conditions:
0 ≤ pi ≤ 1; where 1 ≤ i ≤ m
p1 + p2 + p3 + ……. + pm = 1 or we can say 0 ≤ pi ≤ 1 and ∑pi = 1
TYPES OF RANDOM VARIABLES

Discrete Random Variable

A Discrete Random Variable takes on a finite number of values. The probability function
associated with it is said to be PMF.

PMF (Probability Mass Function)


If X is a discrete random variable and the PMF of X is P(xi), then 0 ≤ pi ≤ 1 ∑pi = 1
where the sum is taken over all possible values of x
Example: Let S = {0, 1, 2}

xi 0 1 2

Pi(X = xi) P1 0.3 0.5

Find the value of P (X = 0)


 the sum of all probabilities is equal to 1. And P (X = 0) be P1
P1 + 0.3 + 0.5 = 1
P1 = 0.2
Then, P (X = 0) is 0.2

Continuous Random Variable

Continuous Random Variable takes on an infinite number of values. The probability


function associated with it is said to be PDF (Probability Density Function).
PDF (Probability Density Function)
If X is a continuous random variable. P (x < X < x + dx) = f(x)dx then,
0 ≤ f(x) ≤ 1; for all x
∫ f(x) dx = 1 over all values of x
Then P (X) is said to be a PDF of the distribution.
Example
Find the value of P (1 < X < 2) Such that, f(x) = kx3; 0 ≤ x ≤ 3 = 0 Otherwise f(x) is a density
function.
If a function f is said to be a density function, then the sum of all probabilities is equal to 1.
Since it is a continuous random variable Integral value is 1 overall sample space s.
∫ f(x) dx = 1
∫ kx3 dx = 1
K[x4]/4 = 1
Given interval, 0 ≤ x ≤ 3 = 0
K[34 – 04]/4 = 1
K(81/4) = 1
K = 4/81
Thus,
P (1 < X < 2) = k × [X4]/4
P = 4/81 × [16-1]/4
P = 15/81

Random Variable Formulas


There are two main random variable formulas,
 Mean of Random Variable
 Variance of Random Variable
Mean of Random Variable
For any random variable X where P is its respective probability we define its mean as,
Mean(μ) = ∑ X.P
where,
X is the random variable that consist of all possible values.
P is the probability of respective variables

Variance of Random Variable


The variance of a random variable tells us how the random variable is spread about the mean value
of the random variable. The variance of the Random Variable is calculated using the formula,
Var(x) = σ2 = E(X2) – {E(X)}2
where,
E(X2) = ∑X2P
E(X) = ∑XP

Probability Distribution Function


Probability Distribution refers to the function that gives the probability of all possible
values of a random variable.It shows how the probabilities are assigned to the different possible
values of the random variable.
A Probability Distribution Function (PDF) is a mathematical function that describes the likelihood
of different outcomes in a random experiment. For any random variable X, where its value is
evaluated at the points ‘x’, then the probability distribution function gives the probability that X
takes the value less than equal to x.
We represent the probability distribution as, F(x) = P (X ≤ x)
Probability Distribution Function is also called Cumulative Distribution Function(CDF), The CDF
represents the cumulative probability up to a certain value of the random variable.
The cumulative probability for a closed interval(a, b] is given by:
P(a < X ≤ b) = F(b) – F(a)

Uses of Probability Distribution Function


 Statistical Inference: PDFs are fundamental in statistical inference, allowing for the
estimation of population parameters and hypothesis testing.
 Modeling and Simulations: PDFs are used to model real-world phenomena and to
simulate random processes in fields like engineering, finance, and the natural sciences.
 Risk Assessment: In finance and insurance, PDFs help assess risks and determine the
likelihood of various financial outcomes.
Types of Probability Distribution
There are two types of probability distributions:
 Discrete probability distributions
 Continuous probability distributions

Discrete probability distributions


A discrete probability distribution is a probability distribution of a categorical or discrete variable.
Discrete probability distributions only include the probabilities of values that are possible. In other
words, a discrete probability distribution doesn’t include any values with a probability of zero. For
example, a probability distribution of dice rolls doesn’t include 2.5 since it’s not a possible
outcome of dice rolls.

The probability of all possible values in a discrete probability distribution add up to one. It’s certain
(i.e., a probability of one) that an observation will have one of the possible values.

Probability mass functions


A probability mass function (PMF) is a mathematical function that describes a discrete probability
distribution. It gives the probability of every possible value of a variable. A probability mass
function can be represented as an equation or as a graph.
Continuous probability distributions
A continuous probability distribution is the probability distribution of a continuous variable. A
continuous variable can have any value between its lowest and highest values. Therefore,
continuous probability distributions include every number in the variable’s range. The probability
that a continuous variable will have any specific value is so infinitesimally small that it’s
considered to have a probability of zero. However, the probability that a value will fall within a
certain interval of values within its range is greater than zero.

Probability density functions


A probability density function (PDF) is a mathematical function that describes a continuous
probability distribution. It provides the probability density of each value of a variable, which can
be greater than one.

A probability density function can be represented as an equation or as a graph. In graph form, a


probability density function is a curve. You can determine the probability that a value will fall
within a certain interval by calculating the area under the curve within that interval. You can use
reference tables or software to calculate the area.

The area under the whole curve is always exactly one because it’s certain (i.e., a probability of
one) that an observation will fall somewhere in the variable’s range.

A cumulative distribution function is another type of function that describes a continuous


probability distribution.
PROPERTIES OF PROBABILTY DISTRIBUTION

1. Mean (Expectation)

Mean is a measure of central tendency that represents the average value of the dataset. It
is defined as the ratio of the sum of all observations to the total number of observations.

Formula: Mean (µ) = Σxᵢ / n

where,
Σxᵢ: Sum of all data points (xᵢ)
n: Number of data points
2.Median

Median: Given that the data collection is arranged in ascending or descending order, the
following method is applied: If number of values or observations in the given data is odd, then
the median is given by [(n+1)/2]th observation.

𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 = ( ) 𝑡𝑒𝑟𝑚
2

 If in the given data set, the number of values or observations is even, then the median is
given by the average of (n/2)th and [(n/2) +1]th observation.

3.Mode

The most frequent number occurring in the data set is known as the mode.

Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has
appeared in the set twice.

4.Variance
Variance is the measure of dispersion, referred to as the second moment in statistics. It is
the average squared difference of data points from the mean. The higher the variance value, the
greater the variability (i.e., data points are far away from the mean) within the dataset, and vice
versa.
Variance measures how far each data point in the dataset is from the mean.
Formula of Variance:

Variance (σ²) = Σ(xᵢ – µ)²/n


where:
Σ(xᵢ – µ)²: Sum of the squared difference between each data point (xᵢ) and the mean (µ)
n: Number of data points

5.Standard Deviation
It is defined as the square root of variance. It compares the spread of two different datasets with
approximately the same mean.
Formula of Standard Deviation:
Standard Deviation = (variance)1/2
6.Skewness
Skewness is the third moment, which measures the deviation of the given distribution of a
random variable from a symmetric distribution. In simple terms, skewness means the lack of
straightness or symmetry.
Formula of Skewness:
Skewness (γ) = Σ(xᵢ – µ)³/(n * σ³)
where:
Σ(xᵢ – µ)³: Sum of the cubed difference between each data point (xᵢ) and the mean (µ)
n: Number of data points
σ: Standard deviation

There are two types of skewness:


Positive Skewness:
 In positive skewness, the extreme data values are greater, which, in turn, increases the
mean value of the dataset.
 In positive skewness: Mode < Median < Mean.
Negative Skewness:
 In negative skewness, the extreme data values are smaller, which, in turn, decreases the
mean value of the dataset.
 In negative skewness: Mean < Median < Mode.
 Note: When there is no skewness in the dataset, it is referred to as Symmetrical
Distribution.

In Symmetrical Distribution: Mean = Median = Mode.

7.KURTOSIS
Kurtosis is the fourth moment, which measures the presence of outliers in the distribution.
It gives the graph as either heavily-tailed or lightly-tailed due to the presence of outliers. In simple
terms, kurtosis measures the peakness or flatness of a distribution.

 If the graph has a shorter tail and a flat top, then Kurtosis is said to be high.
 If the graph has a higher peak and lower tail, then the kurtosis is said to be low.
There are three types of Kurtosis:
 Mesokurtic: This is the same as Normal distribution, i.e., a type of distribution in which
the extreme ends of the graph are similar.
 Lepotokurtic: This distribution indicates that a more significant percentage of data is
present near the tail, which implies the longer tail. Lepotokurtic has a greater value of
kurtosis than Mesokurtic.
 Platykurtic: This distribution indicates that there is less data in the tail portion, which
implies a shorter tail. Platykurtic has a lesser value of kurtosis than Mesokurtic.

8.Moments
Moments are measures that describe the shape of a distribution. They provide insights into the
central tendency, dispersion, skewness, and kurtosis of a dataset. Statistical Moments play a
crucial role when we specify our probability distribution to work with since, with the help of
moments, we can describe the properties of statistical distribution. Therefore, they help define the
distribution. We required the statistical moments in Statistical Estimation and Testing of
Hypotheses, which all are based on the numerical values arrived for each distribution

Types of Moments:

 First Moment (r = 1): Mean


 Second Moment (r = 2): Variance
 Third Moment (r = 3): Skewness (asymmetry)
 Fourth Moment (r = 4): Kurtosis (tailedness)

5 Number Summary

The concept of a 5 number summary is a way to describe a distribution using 5 numbers.


This includes minimum number, quartile-1, median/quartile-2, quartile-3, and maximum
number. This concept of 5 number summary comes under the concept of Statistics which deals
with the collection of data, analyzing it, interpreting, and presenting the data in an organized
manner.
As told in the above paragraph, It gives a rough idea how the given dataset looks like by
representing minimum value, maximum value, median, quartile values, etc. To understand better
the 5 number summary concept, look at the below pictorial representation of 5 number summary

Calculating 5 number summary

In order to find the 5 number summary, we need the data to be sorted. If not sort it first in
ascending order and then find it.
 Minimum Value: It is the smallest number in the given data, and the first number when it is
sorted in ascending order.

 Maximum Value: It is the largest number in the given data, and the last number when it is
sorted in ascending order.

 Median: Middle value between the minimum and maximum value. Below is the formula to
find median,

Median = (n + 1)/2th term

 Quartile 1: Middle/center value between the minimum and median value. We can simply
identify the middle value between median and minimum value for a small dataset. If it is a
big dataset with so many numbers then better to use a formula,
Quartile 1 = ((n + 1)/4)th term

 Quartile 3: Middle/center value between median and maximum value.


Quartile 3 = (3(n + 1)/4)th term

To get more grip on this let’s look at the few examples,


Sample Questions
Question 1: What is the minimum value in the given data 10, 20, 5, 15, 25, 30, 8.

Solution:
 Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
 Step-2 Find minimum number
Here the first number is the minimum number as it is sorted in ascending order.
Minimum value = 5
Question 2: What is the maximum value in the given data 10, 20, 5, 15, 25, 30, 8.

Solution:
 Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
 Step-2 Find maximum number
Here the last number is the maximum number as it is sorted in ascending order.
Maximum value = 30

Question 3: What is the median value in the given data 10, 20, 5, 15, 25, 30, 8

Solution:
 Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
 Step-2 Find median
Here we need to find median value by a formula (n + 1)/2th term where n is the total count
of numbers.
Here n = 7
So median = (7 + 1)/2 = 8/2 = 4th term
4th term is median which is 15.

Summarize a Data Set with a Five-Number Summary

Step 1: Order the values from least to greatest.


Step 2: Determine the minimum and maximum of the data set by identifying the lowest and highest
values.
Step 3: Find the median of the data set. Separate the lower half from the upper half.
Step 4: Find the first and third quartiles by finding the median of the lower half and upper half of
the data.
Step 5: Summarize the data set by stating the minimum, first quartile, median, third quartile, and
maximum.

Multidimensional probability distribution


A multidimensional probability distribution involves two or more random variables, which can
be either discrete, continuous, or a mix of both. These distributions are essential for studying the
relationships between multiple variables and analyzing joint behavior.
Key Concepts in Multidimensional Probability Distributions

1. Joint Probability Distribution:

 Describes the probability of two or more random variables occurring


simultaneously.
 Discrete case: P(X=x,Y=y)
 Continuous case: Joint probability density function f(x,y), where

2. Marginal Probability Distribution:

 The probability distribution of a subset of random variables obtained by summing


(discrete case) or integrating (continuous case) over the other variables.

3. Conditional Probability Distribution:


 Describes the distribution of one variable given that another variable has a specific
value.

4. Independence:
 Two variables XXX and YYY are independent if:

Key Metrics in Multidimensional Distributions

1. Covariance:
o Measures the linear relationship between two variables
Correlation:

 Normalized measure of linear dependence between variables:

2. Covariance Matrix:
 Generalizes variance and covariance for multiple variables.
 For a k-dimensional random vector X, it is:

Applications of Multidimensional Probability Distributions

 Machine Learning: Multivariate Gaussian models in classification and regression.


 Finance: Portfolio risk analysis using covariance matrices.
 Physics: Modeling spatial phenomena or multidimensional processes.
 Econometrics: Analyzing relationships between economic indicators.

Joint Probability Distribution

A Joint probability distribution tells us how likely two or more random variables occur together.
It describes the relationship between these variables and shows the probability of different
combinations of their values. For example, if we want to understand the likelihood of it being hot
(X) and raining (Y) on the same day, a joint probability distribution provides this information.

Imagine a simple case with two random variables:

1. X: The number of hours you study (e.g., 1, 2, or 3 hours).


2. Y: Your test score (e.g., 50, 75, or 100 points).

The joint probability distribution tells us the probability of studying for X hours and scoring Y
points. For instance, P(X=2,Y=75)=0.2 means there is a 20% chance that you studied for 2 hours
and scored 75.
A joint probability distribution explains how likely two or more events happen together. It
describes the relationship between these events and tells us the probability of different
combinations of their outcomes.

For example:

 Let X represent the number of hours you study for an exam (e.g., 1, 2, or 3 hours).
 Let Y represent your score on the exam (e.g., 50, 75, or 100 points).

The joint probability distribution shows the likelihood of each combination of Xand Y. For
instance, P(X=2,Y=75)=0.2 means there is a 20% chance that you studied for 2 hours and scored
75.

How Does It Work?

To understand a joint probability distribution, you need:

1. Random variables: The things you are observing or measuring (e.g., study hours and test
scores).
2. Combinations of values: All possible pairs of outcomes (e.g., studying 2 hours and scoring
75 points).
3. Probabilities: The likelihood of each combination happening.

All the probabilities in the joint distribution must add up to 1 because one of the combinations
must happen.

Example: Let’s consider a situation where:

 X is the number of hours studied (1, 2, or 3 hours).


 Y is the test score (50, 75, or 100 points).

Here is a table that represents the joint probability distribution:

How to read this table

 Each cell shows the probability of a specific combination. For example:


o P(X=2,Y=75)=0.2. There’s a 20% chance that you studied for 2 hours and scored
75.
o P(X=3,Y=100)=0.15. There’s a 15% chance that you studied for 3 hours and scored
100.
 The rows show probabilities for different study hours. For example:
o For X=1, the total is 0.1+0.1+0.1=0.30., meaning there’s a 30% chance you studied
for 1 hour (regardless of your score).
 The columns show probabilities for different scores. For example:
o For Y=75 the total is 0.1+0.2+0.1=0.4, meaning there’s a 40% chance you scored
75 (regardless of your study hours).

Marginal Probability in a Joint Probability Distribution


Marginal probability helps us focus on one variable while ignoring or summarizing the impact of
other variables in a joint distribution. It provides a simplified view of the probabilities when we
are not interested in the combined relationships between variables.

Key Reasons for Declaring Marginal Probability

1. To Simplify Analysis:
o Joint probability distributions can be complex, involving multiple variables.
Marginal probability simplifies the data by focusing on one variable, making it
easier to analyze.

Example: From the joint distribution of study hours (X) and test scores (Y), you may only
want to know the probability of studying for 2 hours, regardless of the score:

2. To Determine Overall Likelihoods:

 Marginal probabilities give the overall likelihood of an event happening,


independent of the other variables.

Example: If you’re interested in the overall probability of scoring 75 points (P(Y=75)), marginal
probability provides that information without considering study hours.

3. To Serve as a Foundation for Conditional Probability:

 Marginal probability is essential for calculating conditional probability, which


shows the likelihood of one event happening given that another event has occurred.

Formula:
 Here, P(Y), the marginal probability of Y, is required for the calculation.

4. To Identify Relationships Between Variables:

 Comparing marginal probabilities with joint probabilities helps determine if variables are
independent or dependent.

5. To Make Real-World Decisions:

 Many real-world problems only require marginal probabilities, especially when specific
interactions between variables are not relevant.

Example: A company might want to know the probability of selling a product on a given day,
regardless of customer demographics. This would involve marginal probabilities.

Conditional Probability Distribution

Conditional probability is the probability of an event occurring given that another event has already
occurred. It is mathematically expressed as:

where:

 P(A∣B) is the probability of event A occurring given that B has already occurred.
 P(A∩B) is the probability of both A and B occurring.
 P(B) is the probability of event B occurring.

A conditional probability distribution describes the probability distribution of a random variable


given that another variable has a specific value. This is crucial in statistics, machine learning, and
probability theory.

For two random variables X and Y, the conditional probability distribution of X given Y=y is:
Example 1: Conditional Probability in a Discrete Case

Problem Statement:

Suppose a company has two departments: Software Development (S) and Marketing (M). The
probability distribution of employees in each department based on experience level (Junior or
Senior) is given as follows:

Interpretation: If we randomly select an employee from the Marketing department, there is a


60% chance that they are a Senior.

Example 2: Conditional Probability in a Continuous Case

Problem Statement:
Suppose a company tracks the time it takes to complete a task X (in hours) based on the experience
level Y of the employee (Junior or Senior). The probability density functions (PDFs) are given:

Question:

What is the probability that a Junior employee completes the task within 2 hours?

Interpretation: A Junior employee has a 32.97% chance of completing the task within 2 hours.

Applications of Conditional Probability Distribution

1. Machine Learning
o Naïve Bayes Classifier uses conditional probability to make predictions.
o Hidden Markov Models (HMM) in speech recognition and NLP rely on conditional
distributions.
2. Medical Diagnosis
o Given symptoms SSS, doctors estimate the probability of a disease DDD using
P(D∣S)P(D | S)P(D∣S).
3. Finance & Risk Analysis
o Stock price movements conditioned on economic indicators.
4. Natural Language Processing (NLP)
o Predicting the next word in a sentence given previous words.

Bayes Theorem

Bayes theorem is also known with some other name such as Bayes rule or Bayes Law. Bayes
theorem helps to determine the probability of an event with random knowledge. It is used to
calculate the probability of occurring one event while other one already occurred. It is a best
method to relate the condition probability and marginal probability. In simple words, we can say
that Bayes theorem helps to contribute more accurate results.

Bayes Theorem is used to estimate the precision of values and provides a method for
calculating the conditional probability. However, it is hypocritically a simple calculation but it is
used to easily calculate the conditional probability of events where intuition often fails. Some of
the data scientist assumes that Bayes theorem is most widely used in financial industries but it is
not like that. Other than financial, Bayes theorem is also extensively applied in health and medical,
research and survey industry, aeronautical sector, etc.

What is Bayes Theorem?

Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:

o According to the product rule we can express as the probability of event X with known
event Y as follows;
 P(X Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:


 P(X Y)= P(Y|X) P(X) {equation 2}
o Mathematically, Bayes theorem can be expressed by combining both equations on
right hand side. We will get:
Here, both events X and Y are independent events which means probability of outcome of both
events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated


probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before considering the
evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem

While studying the Bayes theorem, we need to understand few important concepts. These are as
follows:

1. Experiment

An experiment is defined as the planned operation carried out under controlled condition such as
tossing a coin, drawing a card and rolling a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible outcomes and the set of all
possible outcome of an event is known as sample space. For example, if we are rolling a dice,
sample space will be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space
will be:
S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further, it is also called as set of
outcomes.

Assume in our experiment of rolling a dice, there are two event A and B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of


possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total
number of possible outcomes
=2/6
=1/3
=0.333

Union of event A and B:


A∪B = {2, 4, 5, 6}
Intersection of event A and B:
A∩B= {6}

o Disjoint Event: If the intersection of the event A and B is an empty set or null then such
events are known as disjoint event or mutually exclusive events also.
o

4. Random Variable:

It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having some
probability. However, it is neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.

Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and
both are mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.
6. Independent Event:

Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of outcome of both
events does not depends one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A, given that another event
B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it
as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A occurring independent of


any other event B. Further, it is considered as the probability of evidence under any consideration.

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?

Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This rule is very helpful in such scenarios where we have a good probability of P(A|B), P(B), and
P(A) and need to determine the fourth term.
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

These are two conditions given to us, and our classifier that works on Machine Language has to
predict A and the first thing that our classifier has to choose will be the best possible class. So,
with the help of Bayes theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not change its value with respect to
change in class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

With n number classes on the probability list let's assume that the possibility of any class being the
right answer is equally likely. Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time. This is how Bayes theorem
plays a significant role in Machine Learning and Naïve Bayes theorem has simplified the
conditional probability tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of
smaller events.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Working of Naïve Bayes' Classifier:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes
11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

You might also like