Unit II_ML
Unit II_ML
RANDOM VARIABLE
Random variable is a fundamental concept in statistics that bridges the gap between theoretical
probability and real-world data. A Random variable in statistics is a function that assigns a real
value to an outcome in the sample space of a random experiment. For example: if you roll a die,
you can assign a number to each possible outcome.
X: S →R
where,
X is Random Variable (It is usually denoted using capital letter)
S is Sample Space
R is Set of Real Numbers
Example 1
If two unbiased coins are tossed then find the random variable associated with that event.
Solution:
Suppose Two (unbiased) coins are tossed
X = number of heads. [X is a random variable or function]
Here, the sample space S = {HH, HT, TH, TT}
Example 2
Suppose a random variable X takes m different values, X = {x1, x2, x3………xm}, with
corresponding probabilities P(X = xi) = pi, where 1 ≤ i ≤ m.
The probabilities must satisfy the following conditions:
0 ≤ pi ≤ 1; where 1 ≤ i ≤ m
p1 + p2 + p3 + ……. + pm = 1 or we can say 0 ≤ pi ≤ 1 and ∑pi = 1
TYPES OF RANDOM VARIABLES
A Discrete Random Variable takes on a finite number of values. The probability function
associated with it is said to be PMF.
xi 0 1 2
The probability of all possible values in a discrete probability distribution add up to one. It’s certain
(i.e., a probability of one) that an observation will have one of the possible values.
The area under the whole curve is always exactly one because it’s certain (i.e., a probability of
one) that an observation will fall somewhere in the variable’s range.
1. Mean (Expectation)
Mean is a measure of central tendency that represents the average value of the dataset. It
is defined as the ratio of the sum of all observations to the total number of observations.
where,
Σxᵢ: Sum of all data points (xᵢ)
n: Number of data points
2.Median
Median: Given that the data collection is arranged in ascending or descending order, the
following method is applied: If number of values or observations in the given data is odd, then
the median is given by [(n+1)/2]th observation.
𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 = ( ) 𝑡𝑒𝑟𝑚
2
If in the given data set, the number of values or observations is even, then the median is
given by the average of (n/2)th and [(n/2) +1]th observation.
3.Mode
The most frequent number occurring in the data set is known as the mode.
Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has
appeared in the set twice.
4.Variance
Variance is the measure of dispersion, referred to as the second moment in statistics. It is
the average squared difference of data points from the mean. The higher the variance value, the
greater the variability (i.e., data points are far away from the mean) within the dataset, and vice
versa.
Variance measures how far each data point in the dataset is from the mean.
Formula of Variance:
5.Standard Deviation
It is defined as the square root of variance. It compares the spread of two different datasets with
approximately the same mean.
Formula of Standard Deviation:
Standard Deviation = (variance)1/2
6.Skewness
Skewness is the third moment, which measures the deviation of the given distribution of a
random variable from a symmetric distribution. In simple terms, skewness means the lack of
straightness or symmetry.
Formula of Skewness:
Skewness (γ) = Σ(xᵢ – µ)³/(n * σ³)
where:
Σ(xᵢ – µ)³: Sum of the cubed difference between each data point (xᵢ) and the mean (µ)
n: Number of data points
σ: Standard deviation
7.KURTOSIS
Kurtosis is the fourth moment, which measures the presence of outliers in the distribution.
It gives the graph as either heavily-tailed or lightly-tailed due to the presence of outliers. In simple
terms, kurtosis measures the peakness or flatness of a distribution.
If the graph has a shorter tail and a flat top, then Kurtosis is said to be high.
If the graph has a higher peak and lower tail, then the kurtosis is said to be low.
There are three types of Kurtosis:
Mesokurtic: This is the same as Normal distribution, i.e., a type of distribution in which
the extreme ends of the graph are similar.
Lepotokurtic: This distribution indicates that a more significant percentage of data is
present near the tail, which implies the longer tail. Lepotokurtic has a greater value of
kurtosis than Mesokurtic.
Platykurtic: This distribution indicates that there is less data in the tail portion, which
implies a shorter tail. Platykurtic has a lesser value of kurtosis than Mesokurtic.
8.Moments
Moments are measures that describe the shape of a distribution. They provide insights into the
central tendency, dispersion, skewness, and kurtosis of a dataset. Statistical Moments play a
crucial role when we specify our probability distribution to work with since, with the help of
moments, we can describe the properties of statistical distribution. Therefore, they help define the
distribution. We required the statistical moments in Statistical Estimation and Testing of
Hypotheses, which all are based on the numerical values arrived for each distribution
Types of Moments:
5 Number Summary
In order to find the 5 number summary, we need the data to be sorted. If not sort it first in
ascending order and then find it.
Minimum Value: It is the smallest number in the given data, and the first number when it is
sorted in ascending order.
Maximum Value: It is the largest number in the given data, and the last number when it is
sorted in ascending order.
Median: Middle value between the minimum and maximum value. Below is the formula to
find median,
Quartile 1: Middle/center value between the minimum and median value. We can simply
identify the middle value between median and minimum value for a small dataset. If it is a
big dataset with so many numbers then better to use a formula,
Quartile 1 = ((n + 1)/4)th term
Solution:
Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
Step-2 Find minimum number
Here the first number is the minimum number as it is sorted in ascending order.
Minimum value = 5
Question 2: What is the maximum value in the given data 10, 20, 5, 15, 25, 30, 8.
Solution:
Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
Step-2 Find maximum number
Here the last number is the maximum number as it is sorted in ascending order.
Maximum value = 30
Question 3: What is the median value in the given data 10, 20, 5, 15, 25, 30, 8
Solution:
Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
Step-2 Find median
Here we need to find median value by a formula (n + 1)/2th term where n is the total count
of numbers.
Here n = 7
So median = (7 + 1)/2 = 8/2 = 4th term
4th term is median which is 15.
4. Independence:
Two variables XXX and YYY are independent if:
1. Covariance:
o Measures the linear relationship between two variables
Correlation:
2. Covariance Matrix:
Generalizes variance and covariance for multiple variables.
For a k-dimensional random vector X, it is:
A Joint probability distribution tells us how likely two or more random variables occur together.
It describes the relationship between these variables and shows the probability of different
combinations of their values. For example, if we want to understand the likelihood of it being hot
(X) and raining (Y) on the same day, a joint probability distribution provides this information.
The joint probability distribution tells us the probability of studying for X hours and scoring Y
points. For instance, P(X=2,Y=75)=0.2 means there is a 20% chance that you studied for 2 hours
and scored 75.
A joint probability distribution explains how likely two or more events happen together. It
describes the relationship between these events and tells us the probability of different
combinations of their outcomes.
For example:
Let X represent the number of hours you study for an exam (e.g., 1, 2, or 3 hours).
Let Y represent your score on the exam (e.g., 50, 75, or 100 points).
The joint probability distribution shows the likelihood of each combination of Xand Y. For
instance, P(X=2,Y=75)=0.2 means there is a 20% chance that you studied for 2 hours and scored
75.
1. Random variables: The things you are observing or measuring (e.g., study hours and test
scores).
2. Combinations of values: All possible pairs of outcomes (e.g., studying 2 hours and scoring
75 points).
3. Probabilities: The likelihood of each combination happening.
All the probabilities in the joint distribution must add up to 1 because one of the combinations
must happen.
1. To Simplify Analysis:
o Joint probability distributions can be complex, involving multiple variables.
Marginal probability simplifies the data by focusing on one variable, making it
easier to analyze.
Example: From the joint distribution of study hours (X) and test scores (Y), you may only
want to know the probability of studying for 2 hours, regardless of the score:
Example: If you’re interested in the overall probability of scoring 75 points (P(Y=75)), marginal
probability provides that information without considering study hours.
Formula:
Here, P(Y), the marginal probability of Y, is required for the calculation.
Comparing marginal probabilities with joint probabilities helps determine if variables are
independent or dependent.
Many real-world problems only require marginal probabilities, especially when specific
interactions between variables are not relevant.
Example: A company might want to know the probability of selling a product on a given day,
regardless of customer demographics. This would involve marginal probabilities.
Conditional probability is the probability of an event occurring given that another event has already
occurred. It is mathematically expressed as:
where:
P(A∣B) is the probability of event A occurring given that B has already occurred.
P(A∩B) is the probability of both A and B occurring.
P(B) is the probability of event B occurring.
For two random variables X and Y, the conditional probability distribution of X given Y=y is:
Example 1: Conditional Probability in a Discrete Case
Problem Statement:
Suppose a company has two departments: Software Development (S) and Marketing (M). The
probability distribution of employees in each department based on experience level (Junior or
Senior) is given as follows:
Problem Statement:
Suppose a company tracks the time it takes to complete a task X (in hours) based on the experience
level Y of the employee (Junior or Senior). The probability density functions (PDFs) are given:
Question:
What is the probability that a Junior employee completes the task within 2 hours?
Interpretation: A Junior employee has a 32.97% chance of completing the task within 2 hours.
1. Machine Learning
o Naïve Bayes Classifier uses conditional probability to make predictions.
o Hidden Markov Models (HMM) in speech recognition and NLP rely on conditional
distributions.
2. Medical Diagnosis
o Given symptoms SSS, doctors estimate the probability of a disease DDD using
P(D∣S)P(D | S)P(D∣S).
3. Finance & Risk Analysis
o Stock price movements conditioned on economic indicators.
4. Natural Language Processing (NLP)
o Predicting the next word in a sentence given previous words.
Bayes Theorem
Bayes theorem is also known with some other name such as Bayes rule or Bayes Law. Bayes
theorem helps to determine the probability of an event with random knowledge. It is used to
calculate the probability of occurring one event while other one already occurred. It is a best
method to relate the condition probability and marginal probability. In simple words, we can say
that Bayes theorem helps to contribute more accurate results.
Bayes Theorem is used to estimate the precision of values and provides a method for
calculating the conditional probability. However, it is hypocritically a simple calculation but it is
used to easily calculate the conditional probability of events where intuition often fails. Some of
the data scientist assumes that Bayes theorem is most widely used in financial industries but it is
not like that. Other than financial, Bayes theorem is also extensively applied in health and medical,
research and survey industry, aeronautical sector, etc.
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:
o According to the product rule we can express as the probability of event X with known
event Y as follows;
P(X Y)= P(X|Y) P(Y) {equation 1}
While studying the Bayes theorem, we need to understand few important concepts. These are as
follows:
1. Experiment
An experiment is defined as the planned operation carried out under controlled condition such as
tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of all
possible outcome of an event is known as sample space. For example, if we are rolling a dice,
sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space
will be:
S2 = {Head, Tail}
3. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as set of
outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
o Disjoint Event: If the intersection of the event A and B is an empty set or null then such
events are known as disjoint event or mutually exclusive events also.
o
4. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value having some
probability. However, it is neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and
both are mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.
6. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of outcome of both
events does not depends one another.
7. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that another event
B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it
as:
8. Marginal Probability:
Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This rule is very helpful in such scenarios where we have a good probability of P(A|B), P(B), and
P(A) and need to determine the fourth term.
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in
classification algorithms to isolate data as per accuracy, speed and classes.
Let's understand the use of Bayes theorem in machine learning with below example.
These are two conditions given to us, and our classifier that works on Machine Language has to
predict A and the first thing that our classifier has to choose will be the best possible class. So,
with the help of Bayes theorem, we can write it as:
Here;
P(A) will remain constant throughout the class means it does not change its value with respect to
change in class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the possibility of any class being the
right answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem
plays a significant role in Machine Learning and Naïve Bayes theorem has simplified the
conditional probability tasks without affecting the precision. Hence, we can conclude that:
Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of
smaller events.
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.