0% found this document useful (0 votes)
4 views36 pages

Unit-2

The document outlines the course materials for 'Foundation of Data Science' (22CSC202) offered by the School of Physical Sciences, covering topics such as randomness, probability, sampling, and statistical inference. It includes detailed explanations of discrete and continuous random variables, their applications, and Python implementations for generating random numbers. The syllabus is structured into five units, with recommended textbooks and reference materials provided for further study.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Unit-2

The document outlines the course materials for 'Foundation of Data Science' (22CSC202) offered by the School of Physical Sciences, covering topics such as randomness, probability, sampling, and statistical inference. It includes detailed explanations of discrete and continuous random variables, their applications, and Python implementations for generating random numbers. The syllabus is structured into five units, with recommended textbooks and reference materials provided for further study.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

22CSC202- Learning Materials

Foundation of Data Science/22 Unit II

School of Physical Sciences

Department of Mathematics

Course Materials

Course Name : Foundation of Data Science

Course Code : 22CSC202

Programme Name : Int. M.Sc. Data Science

Year : II

Semester : III

Course Coordinator : Dr. P. Sriramakrishnan


Foundation of Data Science/22CSC202- Learning Materials Unit II

Syllabus
Unit I
Introduction, Causality and Experiments, Data Preprocessing: Data cleaning, Data reduction, Data
transformation, Data discretization. Visualization and Graphing: Visualizing Categorical
Distributions, Visualizing Numerical Distributions, Overlaid Graphs, plots, and summary statistics of
exploratory data analysis
Unit-II
Randomness, Probability, Sampling, Sample Means and Sample Sizes
Unit-III
Introduction to Statistics, Descriptive statistics – Central tendency, dispersion, variance, covariance,
kurtosis, five point summary, Distributions, Bayes Theorem, Error Probabilities; Permutation
Testing
Unit-IV
Statistical Inference; Hypothesis Testing, Assessing Models, Decisions and Uncertainty, Comparing
Samples, A/B Testing, P-Values, Causality.
Unit-V
Estimation, Prediction, Confidence Intervals, Inference for Regression, Classification, Graphical
Models, Updating Predictions.
Text Books:

1. Adi Adhikari and John DeNero, “Computational and Inferential Thinking: The Foundations
of Data Science”, e-book.
Reference Books:

1. Data Mining for Business Analytics: Concepts, Techniques and Applications in R, by Galit
Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr., Wiley
India, 2018.

2. Rachel Schutt & Cathy O’Neil, “Doing Data Science” O’ Reilly, First Edition, 2013.

2
Foundation of Data Science/22CSC202- Learning Materials Unit II

Unit II

1. Randomness

 A random sequence of events, symbols or steps often has no order and does not follow an
intelligible pattern or combination.

 A process is random if it is unpredictable.

Random Variable

A random variable is a rule that assigns a numerical value to each outcome in a sample space.
Random variables may be either discrete or continuous.

Discrete Random Variable

A discrete random variable can take only a finite number of distinct values such as 0, 1, 2, 3, 4,
… and so on. Discrete random variables take on a countable number of distinct values. Consider
an experiment where a coin is tossed two times. If X represents the number of times that the coin
comes up heads, then X is a discrete random variable that can only have the values 0, 1, 2, or 3
(from no heads in two successive coin tosses to all heads). No other value is possible for X.

Continuous Random Variable

Continuous random variables can represent any value within a specified range or interval and can
take on an infinite number of possible values. An example of a continuous random variable
would be an experiment that involves measuring the amount of rainfall in a city over a year or
the average height of a random group of 25 people.

Applications

Gaming: Lottery

Cryptography: Random numbers are crucial for the digital encryption of passwords, browsers
and other online and digital data.

Cryptocurrency wallets: Seed phrases calculated with random numbers are used for BIP39
standard algorithms, which are used to calculate encryption keys for cryptocurrency wallets.

Simulations: Pseudorandom number sequences are used to test and re-run simulations, such as
Monte Carlo simulations, to sample something for the estimation of unknown ratios and areas.

3
Foundation of Data Science/22CSC202- Learning Materials Unit II

Machine learning (ML): Random numbers and ML model-free learning frameworks such as
domain randomization (DR) are used in many real-world applications, including robotic vacuum
cleaners and the OpenAI hand dexterity learning project.

Computing: Random numbers are important for TCP/IP sequence numbers, Transport Layer
Security nonces, password salts and DNS source port numbers.

2. Randomness in Python

Python defines a set of functions that are used to generate or manipulate random numbers
through the random module.

Functions in the random module rely on a pseudo-random number generator function random(),
which generates a random float number between 0.0 and 1.0. These particular type of functions
is used in a lot of games, lotteries, or any application requiring a random number generation.

2.1 Discrete Random variables

Generating a Random number using choice()

import random

# prints a random value from the list

list1 = [1, 2, 3, 4, 5, 6]

print(random.choice(list1))

# prints a random item from the string

string = "striver"

print(random.choice(string))

Output:

Generating a Random number using shuffle()

The shuffle() function is used to shuffle a sequence (list). Shuffling means changing the position

4
Foundation of Data Science/22CSC202- Learning Materials Unit II

of the elements of the sequence. Here, the shuffling operation is in place.

# import the random module

import random

# declare a list

sample_list = ['A', 'B', 'C', 'D', 'E']

print("Original list : ")

print(sample_list)

# first shuffle

random.shuffle(sample_list)

print("\nAfter the first shuffle : ")

print(sample_list)

# second shuffle

random.shuffle(sample_list)

print("\nAfter the second shuffle : ")

print(sample_list)

Output:

Original list :

['A', 'B', 'C', 'D', 'E']

After the first shuffle :

['A', 'B', 'E', 'C', 'D']

After the second shuffle :

5
Foundation of Data Science/22CSC202- Learning Materials Unit II

['C', 'E', 'B', 'D', 'A']

2.2.Continuous Random Variable

Example:

import random

num = random.random()

print(num)

Output:

0.30078080420602904

Generating a Random number using seed()

Python random.seed() function is used to save the state of a random function so that it can
generate some random numbers in Python on multiple executions of the code on the same
machine or on different machines. The seed value is the previous value number generated by the
generator.

# importing "random" for random operations

import random

# using random() to generate a random number between 0 and 1

print("A random number between 0 and 1 is : ", end="")

print(random.random())

# using seed() to seed a random number

random.seed(5)

# printing mapped random number

print("The mapped random number with 5 is : ", end="")

6
Foundation of Data Science/22CSC202- Learning Materials Unit II

print(random.random())

# using seed() to seed different random number

random.seed(7)

# printing mapped random number

print("The mapped random number with 7 is : ", end="")

print(random.random())

# using seed() to seed to 5 again

random.seed(5)

# printing mapped random number

print("The mapped random number with 5 is : ", end="")

print(random.random())

# using seed() to seed to 7 again

random.seed(7)

# printing mapped random number

print("The mapped random number with 7 is : ", end="")

print(random.random())

7
Foundation of Data Science/22CSC202- Learning Materials Unit II

Output:

A random number between 0 and 1 is : 0.510721762520941

The mapped random number with 5 is : 0.6229016948897019

The mapped random number with 7 is : 0.32383276483316237

The mapped random number with 5 is : 0.6229016948897019

The mapped random number with 7 is : 0.32383276483316237

Generating a Random Number using randrange()

The random module offers a function that can generate Python random numbers from a specified
range and also allows room for steps to be included, called randrange().

# importing "random" for random operations

import random

# using randrange() to generate in range from 20 to 50. The last parameter 3 is step size to skip

# three numbers when selecting.

print("A random number from range is : ", end="")

print(random.randrange(20, 50, 3))

Output:

A random number from range is : 41

Problem: A website requires a password or captcha of length 10. Each character may be a letter
(capital or small), number, or special characters @ and #. Write a program in Python to generate
one such password/captcha using randint() or choice().

8
Foundation of Data Science/22CSC202- Learning Materials Unit II

Code:

import random

valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

valid_chars += valid_chars.lower() + "0123456789" +"@#"

def generate_password(length):

password = ""

counter = 0

while counter < length:

rchar =chr(random.randint(0, 128)) #rchar =random.choice(valid_chars)

#print(rchar)

if rchar in valid_chars:

password += rchar

counter += 1

return password

print(generate_password(10))

9
Random Variables and Probability
Distributions:

Concept of a Random Variable:


• In a statistical experiment, it is often very important to
allocate numerical values to the outcomes.
Example:
• Experiment: testing two components.
(D=defective, N=non-defective)
• Sample space: S={DD,DN,ND,NN}
• Let X = number of defective components when two
components are tested.
• Assigned numerical values to the outcomes are:

Sample point Assigned


(Outcome) Numerical
Value (x)
DD 2
DN 1
ND 1
NN 0

• Notice that, the set of all possible values of the random


variable X is {0, 1, 2}.
Definition 3.1:
A random variable X is a function that associates each element
in the sample space with a real number (i.e., X : S → R.)

Notation: " X " denotes the random variable .


" x " denotes a value of the random variable X.
Types of Random Variables:
• A random variable X is called a discrete random variable
if its set of possible values is countable, i.e.,
.x ∈ {x1, x2, …, xn} or x ∈ {x1, x2, …}
• A random variable X is called a continuous random
variable if it can take values on a continuous scale, i.e.,
.x ∈ {x: a < x < b; a, b ∈R}
Discrete Probability Distributions
•A discrete random variable X assumes each of its values with a
certain probability

Example 3.3:
A shipment of 8 similar microcomputers
to a retail outlet contains 3 that are
defective and 5 are non-defective.
If a school makes a random purchase of 2
of these computers, find the probability
distribution of the number of defectives.
Solution:
We need to find the probability distribution of the random
variable: X = the number of defective computers purchased.
Experiment: selecting 2 computers at random out of 8

⎛8 ⎞
n(S) = ⎜⎜ ⎟⎟ equally likely outcomes
⎝ ⎠
2
The possible values of X are: x=0, 1, 2.
Consider the events:
⎛3⎞ ⎛5⎞
(X=0)={0D and 2N} ⇒ n(X=0)= ⎜⎜ ⎟⎟ × ⎜⎜ ⎟⎟
⎝ 0⎠ ⎝ 2⎠
⎛ 3⎞ ⎛ 5 ⎞
(X=1)={1D and 1N} ⇒ n(X=1)= ⎜⎜ ⎟⎟ × ⎜⎜ ⎟⎟
⎝ 1 ⎠ ⎝1 ⎠
⎛3 ⎞ ⎛5⎞
(X=2)={2D and 0N} ⇒ n(X=2)= ⎜⎜ ⎟⎟ × ⎜⎜ ⎟⎟
⎝ 2⎠ ⎝ 0⎠
⎛3⎞ ⎛5 ⎞
⎜ ⎟×⎜ ⎟
n( X = 0) ⎜⎝ 0 ⎟⎠ ⎜⎝ 2 ⎟⎠ 10
f(0)=P(X=0)= = =
n( S ) ⎛8 ⎞ 28
⎜⎜ ⎟⎟
⎝ 2⎠
⎛ 3⎞ ⎛ 5⎞
⎜ ⎟×⎜ ⎟
n( X = 1) ⎜⎝1 ⎟⎠ ⎜⎝1 ⎟⎠ 15
f(1)=P(X=1)= = =
n( S ) ⎛8 ⎞ 28
⎜⎜ ⎟⎟
⎝ 2⎠
⎛3⎞ ⎛5⎞
⎜ ⎟×⎜ ⎟
n( X = 2) ⎜⎝ 2 ⎟⎠ ⎜⎝ 0 ⎟⎠ 3
f(2)=P(X=2)= = =
n( S ) ⎛8 ⎞ 28
⎜⎜ ⎟⎟
⎝ 2⎠

In general, for x=0,1, 2, we have:


⎛3 ⎞ ⎛ 5 ⎞
⎜ ⎟×⎜ ⎟
n( X = x) ⎜⎝ x ⎟⎠ ⎜⎝ 2 − x ⎟⎠
f(x)=P(X=x)= =
n( S ) ⎛8 ⎞
⎜⎜ ⎟⎟
⎝ 2⎠

The probability distribution of X can be given in the following


table:
x 0 1 2 Total
f(x)= P(X=x) 10 15 3 1.00
28 28 28
The probability distribution of X can be written as a formula as
follows:

⎧⎛ 3 ⎞ ⎛ 5 ⎞
⎪ ⎜⎜ ⎟⎟ × ⎜⎜ ⎟⎟
⎪⎪ ⎝ x ⎠ ⎝ 2 − x ⎠ ; x = 0, 1, 2 Hypergeometric
f ( x) = P( X = x) = ⎨ ⎛8 ⎞ Distribution
⎪ ⎜⎜ ⎟⎟
⎪ ⎝ 2⎠
⎪⎩0 ; otherwise
3.3. Continuous Probability Distributions
For any continuous random variable, X, there exists a non-
negative function f(x), called the probability density function
(p.d.f) through which we can find probabilities of events
expressed in term of X.

b
P(a < X < b) = ∫ f(x) dx
a
= area under the curve
of f(x) and over the
interval (a,b)

P(X∈A) = ∫ f(x) dx
A
= area under the curve
f: R → [0, ∞) of f(x) and over the
region A

Definition 3.6:
The function f(x) is a probability density function (pdf) for a
continuous random variable X, defined on the set of real
numbers, if:
1. f(x) ≥ 0 ∀ x ∈R

2. ∫ f(x) dx = 1
-∞
b
3. P(a ≤ X ≤ b) = ∫ f(x) dx ∀ a, b ∈R; a≤b
a
Note:
For a continuous random variable X, we have:
1. f(x) ≠ P(X=x) (in general)
2. P(X=a) = 0 for any a∈R
3. P(a ≤ X ≤ b)= P(a < X ≤ b)= P(a ≤ X < b)= P(a < X < b)
4. P(X∈A) = ∫ f(x) dx
A
Total area = ∫ ∞
f ( x ) dx = 1 area = P(a ≤ X ≤ b )
−∞
= ∫ a f ( x ) dx
b

area = P( X ≥ b ) area = P( X ≤ a )
= ∫ b f ( x ) dx = ∫ −∞ f ( x ) dx
∞ a

Example 3.6:
Suppose that the error in the reaction temperature, in oC, for a
controlled laboratory experiment is a continuous random
variable X having the following probability density function:
⎧1 2
⎪ x ; −1< x < 2
f ( x) = ⎨ 3
⎪⎩0 ; elsewhere
1. Verify that:

(a)∫ f(x) dx=1

-
2. Find P(0<X≤1)

Solution:
X = the error in the reaction temperature in oC.
X is continuous r. v.
⎧1 2
⎪ x ; −1< x < 2
f ( x) = ⎨ 3
⎪⎩0 ; elsewhere
1. (a)
∞ −1 21 ∞
∫ f(x) dx = ∫ 0 dx + ∫ x 2
dx + ∫ 0 dx
-∞ -∞ -13 2
21 ⎡1 x=2 ⎤
= ∫ x 2 dx = ⎢ x 3 ⎥
-13 ⎣9 x = −1⎦
1
= (8 − (−1)) = 1
9

1 11
2. P(0<X≤1) = ∫ f(x) dx = ∫ x 2 dx
0 03
⎡1 x =1⎤
= ⎢ x3 ⎥
⎣9 x = 0⎦
1
= (1 − (0))
9
1
=
9

-
Mathematical Expectation:

4.1 Mean of a Random Variable:


Definition 4.1:
Let X be a random variable with a probability distribution f(x).
The mean (or expected value) of X is denoted by µX (or E(X))
and is defined by:
⎧ ∑ x f ( x) ; if X is discrete
⎪⎪all x
E(X) = µ X = ⎨ ∞
⎪ ∫ x f ( x) dx ; if X is continuous
⎪⎩− ∞

Example 4.1: (Reading Assignment)

Example: (Example 3.3)


A shipment of 8 similar microcomputers to a retail outlet
contains 3 that are defective and 5 are non-defective. If a school
makes a random purchase of 2 of these computers, find the
expected number of defective computers purchased
Solution:
Let X = the number of defective computers purchased.
In Example 3.3, we found that the probability distribution of X
is:
x 0 1 2
f(x)= P(X=x) 10 15 3
28 28 28
or:
⎧⎛ 3 ⎞ ⎛ 5 ⎞
⎪ ⎜⎜ ⎟⎟ × ⎜⎜ ⎟⎟
⎪⎪ ⎝ ⎠ ⎝
x 2 − x ⎠ ; x = 0, 1, 2
f ( x) = P( X = x) = ⎨ ⎛8 ⎞
⎪ ⎜⎜ ⎟⎟
⎪ ⎝ 2⎠
⎪⎩0 ; otherwise
The expected value of the number of defective computers
purchased is the mean (or the expected value) of X, which is:
2
E(X) = µ X = ∑ x f ( x)
x=0
= (0) f(0) + (1) f(1) +(2) f(2)
10 15 3
= (0) + (1) +(2)
28 28 28
15 6 21
= + = = 0.75 (computers)
28 28 28
Example 4.3:
Let X be a continuous random variable that represents the life
(in hours) of a certain electronic device. The pdf of X is given
by:
⎧ 20,000
⎪ ; x > 100
f ( x) = ⎨ x 3
⎪⎩0 ; elsewhere
Find the expected life of this type of devices.
Solution:

E(X) = µ X = ∫ x f ( x) dx
−∞
∞ 20000
= ∫ x dx
100 x3
∞ 1
= 20000 ∫ 2
dx
100 x
⎡ 1 x=∞ ⎤
= 20000 ⎢− ⎥
⎣ x x = 100⎦
= − 20000 ⎡⎢0 −
1 ⎤
= 200 (hours)
⎣ 100 ⎥⎦
Therefore, we expect that this type of electronic devices to last,
on average, 200 hours.

Theorem 4.1:
Let X be a random variable with a probability distribution f(x),
and let g(X) be a function of the random variable X. The mean
(or expected value) of the random variable g(X) is denoted by
µg(X) (or E[g(X)]) and is defined by:
⎧ ∑ g ( x) f ( x) ; if X is discrete
⎪⎪all x
E[g(X)] = µ g(X) = ⎨ ∞
⎪ ∫ g ( x) f ( x) dx ; if X is continuous
⎪⎩− ∞
Example:
Let X be a discrete random variable with the following
probability distribution
x 0 1 2
f(x) 10 15 3
28 28 28
2
Find E[g(X)], where g(X)=(X −1) .
Solution:
g(X)=(X −1)2
2 2
E[g(X)] = µ g(X) = ∑ g ( x) f ( x) = ∑ ( x − 1) 2 f ( x)
x=0 x =0
= (0−1) f(0) + (1−1) f(1) +(2−1)2 f(2)
2 2

10 15 3
= (−1)2 + (0)2 +(1)2
28 28 28
10 3 13
= +0+ =
28 28 28
Example:
1
In Example 4.3, find E⎛⎜ ⎞⎟ .
1
{note: g(X) = X }
⎝X⎠
Solution:
⎧ 20,000
⎪ ; x > 100
f ( x) = ⎨ x 3
⎪⎩0 ; elsewhere
1
g(X) =
X
∞ ∞ 1
⎛1⎞
E⎜ ⎟ = E[g(X)] = µ g(X) = ∫ g ( x) f ( x) dx = ∫ f ( x) dx
⎝X⎠ −∞ −∞ x
∞ 1 20000 ∞ 1 20000 ⎡ 1 x = ∞ ⎤
= ∫ dx = 20000 ∫ 4 dx = ⎢ ⎥
100 x x
3
100 x − 3 ⎣ x 3 x = 100⎦
− 20000 ⎡ 1 ⎤
= ⎢⎣ 1000000 ⎥⎦ = 0.0067
0 −
3
4.2 Variance (of a Random Variable):
The most important measure of variability of a random variable
X is called the variance of X and is denoted by Var(X) or σ 2X .

Definition 4.3:
Let X be a random variable with a probability distribution f(x)
and mean µ. The variance of X is defined by:
⎧ ∑ ( x − µ ) 2 f ( x) ; if X is discrete
⎪⎪all x
Var(X) = σ 2X = E[(X − µ) 2 ] = ⎨ ∞
⎪ ∫ ( x − µ ) 2 f ( x) dx ; if X is continuous
⎪⎩− ∞
Definition:
The positive square root of the variance of X, σ X = σ 2X , is
called the standard deviation of X.

Note:
Var(X)=E[g(X)], where g(X)=(X −µ)2

Theorem 4.2:
The variance of the random variable X is given by:
Var(X) = σ 2X = E(X 2 ) − µ 2
⎧ ∑ x 2 f ( x) ; if X is discrete
⎪⎪all x
where E(X 2 ) = ⎨ ∞
⎪ ∫ x 2 f ( x) dx ; if X is continuous
⎪⎩− ∞
Example 4.9:
Let X be a discrete random variable with the following
probability distribution
x 0 1 2 3
f(x) 0.51 0.38 0.10 0.01
2
Find Var(X)= σ X .
Solution:
3
µ = ∑ x f ( x) = (0) f(0) + (1) f(1) +(2) f(2) + (3) f(3)
x =0
= (0) (0.51) + (1) (0.38) +(2) (0.10) + (3) (0.01)
= 0.61
1. First method:
3
Var(X) = σ 2X = ∑ ( x − µ ) 2 f ( x )
x=0
3
= ∑ ( x − 0.61) 2 f ( x)
x=0
=(0−0.61)2 f(0)+(1−0.61)2 f(1)+(2−0.61)2 f(2)+ (3−0.61)2 f(3)
=(−0.61)2 (0.51)+(0.39)2 (0.38)+(1.39)2 (0.10)+ (2.39)2 (0.01)
= 0.4979
2. Second method:
Var(X) = σ 2X = E(X 2 ) − µ 2
3 2 2 2 2
E(X 2 ) = ∑ x 2 f(x) = (0 ) f(0) + (1 ) f(1) +(2 ) f(2) + (3 ) f(3)
x =0
= (0) (0.51) + (1) (0.38) +(4) (0.10) + (9) (0.01)
= 0.87
2
Var(X) = σ 2X = E(X 2 ) − µ 2 = 0.87 − (0.61) = 0.4979
Example 4.10:
Let X be a continuous random variable with the following pdf:
⎧2( x − 1) ; 1 < x < 2
f ( x) = ⎨
⎩0 ; elsewhere
Find the mean and the variance of X.
Solution:
∞ 2 2
µ = E(X) = ∫ x f ( x) dx = ∫ x [ 2( x − 1)] dx = 2 ∫ x ( x − 1) dx =5/3
−∞ 1 1
∞ 2 2
E(X 2 ) = ∫ x 2 f ( x) dx = ∫ x 2 [ 2( x − 1)] dx = 2 ∫ x 2 ( x − 1) dx =17/6
−∞ 1 1
2
Var(X) = σ 2X = E(X 2 ) − µ = 17/6 − (5/3) = 1/18
2
Foundation of Data Science/22CSC202- Learning Materials Unit II

3. Probability

Probability is a way to analyze how probable something is to happen. Many occurrences are
difficult to predict with 100 percent accuracy. We can only estimate the possibility of an event
occurring. The values range from 0 to 1, with 0 telling us improbability and 1 denoting certainty.
For calculating probability, we simply divide the number of favorable outcomes to the total
number of outcomes.

Probability Formula

The formula defines the possibility that an event will occur. It is defined as the ratio of the
number of favorable outcomes to the total number of outcomes.

The probability formula can be expressed as:

P(E) = Number of favorable outcomes / Total number of outcomes

where P(E) denotes the probability of an event E.

Solved Examples

Example 1. There are 8 balls in a container, 4 are red, 1 is yellow and 3 are blue. What is
the probability of picking a yellow ball?

Answer:

The probability is equal to the number of yellow balls in the container divided by the total
number of balls in the container, i.e. 1/8.

Example 2: A dice is rolled. What is the probability that an even number has been
obtained?

Answer:

When fair six-sided dice are rolled, there are six possible outcomes: 1, 2, 3, 4, 5, or 6.

Out of these, half are even (2, 4, 6) and half are odd (1, 3, 5). Therefore, the probability of
getting an even number is:

P(even) = number of even outcomes / total number of outcomes

10
Foundation of Data Science/22CSC202- Learning Materials Unit II

P(even) = 3 / 6

P(even) = 1/2

Applications of Probability –Coin Toss Probability

Now let us take into account the case of coin tossing to understand probability in a better way.

Tossing a Coin

A single coin when flipped has two possible outcomes, a head or a tail. The definition of
probability when applied here to find the probability of getting a head or getting a tail.

The total number of possible outcomes = 2

Sample Space = {H, T} H: Head, T: Tail

P(H) = Number of Heads/ Total Number of outcomes = 1/2

P(T) = Number of Tails/ Total Number of outcomes = 1/2

Tossing Two Coins

There are a total of four possible results when tossing two coins. We can calculate the probability
of two heads or two tails using the formula.

The probability of getting two tails can be calculated as :

Total number of outcomes = 4

Sample Space = {(H,H),(H,T),(T,H),(T,T)}

P(2T) = P(0 H) = Number of outcomes with two tails/ Total number of outcomes = 1/4

P(1H) = P(1T) = Number of outcomes with one head/ Total number of outcomes = 2/4 = 1/2

Tossing Three Coins

The number of total outcomes on tossing three coins is 23 = 8. For these outcomes, we can find
the probability of various cases such as getting one tail, two tails, three tails, and no tails, and
similarly can be calculated for several heads.

Total number of outcomes = 23 = 8

Sample space = {(H, H, H), (H, H, T),(H, T, H), (T, H, H), (T, T, H), (T, H, T), (H, T, T), (T, T,
T)}.

11
Foundation of Data Science/22CSC202- Learning Materials Unit II

P(3T) = P(0 H) = Number of outcomes with three tails/ Total Number of outcomes = 1/8

Dice Roll Probability

Various games use dice to decide the movements of the player during the games. A dice has six
outcomes. Some games are played using two dice. Now let us calculate the outcomes, and their
probabilities for one dice and two dice respectively.

Rolling One Dice

The number of outcomes is 6 when a die is rolled and the sample space is = {1,2,3,4,5,6}.

Let us now discuss some cases

P(Even Number) = Number of outcomes in which even number occur/Total Outcomes = 3/6 =
1/2

P(Prime Number) = Number of prime number outcomes/ Total Outcomes = 3/6 = ½

Rolling Two Dice

The number of total outcomes, when two dice are rolled, is 62 = 36.

Let us check a few cases in the above example,

Probability of getting a doublet(Same number) = 6/36 = 1/6

Probability of getting a number 3 on at least one dice = 11/36

4. Probability using python

To write a program for a probability question, there are two basic steps:

First, collect a set of items/events.

Then, write a function to solve the problem.


Foundation of Data Science/22CSC202- Learning Materials Unit II

Example

Let’s take the example of choosing one number out of a set of numbers. We can find different
probabilities for this particular problem; for instance:

i. selecting an even number

ii. selecting an odd number

iii. selecting a prime number

iv. selecting a number divisible by 3 and 5

Assigning values to list

set_of_numbers=[]

n=int(input("enter the number of elements: "))

print("enter the elements: ")

for i in range(0,n):

j=int(input())

set_of_numbers.append(j)

#print the set of numbers

print(" the set of numbers are: ",set_of_numbers)

Output

enter the number of elements: 5

enter the elements:

10

20

30

45

59

13
Foundation of Data Science/22CSC202- Learning Materials Unit II

the set of numbers are: [10, 20, 30, 45, 59]

i. Checking Probability of Odd numbers in the list

def probability_of_odd(a):

c=0

# here the set of numbers is the list a

total=len(a)

for i in a:

#checks whether the number is even or not

if(i%2!=0):

c+=1;

#returns the probability of total even numbers/total outcomes

return(c/total)

probability_of_odd(set_of_numbers)

Output: 0.4

ii. Checking probability of even numbers in the list

def probability_of_even(a):

c=0

# here the set of numbers is the list a

total=len(a)

for i in a:

#checks whether the number is even or not

if(i%2==0):

c+=1;

#returns the probability of total even numbers/total outcomes

14
Foundation of Data Science/22CSC202- Learning Materials Unit II

return(c/total)

probability_of_even(set_of_numbers)

Output: 0.6

iii. Checking probability of prime numbers in the list

def probability_of_prime(a):

c=0

# here the set of numbers is the list a

total=len(a)

for i in a:

#funciton to see if its a prime or not

if(i>1):

c2=0

for j in range(2,i):

if(i%j==0):

c2+=1

if(c2==0):

c+=1

#probability value:

return(c/total)

probability_of_prime(set_of_numbers)

Output:0.2

iv. Checking the probability of list divisible by 3 and 5

d ef probability_divisible(a):

c=0

15
Foundation of Data Science/22CSC202- Learning Materials Unit II

#here the set of numbers is list a

total=len(a)

for i in a:

#check divisibility

if(i %3==0 and i%5==0):

c+=1

# probability value:

return(c/total)

Output

0.4

Exercise 1: An urn contains 10 red balls, 10 green balls, 10 blue balls. 3 balls are randomly
selected from them. Write a program in Python for the same.

Exercise 2: In a lottery, 5 random numbers are generated from numbers in the range 1 to 20. The
same number cannot be repeated twice. Write a Python program for this.

Exercise 3: Write a Python program to shuffle a deck of card and draw 5 cards from it.

5. Sampling
 The process or method of sample selection from the population.
 Sampling is a technique of selecting individual members or a subset of the population to
make statistical inferences from them and estimate the characteristics of the whole
population.
Example: suppose a drug manufacturer would like to research the adverse side effects of a drug
on the country’s population. In that case, it is almost impossible to conduct a research study that
involves everyone. In this case, the researcher decides on a sample of people from each
demographic and then researches them, giving him/her indicative feedback on the drug’s
behavior.

The Need for Sampling


- Reduced cost
- Greater speed
- Greater accuracy

16
Foundation of Data Science/22CSC202- Learning Materials Unit II

- Greater scope
- More detailed information can be obtained.
6. Types of sampling:
Sampling in market action research is of two types – probability sampling and non-probability
sampling.
Probability sampling: Probability sampling is a technique in which the researcher chooses samples
from a larger population using a method based on probability theory. For a participant to be
considered as a probability sample, he/she must be selected using a random selection.
Non-probability sampling: In non-probability sampling, the researcher randomly chooses members
for research. This sampling method is not a fixed or predefined selection process. This makes it
difficult for all population elements to have equal opportunities to be included in a sample.

7.1 Probability sampling with examples:

Probability sampling is a technique in which researchers choose samples from a larger


population based on the theory of probability. This sampling method considers every member
of the population and forms samples based on a fixed process.

Example, in a population of 1000 members, every member will have a 1/1000 chance of being
selected to be a part of a sample. Probability sampling eliminates sampling bias in the
population and allows all members to be included in the sample.
Foundation of Data Science/22CSC202- Learning Materials Unit II

a. Simple random sampling

b. Cluster sampling

c. Systematic sampling

d. Stratified random sampling

a. Simple random sampling

One of the best probability sampling techniques that helps in saving time and resources is the
Simple Random Sampling method. It is a reliable method of obtaining information where
every single member of a population is chosen randomly, merely by chance. Each individual
has the same probability of being chosen to be a part of a sample.

Example: in an organization of 500 employees, if the HR team decides on conducting team-


building activities, they would likely prefer picking chits out of a bowl. In this case, each of
the 500 employees has an equal opportunity of being selected.

b. Cluster sampling:

Cluster sampling is a method where the researchers divide the entire population into sections
or clusters representing a population. Clusters are identified and included in a sample based on
demographic parameters like age, sex, location, etc. This makes it very simple for a survey
creator to derive effective inferences from the feedback.

Example: For example, to estimate the average annual household income in a large city we
use cluster sampling, because to use simple random sampling we need a complete list of
households in the city from which to sample. A less expensive way is to let each block within
the city represent a cluster. A sample of clusters could then be randomly selected, and every
household within these clusters could be interviewed to find the average annual household
income.

c. Systematic sampling:

Researchers use the systematic sampling method to choose the sample members of a
population at regular intervals. It requires selecting a starting point for the sample and sample
size determination that can be repeated at regular intervals. This type of sampling method has
a predefined range; hence, this sampling technique is the least time-consuming.

Example, a researcher intends to collect a systematic sample of 500 people in a population of

18
Foundation of Data Science/22CSC202- Learning Materials Unit II

5000. He/she numbers each element of the population from 1-5000 and will choose every 10th
individual to be a part of the sample (Total population/ Sample Size = 5000/500 = 10).

d. Stratified random sampling:

Stratified random sampling is a method in which the researcher divides the population into
smaller groups that don’t overlap but represent the entire population. While sampling, these
groups can be organized, and then draw a sample from each group separately.

Example: A researcher looking to analyze the characteristics of people belonging to different


annual income divisions will create strata (groups) according to the monthly family income.
Eg. – less than 20,000, 21,000 – 30,000, 31,000 to 40,000, 41,000 to 50,000, etc. By doing
this, the researcher concludes the characteristics of people belonging to different income
groups. Marketers can analyze which income groups to target and which ones to eliminate to
create a roadmap that would bear fruitful results.

7.2 Non-probability sampling:

Non-probability sampling is defined as a sampling technique in which the researcher selects


samples based on the subjective judgment of the researcher rather than random selection.

a. Convenience sampling:

Convenience sampling is a non-probability sampling technique where samples are selected


from the population only because they are conveniently available to the researcher.
Researchers choose these samples just because they are easy to recruit, and the researcher
did not consider selecting a sample that represents the entire population.

b. Volunteer sampling

The respondents are only volunteers in this method. Generally, volunteers must be screened
so as to get a set of characteristics suitable for the purposes of the survey (e.g. individuals
with a particular disease). This method can be subject to large selection biases, but is
sometimes necessary. For example, for ethical reasons, volunteers with particular medical
conditions may have to be solicited for some medical experiments.

Another example of volunteer sampling is callers to a radio or television show, when an issue
is discussed and listeners are invited to call in to express their opinions. Only the people who

19
Foundation of Data Science/22CSC202- Learning Materials Unit II

care strongly enough about the subject one way or another tend to respond. The silent
majority does not typically respond, resulting in a large selection bias. Volunteer sampling is
often used to select individuals for focus groups or in-depth interviews

c. Quota sampling:

A researcher wants to study the career goals of male and female employees in an
organization. There are 500 employees in the organization, also known as the population.
To understand better about a population, the researcher will need only a sample, not the
entire population. Further, the researcher is interested in particular group within the
population.

d. Judgmental or Purposive sampling:

With this method, sampling is done based on previous ideas of population composition and
behaviour. In the judgmental sampling method, researchers select the samples based purely
on the researcher’s knowledge and credibility. In other words, researchers choose only
those people who they deem fit to participate in the research study.

7. Sample mean

A sample mean is an average of a sample data. The sample mean can be used to calculate the central
tendency, standard deviation and the variance of a data set. The sample mean can be applied to a
variety of uses, including calculating population averages. Many job industries also employ the use
of statistical data, such as: Scientific fields like ecology, biology and meteorology, Medical fields
and pharmacology

Data and computer science, information technology and cybersecurity

Aerospace and aeronautical industries

Fields in engineering and design

How to calculate the sample mean

Calculating sample mean is as simple as adding up the number of items in a sample set and then
dividing that sum by the number of items in the sample set. To calculate the sample mean through
spreadsheet software and calculators, you can use the formula:

20
Foundation of Data Science/22CSC202- Learning Materials Unit II

x̄ = ( Σ xi ) / n

Here, x̄ represents the sample mean, Σ tells us to add, xi refers to all the X-values and n stands for
the number of items in the data set.

When calculating the sample mean using the formula, you will plug in the values for each of the
symbols. The following steps will show you how to calculate the sample mean of a data set:

1. Add up the sample items

First, you will need to count how many sample items you have within a data set and add up the total
amount of items.

Example: A teacher wants to find the average score for a 100 students in his class. He randomly
selects seven sample students for computing sample mean. The teacher's sample set has seven
different test scores: 78, 89, 93, 95, 88, 78, 95. He adds all the scores together and gets a sum of 616.
He can use this sum in the next step to find his sample mean.

2. Divide sum by the number of samples

Next, divide the sum from step one by the total number of items in the data set. Using the teacher as
an example, here is what this looks like:

Example: The teacher uses the sum of 616 to find the average score. He divides 616 by seven since
the number of scores in his data set was seven. The resulting quotient is 88.

3. The result is the mean

After dividing, the resulting quotient becomes your sample mean, or average. In the example of the
teacher:

Example: The student's scores he was calculating resulted in an average grade of 88%. You can use
the sample mean to further calculate variance, standard deviation and standard error.

4. Use the mean to find the variance

You can use the sample mean in further calculations by finding the variance of the data sample.
Variance represents how far spread out each of the sample items are within a data set. To calculate
variance, you find the difference between each data item and the mean. Using the teacher example,
let's see how this works:

Example: The teacher wants to find the variance of his student's scores, so he calculates the variance

21
Foundation of Data Science/22CSC202- Learning Materials Unit II

by first finding the difference between the average score and all the student's seven scores he used to
find the mean:

Variance =( 78-88, 89-88, 93-88, 95-88, 88-88, 78-88, 95-88) =mean(square (-10, 1, 5, 7, 0, -10,
7))=mean(100, 1, 25, 49, 0, 100, 49)=~46.

Then, the teacher squares each difference (100, 1, 25, 49, 0, 100, 49) and, just like the mean, adds all
the numbers up and divides by seven. He gets 324 / 7 = 46.3, or approximately 46. The larger the
variance, the more spread out from the mean the data is.

5. Use the variance to find the standard deviation

You can also take the sample mean even further by calculating the standard deviation of the sample
set. Standard deviation represents the normal distribution rate for a set of data, and it is the square
root of the variance. Let's look at an example:

Standard deviation = sqrt(variance)

Example: The teacher uses the variance of 46 to find the standard deviation: √46 = 6.78. This
number tells the teacher how far above or below the grade average of 88% his student is on any
given test score in the sample set.

What is Population mean?

Population mean is the Mean of complete values contained in a population. Here, population mean
means that the average scores of 100 students. Actual population mean = 79%

How to calculate standard mean error (SEM)?

The standard error of the mean (SEM), or standard deviation, represents how far the sample mean is
from the true population mean.

SEM = difference between population mean & sample mean= 88-79=9%

9. Sample Size

Sample size’ is a market research term used for defining the number of individuals included in
conducting research. Researchers choose their sample based on demographics, such as age, gender
questions, or physical location.

Sample size determination is the process of choosing the right number of observations or people

22
Foundation of Data Science/22CSC202- Learning Materials Unit II

from a larger group to use in a sample.

The goal of figuring out the sample size is to ensure that the sample is big enough to give
statistically valid results and accurate estimates of population parameters but small enough to be
manageable and cost-effective.

9.1 Sample size calculation formula

( )
If population size (N) unknown; Sample size 𝑋 = 𝑍 × 𝑃 ×


If population size (N) known; 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 =

where, Z-Z score, P- population proportion, N- population size, and n-sample size

What are the terms used around the sample size?

Before we jump into sample size determination, let’s take a look at the terms you should know:

a. Population size (N):

Population size is how many people fit your demographic. Example: Which political party will win
this upcoming election? Actual target is entire state or nation. But we can randomly choose people
from using anyone of the sampling methods like random sampling, cluster sampling etc. Here
number of sample/population size we need to fix before start our survey.

b. Population proportion (P)

The Population proportion is what you expect the results to be. This can often be determined by
using the results from a previous survey, or by running a small pilot study. If you are unsure, use
Foundation of Data Science/22CSC202- Learning Materials Unit II

50%, which is conservative and gives the largest sample size.

c. Confidence level:

The confidence level tells you how sure you can be that your data is accurate. It is expressed as a
percentage and aligned to the confidence interval. For example, if your confidence level is 95%,
your results will most likely be 95% accurate.

d. Finding the Z-score (Z):

The Z-score can be considered as a constant value that is set automatically depending on the
confidence level. Z-score shows the number of standard deviations or the standard normal score
between the average/mean of the population and any selected value.
Confidence Level Z value
80% 1.28
85% 1.44
90% 1.65
95% 1.96
99% 2.58

e. The margin of error (calculated using confidence interval):

There’s no way to be 100% accurate when it comes to surveys. Confidence intervals tell you how far
off from the population means you’re willing to allow your data to fall.
A margin of error describes how close you can reasonably expect a survey result to fall relative to
the real population value.
Margin error = 100% - confidence level
Example 1: Calculate the sample size for a infinite population. Take confidence level as 95% and
margin of error as 5%.

We will calculate the sample size first by calculating it for size and then adjusting it to the required
size.

Given: Z = 1.960, population proportion (P) = 50% or 0.5, Margin error = 0.05, N=100000

(1 − 𝑃)
𝑋 =𝑍 ×𝑃×
𝑀𝑎𝑟𝑔𝑖𝑛 𝐸𝑟𝑟𝑜𝑟

(1 − 0.5)
𝑋 = 1.96 × 0.5 ×
0.05

X = 3.8416 × 0.25 / 0.0025

24
Foundation of Data Science/22CSC202- Learning Materials Unit II

Sample size for infinite population X = 384.16


Example 2: Calculate the sample size for a population size 100000. Take confidence level as 95%
and margin of error as 5%.

Population size N=100000


𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑖𝑧𝑒 𝑓𝑜𝑟 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 (𝑁) =

100000 ∗ 384.16
𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑓𝑜𝑟 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 (𝑁) =
100000 + 384.16 − 1

The sample size for population size 100000 is 382 peoples.

Exercise 1: Calculate the sample size for a finite and infinite population of 9000. Take confidence
level as 85%, population proposition 50% and margin of error as 15%.

25

You might also like