0% found this document useful (0 votes)
111 views27 pages

Data Science With Python

This document provides an overview of statistics concepts for data science, including basics, probability distributions, and advanced topics. The agenda covers the basics of statistics like data scales, variance, and standard deviation. It then discusses probability distributions like the normal, binomial, and Poisson distributions. Finally, it outlines advanced statistical concepts like sampling, inferential statistics, hypothesis testing using z-tests and t-tests. The goal is to equip learners with both basic and advanced statistics skills for data science applications in Python.

Uploaded by

Nivas Srini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views27 pages

Data Science With Python

This document provides an overview of statistics concepts for data science, including basics, probability distributions, and advanced topics. The agenda covers the basics of statistics like data scales, variance, and standard deviation. It then discusses probability distributions like the normal, binomial, and Poisson distributions. Finally, it outlines advanced statistical concepts like sampling, inferential statistics, hypothesis testing using z-tests and t-tests. The goal is to equip learners with both basic and advanced statistics skills for data science applications in Python.

Uploaded by

Nivas Srini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Science with Python

Day 3 - Statistics for Data Science - Basic & Advanced

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Today’s Agenda

✓ Basics of Statistics ✓ Probability Distributions


• Type of Random Variables - Based on • Normal Distribution
Scale of Measurement • Standard Normal Distribution and Z-
o Nominal Score
o Ordinal • Binomial Distribution
o Interval • Poisson Distribution
o Ratio
• Variance
• Standard Deviation

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Basics of Statistics

Type of Random Variable - Based on Scale of Measurement

NOMINAL ORDINAL INTERVAL RATIO

• No Order • Order • Order • Order


• No Comparison • Comparison • Comparison • Comparison
• No Calculation • No Calculation • Calculation • Calculation
• No Interval • No Interval • Regular Interval • Regular Interval
• No Absolute Zero • Absolute Zero
Ex. Ex. • Cannot calculate • Can calculate Ratio
Gender {M,F} Size {S<M<L} ratio Ex.
Ex. Height, distance
Temp.{0 C= 32F}, IQ

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Basics of Statistics
Variance(σ2 ) and Standard Deviation(σ) – Data Speed X-Mu (X-
Point Mu)2
Variance ( σ2 ) – Average Squared deviation of value from Mean : Var(X) = 1/n [Sum [X-Mu]2 ]
Standard Deviation (σ) – Square Root of Variance : √ 1/n [Sum [X-Mu]2 ] 1 13 -8 64
2 12 -9 81
3 17 -4 16
4 18 -3 9
5 18 -3 9
6 21 0 0
7 26 5 25
8 30 9 81
9 29 8 64
Variance = 398/10 = 39.8
10 28 7 49
SD = 6.3 Average 21 398

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Basics of Statistics

Mean(µ), Variance(σ2 ) and Standard Deviation(σ) -


Discrete random variable :

Mean (µ) - Variance (σ2 ) - Standard Deviation (σ) -

https://en.wikipedia.org

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Basics of Statistics

Mean(µ), Variance(σ2 ) and Standard Deviation(σ) -


Continues random variable -

Mean (µ) - Variance (σ2 ) - Standard Deviation (σ) -

https://en.wikipedia.org/wiki/Mode_(statistics)

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Probability Distribution
Normal Distribution
Normal or Gaussian or bell shaped curve distribution is a very common continuous probability distribution. Normal Distribution has
bell shaped curve, it’s a symmetric single model distribution with highest density at and around the mean :
Ex. Age, Marks
Some Important properties :
• Mean = Median = Mode

• Area within 1 Std. Dev around the mean ~ 68.3 %

• Area within 2 Std. Dev around the mean ~ 95.4 %

• Area within 3 Std. Dev around the mean ~ 99.7 %

https://en.wikipedia.org/wiki/Normal_distribution

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Probability Distribution
Standard Normal Distribution and z Score/ z Statistic
Special case of Normal distribution with Mean = 0 and Variance = 1, Std. Deviation = 1. it has total area under the curve = 1 which
represents probability.
Any Normal distribution can be converted into Standard Normal Distribution by applying following transformation :

Z = (x - µ) / σ {This is called as z Score, it tells us how many SD far we are from mean}

https://www.mathsisfun.com/data/standard-normal-distribution.html

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Probability Distribution

Binomial Distribution
Bi→ 2, Nomial→ Nominal → Only 2 possible outcomes (Success or Failure)
When we perform any given experiment multiple times and we are interested in knowing #successes, this type of experiments are
known as Binomial experiments, Ex. Flipping the coins multiple times. Using Binomial Distribution we can answer probability related
questions for any Binomial experiments.
Probability of getting ‘x’ #Successes out of ‘n’ trials using Binomial Distribution –
P(x) = ncx Px (1-P)n-x ; P = Probability of Success in 1 trial
Some Important properties :
▪ N Fixed Number of Trials
▪ Only 2 Possible Exclusive Outcomes
▪ Probability of success remain same during the experiment
▪ All the trials are independent

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Probability Distribution
Poisson Distribution
When we analyze the probability of occurrence of any event during some specified interval of time or according to some other binding
conditions.

Probability of ‘x’ occurrence using Poisson Distribution –

P(x) = (x e- )/x! ;  = Mean/Expected #Occurrence


Some Important properties :
▪ All the occurrences are independent
▪ Expected #Occurrence doesn’t change over the period of time

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

“Qs & As”

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]
Data Science with Python
Day 4 - Statistics for Data Science - Advanced

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Today’s Agenda

✓ Inferential Statistics ✓ Hypothesis Testing

• Sampling • Hypothesis and hypothesis Testing


• Inferential Statistics • One tail/Two tail test
• Sampling Distribution • Type I and Type II Errors
• Central Limit Theorem • Hypothesis Testing using z test
• Central Limit Theorem Exercise • Hypothesis Testing t test

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Sampling
Sampling is taking random samples from over all population, sampling is done in order to make some judgements about overall
population because many a time it is not possible or practical to analyze the overall population and instead we can get approximately
same results even using sampling with sufficient sample size

Sample
Population
Sample
Sample
Sample Sample

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Inferential Statistics
With inferential statistics, we try to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential
statistics to try to infer from the sample data what the overall population might think. Or, we use inferential statistics to make
judgments of the probability of the overall population. This is also know as Point Estimation.

Point/Parameter
Point Estimators Estimation
• Sampling Mean (µ XBar) -> Population Mean (µ ) Population
• Sampling Standard Deviation (σ XBar) –> Population Standard Deviation (σ)
Sample

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Sampling Distribution
When we use the distribution of samples taken randomly from population to make judgment about the overall population. Different
Samples taken from same population can show different characteristics this is know as sampling variability. Larger the sample size –
less the variability.

Expected Value E(x) or Sampling Mean (µ XBar)-


We take multiple samples from overall population and analyze the distribution of these samples to make the decision about overall
population. The mean of these samples is known as expected value or sampling mean and it can be considered as Overall population
mean (µ).
Sample Size (n)-> Very large than Expected Value E (x) -> µ

Standard Error of the Mean (σ XBar)-


Standard deviation of sampling distribution is know as Standard Error of the Mean.
Sample Size (n)-> Very large than Standard Error of the mean -> 0

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Central Limit Theorem (CLT)


"Sample Mean will be approximately normally distributed for larger sample size regardless of the original distribution from which
we are taking samples."
With Mean = Population Mean (µ)
SD = σ /√n
{in case σ is not known then SD = s/ √n, s = Sample SD}

Application of CLT
So we can use standard normal distribution concepts for any non normal population by taking samples because as per CLT Samples
will be normally distributed for large sample size.
From CLT we know, sampling SD σx = σ /√n
From Standard Normal distribution we know – Z = (x - µ) / σ
So for any sampling distribution we can say – Z = (X - µ) / (σ /√n), so now we can calculate the probability using SND for any Non
normal Population.

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Exercise - Central Limit Theorem


A large freight elevator can transport a maximum of 9800 pounds. Suppose a load of cargo containing 49 boxes must be transported via the elevator. Experience
has shown that the weight of boxes of this type of cargo follows a distribution with mean µ = 205 pounds and standard deviation σ = 15 pounds. Based on this
information, what is the probability that all 49 boxes can be safely loaded onto the freight elevator and transported?
Solution – Given : µ = 205 , σ = 15 , n=49; Average total weight = 49*205 = 10045 > 9800
We know nothing about the original probability distribution weather its normal or not but from CLT we know sample mean will be normally distributed,
We are interested in the weight of 49 boxes not 1 so lets calculate : µ and σ for 49 boxes :

µ = 10045 , σ = 15*49= 735, now we need to know the probability that total weight would be <=9800 so X=9800
Z = (X - µ) / σ/ √n
(9800- 10045 ) / (735/7)
-245/105
Z = -2.33
let use z table to get the probability for z <= -2.33 → 0.0099

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Hypothesis
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. In Statistics Hypothesis can be any theory about the
data that we want to validate (generally accept or reject) – we will be mainly working of two type of hypotheses :

1. Null Hypothesis (H0) – Current Assumption or Theory which is currently assumed to be correct
2. Alternative Hypothesis (H1) – Claim or theory that we want to prove

Ex. H0: While flipping a coin the probability of getting head is 0.5; H1 : Probability of getting head is less than 0.5

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Hypothesis Testing
Validating the null hypothesis (H0) against some Alternative
Hypothesis (H1) based on some given sample data , Steps
involved in Hypothesis Testing using P values –

1. Define your Null and Alternative hypothesis, H0 & H1


2. Decide the type of test (One tail or two tail)
3. Define level of significance α , generally assumed to be 0.05 or 0.01
4. Find the Test Statistics TS (t Test or z test)
5. Find P Value
6. Reject the null hypothesis or you may accept the alternative
hypothesis if P < α

• P value – Probability of getting the given sample or even more


extreme samples if null hypothesis is true
• Significance level (α ) – Minimum acceptable P Value/ border line

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Type I Error or False Positive


Getting Positive result when it should be Negative in reality.

Rejecting null hypothesis (H0) while H0 is correct and should not be


rejected, Probability of Type I error is known as Alpha Risk.

Type II Error or False Negative

Getting Negative result when it should be Positive in reality.

Failing to Reject null hypothesis (H0) while H0 is not correct and should
be rejected, Probability of Type II error is known as Beta Risk.

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

Z-Test for Hypothesis Testing


Z-Test is used to perform hypothesis testing when we know population Standard Deviation (σ) or the sample size n > 30, Steps for 1
sample z test –

1. Define your Null and Alternative hypothesis, H0 & H1


2. Define level of significance α , generally assumed to be 0.05 or 0.01
3. Find the Test Statistics using TS = (X- µ)/ (σ/ √n) or TS = (X- µ)/ (s/ √n) {when σ is not known but n>=30}
4. Find P Value using Z Table and TS, if it’s a two sided test then double P value
5. Reject the null hypothesis or you may accept the alternative hypothesis if P < α

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

Advanced Statistics

T-Test for Hypothesis Testing


T-Test is used to perform hypothesis testing when we don’t know population Standard Deviation (σ) and sample size n < 30, Steps for
1 sample t test –

1. Define your Null and Alternative hypothesis, H0 & H1


2. Define level of significance α , generally assumed to be 0.05 or 0.01
3. Calculate Degree of Freedom DF = n-1
4. Find the Test Statistics using TS = (X- µ)/ (s/ √n)
5. Find P Value or P Range using T Table for calculated TS and DF
6. Reject the null hypothesis or you may accept the alternative hypothesis if P < α

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


Data Science with Python

“Qs & As”

18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]


18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • [email protected]

You might also like