Data Science With Python
Data Science With Python
Today’s Agenda
Basics of Statistics
Basics of Statistics
Variance(σ2 ) and Standard Deviation(σ) – Data Speed X-Mu (X-
Point Mu)2
Variance ( σ2 ) – Average Squared deviation of value from Mean : Var(X) = 1/n [Sum [X-Mu]2 ]
Standard Deviation (σ) – Square Root of Variance : √ 1/n [Sum [X-Mu]2 ] 1 13 -8 64
2 12 -9 81
3 17 -4 16
4 18 -3 9
5 18 -3 9
6 21 0 0
7 26 5 25
8 30 9 81
9 29 8 64
Variance = 398/10 = 39.8
10 28 7 49
SD = 6.3 Average 21 398
Basics of Statistics
https://en.wikipedia.org
Basics of Statistics
https://en.wikipedia.org/wiki/Mode_(statistics)
Probability Distribution
Normal Distribution
Normal or Gaussian or bell shaped curve distribution is a very common continuous probability distribution. Normal Distribution has
bell shaped curve, it’s a symmetric single model distribution with highest density at and around the mean :
Ex. Age, Marks
Some Important properties :
• Mean = Median = Mode
https://en.wikipedia.org/wiki/Normal_distribution
Probability Distribution
Standard Normal Distribution and z Score/ z Statistic
Special case of Normal distribution with Mean = 0 and Variance = 1, Std. Deviation = 1. it has total area under the curve = 1 which
represents probability.
Any Normal distribution can be converted into Standard Normal Distribution by applying following transformation :
Z = (x - µ) / σ {This is called as z Score, it tells us how many SD far we are from mean}
https://www.mathsisfun.com/data/standard-normal-distribution.html
Probability Distribution
Binomial Distribution
Bi→ 2, Nomial→ Nominal → Only 2 possible outcomes (Success or Failure)
When we perform any given experiment multiple times and we are interested in knowing #successes, this type of experiments are
known as Binomial experiments, Ex. Flipping the coins multiple times. Using Binomial Distribution we can answer probability related
questions for any Binomial experiments.
Probability of getting ‘x’ #Successes out of ‘n’ trials using Binomial Distribution –
P(x) = ncx Px (1-P)n-x ; P = Probability of Success in 1 trial
Some Important properties :
▪ N Fixed Number of Trials
▪ Only 2 Possible Exclusive Outcomes
▪ Probability of success remain same during the experiment
▪ All the trials are independent
Probability Distribution
Poisson Distribution
When we analyze the probability of occurrence of any event during some specified interval of time or according to some other binding
conditions.
Today’s Agenda
Advanced Statistics
Sampling
Sampling is taking random samples from over all population, sampling is done in order to make some judgements about overall
population because many a time it is not possible or practical to analyze the overall population and instead we can get approximately
same results even using sampling with sufficient sample size
Sample
Population
Sample
Sample
Sample Sample
Advanced Statistics
Inferential Statistics
With inferential statistics, we try to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential
statistics to try to infer from the sample data what the overall population might think. Or, we use inferential statistics to make
judgments of the probability of the overall population. This is also know as Point Estimation.
Point/Parameter
Point Estimators Estimation
• Sampling Mean (µ XBar) -> Population Mean (µ ) Population
• Sampling Standard Deviation (σ XBar) –> Population Standard Deviation (σ)
Sample
Advanced Statistics
Sampling Distribution
When we use the distribution of samples taken randomly from population to make judgment about the overall population. Different
Samples taken from same population can show different characteristics this is know as sampling variability. Larger the sample size –
less the variability.
Advanced Statistics
Advanced Statistics
Application of CLT
So we can use standard normal distribution concepts for any non normal population by taking samples because as per CLT Samples
will be normally distributed for large sample size.
From CLT we know, sampling SD σx = σ /√n
From Standard Normal distribution we know – Z = (x - µ) / σ
So for any sampling distribution we can say – Z = (X - µ) / (σ /√n), so now we can calculate the probability using SND for any Non
normal Population.
Advanced Statistics
µ = 10045 , σ = 15*49= 735, now we need to know the probability that total weight would be <=9800 so X=9800
Z = (X - µ) / σ/ √n
(9800- 10045 ) / (735/7)
-245/105
Z = -2.33
let use z table to get the probability for z <= -2.33 → 0.0099
Advanced Statistics
Hypothesis
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. In Statistics Hypothesis can be any theory about the
data that we want to validate (generally accept or reject) – we will be mainly working of two type of hypotheses :
1. Null Hypothesis (H0) – Current Assumption or Theory which is currently assumed to be correct
2. Alternative Hypothesis (H1) – Claim or theory that we want to prove
Ex. H0: While flipping a coin the probability of getting head is 0.5; H1 : Probability of getting head is less than 0.5
Advanced Statistics
Hypothesis Testing
Validating the null hypothesis (H0) against some Alternative
Hypothesis (H1) based on some given sample data , Steps
involved in Hypothesis Testing using P values –
Advanced Statistics
Failing to Reject null hypothesis (H0) while H0 is not correct and should
be rejected, Probability of Type II error is known as Beta Risk.
Advanced Statistics
Advanced Statistics