Basic Stats Session
Basic Stats Session
Data
Charecteristics
Central
Variation Shape
Tendency
1. Data types
2. Measure of Central tendency & Measure of spread & variation
3. Normal distribution
DATA TYPES:
1. Continuous data: If you can break unit of data into smaller unit without the change in
original meaning. E.g.; Weight, distance, duration (time), height, density and temp.
*To apply normal distribution, must use continuous data.
2. Discrete data: All type of data other than continuous
a. Count or percentage: Count of errors, percentage of errors.
b. Binomial data: That can have only one of two values. Ex.: On-time delivery (yes/no)
c. Nominal data: Symbolic name, real name or given name – Ex: Employee ID, Machine 1,
Machine 2, Dept A, Dept B
d. Ordinal data: Name or labels that represent some value. Ex: Name of month, name of
designation, performance indication like good, excellent, fair, poor, mild hot, very hot,
agree, disagree, strongly disagree.
Central Tendency
Data tends to be close to its centre
Average mileage of car on basis of last 55-60 days data -captured CT= 15 Km/Litre Mileage of car
Central
Tendency
Is a good measure of central tendency when there is not too much variation in data.
Arithmetic average
= 1000 + 1000 + 1000 + 1000 + 1000 1000 + 1000 + 1000 + 1000 + 1000 /10
Avg = 1000 + 1000 + 1000 + 1000 + 1000 1000 + 1000 + 1000 + 1000 + 1000 + 46.5 Bn /11
= 4.6 Bn Pounds
Median –
Recruiters – 35K * 20 = 7 L
Median = 27K –
Mode –
Should be used only when you have limited possibilities
Frequency of occurrence
Batsman =
0123456
1 2 3 4 5 6 7
12 14 100 100 2 100 1
Which data is occurring the most
Mode = 0
Mode = Dice
1 2 3 4 5 6
12 14 19 85 2 42
Mode = 4
1 2 3 4 5 6
12 14 92 92 2 25
1 2 3 4 5 6
12 14 25 25 25 2
3. Measure of spread/Variation (Range, Variance, Std. Dev., Sum of Square, Quartile, Stability
Factor, Inter quartile range)
Spread tells us about how the data are distributed around the centre point.
A lot of spread = high variation.
Common measure of spread includes range, variance and std. dev.
Variation
Variation
Vis-a-vis Positional
Centre Information
Range Quartiles
QUARTILE: A set of values which have three points that divide the data set into four equal groups
that represent a fourth of population being sampled.
Quartile is very important information to gather in understanding of any business as it can explain
the data behaviour is a great way.
First quartile (Q1) = lower quartile = lowest 25% of data = 25th percentile
Second quartile (Q2) = Median = half of data set = 50th percentile
Third quartile (Q3) = upper quartile = splits highest 25% of data or lowest 75% = 75 th
percentile
Min – Q1 = 25%
Q2 – Q3 = 25%
Q3 – Max = 25%
Min – 0 days
Q1- 5 days
Q2 – 15 days
Q3 – 25 days
This data behaviour concluded that 25% of hiring has done between 0-5 days, next 25% hiring has
done between 5-15 days, next 25% hiring done between 15-25 days and last 25% hiring has done
between 25-100 days.
IQR – Inter Quartile Data range = Q3- Q1 = it represents mid 50% of data
SF closer to 1 mean that Q1 and Q3 are closer to each other. 1 indicates that Q1&Q3 are
exactly equal. 0 indicates that Q1 & Q3 are very far away from each other.
Variance: Tells how far off the data values are from the means overall.
Variation
Vis-a-vis Positional
Centre Information
Sum of
Variance Std Dev
Square
1 2 3 4 5 = 15/5 = 3
Mean = 3
(1-3) + (2-3) + (3-3) + (4-3) + (5-3) = (-2) + (-1) + (0) + (+1) + (+2) = 0
4+ 1+0+1+4= 10
1,2,3,4,5,3 -
12345
Mean = 3
(1-3)^2 + (2-3) ^2 + (3-3) ^2 + (4-3) ^2 + (5-3) ^2 / n-1 = Variance =average of Sum of Square of
data from its centre (mean)
1,2,3,4,5
Mean =3
Variation
Mean = 3
+ 4 + 1 + 0 + 1 + 4 = 10 = Sum of Square >> This is squared sum of distance of data from its centre
SS/n-1 = 10/4 = 2.5 = Variance (Avg sum of Square) >> average of squared sum of distance of data
from its centre
3. NORMAL DISTRIBUTION: A type of probability distribution where we can easily assign numerical
value for uncertainty of certain event. Prediction is easy on normally distributed data.
* Mainly applicable to continuous data but also to some discrete data such as Binomial and Poisson
distribution.
Empirical Law: A law that helps us plans future performance of business. According to this law, if our
data is normally distributed and if the Mean and std. deviation of distribution are known, then we
can predict our distribution.
+/- 1 Std dev = 68% = Almost 68% of the process will fall within +/-1 std dev from Mean. (i.e.
there is 68% probability that entire data/process will fall within +/-1 std dev from the mean).
+/-2 Std dev = 95%= Almost 95% of the process will fall within +/-2 std dev from the Mean.
(i.e. there is 95% probability that entire data/process will fall within +/-2 std dev from the
mean).
+/-3 Std dev = 99.73%= Almost 99.73% of the process will fall within +/-3 std dev from the
Mean. (i.e. there is 99.73% probability that entire data/process will fall within +/-3 std dev
from the mean).
Property of normal distribution says that tails of curve will never touch X-axis, so 100% of
process will never be covered exactly 100%, probability may be close to 99.99% but this will
not be 100%. (Both tails never touch x-axis)
Normality
If a process is to be considered “Normal” it will follow the below rules
The Process is considered non normal or out of control and the reason is considered special cause
variation. Statistics mandates that u must conduct RCA for such special behaviour.
22 ---
Production
68%
Under normal conditions the production will be 140 – 260 (we are 99.73% sure)
Target - 300
Route1: Mean = 30 Min Std Dev = 4 Min
Route1
68% 26-34 Min
95% 22-38 Min
99.73% 30 + 3(4) =30 + 12 30 – 3(4)= 30 – 12 = 18 18 - 42 Min
Route2
68%
95%
99.73% 20 + 3(20)= 20 + 60 = 20 – 3(20) = 0 0 – 80 Min
80
HDFC Avg =>10 Std Dev =1 from 7% to 13 %
Milege
68 % -> 21 – 19
95% - 22 -18
99.73% -- 23 – 17
Every time you see special behaviour – you must conduct RCA
26 -- 24
28 -- 22
Normality
Normal
Following criteria:
13-17
15 +6 15-6 = 21----9
If any data adheres to the above, it is referred to as normal (or following normal distribution)
If any of the above is not met – process is considered non normal and there is presence of special
cause variation in data.
Milege
Mean = 15
Std Dev = 1
12-18 – 99.74
Mean = 25 Min
25 + 2 = 27
25-2 = 23
99.74%
25 + (3*2) = 31
25 – (3*2) = 19
30 min Target
95%
25 + 4 = 29
25-4 = 21
Std Deviation
Variation
Vis-à-vis Centre
o Centre = Mean, we want to study distance of data from centre
o 1,2,3,4,5
Mean = 3
Distance from Centre
(1-3) + (2 -3) + (3-3) + (4-3) + (5-3) = 0
Because of this challenge – they squared the same
o (1-3)^2+ (2 -3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2
o Sum of Square is the sum of squared data from its mean
o Variance = Avg = (1-3)^2+ (2 -3)^2 + (3-3)^2 + (4-3)^2 + (5-
3)^2 / n-1
o Sum of Square = 10/4 = 2.5
Std Dev = Sqr root of Variance
Root of the avg distance of data from its centre
Positional Information
o Quartiles
Q1
Q3
IQR
SF
o Range
Min
Max