0% found this document useful (0 votes)
24 views16 pages

Basic Stats Session

Uploaded by

Neha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views16 pages

Basic Stats Session

Uploaded by

Neha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

BASIC STATISTICS

Why do we need data

 Work on it – Understand if things in order


 Meaningful information – as is
 To enable decision making
 Achivement Vs Target
 Shows performance and gaps to work upon
 Predict
o Historical Data

Why need Data?


 To analyse perfromance
 To understand the system
 To review the system
 To take correct decision
 To conclude results
 For Comparison
 Identify deviations
 Future Planning
 For prediction or anticipation
 Find out descrepencies
 Comparison with std
 Determine cause of problem
 Review of plans and Goals
 Elimimnate repeatability
 Visualize relationship

There are three basic characteristics of DATA

Data
Charecteristics

Central
Variation Shape
Tendency

Why do we need Data?

 Analyze the current situation and future prediction


o To Analyze trend
 Prepare for corrective action
 Predict

1. Data types
2. Measure of Central tendency & Measure of spread & variation
3. Normal distribution

DATA TYPES:

1. Continuous data: If you can break unit of data into smaller unit without the change in
original meaning. E.g.; Weight, distance, duration (time), height, density and temp.
*To apply normal distribution, must use continuous data.
2. Discrete data: All type of data other than continuous
a. Count or percentage: Count of errors, percentage of errors.
b. Binomial data: That can have only one of two values. Ex.: On-time delivery (yes/no)
c. Nominal data: Symbolic name, real name or given name – Ex: Employee ID, Machine 1,
Machine 2, Dept A, Dept B
d. Ordinal data: Name or labels that represent some value. Ex: Name of month, name of
designation, performance indication like good, excellent, fair, poor, mild hot, very hot,
agree, disagree, strongly disagree.

Central Tendency
Data tends to be close to its centre

Average mileage of car on basis of last 55-60 days data -captured CT= 15 Km/Litre Mileage of car

Central
Tendency

Mean Median Mode

Arithmetic Positional Frequence of


Average Average Occurance
Mean (Most commonly use of CT)

Is a good measure of central tendency when there is not too much variation in data.

Arithmetic average

Disadvantage – Gets impacted by extreme high or extreme low values

10 Families – 1000 pounds = 10000

Avg = sum of all/ number of all observations

= 1000 + 1000 + 1000 + 1000 + 1000 1000 + 1000 + 1000 + 1000 + 1000 /10

Average or Mean = 1000 pounds

LNM =46.5Bn Pounds

Avg = 1000 + 1000 + 1000 + 1000 + 1000 1000 + 1000 + 1000 + 1000 + 1000 + 46.5 Bn /11

= 4.6 Bn Pounds

This locality has billionaires only

150 – 194, 28000

Median –

 Prefer median when your data has high variation


 data is arranged in Ascending order. And the 50th position becomes your median.
 Advantage – Does not get impacted by extreme high or low data points

 Disadvantage – Because it is positional value, it can’t be used for mathematical calculations

Avg = 1L Total = 1L* 11 = 11L

Median = 1L = 50% time value is less than equal to 1L,

Recruiters – 35K * 20 = 7 L

Median = 27K –

Mode –
Should be used only when you have limited possibilities

Frequency of occurrence

Batsman =

0123456

1 2 3 4 5 6 7
12 14 100 100 2 100 1
Which data is occurring the most

Mode = 0

Mode = Dice

1 2 3 4 5 6
12 14 19 85 2 42

Mode = 4

1 2 3 4 5 6
12 14 92 92 2 25

Mode =3,4 – referred as bi-modal

1 2 3 4 5 6
12 14 25 25 25 2

Mode = 3,4,5 – Tri Modal

3. Measure of spread/Variation (Range, Variance, Std. Dev., Sum of Square, Quartile, Stability
Factor, Inter quartile range)
 Spread tells us about how the data are distributed around the centre point.
 A lot of spread = high variation.
 Common measure of spread includes range, variance and std. dev.

Variation

Variation

Vis-a-vis Positional
Centre Information

Range Quartiles

Min (0th Max (100th Q1 (25th Q3 (75th


IQR(Q3-Q1) SF (Q1/Q3)
Position) Position) Position) Position)
RANGE = Max – Min (Range is diff. between maximum and minimum observation of the given data)

QUARTILE: A set of values which have three points that divide the data set into four equal groups
that represent a fourth of population being sampled.

Quartile is very important information to gather in understanding of any business as it can explain
the data behaviour is a great way.

 First quartile (Q1) = lower quartile = lowest 25% of data = 25th percentile
 Second quartile (Q2) = Median = half of data set = 50th percentile
 Third quartile (Q3) = upper quartile = splits highest 25% of data or lowest 75% = 75 th
percentile

Quartiles will always have 25% data

Min – Q1 = 25%

Q1 – Median (Q2) = 25%

Q2 – Q3 = 25%

Q3 – Max = 25%

Example: Upon analysis of recruitment TAT

Min – 0 days

Q1- 5 days

Q2 – 15 days

Q3 – 25 days

Max – 100 days

This data behaviour concluded that 25% of hiring has done between 0-5 days, next 25% hiring has
done between 5-15 days, next 25% hiring done between 15-25 days and last 25% hiring has done
between 25-100 days.

IQR – Inter Quartile Data range = Q3- Q1 = it represents mid 50% of data

SF = Q1/Q3 best case = 1, worst = as far away as possible

 SF closer to 1 mean that Q1 and Q3 are closer to each other. 1 indicates that Q1&Q3 are
exactly equal. 0 indicates that Q1 & Q3 are very far away from each other.

Range = Max – Min

Variance: Tells how far off the data values are from the means overall.

 First calculate the mean of all data points (x-bar)


 Calculate difference between each data point and the average
 Square those figures for all data points.
 Add the squared values together (called sum of square)
 Divide the total by n-1 (no. of data values -1)
Std Deviation: Is the average distance from each data point to the mean.

Variation

Vis-a-vis Positional
Centre Information

Sum of
Variance Std Dev
Square

1 2 3 4 5 = 15/5 = 3

Distance of data from Centre

Mean = 3

(1-3) + (2-3) + (3-3) + (4-3) + (5-3) =

Mean = 1 + 2 + 3 + 4 + 5/5 = 15/5 = 3

(1-3) + (2-3) + (3-3) + (4-3) + (5-3) = (-2) + (-1) + (0) + (+1) + (+2) = 0

(1-3)^2 + (2-3) ^2+ (3-3)^2 + (4-3)^2 + (5-3)^2 = 10

4+ 1+0+1+4= 10

1,2,3,4,5,3 -

Sum of Square of distance of data from centre = SS


Avg of Sum of Square of distance of data from centre = Variance
(1-3)^2 + (2-3) ^2+ (3-3)^2 + (4-3)^2 + (5-3)^2 / n-1

12345

Mean = 3

(1-3) + (2-3) + (3-3) + (4-3) + (5-3) = 0

(1-3)^2 + (2-3) ^2 + (3-3) ^2 + (4-3) ^2 + (5-3) ^2 = Sum of Square


4 + 1 + 0 + 1 + 4 = 10 = Sum of Square of data from its centre (mean)

(1-3)^2 + (2-3) ^2 + (3-3) ^2 + (4-3) ^2 + (5-3) ^2 / n-1 = Variance =average of Sum of Square of
data from its centre (mean)

Variance = 10/4 = 2.5

(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2 = Sum of Square

Average of Sum of Square = Variance

Sqr Root of Variance = Std Dev

1,2,3,4,5

Mean =3

(1-3)+ (2-3) + (3-3) + (4-3) + (5-3) = 0

(1-3)^2+ (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2

4 + 1+ 0 + 1 + 4 = 10 =>> Sum of Square

Variance = Sum of Square/ n-1

Std Dev = Sqr root of Variance

Variation

Mean = 3

(1-3)^2+(2-3) ^2+(3-3) ^2+(4-3) ^2+(5-3) ^2 =

+ 4 + 1 + 0 + 1 + 4 = 10 = Sum of Square >> This is squared sum of distance of data from its centre

SS/n-1 = 10/4 = 2.5 = Variance (Avg sum of Square) >> average of squared sum of distance of data
from its centre

Variance Sqr Root = Std Dev

(1-3)^2+(2-3)^2+(3-3)^2+(4-3)^2+(5-3)^2 = Sum of Square

= Sum of Squared distance of data from its centre

(1-3)^2+(2-3)^2+(3-3)^2+(4-3)^2+(5-3)^2 / n-1 = Variance

Std Dev = Square Root of Variance

3. NORMAL DISTRIBUTION: A type of probability distribution where we can easily assign numerical
value for uncertainty of certain event. Prediction is easy on normally distributed data.

* Mainly applicable to continuous data but also to some discrete data such as Binomial and Poisson
distribution.
Empirical Law: A law that helps us plans future performance of business. According to this law, if our
data is normally distributed and if the Mean and std. deviation of distribution are known, then we
can predict our distribution.

 +/- 1 Std dev = 68% = Almost 68% of the process will fall within +/-1 std dev from Mean. (i.e.
there is 68% probability that entire data/process will fall within +/-1 std dev from the mean).
 +/-2 Std dev = 95%= Almost 95% of the process will fall within +/-2 std dev from the Mean.
(i.e. there is 95% probability that entire data/process will fall within +/-2 std dev from the
mean).
 +/-3 Std dev = 99.73%= Almost 99.73% of the process will fall within +/-3 std dev from the
Mean. (i.e. there is 99.73% probability that entire data/process will fall within +/-3 std dev
from the mean).
 Property of normal distribution says that tails of curve will never touch X-axis, so 100% of
process will never be covered exactly 100%, probability may be close to 99.99% but this will
not be 100%. (Both tails never touch x-axis)

Normality
If a process is to be considered “Normal” it will follow the below rules

 Mean + 1 Std Dev Mean – 1 Std Dev = 68%


 Mean + 2 Std Dev Mean – 2 Std Dev = 95%
 Mean + 3 Std Dev – 3 Std Dev = 99.73%

The Process is considered non normal or out of control and the reason is considered special cause
variation. Statistics mandates that u must conduct RCA for such special behaviour.

Mileage---- Mean = 15 Std Dev = 1

Lower Upper Mileage


68% 15 -1=14 15+1=16 14-16

95% 15-2(1) 15+2(1) 13-17


99.73% 15-3(1) 15+3(1) 12-18

22 ---

Production

Mean = 200 Std Dev = 20

68% 

68% 200 + 20 200 – 20 180 - 220


95% 200 + 2(20) 200 – 2(20) 160 - 240
99.73 200 +3(20) 200 – 3(20) 140 - 260

Under normal conditions the production will be 140 – 260 (we are 99.73% sure)

Target - 300
Route1: Mean = 30 Min Std Dev = 4 Min

Route 2: Mean = 20 Min Std Dev = 20 Min

Route1
68% 26-34 Min
95% 22-38 Min
99.73% 30 + 3(4) =30 + 12 30 – 3(4)= 30 – 12 = 18 18 - 42 Min

Route2
68%
95%
99.73% 20 + 3(20)= 20 + 60 = 20 – 3(20) = 0 0 – 80 Min
80
HDFC Avg =>10 Std Dev =1 from 7% to 13 %

Mean + 3 Std Dev = 10 + 3 = 13

Mean – 3 Std Dev = 10 – 3 = 7

ICICI => Mean =20 Std Dev = 20 Return -40% to 80%

Mean + 3 Std Dev = 20 + 3(20) = 20 + 60 = 80

Mean – 3 Std Dev = 20 – 60 = -40%

Milege

Mean = 20 Std Dev 1

68 % -> 21 – 19

95% - 22 -18

99.73% -- 23 – 17

Under normal circumstance this is what my performance will be

Every time you see special behaviour – you must conduct RCA

Consider that process is “normal” or within control

Even if one data is outside – we term that as special

Mean = 25 Min, Std Dev = 1

26 -- 24

Mean + 2 Std Dev Mean – 2 Std Dev = 95%

27 Min- 23 Min = >> 95%

Mean + 3 Std Dev – 3 Std Dev = 99.73%

28 -- 22

Normality
Normal

Statistics has a definition of the term “normal”

Following criteria:

 Mean +- 1 Std Dev = 68% of data

13-17

 Mean + _ 2 Std Dev = 95%


11 – 19 Milege

 Mean +- 3 Std Dev = 99.74 % of data

15 +6 15-6 = 21----9

If any data adheres to the above, it is referred to as normal (or following normal distribution)

If any of the above is not met – process is considered non normal and there is presence of special
cause variation in data.

Milege

Mean = 15

Std Dev = 1

16 – 14 – 68% - it is 68% likely that milege shall be between 14 to 16

13 to 17 – 95% - You are 95% sure that milege shall be between 13 – 17

12-18 – 99.74

Mean = 25 Min

Std Dev = 2 Min

25 + 2 = 27

25-2 = 23

23 --- 27 = 68% time


HDFC MF = mean = 10% Std Dev = 1

ICICI MF = Mean 40 % Std Dev = 20%

68% 95% 99.73 %


HDFC 9-11 8-12 7-13
ICICI 20-60 0-80 -20 - 100

Normal Data exhibits these properties:

 Mean Median Mode will be equal


 Unimodal i.e. you will have only one mode and that shall be at the centre
 Bell curve will accommodate entire process performance
 If you divide the bell curve at the centre, you will always have identical behaviour both sides.
 Bell curve never touches the x axis
Mean must be used with Std Dev ( never use mean alone)

Median should be used with Min Q1 Median Q3 Max


25 min

Std dev =2 min

99.74%

25 + (3*2) = 31

25 – (3*2) = 19

30 min Target

95%

25 + 4 = 29

25-4 = 21
Std Deviation

Variation

 Vis-à-vis Centre
o Centre = Mean, we want to study distance of data from centre
o 1,2,3,4,5
 Mean = 3
 Distance from Centre
 (1-3) + (2 -3) + (3-3) + (4-3) + (5-3) = 0
 Because of this challenge – they squared the same
o (1-3)^2+ (2 -3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2
o Sum of Square is the sum of squared data from its mean
o Variance = Avg = (1-3)^2+ (2 -3)^2 + (3-3)^2 + (4-3)^2 + (5-
3)^2 / n-1
o Sum of Square = 10/4 = 2.5
 Std Dev = Sqr root of Variance
 Root of the avg distance of data from its centre
 Positional Information
o Quartiles
 Q1
 Q3
 IQR
 SF
o Range
 Min
 Max

You might also like