Business statistics
Session 2
Descriptive statistics
Descriptive statistics provide an organization
and summary of a dataset
A small number of summary measures replaces
the entirety of a dataset
We’ll briefly talk about some simple descriptive
summary measures
Learning Objectives
Distinguish between measures of central
tendency, measures of variability, measures of
shape.
Understand the meanings of mean, median,
mode, quartile and range.
Compute mean, median, mode, quartile, range,
variance, standard deviation, and mean absolute
deviation on ungrouped data.
Differentiate between sample and population
variance and standard deviation
Learning Objectives -- Continued
Understand the meaning of standard deviation
as it is applied by using the empirical rule and
Chebyshev’s theorem.
Compute the mean, median, standard deviation,
and variance on grouped data.
Understand box and whisker plots, skewness,
and kurtosis.
Measures of Central Tendency:
Ungrouped Data
Measures of central tendency yield
information about “particular places or
locations in a group of numbers.”
Common Measures of Location
Mode
Median
Mean
Quartiles
Mode
Mode - the most frequently occurring value in a
data set
Applicable to all levels of data measurement
(nominal, ordinal, interval, and ratio)
Can be used to determine what categories
occur most frequently
Bimodal – In a tie for the most frequently
occurring value, two modes are listed
Multimodal -- Data sets that contain more than
two modes
Median
Median - middle value in an ordered array of
numbers.
For an array with an odd number of terms, the
median is the middle number
For an array with an even number of terms the
median is the average of the middle two
numbers
Arithmetic Mean
Mean is the average of a group of numbers
Applicable for interval and ratio data
Not applicable for nominal or ordinal data
Affected by each value in the data set,
including extreme values
Computed by summing all values in the data
set and dividing the sum by the number of
values in the data set
Demonstration Problem
The number of U.S. cars in service by top car rental
companies in a recent year according to Auto Rental
News follows.
Company Number of Cars in Service
Enterprise 643,000; Hertz 327,000; National/Alamo
233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget
144,000; Advantage 20,000; U-Save 12,000; Payless
10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000;
Triangle 6,000
Compute the mode, the median, and the mean.
Demonstration Problem
Solution:
Mode: 9,000
Median: With 13 different companies in this
group, N = 13. The median is located at the (13
+1)/2 = 7th position. Because the data are
already ordered, the 7th term is 20,000, which is
the median.
Mean: The total number of cars in service is
1,791,000 = ∑x
μ = ∑x/N = (1,791,000/13) = 137,769.23
Quartiles
Quartile - measures of central tendency that
divide a group of data into four subgroups
Q1: 25% of the data set is below the first
quartile
Q2: 50% of the data set is below the second
quartile
Q3: 75% of the data set is below the third
quartile
Q1 Q2 Q3
25% 25% 25% 25%
Measures of Variability:
Ungrouped Data
Measures of Variability - tools that describe the
spread or the dispersion of a set of data.
Provides more meaningful data when used
with measures of central tendency
Measures of Variability:
Ungrouped Data
Common Measures of Variability
Range
Inter-quartile Range
Mean Absolute Deviation
Variance
Standard Deviation
Z scores
Coefficient of Variation
Range
The difference between the largest and the
smallest values in a set of data
Advantage – easy to compute
Disadvantage – is affected by extreme values
Interquartile Range
Interquartile Range - range of values between
the first and third quartiles
Range of the “middle half”; middle 50%
Useful when researchers are interested in the
middle 50%, and not the extremes
Interquartile Range – used in the construction
of box and whisker plots
Interquartile Range Q 3 Q1
Mean Absolute Deviation, Variance,
and Standard Deviation
These data are not meaningful unless the data
are at least interval level data
One way for researchers to look at the spread of
data is to subtract the mean from each data set
Subtracting the mean from each data value
gives the deviation from the mean (X - µ)
Mean Absolute Deviation, Variance,
and Standard Deviation
An examination of deviation from the mean can
reveal information about the variability of the
data
Deviations are used mostly as a tool to compute
other measures of variability
The Sum of Deviation from the arithmetic mean
is always zero
Sum (X - µ) = 0
Mean Absolute Deviation, Variance,
and Standard Deviation
An obvious way to force the sum of deviations to
have a non zero total is to take the absolute
value of each deviation around the mean
Allows one to solve for the Mean Absolute
Deviation
Mean Absolute Deviation (MAD)
Mean Absolute Deviation - average of the
absolute deviations from the mean
X X X
M . A. D.
X
5 -8 +8
9 -4 +4 N
16 +3 +3 24
17 +4 +4
18 +5 +5 5
0 24 4.8
Population Variance
Variance - average of the squared deviations
from the arithmetic mean
Population variance is denoted by σ2
Sum of Squared Deviations (SSD) about the
mean of a set of values (called Sum of Squares
of X) .
Population Variance
Variance = average of the squared deviations
from the arithmetic mean
Population variance is denoted by σ2
X X
X X
2
2
2
5 -8 64
9 -4 16 N
16 +3 9 130
17 +4 16
18 +5 25 5
0 130 2 6 .0
Sample Variance
Sample Variance - average of the squared
deviations from the arithmetic mean
Sample Variance – denoted by S2
X X X X X
2
2
2
X X
2,398 625 390,625 S n 1
1,844 71 5,041
663,866
1,539 -234 54,756
1,311 -462 213,444 3
7,092 0 663,866 221,288.67
Where,X =1773
Sample Standard Deviation
Sample Std Dev is the square root of the sample
variance.
X X X X X X
2
X
2
2
S
n 1
2,398 625 390,625 6 6 3 ,8 6 6
1,844 71 5,041
3
1,539 -234 54,756 2 2 1 , 2 8 8 .6 7
1,311 -462 213,444
S
2
7,092 0 663,866
S
2 2 1 , 2 8 8 .6 7
4 7 0 .4 1
Empirical Rule
Empirical Rule – used to state the approximate
percentage of values that lie within a given
number of standard deviations from the set of
data if the data are normally distributed
Empirical rule is used only for three numbers of
standard deviation: 1σ, 2σ, and 3σ
1σ = 68% of data;
2σ = 95% of data; and
3σ = 99% of data
Chebyshev’s Theorem
Empirical rule – applies when data are
approximately normally distributed
Chebyshev’s Theorem – applies to all
distributions, and can be used whenever the data
distribution shape is unknown or non-normal
1
P( k X k ) 1 2
k
for k > 1
Chebyshev’s Theorem
Chebyshev’s Theorem - states that at least (1 –
1/k2) values fall within +k standard deviations of
the mean regardless of the shape of the
distribution
Example: At least 75% of all values are within
+2σ of the mean regardless of the shape of a
distribution
when k = 2, then (1 – 1/k2) = .75
when K=3 ,then (1-1/k2 )=.89
Demonstration Problem
The effectiveness of district attorneys can be measured by
The effectiveness of district attorneys can be measured by
several
severalvariables,
variables,including
includingthe
thenumber
numberofofconvictions
convictionsperper
month,
month, the
the number
number of of cases
caseshandled
handledperpermonth,
month, and
and the
the total
number of years
total number of conviction
of years per month.
of conviction A researcher
per month. uses a
A researcher
uses a sample
sample of five attorneys
of five district district attorneys
in a city in a city
and and
determines the
determines the total number of years of conviction that each
total number
attorney wonof years of
against convictionduring
defendants that each attorney
the past month,wonas
against
reporteddefendants
in the firstduring
column the
inpast month, astabulations.
the following reported in the
Compute
first column theinmean absolutetabulations.
the following deviation, the variance,
Compute theand
meanthe
standarddeviation,
absolute deviation the
for these figures.
variance, and the standard deviation
figures.
Demonstration Problem
Solution
The researcher computes the mean absolute deviation,
the variance, and the standard deviation for these
data in the following manner.
x |x-x| (x-x )2
55 41 1,681
100 4 16
125 29 841
140 44 1,936
60 36 1,296
∑x = 480 ∑ ( x-x)2 =5770
X =96
Demonstration Problem
The computational formulas are used to
solve for s2 and s and compares the
results.
S2 = (5,770/4) = 1,442.5 and S = Square
root of variance = 37.98
MAD = 154/5 = 30.8
Z Scores
Z score – represents the number of Std Dev a
value (x) is above or below the mean of a set
of numbers when the data are normally
distributed
Z score allows translation of a value’s raw
distance from the mean into units of std dev.
Z = (x-µ)/σ
Z Scores
If Z is negative, the raw value (x) is below the
mean
If Z is positive, the raw value (x) is above the
mean
Between
Z = + 1, are app. 68% of the values
Z = + 2, are app. 95% of the values
Z = + 3, are app. 99% of the values
Coefficient of Variation
Coefficient of Variation (CV) - ratio of the standard
deviation to the mean, expressed as a percentage
useful when comparing Std Dev computed from data
with different means
Measurement of relative dispersion
C.V . 100
Coefficient of Variation
29
1
84
2
1
4.6 2
10
100 100
. .
CV 1
1
. .
CV 2
2
1 2
4.6 10
100 100
29 84
1586
. 1190
.
Measures of Central Tendency
and Variability: Grouped Data
Measures of Central Tendency
Mean
Median
Mode
Measures of Variability
Variance
Standard Deviation
Measures of Central Tendency
and Variability: Grouped Data
Mean – The midpoint of each class interval is used
to represent all the values in a class interval
Midpoint is weighted by the frequency of values in
the
class interval
Mean is computed by summing the products of
class midpoint, and the class frequency for each
class and
dividing that sum by the total number of
frequencies
Measures of Central Tendency
and Variability: Grouped Data
Median – The middle value in an ordered array
of numbers
Mode – the mode for grouped data is the class
midpoint of the modal class
The modal class is class interval with the
greatest frequency
Calculation of Grouped Mean
Class Interval Frequency Class Midpoint fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150
fM
2150
43 . 0
f 50
Median of Grouped Data - Example
N
cfp
Class Interval Freq Cum.Freq Md L 2 W
fmed
20-under 30 6 6 50
24
30-under 40 18 24 40 2 10
11
40-under 50 11 35 40.909
50-under 60 11 46 Where,
60-under 70 3 49 Md =median value
L=Lower limit of median class
70-under 80 1 50 Fmed =freq of median class
N = 50 Cfp =cum freq of class
preceding median class
W= width of class interval
Mode of Grouped Data
Class Interval Frequency 30 40
20-under 30 6 Mode 35
30-under 40 18 2
40-under 50 11
50-under 60 11
60-under 70 3
70-under 80 1
Mode of Grouped Data contd…
Mode is given by:
Class Freq. d1
L W
0-1 1 d1 d 2
1-2 4 d1 f1 f 0 d 2 f1 f 2
Where,
2-3 8 L=lower limit of modal class
W= width of class
f1=freq of modal class
3-4 7 f2=freq of class after the modal class
f0=freq of class preceding modal class
4-5 3
Here, L=2 d1 =8-4
5-6 2 W=1 d2=8-7
Total 25
So mode = 2+ 4/5*1= 2.8
Variance and Standard Deviation
of Grouped Data
Population Sample
M S M X
2 2
f f
2
2
n1
N
2
S S
2
Population Variance and Standard
Deviation of Grouped Data
M M M
2
f
2
Class Interval M fM
f
20-under 30 6 25 150 -18 324 194
30-under 40 18 35 630 -8 64 1152
4
40-under 50 11 45 495 2 4 44
50-under 60 11 55 605 12 144 1584
3 65 195 22 1452
60-under 70 484
1 75 75 32 1024
70-under 80 50 1024 7200
2150
M
2
2
f
µ=43
2
7200
144 144 12
N 50
Further Measures of the Distribution
• While measures of dispersion are useful for helping
us describe the width of the distribution, they tell us
nothing about the shape of the distribution
Further Measures of the Distribution
There are further statistics that describe the
shape of the distribution, using formulae that
are similar to those of the mean and variance
Mean (describes central value)
Variance (describes dispersion)
Skewness (describes asymmetry)
Kurtosis (describes peakedness)
Measures of Shape :Skewness and
Kurtosis
A fundamental task in many statistical analyses is to
characterize the location and variability of a data set
(Measures of central tendency vs. measures of
dispersion)
Both measures tell us nothing about the shape of the
distribution
A further characterization of the data includes
skewness and kurtosis
The histogram is an effective graphical technique for
showing both the skewness and kurtosis of a data
set
Skewness-Measure of asymmetry
Symmetrical – the right half is a mirror image of
the left half
Skewness – shows that the distribution lacks
symmetry; used to denote the data is sparse at
one end, and piled at the other end
Absence of symmetry
Extreme values in one side of a distribution
Types of Skewness
Skewness-Measure of asymmetry
If skewness equals zero, the histogram is symmetric about the
mean
Positive skewness
There are more observations below the mean than above it
When the mean is greater than the median
Negative skewness
There are a small number of low observations and a large
number of high ones
When the median is greater than the mean
Coefficient of Skewness
Coefficient of Skewness (Sk) - compares the
mean and median in light of the magnitude to
the standard deviation; Md is the median; Sk is
coefficient of skewness; σ is the Std Dev
3 Md
Sk
Coefficient of Skewness
Summary measure for skewness
3 Md
Sk
If Sk < 0, the distribution is negatively skewed
(skewed to the left).
If Sk = 0, the distribution is symmetric (not
skewed).
If Sk > 0, the distribution is positively skewed
(skewed to the right).
Further Moments – Kurtosis
The kurtosis of a normal distribution is 0
(Mesokurtic)
Kurtosis characterizes the relative peakedness
or flatness of a distribution compared to the
normal distribution
Further Moments – Kurtosis
Platykurtic– When the kurtosis < 0, the frequencies
throughout the curve are closer to be equal (i.e., the
curve is more flat and wide)
Thus, negative kurtosis indicates a relatively flat
distribution
Leptokurtic– When the kurtosis > 0, there are high
frequencies in only a small part of the curve (i.e, the
curve is more peaked)
Thus, positive kurtosis indicates a relatively peaked
distribution
Low vs high Kurtosis
leptokurtic
Platykurtic
Kurtosis is based on the size of a distribution's tails.
Negative kurtosis (platykurtic) – distributions with
short tails
Positive kurtosis (leptokurtic) – distributions with
relatively long tails
Why Do We Need Kurtosis?
These two distributions have the same variance,
approximately the same skew, but differ markedly in
kurtosis.
How to Graphically Summarize Data?
Histograms
Box plots
Functions of a Histogram
The function of a histogram is to graphically
summarize the distribution of a data set
The histogram graphically shows the following:
1. Center (i.e., the location) of the data
2. Spread (i.e., the scale) of the data
3. Skewness of the data
4. Kurtosis of the data
4. Presence of outliers
5. Presence of multiple modes in the data.
Functions of a Histogram contd..
The histogram can be used to answer the
following questions:
1. What kind of population distribution do the
data come from?
2. Where are the data located?
3. How spread out are the data?
4. Are the data symmetric or skewed?
5. Are there outliers in the data?
Box Plots
We can also use a box plot to graphically summarize a data set
A box plot represents a graphical summary of what is
sometimes called a “five-number summary” of the distribution
Minimum
Maximum
25th percentile
75th percentile
Median
Interquartile Range (IQR)
Box plot
Box Plots
Example – Consider first 9 stock prices ( in $,000)
6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0
Arrange these in order of magnitude
3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
The median is Q2 = 6.7 (there are 4 values on either
side)
Q1 = 5.9 (median of the 4 smallest values)
Q3 = 10.2 (median of the 4 largest values)
IQR = Q3 – Q1 = 10.2 - 5.9 = 4.3
THANK YOU