0% found this document useful (0 votes)
44 views

Ch3 Numerically Summarizing Data

- The document discusses various statistical measures used to summarize quantitative data, including the mean, median, mode, range, standard deviation, variance, percentiles, quartiles, and interquartile range. - It provides formulas and steps to calculate each measure and explains how to interpret the results. For example, the mean is the average value, the median splits the data in half, and the standard deviation indicates how spread out the data are around the mean. - Guidance is given on choosing the appropriate statistical measure based on the characteristics of the data, such as whether it is resistant to outliers. The five-number summary and boxplot are also introduced as visual tools to summarize a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Ch3 Numerically Summarizing Data

- The document discusses various statistical measures used to summarize quantitative data, including the mean, median, mode, range, standard deviation, variance, percentiles, quartiles, and interquartile range. - It provides formulas and steps to calculate each measure and explains how to interpret the results. For example, the mean is the average value, the median splits the data in half, and the standard deviation indicates how spread out the data are around the mean. - Guidance is given on choosing the appropriate statistical measure based on the characteristics of the data, such as whether it is resistant to outliers. The five-number summary and boxplot are also introduced as visual tools to summarize a dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Business Statistics

Dr. Gómez
[email protected]

Numerically
Summarizing
data
• To compute the arithmetic mean of a set of
data, the data must be quantitative.
Measures of • The arithmetic mean of a variable is computed
by adding all the values of the variable in the
central data set and dividing by the number of
tendency: observations (population arithmetic mean ).
The population mean is a parameter.
Mean • The sample arithmetic mean, s computed
using sample data. The sample mean is a
statistic.
Formula to calculate the mean
• If are the N observations of a variable from a population, then the population
mean, , is:

• If are the n observations of a variable from a sample, then the sample mean, ,
is:
Mean as a center of gravity
• The median of a variable is the value that lies in
the middle of the data when arranged in
Measures of ascending order. We use M to represent the
median.
central • Steps to find the median:
tendency: 1. Arrange the data in ascending order.

Median 2. Determine the number of observations, n.


3. Determine the observations in the middle of
the dataset.
• If the number of observations is odd, then the
median is the data value exactly in the middle
Odd and even of the data set. That is, the median is the
observation that lies in the position.
number of • If the number of observations is even, then the
median is the mean of the two middle
observations observations in the data set. That is, the median
is the mean of the observations that lie in the
position and the position.
Mean vs
median
A numerical summary of data is said to be resistant if
extreme values (very large or small or outliers) relative
to the data do not affect its value substantially.

So in this example where the extreme value 48 alters


Resistant data the mean. We concluded that the median is resistant,
while the mean is not resistant.

If the mean and the median are close in value, and the
distribution is symmetric, we use the mean to describe
the data.
Mean or median?
Measures of central
tendency: Mode
• The mode of a variable is the most frequent
observation of the variable that occurs in
the data set.
• To compute the mode, tally the number of
observations that occur for each data value.
• The data value that occurs most often is the
mode.
• A set of data can have no mode, one mode,
or more than one mode.
• If no observation occurs more than once,
we say the data have no mode.
Bimodal distribution

• When the data set has two modes, we


call this the bimodal distribution.
• If the data set has more than two modes,
then it is called the multimodal
distribution.
• We cannot determine the value of the
mean or median of data that are
nominal.
• The only measure of central tendency
that can be determined for nominal data
is the mode.
When to use?
Measures of dispersion are
meant to describe how spread
out data are.
Measures of
dispersion In other words, they describe
how far, on average, each
observation is from the typical
data value.
Measures of
dispersion
Measures of dispersion: Range

• To compute the range, the data must be quantitative.


• The range, R, of a variable is the difference between the largest and the smallest data
value. That is,

• The range is affected by extreme values, so the range is not resistant.


Measures of dispersion:
Standard deviation

• Standard deviation is based on the


deviation about the mean.
• For a population, the deviation about the
mean for the ith observation is
• For a sample, the deviation about the mean
for the ith observation is
• The farther an observation is from the mean, the larger
the absolute value of the deviation.
• The sum of all deviations about the mean must equal
zero. This condition is always true. That is,

Understanding • and
• Because this sum is zero, we cannot use the average
the standard deviation about the mean as a measure of spread.
• We calculate the mean of the squared deviations because
deviation squaring a nonzero number always results in a positive
number. This leads to variance.
• Variance is difficult to interpret (such as dollars squared).
• We “undo” the squaring process by taking the square
root of the sum of squared deviations.
Population standard deviation

• The square root of the sum of squared deviations


about the population mean divided by the number of
observations in the population, N.
• It is the square root of the mean of the squared
deviations about the population mean.
• Represented by Greek letter sigma
Example:
population
standard
deviation

√ ∑ ( 𝑥𝑖 − 𝜇 )
2

𝜎=
𝑁
Sample standard deviation
• The sample standard deviation, s, of a variable is the square root
of the sum the squared deviations about the sample mean
divided by , where n is the sample size.

• We call n - 1 the degrees of freedom because the first n - 1


observations have freedom to be whatever value they wish, but
the nth value has no freedom.
• It must be whatever value forces the sum of the deviations
about the mean to equal zero.
Interpretation of the standard deviation

The mean measures the center of the distribution, while the standard
deviation measures the spread of the distribution.

If we are comparing two populations, then the larger the standard


deviation, the more dispersion the distribution has, provided that the
variable of interest from the two populations has the same unit of measure.
Measures of dispersion:
Variance
• The variance of a variable is the square of the standard
deviation. The population variance is and the sample
variance is .

• What if we divided by instead of to obtain the sample


variance, as one might expect?
• The sample variance would consistently underestimate
the population variance and would result in a biased
estimator.
Determine and Interpret z-Scores

• The z-score measures the number of standard deviations an observation is above or below the mean.

• If a data value is larger than the mean, the z-score is positive.


• If a data value is smaller than the mean, the z-score is negative.
• If the data value equals the mean, the z-score is zero.
Example: Comparing z – scores
• Percentiles divide a set of data that is written in ascending order into

Percentiles
100 parts; thus 99 percentiles can be determined.
• For example, P1 divides the bottom 1% of the observations from the
top 99%, P2 divides the bottom 2% of the observations from the top
98%, and so on.
Example: interpret percentiles
Quartiles
• The most common percentiles are quartiles. Quartiles divide data sets into fourths, or four equal
parts.
1. Arrange the dataset in ascending order.
2. Determine the median, M, or second quartile Q2
3. Divide the data set into halves: the observations below (to the left of) M and the observations
above M.
4. The first quartile, Q1, is the median of the bottom half of the data and the third quartile, Q3, is the
median of the top half of the data.
Example: Quartiles
• The Highway Loss Data Institute
routinely collects data on collision
coverage claims.
• Collision coverage insures against
physical damage to an insured
individual’s vehicle.
• The data represent a random sample
of 18 collision coverage claims based
on data obtained from the Highway
Loss Data Institute for 2007 models.
Find and interpret the first, second,
and third quartiles for collision
coverage claims.
• The interquartile range, IQR, is the range of the
middle 50% of the observations in a data set.
That is, the IQR is the difference between the
third and first quartiles and is found using the
formula.
Interquartile
range • The interpretation of the interquartile range is
similar to that of the range and standard
deviation.
• The more spread a set of data has, the higher
the interquartile range will be.
Which measure should I use?
Check for outliers
• Summaries of data represent an exploration, a famous
statistician named John Tukey called this material exploratory
data analysis.
• The five-number summary of a set of data consists of the
smallest data value, Q1, the median, Q 3, and the largest data
value. We organize, the five-number summary as follows:
The five
number
summary
Example:
Five-number
summary
The five number summary can be used to
create a boxplot
Boxplot

You might also like