Ch3 Numerically Summarizing Data
Ch3 Numerically Summarizing Data
Dr. Gómez
[email protected]
Numerically
Summarizing
data
• To compute the arithmetic mean of a set of
data, the data must be quantitative.
Measures of • The arithmetic mean of a variable is computed
by adding all the values of the variable in the
central data set and dividing by the number of
tendency: observations (population arithmetic mean ).
The population mean is a parameter.
Mean • The sample arithmetic mean, s computed
using sample data. The sample mean is a
statistic.
Formula to calculate the mean
• If are the N observations of a variable from a population, then the population
mean, , is:
• If are the n observations of a variable from a sample, then the sample mean, ,
is:
Mean as a center of gravity
• The median of a variable is the value that lies in
the middle of the data when arranged in
Measures of ascending order. We use M to represent the
median.
central • Steps to find the median:
tendency: 1. Arrange the data in ascending order.
If the mean and the median are close in value, and the
distribution is symmetric, we use the mean to describe
the data.
Mean or median?
Measures of central
tendency: Mode
• The mode of a variable is the most frequent
observation of the variable that occurs in
the data set.
• To compute the mode, tally the number of
observations that occur for each data value.
• The data value that occurs most often is the
mode.
• A set of data can have no mode, one mode,
or more than one mode.
• If no observation occurs more than once,
we say the data have no mode.
Bimodal distribution
Understanding • and
• Because this sum is zero, we cannot use the average
the standard deviation about the mean as a measure of spread.
• We calculate the mean of the squared deviations because
deviation squaring a nonzero number always results in a positive
number. This leads to variance.
• Variance is difficult to interpret (such as dollars squared).
• We “undo” the squaring process by taking the square
root of the sum of squared deviations.
Population standard deviation
√ ∑ ( 𝑥𝑖 − 𝜇 )
2
𝜎=
𝑁
Sample standard deviation
• The sample standard deviation, s, of a variable is the square root
of the sum the squared deviations about the sample mean
divided by , where n is the sample size.
The mean measures the center of the distribution, while the standard
deviation measures the spread of the distribution.
• The z-score measures the number of standard deviations an observation is above or below the mean.
Percentiles
100 parts; thus 99 percentiles can be determined.
• For example, P1 divides the bottom 1% of the observations from the
top 99%, P2 divides the bottom 2% of the observations from the top
98%, and so on.
Example: interpret percentiles
Quartiles
• The most common percentiles are quartiles. Quartiles divide data sets into fourths, or four equal
parts.
1. Arrange the dataset in ascending order.
2. Determine the median, M, or second quartile Q2
3. Divide the data set into halves: the observations below (to the left of) M and the observations
above M.
4. The first quartile, Q1, is the median of the bottom half of the data and the third quartile, Q3, is the
median of the top half of the data.
Example: Quartiles
• The Highway Loss Data Institute
routinely collects data on collision
coverage claims.
• Collision coverage insures against
physical damage to an insured
individual’s vehicle.
• The data represent a random sample
of 18 collision coverage claims based
on data obtained from the Highway
Loss Data Institute for 2007 models.
Find and interpret the first, second,
and third quartiles for collision
coverage claims.
• The interquartile range, IQR, is the range of the
middle 50% of the observations in a data set.
That is, the IQR is the difference between the
third and first quartiles and is found using the
formula.
Interquartile
range • The interpretation of the interquartile range is
similar to that of the range and standard
deviation.
• The more spread a set of data has, the higher
the interquartile range will be.
Which measure should I use?
Check for outliers
• Summaries of data represent an exploration, a famous
statistician named John Tukey called this material exploratory
data analysis.
• The five-number summary of a set of data consists of the
smallest data value, Q1, the median, Q 3, and the largest data
value. We organize, the five-number summary as follows:
The five
number
summary
Example:
Five-number
summary
The five number summary can be used to
create a boxplot
Boxplot