Chapter Notes (Chapter 1)
Chapter Notes (Chapter 1)
Summarising data
In this chapter we will introduce a few basic concepts about data sets, and how they can be
presented and summarised.
Definitions.
Here are a few common words and their definitions.
Population – refer to the complete set of objects of interest. Examples:
o All new cars sold by Ford UK in Leeds this year (is this representative?)
1
2 CHAPTER 1. SUMMARISING DATA
o “yes/no”
Quantitative data – are given as numeric values. Quantitative data can be discrete or
continuous.
Quantitative discrete data are typically whole numbers. Examples:
o Number of absent students in a class
Quantitative continuous data can take any value in a range. The data may be rounded
to distinct values but we still think of the data as being for a continuous variable. Examples:
o The height of students in a class
Exercise 1.1
For each of the following variables, state whether they are QUALITATIVE or QUANTI-
TATIVE. If they are quantitative, are they DISCRETE or CONTINUOUS?
Before looking at the answers, come up with answers yourself first.
Answers: (i) Quantitative discrete (ii) Qualitative (iii) Quantitative continuous (iv) Quanti-
tative discrete (v) Qualitative (vi) Quantitative continuous (vii) Qualitative (viii) Quantitative
continuous (ix) Quantitative discrete (x) Qualitative
Examples 1.2
o Descriptive statistics: A lecturer wants to summarise the performance of 70 Leeds foun-
dation year students on MATH0365 over the past two years.
o Inferential statistics: The government wants to assess the opinion of the British public
on whether or not we should join / leave the European Union.
The following video gives a summary of some of the concepts introduced above:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
4 CHAPTER 1. SUMMARISING DATA
Notation.
With quantitative data we often use xi or yi to denote an observation where i takes values
1, 2, . . . , n, and n is the number of observations in the sample.
Example 1.3
Suppose the variable of interest is the age of a student in years, and the population consists
of the students taking this module. If we take a sample of n = 5 observations, the data could
be
x1 = 20, x2 = 19, x3 = 21, x4 = 18, x5 = 24.
We often need to order quantitative data by size. We denote x(1) to be the smallest observa-
tion, x(2) the next smallest, and so on, and x(n) denotes the largest observation. (Note the
round brackets around the index!) For the example above we have
x(1) = 18, x(2) = 19, x(3) = 20, x(4) = 21, x(5) = 24.
Often we need to add together observations for quantitative data. We use the notation
Xn
xi = x1 + x2 + . . . + xn . In the example above
i=1
5
X
xi = x1 + x2 + x3 + x4 + x5 = 20 + 19 + 21 + 18 + 24 = 102.
i=1
n
X
Similarly, we can calculate x2i = x21 + x22 + . . . + x2n . In the example,
i=1
5
X
x2i = 202 + 192 + 212 + 182 + 242 = 400 + 361 + 441 + 324 + 576 = 2102.
i=1
Exercise 1.4
Let x1 = 2, x2 = 8, x3 = 4, x4 = 12. Calculate each of the following.
3
X
(i) xi =
i=1
4
X
(ii) x2i =
i=1
(iii) x(3) =
5
Try the exercise yourself before looking at this video which explains the answers:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Stem-and-Leaf Diagrams.
Stem-and-leaf diagrams are used to display quantitative data in a more condensed form,
without losing the detail of the original data.
Example 1.5
The scores of 10 adults in a test are:
114, 99, 131, 124, 117, 102, 106, 127, 119, 114.
We use the last digit (the leaf) to represent each data item, with a stem consisting of the
previous digits. We always include a key on the diagram.
9 9
10 2 6
11 4 4 7 9
12 4 7
13 1
From the diagram we see the lowest test score is x(1) = 99 and the highest was x(10) = 131;
40% of scores were in the range 110-119.
6 CHAPTER 1. SUMMARISING DATA
Example 1.6
The closing prices (to the nearest £) of 20 common stocks on a certain date were
30, 34, 43, 9, 38, 9, 8, 29, 35, 19, 9, 17, 38, 54, 17, 1, 48, 18, 9, 9.
0 1 8 9 9 9 9 9
1 7 7 8 9
2 9
3 0 4 5 8 8
4 3 8
5 4
We can see that the cheapest stock was £1 and the most expensive stock was £54. There
appears to be two different groupings of stock. Those with prices in the range £1-£19, and
those with prices in excess of £30.
The following video discusses a few remarks on stem and leaf diagrams:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
7
Histograms.
A stem-and-leaf diagram is useful if there are a small number of observations. For large
datasets histograms are more useful.
To construct a histogram we count the number of observations that lie within classes that we
choose. The classes span the range of the data and are non-overlapping. Bars are then drawn
which have areas proportional to the number of observations in the class.
IMPORTANT: It is the AREA of the bar that is proportional to the number of observations.
The number of observations in a class is called the class frequency.
Example 1.7
The data below refers to the amount spent on food (in £) in one week by 40 UK households.
44.66, 18.60, 83.27, 59.90, 62.45, 50.96, 49.08, 60.08, 34.71, 53.75,
79.22, 36.84, 30.45, 61.63, 55.80, 52.00, 48.92, 57.23, 40.50, 52.81,
78.94, 63.89, 24.65, 74.56, 69.92, 20.21, 28.93, 65.04, 76.60, 73.12,
65.55, 68.89, 50.15, 54.99, 87.31, 48.92, 40.81, 43.00, 95.90, 46.81.
We are free to choose the widths of the classes, though remember they must be non-overlapping
and span the range of the data. In the above data the observations range from 18.60 to 95.90.
One possible approach, among many others, is to choose classes 17.495-27.495, 27.495-37.495,
. . . , 87.495-97.495.
Note: Here the boundaries of the classes were chosen to have the three digits .495 in order
to avoid for any of the data points to fall on the boundary between two classes. This is not
strictly necessary, as long as we are consistent and make clear to the reader how we group
the data points into the classes.
Figure 1.1: Histogram showing the amount spent on food (in £) in one week by 40 households.
The following video explains what is represented in the histogram, in particular on the vertical
axis:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Note that the width of each class is 10. The first bar over the interval from 17.495 to 27.495
for example, should have an area of 3, since this is the number of observations, i.e. frequency,
in this class. So the height of this bar is 0.3, since 0.3 × 10 = 3. Similarly, for the other
intervals.
So the height of the bars is the class frequency divided by the class width. This quantity is
referred to as the frequency density of a class, this is the quantity recorded on the vertical
axis:
Class Frequency
Frequency density =
Class width
9
Figure 1.2: Histogram showing amount spent on food (in £) in one week by 40 households.
10 CHAPTER 1. SUMMARISING DATA
Note how the area of the rectangle over the interval from 37.495 to 57.495 on Figure 1.2 is
the sum of the areas of the two rectangles over the intervals from 37.495 to 47.495 and 47.495
to 57.495 in Figure 1.1., as explained in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Exercise 1.8
Using some graph or other grid paper (if you have access to some), construct a histogram to
display the following data which refers to the amount of time (to the nearest minute) students
spent studying for a test. Before drawing the histogram, determine the frequency densities,
and decide on the scale for the histogram.
Try to do this yourself first, before watching this video of the solution:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
11
The Mode.
The mode is defined to be the most frequently occurring value in a set of data.
Example 1.9
The data below gives the number of students present at a statistics tutorial over 10 consecutive
weeks.
8, 5, 7, 10, 8, 6, 5, 6, 4, 8.
Arranging in ascending numerical order we have
4, 5, 5, 6, 6, 7, 8, 8, 8 , 10.
The mode of this set of data is 8. We also say that “the modal number of students present is
8”.
The mode need not be unique. If two values occur most frequently, the data are said to be
bimodal. If more than two values occur most frequently, the data are multimodal.
Note that the mode can be determined for qualitative and quantitative data, and is guaranteed
to be a value actually observed.
The Median.
The median is defined as the “middle value” when the data are arranged in ascending numer-
ical order. More precisely, for a data set with n observations,
1h i
if n is even, the the median is x( n ) + x( n +1) .
2 2 2
So, for an odd number of observations the median is the value “in the middle”; for an even
number of observations the median is the average of the two “middle values”. In this video
we look at these formulas a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
12 CHAPTER 1. SUMMARISING DATA
Example 1.10
The data set
0, 7, 7, 19, 19, 20, 35,
has 7 data points. So the median is the 4th in this ordered list, i.e. the median is 19.
Example 1.11
Given the data from above with 10 data points,
4, 5, 5, 6, 6, 7 , 8, 8, 8, 10,
the median is the average of the 5th and 6th observations, which is
6+7 13
= = 6.5,
2 2
i.e. the median number of students attending the tutorial is 6.5.
Note that the median can be number which is not an actual observation.
The Mean.
The mean, or sample mean, or average, is calculated by adding up all of the observations
and dividing by the number of observations there are. For observations xi , where i takes
values 1, 2, 3 . . . , n, we write
n
x1 + x2 + x3 + · · · + xn 1X
sample mean = x̄ = = xi .
n n i=1
We denote the sample mean as x̄ (x with a horizontal bar on top), pronounced “x bar”.
Example 1.12
The mean for the data from above is
10
1 X 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 67
x̄ = xi = = = 6.7
10 i=1 10 10
The mean number of students attending the tutorial is 6.7 (again, not a number that we could
actually observe).
The mean is a very widely used measure, as it has many desirable statistical properties.
13
Example 1.13
A survey of subscribers to Hello magazine are asked the following question: “How many of
the last four issues have you read or looked through?”
Suppose that the following frequency distribution summarises 500 responses.
This video discusses how we should read this table, compute the mean and the median:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Let’s now low at an example where the data are given as grouped data without the original
individual data values.
Example 1.14
Cars traveling on a road with a posted speed limit of 60 miles per hour are checked for speed
by a police radar system, giving the following frequency distribution of speeds.
Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
To estimate the mean (it can only be an estimate, since we do not know the original data
values!) we use the midpoints of the classes as the xi ’s in the mean of frequency data formula.
m
1X 28825
x̄ = f i xi = = 60.69.
n i=1 475
We see that the mean speed of a car checked by the police was 60.69 miles per hour – just in
excess of the speed limit!
Remark 1.15
The mean may be significantly affected by the inclusion of a mistaken observation, as the
following example shows:
Example 1.16
Suppose in Example 1.9 the observation 10 is recorded as 100 by mistake. The data is now
4, 5, 5, 6, 6, 7 , 8, 8, 8 , 100.
6+7
The mode does not change, it is still 8. The median is unchanged, it is still 2
= 6.5.
However, the mean is now:
10
1 X 4 + 5 + 5 + 6 + 6 + 7 + 8 + 8 + 8 + 100 157
x̄ = xi = = = 15.7.
10 i=1 10 10
53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40
Calculate the mean, the median and the mode of these data.
15
Quartiles.
The median is the value that splits the ordered data into two halves. Recall that we defined
it as follows:
o If n is odd, then the median is x( n+1 ) , and is an actual data point. In this case we
2
include the median in the lower half and in the the upper half,
x( n ) +x( n +1)
o If n is even, the the median is 2 2 2 . In this case the lower half consists simply of
the lower n2 values, and the upper half of the upper n2 values.
The quartiles split the ordered data into quarters. We refer to the median as Q2 , and define
the lower quartile Q1 to be the median of the lower half, and the upper quartile Q3 to
be the median of the upper half.
!!! Be careful when reading textbooks, or using calculators or other software to compute quar-
tiles. There are a number of different ways to define the lower and upper quartile producing
different results !!!
Example 1.18
In a survey of 21 households the number of telephones used by each household are given
below.
1, 3, 4, 1, 1, 2, 1, 1, 2, 5, 1, 2, 3, 0, 2, 1, 2, 1, 3, 0, 4.
Arranging the data in ascending numerical order gives
0, 0, 1, 1, 1, x(6) = 1, 1, 1, 1, 1, x(11) = 2 , 2, 2, 2, 2, x(16) = 3, 3, 3, 4, 4, 5.
There are n = 21 data items. Twenty-one is an odd number, and so the median number of
telephones per household is data item Q2 = x( n+1 ) = x(11) = 2. The lower quartile, Q1 is the
2
median of the data items x(1) , x(2) , x(3) , . . . , x(11) , which is x(6) = 1. The upper quartile, Q3 is
the median of the data items x(11) , x(2) , x(3) , . . . , x(21) , which is x(16) = 3.
This video explains the example above in more details:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
16 CHAPTER 1. SUMMARISING DATA
Example 1.19
The data below gives the monthly starting salaries (in $) of 12 business school graduates.
2850, 2950, 3050, 2880, 2755, 2710, 2890, 9130, 2940, 3325, 2920, 2880.
x(6) + x(7)
Since 12 is even, the median Q2 = = 2905.
2
x(3) + x(4)
The lower quartile Q1 is the median of x(1) , x(2) , . . . , x(6) , which is Q1 = = 2865.
2
x(9) + x(10)
The upper quartile Q3 is the median of x(7) , x(8) , . . . , x(12) , which is Q3 = = 3000.
2
See whether you can get the same answers. This video goes through the process of determining
Q1 , Q2 , and Q3 for this example:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
We are now introducing a number of concepts which measure the dispersion, i.e. the spread
or variability of a quantitative data set.
The Range.
The range is simply the difference between the largest and the smallest observation, i.e. in
a data set with n observations
IQR = Q3 − Q1 .
For Example 1.19, the interquartile range IQR is $3000 − $2865 = $135.
17
Example 1.20
The data given in in Example 1.9 with the number of students present at a statistics tutorial
over 10 consecutive weeks was
8, 5, 7, 10, 8, 6, 5, 6, 4, 8,
For larger data sets, this involves a lot of calculations. Here is a more “calculator friendly”
form of the sample variance:
!2
n n
1 X 1 X
s2 = x2i − xi .
n − 1 i=1 n i=1
18 CHAPTER 1. SUMMARISING DATA
One can show that the two formulas produce the same results. We will show this for n = 3
in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
So in the example from above, we could compute the sample variance instead as follows:
n
X
x2i = 82 + 52 + 72 + 102 + 82 + 62 + 52 + 62 + 42 + 82 = 479, and
i=1
n
X
xi = 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 = 67, so
i=1
672
2 1 1
s == 479 − = (479 − 448.9) = 3.3.
9 10 9
Example 1.21
Let us return to the survey on Hello subscribers from earlier in the chapter.
Here is how we calculate the sample variance s2 for these data, using the frequencies fi .
19
m
X
We have n = fi = 500. So
i=1
!2
m m
17452
2 1 X 1 X 1
s = fi x2i − f i xi = 6535 − = 0.89.
n − 1 i=1 n i=1
499 500
In this video we briefly explain how to apply the formula for the given frequency table:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Example 1.22
We now look at the police speed check example, and see whether you can follow how the
formula below was used.
Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
fi x2i 22090 108160 487350 672700 336675 77760 59290 1764025
Here
!2
m m
288252
1 X 1 X 1
2
s = fi x2i − f i xi = 1764025 − = 31.23 mph2 .
n − 1 i=1 n i=1
474 475
20 CHAPTER 1. SUMMARISING DATA
Exercise 1.23
(We will go through this one in the lecture.)
The data below refer to the number of brothers and sisters a sample of students have. You
will need to calculate the values that go in place of the constants B1 , B2 , B3 and B4 .
Also calculate the mean, the median, the sample variance, and the standard deviation for
these data.
(Q3 − Q2 ) − (Q2 − Q1 )
.
Q3 − Q1
Q3 − 2Q2 + Q1
Note that this is the same as .
Q3 − Q1
This coefficient takes values between −1 and +1. A value near zero indicates symmetry, a
negative value indicates that the data is negatively skewed, a positive value indicates that the
data is positively skewed. The following video explains how the above formula represents the
skewness of the data:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
21
Box Plots.
A useful way of comparing two data sets is to produce box plots. To construct a box plot you
must first calculate the median Q2 and the quartiles, Q1 and Q3 . The general form of a box
plot is shown below: This video gives a brief introduction to box plots:
Smallest Largest
observation observation
Q Median Q
1 3
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Example 1.24
The following data gives systolic blood pressure of 12 smokers and 12 non-smokers.
Smokers 122 146 120 114 124 126 118 128 130 134 116 130
Non-smokers 114 134 114 116 138 110 112 116 132 126 108 116
For the smokers the median Q2 = 125, the lower quartile Q1 = 119, and the upper quartile
Q3 = 130.
For the non-smokers, the median Q2 = 116, the lower quartile Q1 = 113, and the upper
quartile Q3 = 129.
22 CHAPTER 1. SUMMARISING DATA
Smokers
100 Non−smokers
110
120
130
140
150
Blood pressure
Figure 1.4: Box plots comparing blood pressure in smokers and non-smokers.
We see that the blood pressures of the non-smokers tend to be lower than those of the
smokers (the box in the non-smokers’ plot is shifted to the left compared to the smokers’ plot).
However, blood pressure amongst non-smokers appears to be more variable than amongst
smokers (the box in the non-smokers’ plot is wider than the box in the smokers’ plot).
An outlier is an extremely high or extremely low observation. An outlier may be a data item
that has been incorrectly recorded (in which case it should be removed from the data set), or
it may be a genuine observation (but unusual in some way). An observation is identified as
an outlier if it is less than
3
Q1 − (Q3 − Q1 ),
2
or greater than
3
Q3 + (Q3 − Q1 ).
2
Exercise 1.25
(We will go through this one in the lecture.)
Here are the case prices (in $) for 13 wines produced in the USA.
52, 66, 70, 80, 95, 100, 110, 112, 115, 118, 123, 143, 151.
(iii) Construct a box plot to display the data, using graph or grid paper (if available).