0% found this document useful (0 votes)
25 views23 pages

Chapter Notes (Chapter 1)

The document introduces concepts related to summarizing data sets including populations, samples, variables, observations, qualitative vs quantitative data, descriptive vs inferential statistics, and methods for presenting summarized data including stem-and-leaf diagrams and histograms. Examples are provided to illustrate key definitions and techniques.

Uploaded by

DANIYA GENERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views23 pages

Chapter Notes (Chapter 1)

The document introduces concepts related to summarizing data sets including populations, samples, variables, observations, qualitative vs quantitative data, descriptive vs inferential statistics, and methods for presenting summarized data including stem-and-leaf diagrams and histograms. Examples are provided to illustrate key definitions and techniques.

Uploaded by

DANIYA GENERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CHAPTER 1

Summarising data

In this chapter we will introduce a few basic concepts about data sets, and how they can be
presented and summarised.

Definitions.
Here are a few common words and their definitions.
Population – refer to the complete set of objects of interest. Examples:

o All students at Leeds University

o All new cars sold by Ford UK this year

o All oranges in a crate

Sample – refers to a subset of the population, usually chosen to be representative of the


population with respect to some characteristic. Examples:

o All students registered on this module (is this representative?)

o All new cars sold by Ford UK in Leeds this year (is this representative?)

Variable – refers to the quantity being measured. Examples:

o Age of a student; colour of eyes

o Size of car engine

o Ripeness of the fruit

1
2 CHAPTER 1. SUMMARISING DATA

Observation or data item – refers to the result of a measurement. Examples:


o Age of a student is “19”; colour of a eyes is “green”

o Size of car engine is “2200ccm”

o The orange is “half-ripe”

Data can be given in different forms:


Qualitative data – are given as descriptions using names. Examples:
o “brown/blue”

o “yes/no”

o “ ripe/ half-ripe/not ripe/rotten”

Quantitative data – are given as numeric values. Quantitative data can be discrete or
continuous.
Quantitative discrete data are typically whole numbers. Examples:
o Number of absent students in a class

o Number of red cars sold by Ford in the UK

o Number of rotten oranges in a crate

Quantitative continuous data can take any value in a range. The data may be rounded
to distinct values but we still think of the data as being for a continuous variable. Examples:
o The height of students in a class

o The time taken for Ford to produce a car in the UK

o Size of a car engine

Exercise 1.1
For each of the following variables, state whether they are QUALITATIVE or QUANTI-
TATIVE. If they are quantitative, are they DISCRETE or CONTINUOUS?
Before looking at the answers, come up with answers yourself first.

(i) Height of a person to the nearest centimetre

(ii) Height of a person, classed as short/medium/tall

(iii) Height of a person

(iv) Annual number of items sold of a product


3

(v) Soft-drink size, classed as small, medium or large

(vi) Earnings per share

(vii) Method of payment (cash, cheque or credit card)

(viii) Time to pay

(ix) Time to pay in days

(x) Ripeness of oranges

Answers: (i) Quantitative discrete (ii) Qualitative (iii) Quantitative continuous (iv) Quanti-
tative discrete (v) Qualitative (vi) Quantitative continuous (vii) Qualitative (viii) Quantitative
continuous (ix) Quantitative discrete (x) Qualitative

Statistics – is the science of data collection and data analysis.


Descriptive Statistics – is concerned with methods for summarising the data from a sample.
Inferential Statistics – is concerned with estimating properties of the population based on
data from a sample.
Note: The word ‘inference’ refers to a conclusion or opinion drawn from information, evidence
or reasoning.

Examples 1.2
o Descriptive statistics: A lecturer wants to summarise the performance of 70 Leeds foun-
dation year students on MATH0365 over the past two years.

o Inferential statistics: An auditor needs to determine whether the transactions on a


client’s balance sheet give an accurate representation of its financial circumstances.

o Inferential statistics: The government wants to assess the opinion of the British public
on whether or not we should join / leave the European Union.

The following video gives a summary of some of the concepts introduced above:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
4 CHAPTER 1. SUMMARISING DATA

Notation.
With quantitative data we often use xi or yi to denote an observation where i takes values
1, 2, . . . , n, and n is the number of observations in the sample.
Example 1.3
Suppose the variable of interest is the age of a student in years, and the population consists
of the students taking this module. If we take a sample of n = 5 observations, the data could
be
x1 = 20, x2 = 19, x3 = 21, x4 = 18, x5 = 24.

We often need to order quantitative data by size. We denote x(1) to be the smallest observa-
tion, x(2) the next smallest, and so on, and x(n) denotes the largest observation. (Note the
round brackets around the index!) For the example above we have

x(1) = 18, x(2) = 19, x(3) = 20, x(4) = 21, x(5) = 24.

Often we need to add together observations for quantitative data. We use the notation
Xn
xi = x1 + x2 + . . . + xn . In the example above
i=1

5
X
xi = x1 + x2 + x3 + x4 + x5 = 20 + 19 + 21 + 18 + 24 = 102.
i=1

n
X
Similarly, we can calculate x2i = x21 + x22 + . . . + x2n . In the example,
i=1

5
X
x2i = 202 + 192 + 212 + 182 + 242 = 400 + 361 + 441 + 324 + 576 = 2102.
i=1

Exercise 1.4
Let x1 = 2, x2 = 8, x3 = 4, x4 = 12. Calculate each of the following.
3
X
(i) xi =
i=1

4
X
(ii) x2i =
i=1

(iii) x(3) =
5

Try the exercise yourself before looking at this video which explains the answers:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Stem-and-Leaf Diagrams.
Stem-and-leaf diagrams are used to display quantitative data in a more condensed form,
without losing the detail of the original data.
Example 1.5
The scores of 10 adults in a test are:

114, 99, 131, 124, 117, 102, 106, 127, 119, 114.

We use the last digit (the leaf) to represent each data item, with a stem consisting of the
previous digits. We always include a key on the diagram.

9 9
10 2 6
11 4 4 7 9
12 4 7
13 1

Table 1.1: Stem-and-leaf diagram. Key: Stem|leaf = 9|9 means 99.

From the diagram we see the lowest test score is x(1) = 99 and the highest was x(10) = 131;
40% of scores were in the range 110-119.
6 CHAPTER 1. SUMMARISING DATA

Example 1.6
The closing prices (to the nearest £) of 20 common stocks on a certain date were

30, 34, 43, 9, 38, 9, 8, 29, 35, 19, 9, 17, 38, 54, 17, 1, 48, 18, 9, 9.

Representing the above in a stem-and-leaf diagram we have

0 1 8 9 9 9 9 9
1 7 7 8 9
2 9
3 0 4 5 8 8
4 3 8
5 4

Table 1.2: Stem-and-leaf diagram. Key: Stem|leaf = 4|3 means £43.

We can see that the cheapest stock was £1 and the most expensive stock was £54. There
appears to be two different groupings of stock. Those with prices in the range £1-£19, and
those with prices in excess of £30.
The following video discusses a few remarks on stem and leaf diagrams:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
7

Histograms.
A stem-and-leaf diagram is useful if there are a small number of observations. For large
datasets histograms are more useful.
To construct a histogram we count the number of observations that lie within classes that we
choose. The classes span the range of the data and are non-overlapping. Bars are then drawn
which have areas proportional to the number of observations in the class.
IMPORTANT: It is the AREA of the bar that is proportional to the number of observations.
The number of observations in a class is called the class frequency.
Example 1.7
The data below refers to the amount spent on food (in £) in one week by 40 UK households.

44.66, 18.60, 83.27, 59.90, 62.45, 50.96, 49.08, 60.08, 34.71, 53.75,
79.22, 36.84, 30.45, 61.63, 55.80, 52.00, 48.92, 57.23, 40.50, 52.81,
78.94, 63.89, 24.65, 74.56, 69.92, 20.21, 28.93, 65.04, 76.60, 73.12,
65.55, 68.89, 50.15, 54.99, 87.31, 48.92, 40.81, 43.00, 95.90, 46.81.

We are free to choose the widths of the classes, though remember they must be non-overlapping
and span the range of the data. In the above data the observations range from 18.60 to 95.90.
One possible approach, among many others, is to choose classes 17.495-27.495, 27.495-37.495,
. . . , 87.495-97.495.
Note: Here the boundaries of the classes were chosen to have the three digits .495 in order
to avoid for any of the data points to fall on the boundary between two classes. This is not
strictly necessary, as long as we are consistent and make clear to the reader how we group
the data points into the classes.

Before we can construct the histogram we need to produce a frequency table.

Spending per week (£) Tally Frequency (number of observations)


17.495-27.495 ||| 3
27.495-37.495 |||| 4
37.495-47.495 |||| 5
47.495-57.495 |||| |||| | 11
57.495-67.495 |||| || 7
67.495-77.495 |||| 5
77.495-87.495 |||| 4
87.495-97.495 | 1
Total 40

Table 1.3: Frequency table for food shopping data.


8 CHAPTER 1. SUMMARISING DATA

Here is the histogram:

Figure 1.1: Histogram showing the amount spent on food (in £) in one week by 40 households.

The following video explains what is represented in the histogram, in particular on the vertical
axis:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Note that the width of each class is 10. The first bar over the interval from 17.495 to 27.495
for example, should have an area of 3, since this is the number of observations, i.e. frequency,
in this class. So the height of this bar is 0.3, since 0.3 × 10 = 3. Similarly, for the other
intervals.
So the height of the bars is the class frequency divided by the class width. This quantity is
referred to as the frequency density of a class, this is the quantity recorded on the vertical
axis:

Class Frequency
Frequency density =
Class width
9

Looking at a histogram, we can make a number of observations. For example:


The histogram shows that the smallest amount a family spent on food was around £17.5, and
the largest amount was around £97.5. The “average” amount spent on food is around £57.
Note that in the example above all of the classes have the same width, so the frequency
density is the frequency divided by 10 for each of the classes, i.e. in the example above the
frequency density is proportional to the frequency. This is not always the case, and becomes
important when not all the classes have the same width.
Let’s now look at the same data, but with one of the classes having a different width:
Using the same data on food shopping from above, suppose the classes 37.495-47.495 and
47.495-57.495 were merged to produce a single class 37.495-57.495. The frequency table would
appear as follows:

Spending per week (£) Class width Frequency Frequency density


17.495-27.495 10 3 0.3
27.495-37.495 10 4 0.4
37.495-57.495 20 16 0.8
57.495-67.495 10 7 0.7
67.495-77.495 10 5 0.5
77.495-87.495 10 4 0.4
87.495-97.495 10 1 0.1
40

Table 1.4: Frequency table for food shopping data.

And the histogram now looks like this:

Figure 1.2: Histogram showing amount spent on food (in £) in one week by 40 households.
10 CHAPTER 1. SUMMARISING DATA

Note how the area of the rectangle over the interval from 37.495 to 57.495 on Figure 1.2 is
the sum of the areas of the two rectangles over the intervals from 37.495 to 47.495 and 47.495
to 57.495 in Figure 1.1., as explained in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.8
Using some graph or other grid paper (if you have access to some), construct a histogram to
display the following data which refers to the amount of time (to the nearest minute) students
spent studying for a test. Before drawing the histogram, determine the frequency densities,
and decide on the scale for the histogram.

Time studied Class width Frequency Frequency density


(in minutes)
0.5-10.5 10 5 0.5
10.5-15.5 5 10
15.5-20.5 5 12
20.5-30.5 10 6
30.5-50.5 20 2

Table 1.5: Frequency table for student study data.

Try to do this yourself first, before watching this video of the solution:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
11

The Mode.
The mode is defined to be the most frequently occurring value in a set of data.
Example 1.9
The data below gives the number of students present at a statistics tutorial over 10 consecutive
weeks.
8, 5, 7, 10, 8, 6, 5, 6, 4, 8.
Arranging in ascending numerical order we have
 
4, 5, 5, 6, 6, 7, 8, 8, 8 , 10.
 

The mode of this set of data is 8. We also say that “the modal number of students present is
8”.
The mode need not be unique. If two values occur most frequently, the data are said to be
bimodal. If more than two values occur most frequently, the data are multimodal.
Note that the mode can be determined for qualitative and quantitative data, and is guaranteed
to be a value actually observed.

The Median.
The median is defined as the “middle value” when the data are arranged in ascending numer-
ical order. More precisely, for a data set with n observations,

if n is odd, then the median is x( n+1 ) ,


2

1h i
if n is even, the the median is x( n ) + x( n +1) .
2 2 2

So, for an odd number of observations the median is the value “in the middle”; for an even
number of observations the median is the average of the two “middle values”. In this video
we look at these formulas a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
12 CHAPTER 1. SUMMARISING DATA

Example 1.10
The data set
0, 7, 7, 19, 19, 20, 35,
has 7 data points. So the median is the 4th in this ordered list, i.e. the median is 19.
Example 1.11
Given the data from above with 10 data points,
 
4, 5, 5, 6, 6, 7 , 8, 8, 8, 10,
 

the median is the average of the 5th and 6th observations, which is
6+7 13
= = 6.5,
2 2
i.e. the median number of students attending the tutorial is 6.5.

Note that the median can be number which is not an actual observation.

The Mean.
The mean, or sample mean, or average, is calculated by adding up all of the observations
and dividing by the number of observations there are. For observations xi , where i takes
values 1, 2, 3 . . . , n, we write
n
x1 + x2 + x3 + · · · + xn 1X
sample mean = x̄ = = xi .
n n i=1

We denote the sample mean as x̄ (x with a horizontal bar on top), pronounced “x bar”.

Example 1.12
The mean for the data from above is
10
1 X 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 67
x̄ = xi = = = 6.7
10 i=1 10 10

The mean number of students attending the tutorial is 6.7 (again, not a number that we could
actually observe).

The mean is a very widely used measure, as it has many desirable statistical properties.
13

Example 1.13
A survey of subscribers to Hello magazine are asked the following question: “How many of
the last four issues have you read or looked through?”
Suppose that the following frequency distribution summarises 500 responses.

Issues read xi 0 1 2 3 4 Total


Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745

Table 1.6: Issues read in a survey on Hello magazine.

This video discusses how we should read this table, compute the mean and the median:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

And here is a written explanation:


How can we calculate the mean number of issues of Hello read by those in the survey? We
could write the data out as a list of 500 observations and calculate the mean from that.
However, it is much more convenient to use the following formula for the mean of frequency
Xm
data. Let n = fi and m be the number of classes in the table; m = 5 in the example.
i=1
Then
m 5
1X 1 X 1745
x̄ = f i xi = f i xi = = 3.49.
n i=1 500 i=1 500
So the mean number of issues of Hello read by those in the survey is 3.49.
14 CHAPTER 1. SUMMARISING DATA

Let’s now low at an example where the data are given as grouped data without the original
individual data values.
Example 1.14
Cars traveling on a road with a posted speed limit of 60 miles per hour are checked for speed
by a police radar system, giving the following frequency distribution of speeds.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825

Table 1.7: Speed when checked by police radar system.

To estimate the mean (it can only be an estimate, since we do not know the original data
values!) we use the midpoints of the classes as the xi ’s in the mean of frequency data formula.
m
1X 28825
x̄ = f i xi = = 60.69.
n i=1 475

We see that the mean speed of a car checked by the police was 60.69 miles per hour – just in
excess of the speed limit!
Remark 1.15
The mean may be significantly affected by the inclusion of a mistaken observation, as the
following example shows:
Example 1.16
Suppose in Example 1.9 the observation 10 is recorded as 100 by mistake. The data is now
  
4, 5, 5, 6, 6, 7 , 8, 8, 8 , 100.
  
6+7
The mode does not change, it is still 8. The median is unchanged, it is still 2
= 6.5.
However, the mean is now:
10
1 X 4 + 5 + 5 + 6 + 6 + 7 + 8 + 8 + 8 + 100 157
x̄ = xi = = = 15.7.
10 i=1 10 10

The mean has more than doubled.


Exercise 1.17
(We will go through this one in the lecture.)
The following data refers to the annual number of deaths from tornadoes in the USA between
the years 1990 and 2000.

53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40

Calculate the mean, the median and the mode of these data.
15

Quartiles.
The median is the value that splits the ordered data into two halves. Recall that we defined
it as follows:

o If n is odd, then the median is x( n+1 ) , and is an actual data point. In this case we
2
include the median in the lower half and in the the upper half,
x( n ) +x( n +1)
o If n is even, the the median is 2 2 2 . In this case the lower half consists simply of
the lower n2 values, and the upper half of the upper n2 values.

The quartiles split the ordered data into quarters. We refer to the median as Q2 , and define
the lower quartile Q1 to be the median of the lower half, and the upper quartile Q3 to
be the median of the upper half.
!!! Be careful when reading textbooks, or using calculators or other software to compute quar-
tiles. There are a number of different ways to define the lower and upper quartile producing
different results !!!

Example 1.18
In a survey of 21 households the number of telephones used by each household are given
below.
1, 3, 4, 1, 1, 2, 1, 1, 2, 5, 1, 2, 3, 0, 2, 1, 2, 1, 3, 0, 4.
Arranging the data in ascending numerical order gives
 
0, 0, 1, 1, 1, x(6) = 1, 1, 1, 1, 1, x(11) = 2 , 2, 2, 2, 2, x(16) = 3, 3, 3, 4, 4, 5.
 

There are n = 21 data items. Twenty-one is an odd number, and so the median number of
telephones per household is data item Q2 = x( n+1 ) = x(11) = 2. The lower quartile, Q1 is the
2
median of the data items x(1) , x(2) , x(3) , . . . , x(11) , which is x(6) = 1. The upper quartile, Q3 is
the median of the data items x(11) , x(2) , x(3) , . . . , x(21) , which is x(16) = 3.
This video explains the example above in more details:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
16 CHAPTER 1. SUMMARISING DATA

Example 1.19
The data below gives the monthly starting salaries (in $) of 12 business school graduates.

2850, 2950, 3050, 2880, 2755, 2710, 2890, 9130, 2940, 3325, 2920, 2880.
x(6) + x(7)
Since 12 is even, the median Q2 = = 2905.
2
x(3) + x(4)
The lower quartile Q1 is the median of x(1) , x(2) , . . . , x(6) , which is Q1 = = 2865.
2
x(9) + x(10)
The upper quartile Q3 is the median of x(7) , x(8) , . . . , x(12) , which is Q3 = = 3000.
2
See whether you can get the same answers. This video goes through the process of determining
Q1 , Q2 , and Q3 for this example:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

We are now introducing a number of concepts which measure the dispersion, i.e. the spread
or variability of a quantitative data set.

The Range.
The range is simply the difference between the largest and the smallest observation, i.e. in
a data set with n observations

range = (largest observation)−(smallest observation) = x(n) − x(1) .


In Example 1.19, the range is $9130 − $2710 = $6420. Is this a reasonable representation of
the variability? – probably not, as eleven of the twelve salaries lie between $2710 and $3325,
with one data point of $9130 being much larger than all the others.

The Interquartile Range.


The range can be inflated by the presence of a single very large or very small value. It is
better to give the interquartile range (IQR) which is the range of the middle 50% of the data
values, i.e. the difference between the upper and the lower quartile; this is

IQR = Q3 − Q1 .

For Example 1.19, the interquartile range IQR is $3000 − $2865 = $135.
17

The Variance and Standard Deviation.


The (sample) variance and (sample) standard deviation are the most widely used measures
of variation or dispersion of a data set, as they have desirable statistical properties.
The sample variance measures how far the data points are spread out from the mean, and
for a set with n data points x1 , . . . xn , is defined as
n
2 1 X
s = (xi − x̄)2 .
n − 1 i=1
(The reason why the sample variance is abbreviate by a square will become clear further below.)
If s2 is small, the data items lie fairly close to the mean.
If s2 is large, the data items are widely spread about the mean.
The formula above is discussed in more detail in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.20
The data given in in Example 1.9 with the number of students present at a statistics tutorial
over 10 consecutive weeks was

8, 5, 7, 10, 8, 6, 5, 6, 4, 8,

with a mean of 6.7.


So the sample variance for these data is
1
s2 = 9
[(8 − 6.7)2 + (5 − 6.7)2 + (7 − 6.7)2 + (10 − 6.7)2 + (8 − 6.7)2
+ (6 − 6.7)2 + (5 − 6.7)2 + (6 − 6.7)2 + (4 − 6.7)2 + (8 − 6.7)2 ] = 3.3.

For larger data sets, this involves a lot of calculations. Here is a more “calculator friendly”
form of the sample variance:
 !2 
n n
1  X 1 X 
s2 = x2i − xi .
n − 1  i=1 n i=1 
18 CHAPTER 1. SUMMARISING DATA

One can show that the two formulas produce the same results. We will show this for n = 3
in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

So in the example from above, we could compute the sample variance instead as follows:
n
X
x2i = 82 + 52 + 72 + 102 + 82 + 62 + 52 + 62 + 42 + 82 = 479, and
i=1

n
X
xi = 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 = 67, so
i=1

672
 
2 1 1
s == 479 − = (479 − 448.9) = 3.3.
9 10 9

The sample standard deviation is defined as



s= s2 .
It is measured in the same units as the original data and is therefore easier to interpret.
√ √
For the example from above s = s2 = 3.3 = 1.8; so the standard deviation is 1.8 students.

Example 1.21
Let us return to the survey on Hello subscribers from earlier in the chapter.

Number read xi 0 1 2 3 4 Total


Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745
fi x2i 0 10 160 765 5600 6535

Table 1.8: Issues read in a survey on Hello magazine.

Here is how we calculate the sample variance s2 for these data, using the frequencies fi .
19

m
X
We have n = fi = 500. So
i=1
 !2 
m m
17452
 
2 1 X 1 X  1
s = fi x2i − f i xi = 6535 − = 0.89.
n − 1  i=1 n i=1
 499 500

In this video we briefly explain how to apply the formula for the given frequency table:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.22
We now look at the police speed check example, and see whether you can follow how the
formula below was used.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
fi x2i 22090 108160 487350 672700 336675 77760 59290 1764025

Table 1.9: Speed when checked by police radar system.

Here
 !2 
m m
288252
 
1 X 1 X  1
2
s = fi x2i − f i xi = 1764025 − = 31.23 mph2 .
n − 1  i=1 n i=1
 474 475
20 CHAPTER 1. SUMMARISING DATA

Exercise 1.23
(We will go through this one in the lecture.)
The data below refer to the number of brothers and sisters a sample of students have. You
will need to calculate the values that go in place of the constants B1 , B2 , B3 and B4 .

Number brothers/sisters, xi 0 1 2 3 4 5 Total


Frequency, fi 5 12 8 3 0 1 29
f i xi 0 12 16 B1 0 5 B2
fi x2i 0 12 32 27 0 B3 B4

Table 1.10: Data on numbers of brothers and sisters.

Also calculate the mean, the median, the sample variance, and the standard deviation for
these data.

Skewness and Outliers.


If a data set is approximately symmetric, the median Q2 is roughly equally spaced between
the upper quartile Q3 and the lower quartile Q1 . If the median is much closer to Q1 than to
Q3 , then the data set is positively skewed (containing some rather high values). If the median
is much closer to Q3 than to Q1 , then the data set is negatively skewed (containing some
rather low values).
The quartile coefficient of skewness is given by

(Q3 − Q2 ) − (Q2 − Q1 )
.
Q3 − Q1
Q3 − 2Q2 + Q1
Note that this is the same as .
Q3 − Q1
This coefficient takes values between −1 and +1. A value near zero indicates symmetry, a
negative value indicates that the data is negatively skewed, a positive value indicates that the
data is positively skewed. The following video explains how the above formula represents the
skewness of the data:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
21

Box Plots.
A useful way of comparing two data sets is to produce box plots. To construct a box plot you
must first calculate the median Q2 and the quartiles, Q1 and Q3 . The general form of a box
plot is shown below: This video gives a brief introduction to box plots:

Smallest Largest
observation observation
Q Median Q
1 3

Figure 1.3: A box plot.

(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.24
The following data gives systolic blood pressure of 12 smokers and 12 non-smokers.

Smokers 122 146 120 114 124 126 118 128 130 134 116 130
Non-smokers 114 134 114 116 138 110 112 116 132 126 108 116

Table 1.11: Blood pressures of smokers and non-smokers.

For the smokers the median Q2 = 125, the lower quartile Q1 = 119, and the upper quartile
Q3 = 130.
For the non-smokers, the median Q2 = 116, the lower quartile Q1 = 113, and the upper
quartile Q3 = 129.
22 CHAPTER 1. SUMMARISING DATA

The two box plots are as follows.

Smokers

100 Non−smokers

110

120

130

140

150
Blood pressure

Figure 1.4: Box plots comparing blood pressure in smokers and non-smokers.

We see that the blood pressures of the non-smokers tend to be lower than those of the
smokers (the box in the non-smokers’ plot is shifted to the left compared to the smokers’ plot).
However, blood pressure amongst non-smokers appears to be more variable than amongst
smokers (the box in the non-smokers’ plot is wider than the box in the smokers’ plot).

Skewness for the smokers:


(Q3 − M ) − (M − Q1 ) (130 − 125) − (125 − 119)
= = −0.09.
Q3 − Q1 130 − 119
The blood pressures of the smokers are slightly negatively skewed; Q1 is a bit further away
from Q2 than Q3 is.

Skewness for the non-smokers:


(Q3 − M ) − (M − Q1 ) (129 − 116) − (116 − 113)
= = 0.62.
Q3 − Q1 129 − 113
The blood pressures of the non-smokers are positively skewed; Q3 is a further away from Q2
than Q1 is. (i.e. there are some non-smokers with rather high blood pressures compared to
the rest).

An outlier is an extremely high or extremely low observation. An outlier may be a data item
that has been incorrectly recorded (in which case it should be removed from the data set), or
it may be a genuine observation (but unusual in some way). An observation is identified as
an outlier if it is less than
3
Q1 − (Q3 − Q1 ),
2
or greater than
3
Q3 + (Q3 − Q1 ).
2

Outliers for the smokers: (for the data from above)


3 3 3 3
Q1 − (Q3 −Q1 ) = 119− (130−119) = 102.5, and Q3 + (Q3 −Q1 ) = 130+ (130−119) = 146.5.
2 2 2 2
23

Hence there are no outliers in the smokers’ data.

Outliers for the non-smokers:


3 3 3 3
Q1 − (Q3 −Q1 ) = 113− (129−113) = 89, and Q3 + (Q3 −Q1 ) = 130+ (129−113) = 153.
2 2 2 2
There are also no outliers in the non-smokers’ data.

This video explains the concept of an outlier using box plots:


(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.25
(We will go through this one in the lecture.)
Here are the case prices (in $) for 13 wines produced in the USA.

52, 66, 70, 80, 95, 100, 110, 112, 115, 118, 123, 143, 151.

(i) Calculate the Quartile Coefficient of Skewness.

(ii) Identify any outliers.

(iii) Construct a box plot to display the data, using graph or grid paper (if available).

You might also like