0% found this document useful (0 votes)

25 views23 pages

Chapter Notes (Chapter 1)

The document introduces concepts related to summarizing data sets including populations, samples, variables, observations, qualitative vs quantitative data, descriptive vs inferential statistics, and methods for presenting summarized data including stem-and-leaf diagrams and histograms. Examples are provided to illustrate key definitions and techniques.

Uploaded by

DANIYA GENERAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views23 pages

Chapter Notes (Chapter 1)

Uploaded by

DANIYA GENERAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

CHAPTER 1

Summarising data

In this chapter we will introduce a few basic concepts about data sets, and how they can be
presented and summarised.

Definitions.
Here are a few common words and their definitions.
Population – refer to the complete set of objects of interest. Examples:

o All students at Leeds University

o All new cars sold by Ford UK this year

o All oranges in a crate

Sample – refers to a subset of the population, usually chosen to be representative of the

population with respect to some characteristic. Examples:

o All students registered on this module (is this representative?)

o All new cars sold by Ford UK in Leeds this year (is this representative?)

Variable – refers to the quantity being measured. Examples:

o Age of a student; colour of eyes

o Size of car engine

o Ripeness of the fruit

1
2 CHAPTER 1. SUMMARISING DATA

Observation or data item – refers to the result of a measurement. Examples:

o Age of a student is “19”; colour of a eyes is “green”

o Size of car engine is “2200ccm”

o The orange is “half-ripe”

Data can be given in different forms:

Qualitative data – are given as descriptions using names. Examples:
o “brown/blue”

o “yes/no”

o “ ripe/ half-ripe/not ripe/rotten”

Quantitative data – are given as numeric values. Quantitative data can be discrete or
continuous.
Quantitative discrete data are typically whole numbers. Examples:
o Number of absent students in a class

o Number of red cars sold by Ford in the UK

o Number of rotten oranges in a crate

Quantitative continuous data can take any value in a range. The data may be rounded
to distinct values but we still think of the data as being for a continuous variable. Examples:
o The height of students in a class

o The time taken for Ford to produce a car in the UK

o Size of a car engine

Exercise 1.1
For each of the following variables, state whether they are QUALITATIVE or QUANTI-
TATIVE. If they are quantitative, are they DISCRETE or CONTINUOUS?
Before looking at the answers, come up with answers yourself first.

(i) Height of a person to the nearest centimetre

(ii) Height of a person, classed as short/medium/tall

(iii) Height of a person

(iv) Annual number of items sold of a product

(v) Soft-drink size, classed as small, medium or large

(vi) Earnings per share

(vii) Method of payment (cash, cheque or credit card)

(viii) Time to pay

(ix) Time to pay in days

(x) Ripeness of oranges

Answers: (i) Quantitative discrete (ii) Qualitative (iii) Quantitative continuous (iv) Quanti-
tative discrete (v) Qualitative (vi) Quantitative continuous (vii) Qualitative (viii) Quantitative
continuous (ix) Quantitative discrete (x) Qualitative

Statistics – is the science of data collection and data analysis.

Descriptive Statistics – is concerned with methods for summarising the data from a sample.
Inferential Statistics – is concerned with estimating properties of the population based on
data from a sample.
Note: The word ‘inference’ refers to a conclusion or opinion drawn from information, evidence
or reasoning.

Examples 1.2
o Descriptive statistics: A lecturer wants to summarise the performance of 70 Leeds foun-
dation year students on MATH0365 over the past two years.

o Inferential statistics: An auditor needs to determine whether the transactions on a

client’s balance sheet give an accurate representation of its financial circumstances.

o Inferential statistics: The government wants to assess the opinion of the British public
on whether or not we should join / leave the European Union.

The following video gives a summary of some of the concepts introduced above:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
4 CHAPTER 1. SUMMARISING DATA

Notation.
With quantitative data we often use xi or yi to denote an observation where i takes values
1, 2, . . . , n, and n is the number of observations in the sample.
Example 1.3
Suppose the variable of interest is the age of a student in years, and the population consists
of the students taking this module. If we take a sample of n = 5 observations, the data could
be
x1 = 20, x2 = 19, x3 = 21, x4 = 18, x5 = 24.

We often need to order quantitative data by size. We denote x(1) to be the smallest observa-
tion, x(2) the next smallest, and so on, and x(n) denotes the largest observation. (Note the
round brackets around the index!) For the example above we have

x(1) = 18, x(2) = 19, x(3) = 20, x(4) = 21, x(5) = 24.

Often we need to add together observations for quantitative data. We use the notation
Xn
xi = x1 + x2 + . . . + xn . In the example above
i=1

5
X
xi = x1 + x2 + x3 + x4 + x5 = 20 + 19 + 21 + 18 + 24 = 102.
i=1

n
X
Similarly, we can calculate x2i = x21 + x22 + . . . + x2n . In the example,
i=1

5
X
x2i = 202 + 192 + 212 + 182 + 242 = 400 + 361 + 441 + 324 + 576 = 2102.
i=1

Exercise 1.4
Let x1 = 2, x2 = 8, x3 = 4, x4 = 12. Calculate each of the following.
3
X
(i) xi =
i=1

4
X
(ii) x2i =
i=1

(iii) x(3) =
5

Try the exercise yourself before looking at this video which explains the answers:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Stem-and-Leaf Diagrams.
Stem-and-leaf diagrams are used to display quantitative data in a more condensed form,
without losing the detail of the original data.
Example 1.5
The scores of 10 adults in a test are:

114, 99, 131, 124, 117, 102, 106, 127, 119, 114.

We use the last digit (the leaf) to represent each data item, with a stem consisting of the
previous digits. We always include a key on the diagram.

9 9
10 2 6
11 4 4 7 9
12 4 7
13 1

Table 1.1: Stem-and-leaf diagram. Key: Stem|leaf = 9|9 means 99.

From the diagram we see the lowest test score is x(1) = 99 and the highest was x(10) = 131;
40% of scores were in the range 110-119.
6 CHAPTER 1. SUMMARISING DATA

Example 1.6
The closing prices (to the nearest £) of 20 common stocks on a certain date were

30, 34, 43, 9, 38, 9, 8, 29, 35, 19, 9, 17, 38, 54, 17, 1, 48, 18, 9, 9.

Representing the above in a stem-and-leaf diagram we have

0 1 8 9 9 9 9 9
1 7 7 8 9
2 9
3 0 4 5 8 8
4 3 8
5 4

Table 1.2: Stem-and-leaf diagram. Key: Stem|leaf = 4|3 means £43.

We can see that the cheapest stock was £1 and the most expensive stock was £54. There
appears to be two different groupings of stock. Those with prices in the range £1-£19, and
those with prices in excess of £30.
The following video discusses a few remarks on stem and leaf diagrams:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
7

Histograms.
A stem-and-leaf diagram is useful if there are a small number of observations. For large
datasets histograms are more useful.
To construct a histogram we count the number of observations that lie within classes that we
choose. The classes span the range of the data and are non-overlapping. Bars are then drawn
which have areas proportional to the number of observations in the class.
IMPORTANT: It is the AREA of the bar that is proportional to the number of observations.
The number of observations in a class is called the class frequency.
Example 1.7
The data below refers to the amount spent on food (in £) in one week by 40 UK households.

44.66, 18.60, 83.27, 59.90, 62.45, 50.96, 49.08, 60.08, 34.71, 53.75,
79.22, 36.84, 30.45, 61.63, 55.80, 52.00, 48.92, 57.23, 40.50, 52.81,
78.94, 63.89, 24.65, 74.56, 69.92, 20.21, 28.93, 65.04, 76.60, 73.12,
65.55, 68.89, 50.15, 54.99, 87.31, 48.92, 40.81, 43.00, 95.90, 46.81.

We are free to choose the widths of the classes, though remember they must be non-overlapping
and span the range of the data. In the above data the observations range from 18.60 to 95.90.
One possible approach, among many others, is to choose classes 17.495-27.495, 27.495-37.495,
. . . , 87.495-97.495.
Note: Here the boundaries of the classes were chosen to have the three digits .495 in order
to avoid for any of the data points to fall on the boundary between two classes. This is not
strictly necessary, as long as we are consistent and make clear to the reader how we group
the data points into the classes.

Before we can construct the histogram we need to produce a frequency table.

Spending per week (£) Tally Frequency (number of observations)

17.495-27.495 ||| 3
27.495-37.495 |||| 4
37.495-47.495 |||| 5
47.495-57.495 |||| |||| | 11
57.495-67.495 |||| || 7
67.495-77.495 |||| 5
77.495-87.495 |||| 4
87.495-97.495 | 1
Total 40

Table 1.3: Frequency table for food shopping data.

8 CHAPTER 1. SUMMARISING DATA

Here is the histogram:

Figure 1.1: Histogram showing the amount spent on food (in £) in one week by 40 households.

The following video explains what is represented in the histogram, in particular on the vertical
axis:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Note that the width of each class is 10. The first bar over the interval from 17.495 to 27.495
for example, should have an area of 3, since this is the number of observations, i.e. frequency,
in this class. So the height of this bar is 0.3, since 0.3 × 10 = 3. Similarly, for the other
intervals.
So the height of the bars is the class frequency divided by the class width. This quantity is
referred to as the frequency density of a class, this is the quantity recorded on the vertical
axis:

Class Frequency
Frequency density =
Class width
9

Looking at a histogram, we can make a number of observations. For example:

The histogram shows that the smallest amount a family spent on food was around £17.5, and
the largest amount was around £97.5. The “average” amount spent on food is around £57.
Note that in the example above all of the classes have the same width, so the frequency
density is the frequency divided by 10 for each of the classes, i.e. in the example above the
frequency density is proportional to the frequency. This is not always the case, and becomes
important when not all the classes have the same width.
Let’s now look at the same data, but with one of the classes having a different width:
Using the same data on food shopping from above, suppose the classes 37.495-47.495 and
47.495-57.495 were merged to produce a single class 37.495-57.495. The frequency table would
appear as follows:

Spending per week (£) Class width Frequency Frequency density

17.495-27.495 10 3 0.3
27.495-37.495 10 4 0.4
37.495-57.495 20 16 0.8
57.495-67.495 10 7 0.7
67.495-77.495 10 5 0.5
77.495-87.495 10 4 0.4
87.495-97.495 10 1 0.1
40

Table 1.4: Frequency table for food shopping data.

And the histogram now looks like this:

Figure 1.2: Histogram showing amount spent on food (in £) in one week by 40 households.
10 CHAPTER 1. SUMMARISING DATA

Note how the area of the rectangle over the interval from 37.495 to 57.495 on Figure 1.2 is
the sum of the areas of the two rectangles over the intervals from 37.495 to 47.495 and 47.495
to 57.495 in Figure 1.1., as explained in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.8
Using some graph or other grid paper (if you have access to some), construct a histogram to
display the following data which refers to the amount of time (to the nearest minute) students
spent studying for a test. Before drawing the histogram, determine the frequency densities,
and decide on the scale for the histogram.

Time studied Class width Frequency Frequency density

(in minutes)
0.5-10.5 10 5 0.5
10.5-15.5 5 10
15.5-20.5 5 12
20.5-30.5 10 6
30.5-50.5 20 2

Table 1.5: Frequency table for student study data.

Try to do this yourself first, before watching this video of the solution:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
11

The Mode.
The mode is defined to be the most frequently occurring value in a set of data.
Example 1.9
The data below gives the number of students present at a statistics tutorial over 10 consecutive
weeks.
8, 5, 7, 10, 8, 6, 5, 6, 4, 8.
Arranging in ascending numerical order we have

4, 5, 5, 6, 6, 7, 8, 8, 8 , 10.

The mode of this set of data is 8. We also say that “the modal number of students present is
8”.
The mode need not be unique. If two values occur most frequently, the data are said to be
bimodal. If more than two values occur most frequently, the data are multimodal.
Note that the mode can be determined for qualitative and quantitative data, and is guaranteed
to be a value actually observed.

The Median.
The median is defined as the “middle value” when the data are arranged in ascending numer-
ical order. More precisely, for a data set with n observations,

if n is odd, then the median is x( n+1 ) ,

1h i
if n is even, the the median is x( n ) + x( n +1) .
2 2 2

So, for an odd number of observations the median is the value “in the middle”; for an even
number of observations the median is the average of the two “middle values”. In this video
we look at these formulas a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
12 CHAPTER 1. SUMMARISING DATA

Example 1.10
The data set
0, 7, 7, 19, 19, 20, 35,
has 7 data points. So the median is the 4th in this ordered list, i.e. the median is 19.
Example 1.11
Given the data from above with 10 data points,

4, 5, 5, 6, 6, 7 , 8, 8, 8, 10,

the median is the average of the 5th and 6th observations, which is
6+7 13
= = 6.5,
2 2
i.e. the median number of students attending the tutorial is 6.5.

Note that the median can be number which is not an actual observation.

The Mean.
The mean, or sample mean, or average, is calculated by adding up all of the observations
and dividing by the number of observations there are. For observations xi , where i takes
values 1, 2, 3 . . . , n, we write
n
x1 + x2 + x3 + · · · + xn 1X
sample mean = x̄ = = xi .
n n i=1

We denote the sample mean as x̄ (x with a horizontal bar on top), pronounced “x bar”.

Example 1.12
The mean for the data from above is
10
1 X 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 67
x̄ = xi = = = 6.7
10 i=1 10 10

The mean number of students attending the tutorial is 6.7 (again, not a number that we could
actually observe).

The mean is a very widely used measure, as it has many desirable statistical properties.
13

Example 1.13
A survey of subscribers to Hello magazine are asked the following question: “How many of
the last four issues have you read or looked through?”
Suppose that the following frequency distribution summarises 500 responses.

Issues read xi 0 1 2 3 4 Total

Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745

Table 1.6: Issues read in a survey on Hello magazine.

This video discusses how we should read this table, compute the mean and the median:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

And here is a written explanation:

How can we calculate the mean number of issues of Hello read by those in the survey? We
could write the data out as a list of 500 observations and calculate the mean from that.
However, it is much more convenient to use the following formula for the mean of frequency
Xm
data. Let n = fi and m be the number of classes in the table; m = 5 in the example.
i=1
Then
m 5
1X 1 X 1745
x̄ = f i xi = f i xi = = 3.49.
n i=1 500 i=1 500
So the mean number of issues of Hello read by those in the survey is 3.49.
14 CHAPTER 1. SUMMARISING DATA

Let’s now low at an example where the data are given as grouped data without the original
individual data values.
Example 1.14
Cars traveling on a road with a posted speed limit of 60 miles per hour are checked for speed
by a police radar system, giving the following frequency distribution of speeds.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825

Table 1.7: Speed when checked by police radar system.

To estimate the mean (it can only be an estimate, since we do not know the original data
values!) we use the midpoints of the classes as the xi ’s in the mean of frequency data formula.
m
1X 28825
x̄ = f i xi = = 60.69.
n i=1 475

We see that the mean speed of a car checked by the police was 60.69 miles per hour – just in
excess of the speed limit!
Remark 1.15
The mean may be significantly affected by the inclusion of a mistaken observation, as the
following example shows:
Example 1.16
Suppose in Example 1.9 the observation 10 is recorded as 100 by mistake. The data is now

4, 5, 5, 6, 6, 7 , 8, 8, 8 , 100.

6+7
The mode does not change, it is still 8. The median is unchanged, it is still 2
= 6.5.
However, the mean is now:
10
1 X 4 + 5 + 5 + 6 + 6 + 7 + 8 + 8 + 8 + 100 157
x̄ = xi = = = 15.7.
10 i=1 10 10

The mean has more than doubled.

Exercise 1.17
(We will go through this one in the lecture.)
The following data refers to the annual number of deaths from tornadoes in the USA between
the years 1990 and 2000.

53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40

Calculate the mean, the median and the mode of these data.
15

Quartiles.
The median is the value that splits the ordered data into two halves. Recall that we defined
it as follows:

o If n is odd, then the median is x( n+1 ) , and is an actual data point. In this case we
2
include the median in the lower half and in the the upper half,
x( n ) +x( n +1)
o If n is even, the the median is 2 2 2 . In this case the lower half consists simply of
the lower n2 values, and the upper half of the upper n2 values.

The quartiles split the ordered data into quarters. We refer to the median as Q2 , and define
the lower quartile Q1 to be the median of the lower half, and the upper quartile Q3 to
be the median of the upper half.
!!! Be careful when reading textbooks, or using calculators or other software to compute quar-
tiles. There are a number of different ways to define the lower and upper quartile producing
different results !!!

Example 1.18
In a survey of 21 households the number of telephones used by each household are given
below.
1, 3, 4, 1, 1, 2, 1, 1, 2, 5, 1, 2, 3, 0, 2, 1, 2, 1, 3, 0, 4.
Arranging the data in ascending numerical order gives

0, 0, 1, 1, 1, x(6) = 1, 1, 1, 1, 1, x(11) = 2 , 2, 2, 2, 2, x(16) = 3, 3, 3, 4, 4, 5.

There are n = 21 data items. Twenty-one is an odd number, and so the median number of
telephones per household is data item Q2 = x( n+1 ) = x(11) = 2. The lower quartile, Q1 is the
2
median of the data items x(1) , x(2) , x(3) , . . . , x(11) , which is x(6) = 1. The upper quartile, Q3 is
the median of the data items x(11) , x(2) , x(3) , . . . , x(21) , which is x(16) = 3.
This video explains the example above in more details:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
16 CHAPTER 1. SUMMARISING DATA

Example 1.19
The data below gives the monthly starting salaries (in $) of 12 business school graduates.

2850, 2950, 3050, 2880, 2755, 2710, 2890, 9130, 2940, 3325, 2920, 2880.
x(6) + x(7)
Since 12 is even, the median Q2 = = 2905.
2
x(3) + x(4)
The lower quartile Q1 is the median of x(1) , x(2) , . . . , x(6) , which is Q1 = = 2865.
2
x(9) + x(10)
The upper quartile Q3 is the median of x(7) , x(8) , . . . , x(12) , which is Q3 = = 3000.
2
See whether you can get the same answers. This video goes through the process of determining
Q1 , Q2 , and Q3 for this example:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

We are now introducing a number of concepts which measure the dispersion, i.e. the spread
or variability of a quantitative data set.

The Range.
The range is simply the difference between the largest and the smallest observation, i.e. in
a data set with n observations

range = (largest observation)−(smallest observation) = x(n) − x(1) .

In Example 1.19, the range is $9130 − $2710 = $6420. Is this a reasonable representation of
the variability? – probably not, as eleven of the twelve salaries lie between $2710 and $3325,
with one data point of $9130 being much larger than all the others.

The Interquartile Range.

The range can be inflated by the presence of a single very large or very small value. It is
better to give the interquartile range (IQR) which is the range of the middle 50% of the data
values, i.e. the difference between the upper and the lower quartile; this is

IQR = Q3 − Q1 .

For Example 1.19, the interquartile range IQR is $3000 − $2865 = $135.
17

The Variance and Standard Deviation.

The (sample) variance and (sample) standard deviation are the most widely used measures
of variation or dispersion of a data set, as they have desirable statistical properties.
The sample variance measures how far the data points are spread out from the mean, and
for a set with n data points x1 , . . . xn , is defined as
n
2 1 X
s = (xi − x̄)2 .
n − 1 i=1
(The reason why the sample variance is abbreviate by a square will become clear further below.)
If s2 is small, the data items lie fairly close to the mean.
If s2 is large, the data items are widely spread about the mean.
The formula above is discussed in more detail in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.20
The data given in in Example 1.9 with the number of students present at a statistics tutorial
over 10 consecutive weeks was

8, 5, 7, 10, 8, 6, 5, 6, 4, 8,

with a mean of 6.7.

So the sample variance for these data is
1
s2 = 9
[(8 − 6.7)2 + (5 − 6.7)2 + (7 − 6.7)2 + (10 − 6.7)2 + (8 − 6.7)2
+ (6 − 6.7)2 + (5 − 6.7)2 + (6 − 6.7)2 + (4 − 6.7)2 + (8 − 6.7)2 ] = 3.3.

For larger data sets, this involves a lot of calculations. Here is a more “calculator friendly”
form of the sample variance:
 !2 
n n
1  X 1 X 
s2 = x2i − xi .
n − 1  i=1 n i=1 
18 CHAPTER 1. SUMMARISING DATA

One can show that the two formulas produce the same results. We will show this for n = 3
in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

So in the example from above, we could compute the sample variance instead as follows:
n
X
x2i = 82 + 52 + 72 + 102 + 82 + 62 + 52 + 62 + 42 + 82 = 479, and
i=1

n
X
xi = 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 = 67, so
i=1

672

2 1 1
s == 479 − = (479 − 448.9) = 3.3.
9 10 9

The sample standard deviation is defined as

√
s= s2 .
It is measured in the same units as the original data and is therefore easier to interpret.
√ √
For the example from above s = s2 = 3.3 = 1.8; so the standard deviation is 1.8 students.

Example 1.21
Let us return to the survey on Hello subscribers from earlier in the chapter.

Number read xi 0 1 2 3 4 Total

Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745
fi x2i 0 10 160 765 5600 6535

Table 1.8: Issues read in a survey on Hello magazine.

Here is how we calculate the sample variance s2 for these data, using the frequencies fi .
19

m
X
We have n = fi = 500. So
i=1
 !2 
m m
17452

2 1 X 1 X  1
s = fi x2i − f i xi = 6535 − = 0.89.
n − 1  i=1 n i=1
 499 500

In this video we briefly explain how to apply the formula for the given frequency table:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.22
We now look at the police speed check example, and see whether you can follow how the
formula below was used.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
fi x2i 22090 108160 487350 672700 336675 77760 59290 1764025

Table 1.9: Speed when checked by police radar system.

Here
 !2 
m m
288252

1 X 1 X  1
2
s = fi x2i − f i xi = 1764025 − = 31.23 mph2 .
n − 1  i=1 n i=1
 474 475
20 CHAPTER 1. SUMMARISING DATA

Exercise 1.23
(We will go through this one in the lecture.)
The data below refer to the number of brothers and sisters a sample of students have. You
will need to calculate the values that go in place of the constants B1 , B2 , B3 and B4 .

Number brothers/sisters, xi 0 1 2 3 4 5 Total

Frequency, fi 5 12 8 3 0 1 29
f i xi 0 12 16 B1 0 5 B2
fi x2i 0 12 32 27 0 B3 B4

Table 1.10: Data on numbers of brothers and sisters.

Also calculate the mean, the median, the sample variance, and the standard deviation for
these data.

Skewness and Outliers.

If a data set is approximately symmetric, the median Q2 is roughly equally spaced between
the upper quartile Q3 and the lower quartile Q1 . If the median is much closer to Q1 than to
Q3 , then the data set is positively skewed (containing some rather high values). If the median
is much closer to Q3 than to Q1 , then the data set is negatively skewed (containing some
rather low values).
The quartile coefficient of skewness is given by

(Q3 − Q2 ) − (Q2 − Q1 )
.
Q3 − Q1
Q3 − 2Q2 + Q1
Note that this is the same as .
Q3 − Q1
This coefficient takes values between −1 and +1. A value near zero indicates symmetry, a
negative value indicates that the data is negatively skewed, a positive value indicates that the
data is positively skewed. The following video explains how the above formula represents the
skewness of the data:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
21

Box Plots.
A useful way of comparing two data sets is to produce box plots. To construct a box plot you
must first calculate the median Q2 and the quartiles, Q1 and Q3 . The general form of a box
plot is shown below: This video gives a brief introduction to box plots:

Smallest Largest
observation observation
Q Median Q
1 3

Figure 1.3: A box plot.

(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.24
The following data gives systolic blood pressure of 12 smokers and 12 non-smokers.

Smokers 122 146 120 114 124 126 118 128 130 134 116 130
Non-smokers 114 134 114 116 138 110 112 116 132 126 108 116

Table 1.11: Blood pressures of smokers and non-smokers.

For the smokers the median Q2 = 125, the lower quartile Q1 = 119, and the upper quartile
Q3 = 130.
For the non-smokers, the median Q2 = 116, the lower quartile Q1 = 113, and the upper
quartile Q3 = 129.
22 CHAPTER 1. SUMMARISING DATA

The two box plots are as follows.

Smokers

100 Non−smokers

110

120

130

140

150
Blood pressure

Figure 1.4: Box plots comparing blood pressure in smokers and non-smokers.

We see that the blood pressures of the non-smokers tend to be lower than those of the
smokers (the box in the non-smokers’ plot is shifted to the left compared to the smokers’ plot).
However, blood pressure amongst non-smokers appears to be more variable than amongst
smokers (the box in the non-smokers’ plot is wider than the box in the smokers’ plot).

Skewness for the smokers:

(Q3 − M ) − (M − Q1 ) (130 − 125) − (125 − 119)
= = −0.09.
Q3 − Q1 130 − 119
The blood pressures of the smokers are slightly negatively skewed; Q1 is a bit further away
from Q2 than Q3 is.

Skewness for the non-smokers:

(Q3 − M ) − (M − Q1 ) (129 − 116) − (116 − 113)
= = 0.62.
Q3 − Q1 129 − 113
The blood pressures of the non-smokers are positively skewed; Q3 is a further away from Q2
than Q1 is. (i.e. there are some non-smokers with rather high blood pressures compared to
the rest).

An outlier is an extremely high or extremely low observation. An outlier may be a data item
that has been incorrectly recorded (in which case it should be removed from the data set), or
it may be a genuine observation (but unusual in some way). An observation is identified as
an outlier if it is less than
3
Q1 − (Q3 − Q1 ),
2
or greater than
3
Q3 + (Q3 − Q1 ).
2

Outliers for the smokers: (for the data from above)

3 3 3 3
Q1 − (Q3 −Q1 ) = 119− (130−119) = 102.5, and Q3 + (Q3 −Q1 ) = 130+ (130−119) = 146.5.
2 2 2 2
23

Hence there are no outliers in the smokers’ data.

Outliers for the non-smokers:

3 3 3 3
Q1 − (Q3 −Q1 ) = 113− (129−113) = 89, and Q3 + (Q3 −Q1 ) = 130+ (129−113) = 153.
2 2 2 2
There are also no outliers in the non-smokers’ data.

This video explains the concept of an outlier using box plots:

(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.25
(We will go through this one in the lecture.)
Here are the case prices (in $) for 13 wines produced in the USA.

52, 66, 70, 80, 95, 100, 110, 112, 115, 118, 123, 143, 151.

(i) Calculate the Quartile Coefficient of Skewness.

(ii) Identify any outliers.

(iii) Construct a box plot to display the data, using graph or grid paper (if available).

Statistics and Probability Reviewer
77% (13)
Statistics and Probability Reviewer
6 pages
Devya Hvs884 Extended Essay
No ratings yet
Devya Hvs884 Extended Essay
36 pages
CT127 3 2 Pfda NP000327
No ratings yet
CT127 3 2 Pfda NP000327
21 pages
Module 4 (Data Management) - Math 101
No ratings yet
Module 4 (Data Management) - Math 101
8 pages
Vantage100 Manual en
No ratings yet
Vantage100 Manual en
129 pages
Data and Graphs Practice
100% (1)
Data and Graphs Practice
13 pages
Chapter Notes (Chapters 1&2)
No ratings yet
Chapter Notes (Chapters 1&2)
36 pages
Data Handling Notes and Exercises
No ratings yet
Data Handling Notes and Exercises
16 pages
Quantitative and Qualitative
No ratings yet
Quantitative and Qualitative
41 pages
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
No ratings yet
IE 220 Probability and Statistics: Descriptive Statistics - Graphical Summary: Describing Data With Graphs
36 pages
Collecting Organising and Displaying Data
No ratings yet
Collecting Organising and Displaying Data
35 pages
Stats For PGDM
No ratings yet
Stats For PGDM
52 pages
Chap6_STAT_2
No ratings yet
Chap6_STAT_2
11 pages
Lecture 01 Introduction to Statistics Ppt 06022025 095924am
No ratings yet
Lecture 01 Introduction to Statistics Ppt 06022025 095924am
40 pages
AA SL - Unit 1a - Representing Data (Statistics)
No ratings yet
AA SL - Unit 1a - Representing Data (Statistics)
74 pages
WEEK1
No ratings yet
WEEK1
36 pages
Lecture 1
No ratings yet
Lecture 1
94 pages
STAT 111: Introduction To Statistics & Probability For Actuaries
100% (2)
STAT 111: Introduction To Statistics & Probability For Actuaries
230 pages
Intro To Statistics Lecture
No ratings yet
Intro To Statistics Lecture
41 pages
Data Types: and Its Representation Session - 2 & 3
No ratings yet
Data Types: and Its Representation Session - 2 & 3
33 pages
Intro To Statistics
No ratings yet
Intro To Statistics
35 pages
Grade 7 9.12Statistics
No ratings yet
Grade 7 9.12Statistics
62 pages
Graphical and Tabular Descriptive Techniques
No ratings yet
Graphical and Tabular Descriptive Techniques
40 pages
MAT 211 CourseGuide - Lecture Notes - Spring - 2022
No ratings yet
MAT 211 CourseGuide - Lecture Notes - Spring - 2022
74 pages
Data Handling
No ratings yet
Data Handling
39 pages
ADDB - Week 1
No ratings yet
ADDB - Week 1
44 pages
Lect. One
No ratings yet
Lect. One
10 pages
STAT 111: Introduction To Statistics and Probability: Lecture 2: Data Reduction
No ratings yet
STAT 111: Introduction To Statistics and Probability: Lecture 2: Data Reduction
28 pages
5.1 Visual Displays of Data
No ratings yet
5.1 Visual Displays of Data
8 pages
1 Stats Intro 14022024 105127am
No ratings yet
1 Stats Intro 14022024 105127am
26 pages
QAB - II - Lecture - Notes Statistic
No ratings yet
QAB - II - Lecture - Notes Statistic
101 pages
Ns Statistics 2022
No ratings yet
Ns Statistics 2022
70 pages
Stat Introduction Units 1& 2
No ratings yet
Stat Introduction Units 1& 2
108 pages
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
No ratings yet
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
39 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
4 pages
Chapter 2 Measures of Location
No ratings yet
Chapter 2 Measures of Location
16 pages
ةداملا مسا (Subject) ثحبلا ناونع (Research Title) Graphs and its importance
No ratings yet
ةداملا مسا (Subject) ثحبلا ناونع (Research Title) Graphs and its importance
18 pages
Gcse Statistics Revision Notes
No ratings yet
Gcse Statistics Revision Notes
10 pages
As Level Math STATISTIC
No ratings yet
As Level Math STATISTIC
32 pages
1 Introduction of The Nature of Statistics and Frequency Distributions and Graph
No ratings yet
1 Introduction of The Nature of Statistics and Frequency Distributions and Graph
13 pages
1 Stats Intro 13092024 113537pm
No ratings yet
1 Stats Intro 13092024 113537pm
15 pages
OCR MEI S1 Summary Sheets
No ratings yet
OCR MEI S1 Summary Sheets
9 pages
Guiang Mamow Paper 1 Statistical Terms
No ratings yet
Guiang Mamow Paper 1 Statistical Terms
5 pages
Lesson Proper For Week 1-5 Ms
No ratings yet
Lesson Proper For Week 1-5 Ms
13 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
14 pages
What Is Raw Data?
No ratings yet
What Is Raw Data?
8 pages
Statistics Review
No ratings yet
Statistics Review
59 pages
Topic 1 Descriptive Statistics SV
No ratings yet
Topic 1 Descriptive Statistics SV
113 pages
Statistics - Basic Concepts
No ratings yet
Statistics - Basic Concepts
29 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
DIS VISHNU
No ratings yet
DIS VISHNU
48 pages
Introduction Book 1
No ratings yet
Introduction Book 1
41 pages
STATISTICS
No ratings yet
STATISTICS
4 pages
MATH-Lesson-1-2
No ratings yet
MATH-Lesson-1-2
64 pages
DS Module 01
No ratings yet
DS Module 01
17 pages
1&2 MBA Intro Stat Ch1&2
No ratings yet
1&2 MBA Intro Stat Ch1&2
62 pages
1st Mid
No ratings yet
1st Mid
19 pages
Stats Lec01
No ratings yet
Stats Lec01
9 pages
Inferential Statistics
No ratings yet
Inferential Statistics
92 pages
Lecture 1, 2 and 3_d21432a1071b0bf181cd2be654ea33bb
No ratings yet
Lecture 1, 2 and 3_d21432a1071b0bf181cd2be654ea33bb
45 pages
Chapter 1 Eqt 271 (Part 1) : Basic Statistics
No ratings yet
Chapter 1 Eqt 271 (Part 1) : Basic Statistics
69 pages
CHAPTER 1 - PART 1 Latest PDF
No ratings yet
CHAPTER 1 - PART 1 Latest PDF
69 pages
Organizing-Data_250120_180858
No ratings yet
Organizing-Data_250120_180858
32 pages
S4 Week 7 Exam Prep 2 Statistics
No ratings yet
S4 Week 7 Exam Prep 2 Statistics
135 pages
STAT 251 Statistics Notes UBC
No ratings yet
STAT 251 Statistics Notes UBC
292 pages
GCSE Mathematics Numerical Crosswords Higher Tier Written for the GCSE 9-1 Course
From Everand
GCSE Mathematics Numerical Crosswords Higher Tier Written for the GCSE 9-1 Course
Ian Winkworth
No ratings yet
Good Shepherd International School, Ooty Mid Term Examination - September 2022
No ratings yet
Good Shepherd International School, Ooty Mid Term Examination - September 2022
8 pages
Ibbm HL2-MS
No ratings yet
Ibbm HL2-MS
15 pages
Ib BM - Paper 1
No ratings yet
Ib BM - Paper 1
6 pages
IB BM Paper 1-MS
No ratings yet
IB BM Paper 1-MS
13 pages
Comparative Study of Biomes Using Infographics: o o o o
No ratings yet
Comparative Study of Biomes Using Infographics: o o o o
1 page
ESS IA Template
No ratings yet
ESS IA Template
2 pages
Statistical Analysis of Wind Power Forec
No ratings yet
Statistical Analysis of Wind Power Forec
9 pages
2 Graphical Descriptive Techniques
No ratings yet
2 Graphical Descriptive Techniques
49 pages
Output Viewer User Guide2
No ratings yet
Output Viewer User Guide2
211 pages
Data Preprocessing
No ratings yet
Data Preprocessing
30 pages
TB CH 02
No ratings yet
TB CH 02
26 pages
IG219 Statistics Week 4
No ratings yet
IG219 Statistics Week 4
5 pages
Oracle Statistics
No ratings yet
Oracle Statistics
26 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
4 pages
Statistics For Management I - Best
No ratings yet
Statistics For Management I - Best
127 pages
Digital Image Processing, 4th Editio: Chapter 1 Introduction
No ratings yet
Digital Image Processing, 4th Editio: Chapter 1 Introduction
13 pages
PETREL 3 Volumetrics Uncertainty
50% (2)
PETREL 3 Volumetrics Uncertainty
15 pages
NP000418 CT127 3 2 Pfda
No ratings yet
NP000418 CT127 3 2 Pfda
28 pages
BC-7000 Communication Protocol V1.4
No ratings yet
BC-7000 Communication Protocol V1.4
15 pages
HP TUNERS VCM Editor
100% (1)
HP TUNERS VCM Editor
2 pages
احصاء شابتر 1
No ratings yet
احصاء شابتر 1
64 pages
Edited and Compiled By:: Dr. Chandrashekhar V. Joshi
No ratings yet
Edited and Compiled By:: Dr. Chandrashekhar V. Joshi
81 pages
MMW-Data Management Part 2 Activity
No ratings yet
MMW-Data Management Part 2 Activity
8 pages
Ploting With Pyplot Data Visualization Worksheets1!5!279202323752
No ratings yet
Ploting With Pyplot Data Visualization Worksheets1!5!279202323752
10 pages
Elements of Biostatistics
No ratings yet
Elements of Biostatistics
12 pages
2 Using JASP and Histograms
No ratings yet
2 Using JASP and Histograms
10 pages
Lab 1 Me 303
No ratings yet
Lab 1 Me 303
12 pages
Introduction ToThe Theory of Error - Yardley Beer
100% (3)
Introduction ToThe Theory of Error - Yardley Beer
84 pages
Chapter-2 (Business Statistics-1 - BA-1315)
No ratings yet
Chapter-2 (Business Statistics-1 - BA-1315)
18 pages
Graph2: Residual Plots For Y5
No ratings yet
Graph2: Residual Plots For Y5
20 pages
A Mobile App Ordering System2
No ratings yet
A Mobile App Ordering System2
18 pages

Chapter Notes (Chapter 1)

Uploaded by

Chapter Notes (Chapter 1)

Uploaded by

CHAPTER 1

o All students at Leeds University

o All new cars sold by Ford UK this year

o All oranges in a crate

Sample – refers to a subset of the population, usually chosen to be representative of the

o All students registered on this module (is this representative?)

Variable – refers to the quantity being measured. Examples:

o Age of a student; colour of eyes

o Size of car engine

o Ripeness of the fruit

Observation or data item – refers to the result of a measurement. Examples:

o Size of car engine is “2200ccm”

o The orange is “half-ripe”

Data can be given in different forms:

o “ ripe/ half-ripe/not ripe/rotten”

o Number of red cars sold by Ford in the UK

o Number of rotten oranges in a crate

o The time taken for Ford to produce a car in the UK

o Size of a car engine

(i) Height of a person to the nearest centimetre

(ii) Height of a person, classed as short/medium/tall

(iii) Height of a person

(iv) Annual number of items sold of a product

(v) Soft-drink size, classed as small, medium or large

(vi) Earnings per share

(vii) Method of payment (cash, cheque or credit card)

(viii) Time to pay

(ix) Time to pay in days

(x) Ripeness of oranges

Statistics – is the science of data collection and data analysis.

o Inferential statistics: An auditor needs to determine whether the transactions on a

Table 1.1: Stem-and-leaf diagram. Key: Stem|leaf = 9|9 means 99.

Representing the above in a stem-and-leaf diagram we have

Table 1.2: Stem-and-leaf diagram. Key: Stem|leaf = 4|3 means £43.

Before we can construct the histogram we need to produce a frequency table.

Spending per week (£) Tally Frequency (number of observations)

Table 1.3: Frequency table for food shopping data.

Here is the histogram:

Looking at a histogram, we can make a number of observations. For example:

Spending per week (£) Class width Frequency Frequency density

Table 1.4: Frequency table for food shopping data.

And the histogram now looks like this:

Time studied Class width Frequency Frequency density

Table 1.5: Frequency table for student study data.

if n is odd, then the median is x( n+1 ) ,

Issues read xi 0 1 2 3 4 Total

Table 1.6: Issues read in a survey on Hello magazine.

And here is a written explanation:

Table 1.7: Speed when checked by police radar system.

The mean has more than doubled.

range = (largest observation)−(smallest observation) = x(n) − x(1) .

The Interquartile Range.

The Variance and Standard Deviation.

with a mean of 6.7.

The sample standard deviation is defined as

Number read xi 0 1 2 3 4 Total

Table 1.8: Issues read in a survey on Hello magazine.

Table 1.9: Speed when checked by police radar system.

Number brothers/sisters, xi 0 1 2 3 4 5 Total

Table 1.10: Data on numbers of brothers and sisters.

Skewness and Outliers.

Figure 1.3: A box plot.

Table 1.11: Blood pressures of smokers and non-smokers.

The two box plots are as follows.

Skewness for the smokers:

Skewness for the non-smokers:

Outliers for the smokers: (for the data from above)

Hence there are no outliers in the smokers’ data.

Outliers for the non-smokers:

This video explains the concept of an outlier using box plots:

(i) Calculate the Quartile Coefficient of Skewness.

(ii) Identify any outliers.

You might also like