Topic 3
Topic 3
distribution
• Introduction • To calculate mean, and
• Constructing a Frequency Distribution median mode for population
• Measures of Central Tendency and sample
• Measures of Variability • To calculate range, variance
and standard deviation for
population & and sample
To describe situations, draw conclusions, or make inferences about events, one must organise the data in some
meaningful way. The most convenient method of organising data is to construct a frequency distribution. After
organising the data, the researcher must present them so they can be understood by those who will benefit from
reading the study. The most useful method of presenting the data is by constructing statistical charts and graphs.
There are many different types of charts and graphs, and each one has a specific purpose. This lesson shows the
statistical methods that can be used to summarise data. The method is the finding of averages, median, mode,
range, variance and standard deviation will be discussed in this lesson.
Introduction
Statistics
Statistics is the mathematical science that deals with the collection, analysis, and presentation of data, which
can then be used as a basis for inference and induction.
Data
Values assigned to observations or measurements
Information
Data that are transformed into useful facts that can be used for a specific purpose, such as making a decision
Branches of Statistics
1. Descriptive statistics
2. Inferential statistics
2. Sample
Inferential Statistics
A frequency distribution shows the number of data observations that fall into specific intervals.
• Graphically summarise information not readily observable by merely looking at data in a table.
• A class is a category (row) in a frequency distribution.
Continuous data are values that can take on any real numbers, including numbers that contain decimal points.
• Usually measured rather than counted.
• Examples are weight, time, and distance.
Relative frequency distributions display the proportion of observations of each class relative to the total number
of observations.
• Shows the fraction of observations in each class.
• Found by dividing each frequency by the total number of observations.
• The fractions in a relative frequency distribution add up to 1.00.
Example:
A cumulative relative frequency distribution totals the proportion of observations that are less than or equal to
the class at which you are looking.
A histogram is a graph showing the number of observations in each class of a frequency distribution.
Ideally, the number of classes in a frequency distribution should be between 4 and 20.
• Some data sets, particularly those with continuous data, require several values to be grouped together
in a single class.
• This grouping prevents having too many classes in the frequency distribution, which can make it difficult
to detect patterns.
Number of Classes
One method to determine the number of classes in a frequency distribution is the rule
2k ≥ n
where k = Number of classes
n = Number of data points
Suppose n = 50
25 = 32 < 50 (k = 5 is too small.)
26 = 64 > 50 (k = 6 is a good choice.)
Class Width
• Round this estimation to a useful whole number that makes the frequency distribution more readable.
Class Boundaries
Class boundaries represent the minimum and maximum values for each class.
• Choose class boundaries that are easy to read.
☺🗹 ☹🗷
3 to less than 6 minutes 3.21 to less than 6.21 minutes
6 to less than 9 minutes vs. 6.21 to less than 9.21 minutes
9 to less than 12 minutes 9.21 to less than 12.21 minutes
Class Frequencies
Find class frequencies by counting and recording the number of observations in each class.
• This is easier when the data are sorted.
Example:
The Ogive
The ogive is a line graph that plots the cumulative relative frequency distribution.
It provides a simple representation of the frequencies that are less than or equal to a certain number.
Frequency distributions help display qualitative data by indicating the number of occurrences of various
categories.
Pareto Charts
Pareto charts are bar charts that show the frequency of the categories that cause quality control problems.
Show quality problem categories in decreasing order:
• The most problematic categories are shown first
Pareto charts also plot the cumulative relative frequency as a line on the chart known as an ogive.
7|8 8 9 9 9
8|0 0 0 0 1 1 2 3 3 4 4 4 5 6 7 8
9|0 2 5
7(5) | 8 8 9 9 9
8(0) | 0 0 0 0 1 1 2 3 3 4 4 4
8(5) | 5 6 7 8
9(0) | 0 2
9(5) | 5
• The stem labeled 7(5) stores all the scores between 75 and 79.
• The stem 8(0) stores all the scores between 80 and 84.
Disadvantages:
• With only a summary value you lose information about the original data.
The Median
The median is the value in the data set for which half the observations are higher and half the observations are
lower.
• First arrange the data in ascending order.
The median value is, therefore, in the fourth position of our sorted data.
21 27 27 28 34 45 50
When there are odd numbers of data values, the median is always the middle value in the data set.
When there are even numbers of data values, the median is halfway between the two middle values.
Example with a sample of size n = 6:
The Mode
The mode is the value that appears most often in a data set.
• If no data value or category repeats more than once, then we say that the mode does not exist.
• More than one mode can exist if two or more values tie for the most frequent.
0,0,0,0,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,4,5
• The car that appears most often is Toyota (which occurs 7 times), so the mode is the Toyota model.
Example:
Prices for 5 homes have been collected
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Sum 3,000,000
The mean is generally used as it is relatively easy to determine and most widely understood by people with little
statistical training.
If outliers are present, the median is often used, since the median is not sensitive to outliers.
• For example, median home prices may be reported for a region; it is less sensitive to outliers.
Measures of Variability
Advantages:
• Easy to calculate and understand
Disadvantages:
• Only based on two numbers in the data set
Example:
A smaller coefficient of variation indicates more consistency within a set of data values.
Example:
1 to under 5 6
5 to under 9 12
9 to under 13 10
13 to under 17 4
The merchant would like to calculate the average number of viewed pages.
Midpoint
Number of pages Frequency
(mi)
1 to under 5 3 6
5 to under 9 7 12
9 to under 13 11 10
- end of content –