____________________________ _____________________________
Statistics
Types of Data:
Discrete Data – Ungrouped Data Continuous Data – Grouped Data
• Data that can be counted • Quantitative data that can be
• Only whole numbers are used measured
• Given in intervals
Name an example of discrete data: Name an example of continuous data:
Measures of Central Tendency
Tells you about the dispersion and amount of variability in a data set.
Ungrouped Data Grouped Data
∑ 𝑓.𝑥𝑖
∑𝑥 Estimated Mean: 𝑥̅ =
Mean: 𝑥̅ = 𝑛
𝑛 Mean =
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ×𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Mean = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
Mode: appears the most Mode: The interval with the highest
Bimodal: a data set with two modes frequency
Trimodal: a data set with three modes
Median: The middle number in a data set Median: The interval that contains the
1
Position of median = 2 (𝑛 + 1) middle number of a data set
If n is odd = the median will be part of the 1
data set Position of median = 2 (𝑛)
If n is even = the median will be the
average between the two middle
numbers
1
____________________________ _____________________________
How do you know which measure of central tendency is best?
Mean:
• Best for symmetrical, normally distributed data where all values are important
• Highly sensitive to outliers and skewed data
Give an example of when the mean would not be an accurate measure of central
tendency:
Median:
• Best for skewed distributions or datasets with outliers (less affected by extreme
values)
Give an example of when the mean would not be an accurate measure of central
tendency:
Mode:
• Best for nominal or categorical data where the values can’t be ordered
Example:
The mean, median and mode of the Mathematics IEB Final for a certain Matric group is
given as (in percentage):
Mean = 67 Median = 58 Range = 85
The DBE decides to add 3 marks to each of the student’s final marks, excluding learners
who received 100%.
Find the new mean, median and range for the new set of marks.
2
____________________________ _____________________________
Measures of Position
Quantiles:
Divides data into four quarters (pieces of 25% each)
𝑄1 = First or Lower Quartile 𝑄2 = Second Quartile (Also the median)
𝑄3 = Upper or Third Quartile
To calculate position of a quartile:
Ungrouped Data Grouped Data
1 1
𝑄1 = (𝑛 + 1) 𝑄1 = (𝑛)
4 4
1 1
𝑄2 = (𝑛 + 1) 𝑄2 = (𝑛)
2 2
3 3
𝑄3 = 4 (𝑛 + 1) 𝑄3 = 4 (𝑛)
Percentiles:
Indicates which percentage of the data is below the specific percentile
𝑄1 = 25th percentile 𝑄2 = 50th percentile 𝑄3 = 75th percentile
𝑝
To calculate the position of a percentile: 𝑖 = 100 (𝑛)
Measures of Dispersion
The amount of variability in a data set
Range = Max – Min IQR = 𝑸𝟑 − 𝑸𝟏 𝟏
Semi-IQR = (𝑸𝟑 − 𝑸𝟏 )
𝟐
Note! Heaving influenced Note! Measures 50% of the Note! Good measure of
by outliers data set dispersion for skewed data
3
____________________________ _____________________________
Five Number Summary:
Minimum Lower Quartile (𝑄1 ) Median Upper Quartile (𝑄3 ) Maximum
Measures of Dispersion around the Mean:
Variance (𝜹𝟐 ) Standard Deviation (𝝈𝟐 )
Measures the variability from the mean How far the data value differs from the
mean.
‘the standard to deviate from the mean’
∑ 𝑓(𝑥 − 𝑥̅ )2
𝛿=√
𝑛
The bigger the standard deviation, the more the data is spread out.
The smaller the standard deviation, the most the data is clustered around the mean,
and less spread out.
4
____________________________ _____________________________
Calculator Steps:
Data Set: 12 5 7 2 8 9 14 6 13 9
Go into STATS mode: Mode – 3 (STAT) – 1 (1-VAR)
Enter the data.
AC – SHIFT – 1 (goes to the menu)
DO NOT PRESS ON OR GO OUT OF STATS MODE!!!!!!!
Mean: Standard Deviation:
4 (VAR) – 2 (𝑥̅ ) - = 4(VAR) – 2 (𝛿) - =
Minimum: Maximum:
6(MinMax) – 1(minX) - = 6(MinMax) – 2(maxX) - =
Lower Quartile Upper Quartile:
6(MinMax) – 3(𝑄1 ) - = 6(MinMax) – 5(𝑄3 ) - =
Median:
6(MinMax) – 4(med) - =
1. Draw a box and whisker plot to represent the data:
Don’t forget the scale!
2. Calculate the values that will be within one standard deviation of the mean.
5
____________________________ _____________________________
For Grouped Data:
Turn the Frequency ON: SHIFT – MODE - ↓ - 4(STAT) – 1(ON)
Calculate the estimate mean:
Cumulative
Intervals Frequency Midpoint × Frequency
Frequency
𝟏𝟎 < 𝒙 ≤ 𝟐𝟎 5
𝟐𝟎 < 𝒙 ≤ 𝟑𝟎 8
𝟑𝟎 < 𝒙 ≤ 𝟒𝟎 9
𝟒𝟎 < 𝒙 ≤ 𝟓𝟎 3
Enter the MIDPOINT of the interval as the 𝑥 −value and use the FREQUENCY
6
____________________________ _____________________________
Mixed Example:
The time, in minutes, that it took for a football team to score their first goal in seven
games is recorded. The times are listed in ascending order.
The following observations were made about the data:
• All the goals were scored at different times
• The minimum time for the first goal was five minutes.
• The range of the times was 48 minutes.
• The median time was 22 minutes.
• The difference between the time in 𝑄1 and the minimum time was 7 minutes.
• The IQR of the times was 28 minutes.
• The mean time was 27 minutes.
• 𝑒 = 2𝑐
Find the values of 𝑎 − 𝑔.
7
____________________________ _____________________________
Symmetrical and Skewed Data:
Skewed Left (negative skew):
Mean < Median
The left 50% is more spread out
Skewed Right (positive skew):
Mean > Median
The right 50% is more spread out
Symmetrical Data:
Mean = Median
The data is symmetrical over
the median.
Comment on the skewedness of the
box and whisker plot on pg. 5
8
____________________________ _____________________________
Ogives (Cumulative Frequency Graphs):
A survey is conducted to find out the ages of people in a particular restaurant. The data
is represented in the ogive.
The (𝑥; 𝑦)
coordinate is:
𝑥: the upper bound
of the interval
𝑦: the cumulative
frequency for that
interval
1. How any people took part in the survey?
2. What was the median age?
3. Calculate the IQR.
4. What is the modal age group?
5. How old would someone be if it was said that there are 50 people the same age
as them?
9
____________________________ _____________________________
6. Complete the table.
Class Interval Frequency Cumulative
Frequency
7. Calculate the estimate mean
8. If everyone under the age of 30 was taken out of the survey results, how would
this affect the mean and the standard deviation?
10
____________________________ _____________________________
Sketching Ogives:
Cumulative
Intervals Frequency Coordinate
Frequency
𝟏𝟎 < 𝒙 ≤ 𝟐𝟎 5
𝟐𝟎 < 𝒙 ≤ 𝟑𝟎 8
𝟑𝟎 < 𝒙 ≤ 𝟒𝟎 9
𝟒𝟎 < 𝒙 ≤ 𝟓𝟎 3
11
____________________________ _____________________________
Bivariate Data:
The comparison of two data sets
We want to look at how the data correlates with each other – is there a relationship?
𝑦 = 𝐴 + 𝐵𝑥 Correlation Coefficient
−1 < 𝑟 < 1
Strong Strong
Weak None Weak
Moderate Moderate
-1 -0.5 0 0.5 1
Negative Positive
Interpolation Extrapolation
Inside the data set Outside the data set
Reliable Not reliable
12
____________________________ _____________________________
Example 1:
A shoe shop collects data from a sample of ten regular customers to see the ages of
customers compared to the amount of money they spend annually on shoes (in
hundreds of rands). The results are shown in the table below.
1. Calculate the value of 𝐴 and 𝐵 rounded to four decimal places in 𝑦 = 𝐴𝑥 + 𝐵,
the equations of the least squares regression.
2. Sketch the least squares regression line. Clearly indicate the mean point and
another point on the line.
13
____________________________ _____________________________
3. Determine the correlation coefficient, correct to two decimal places.
4. Comment on the relationship between the age of the customer and the amount
of money they spent on shoes.
5. Predict how much money a 60 year old customer will spend on shoes, and
comment on the reliability of your prediction.
6. The point (38; 31) was recorded incorrectly and is removed from the data.
Explain how this will affect the gradient of the least squares regression line.
14
____________________________ _____________________________
Example 2:
The scatter plot below displays the relationships between two sets of data, represented
by 𝑥 and 𝑦. The point 𝑃 is an outlier.
If the point 𝑃 were removed from the data
set, comment on how this would affect:
1. The correlation coefficient.
2. The gradient and 𝑦intercept of the line of best fit.
15
____________________________ _____________________________
Example 3:
The table and scatter plot below represents the number of hours a sales consultant
spent with nine clients, as well as the value of their sales (in thousands of rands) for that
particular client.
1. Identify the outlier in the above data set.
2. Describe the trend in the data.
3. Find the equation of the least squares regression line for the data.
16
____________________________ _____________________________
4. The sales consultant forgot to record the sales of one of his clients. If the sales
consultant spent 80 hours with that client, predict the value of the client’s sales
to the nearest thousand rand.
5. Comment on the strength of the relationship between the time spent with a
client and the value of their sales. Justify your answer.
6. What is the expected increase in sales for each additional hour spent with a
client, to the nearest rand?
17
____________________________ _____________________________
Interpreting Data:
1. The box-and-whisker plots below represent the marks of four different classes
after writing a Mathematics test.
a. Which class performed the best?
b. Which class performed the worst?
c. Comment on the skewness of each Group.
d. How will the outliers affect Group A’s average? What would happen if they
were removed?
18
____________________________ _____________________________
e. Did Group D or Group C do better in the test?
2. A teacher calculates the average of a class test. If one student’s high score is
mistakenly entered as lower than it actually is, how does this affect the mean?
3. The mean income of a town is R50,000. Does this mean that most people earn
close to this amount? Why or why not?
4. Two datasets have the same mean but different standard deviations. What does
this tell you about the spread of the data?
19
____________________________ _____________________________
5. Three Histograms are given below representing three different scenarios.
a. Which histogram could represent the age at which people retire in a
particular country?
b. Which histogram could represent the average monthly income in
Johannesburg?
c. Which histogram could represent the average heights of the Grade 12 boys
at Edify?
d. Comment on the skewness of each of the histograms.
20
____________________________ _____________________________
6. A teacher finds that the standard deviation of students’ test scores is very high.
What does this imply about the students' performance?
7. If all the numbers in a dataset are the same, what is the standard deviation?
Why?
8. A company measures employees' monthly salaries and finds that the mean is
much higher than the median. What does this suggest about the salary
distribution?
9. A dataset has A numbers with a mean of B and a standard deviation of C.
10. If a new number much larger than B is added, what happens to the mean and
standard deviation?
21
____________________________ _____________________________
11. If a new number equal to B is added, what happens to the mean and standard
deviation?
12. Use the ogive below to answer the following questions.
a. What type of real-world data could this ogive represent?
b. Does this ogive suggest a normal, skewed, or symmetrical distribution? Why?
c. Can the mode be directly identified from an ogive? Why or why not?
22
____________________________ _____________________________
d. Based on the ogive, which class interval has the highest frequency? How can you
tell?
e. If another ogive were plotted on the same graph and it was consistently above
this one, what would that indicate?
f. What would happen to the ogive if the frequency of the 40–50 class interval were
doubled?
g. If an additional class interval (100–110) were added with a frequency of 5, how
would the ogive change?
23
____________________________ _____________________________
13. Use the box-and-whisker plots below to answer the following questions.
a. If these datasets represent running times for two different training programs,
which program seems to produce more consistent results?
b. Which dataset has a higher median? Which dataset has a greater range?
what could this suggest about training intensity and effectiveness?
c. If dataset 1 has a lot of outliers on the higher end (slower times), and these
outliers were removed, how would this affect the skewness of the dataset?
24
____________________________ _____________________________
25