0% found this document useful (0 votes)
16 views

Topic 3

Uploaded by

eddyyow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Topic 3

Uploaded by

eddyyow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Topic 3 Topic: Descriptive Statistics • To construct a frequency

distribution
• Introduction • To calculate mean, and
• Constructing a Frequency Distribution median mode for population
• Measures of Central Tendency and sample
• Measures of Variability • To calculate range, variance
and standard deviation for
population & and sample

Why This Topic

To describe situations, draw conclusions, or make inferences about events, one must organise the data in some
meaningful way. The most convenient method of organising data is to construct a frequency distribution. After
organising the data, the researcher must present them so they can be understood by those who will benefit from
reading the study. The most useful method of presenting the data is by constructing statistical charts and graphs.
There are many different types of charts and graphs, and each one has a specific purpose. This lesson shows the
statistical methods that can be used to summarise data. The method is the finding of averages, median, mode,
range, variance and standard deviation will be discussed in this lesson.

Introduction

Statistics
Statistics is the mathematical science that deals with the collection, analysis, and presentation of data, which
can then be used as a basis for inference and induction.

Data
Values assigned to observations or measurements

Information
Data that are transformed into useful facts that can be used for a specific purpose, such as making a decision

The Two Main Types of Data


Data can be classified into two categories, namely qualitative and quantitative

© UNITAR International University Page 1 of 22


Classifying Data by Level of Measurement

Branches of Statistics

1. Descriptive statistics

• collecting, summarising, and displaying data

2. Inferential statistics

• making claims or conclusions about the data based on a sample

Population and Sample

© UNITAR International University Page 2 of 22


1. Population

• represents all possible subjects that are of interest in a particular study

2. Sample

• refers to a portion of the population that is representative of the


population from which it was selected

Parameter and statistics

• Parameter – a described characteristic of a population


• Statistic – a described characteristic of a sample

Inferential Statistics

Making claims about a population by examining sample results


• Example:

Constructing a Frequency Distribution

A frequency distribution shows the number of data observations that fall into specific intervals.
• Graphically summarise information not readily observable by merely looking at data in a table.
• A class is a category (row) in a frequency distribution.

Example: Number of iPads sold per day

© UNITAR International University Page 3 of 22


Discrete data are values based on observations that can be counted and are typically represented by whole
numbers.

• Represent something that has been counted.


• Take on whole numbers such as 0, 1, 2, 3.

Continuous data are values that can take on any real numbers, including numbers that contain decimal points.
• Usually measured rather than counted.
• Examples are weight, time, and distance.

Examples of Discrete data


• Number of children per family.
• Number of cars listed per insurance policy.
• Vacation days per month.

Examples of Continuous data


• Time required to read chapter 2.
• Thickness of paint applied to a car body.
• Voltage of batteries produced in August.

Relative frequency distributions display the proportion of observations of each class relative to the total number
of observations.
• Shows the fraction of observations in each class.
• Found by dividing each frequency by the total number of observations.
• The fractions in a relative frequency distribution add up to 1.00.

Example:

© UNITAR International University Page 4 of 22


Cumulative Relative Frequency Distributions

A cumulative relative frequency distribution totals the proportion of observations that are less than or equal to
the class at which you are looking.

• Shows the accumulated proportion as values vary from low to high

Using a Histogram to Graph a Frequency Distribution

A histogram is a graph showing the number of observations in each class of a frequency distribution.

The Shape of Histograms

© UNITAR International University Page 5 of 22


Constructing a Frequency Distribution Using Grouped Quantitative Data

Ideally, the number of classes in a frequency distribution should be between 4 and 20.
• Some data sets, particularly those with continuous data, require several values to be grouped together
in a single class.
• This grouping prevents having too many classes in the frequency distribution, which can make it difficult
to detect patterns.

Number of Classes
One method to determine the number of classes in a frequency distribution is the rule
2k ≥ n
where k = Number of classes
n = Number of data points

• Find the lowest value of k that satisfies the rule.

Suppose n = 50
25 = 32 < 50 (k = 5 is too small.)
26 = 64 > 50 (k = 6 is a good choice.)

Class Width

Once k is known, the width of each class can be found.


• The width is the range of numbers to put into each class.

• Round this estimation to a useful whole number that makes the frequency distribution more readable.

© UNITAR International University Page 6 of 22


There is no one correct answer for the class width.
• The goal is to create a histogram to clearly and usefully show the pattern in the data.
• Often there is more than one acceptable way to accomplish this.

Class Boundaries

Class boundaries represent the minimum and maximum values for each class.
• Choose class boundaries that are easy to read.

☺🗹 ☹🗷
3 to less than 6 minutes 3.21 to less than 6.21 minutes
6 to less than 9 minutes vs. 6.21 to less than 9.21 minutes
9 to less than 12 minutes 9.21 to less than 12.21 minutes

Class Frequencies

Find class frequencies by counting and recording the number of observations in each class.
• This is easier when the data are sorted.

Example:

Rules for Classes for Grouped Data


1. Equal-size classes - all classes in the frequency distribution must be of equal width.
2. Mutually exclusive classes - class boundaries cannot overlap.
3. Include all data values - make sure all data values are accounted for in the total row of the frequency
distribution.
4. Avoid empty classes - it is undesirable for a histogram to display a class so narrow that there are no
observations in it.
5. Avoid open-ended classes (if possible) - these violate the first rule of equal class sizes.

The Consequences of Too Few or Too Many Classes

© UNITAR International University Page 7 of 22


Wide classes result in few class intervals:
• Can obscure important patterns
• Gives a “blocky” distribution graph
• Summarizes the data too much
• Tells us little about the true distribution shape

Too many narrow classes have consequences:


• Results in a “jagged” histogram
• Some classes may be empty
• Does not summarize the data enough

The Ogive
The ogive is a line graph that plots the cumulative relative frequency distribution.
It provides a simple representation of the frequencies that are less than or equal to a certain number.

Displaying Qualitative Data

Qualitative data are values that are categorical.


• Can be nominal or ordinal measurement level.
• Describe a characteristic, such as gender or level of education.

Frequency distributions help display qualitative data by indicating the number of occurrences of various
categories.

© UNITAR International University Page 8 of 22


Bar Charts
Bar charts are a good tool for displaying qualitative data that have been organised into categories.

Vertical Bar Chart Horizontal bar chart

Pareto Charts
Pareto charts are bar charts that show the frequency of the categories that cause quality control problems.
Show quality problem categories in decreasing order:
• The most problematic categories are shown first

Pareto charts also plot the cumulative relative frequency as a line on the chart known as an ogive.

© UNITAR International University Page 9 of 22


Pie Charts
Pie charts are another excellent tool for comparing proportions for categorical data.
Each segment of the pie represents the relative frequency of one category:
• All categories in the data set must be included in the pie.
• Use a pie chart to compare the relative sizes of all possible categories.
• Bar charts are more useful when you want to highlight the actual data values and when the classes
combined don’t form a whole.

Stem and Leaf Display


A stem and leaf display splits the data values into stems (the larger place values) and leaves (the smaller place
value). By listing all of the leaves to the right of each stem, we can graphically describe how the data are
distributed.

• All the original data points are visible on the display.


• Easy to construct by hand.
• Provides a histogram-like view of the distribution.

© UNITAR International University Page 10 of 22


For this example, use the 10’s digit as the stem
Use the 1’s digit as the leaf

1. Sort the data from lowest to highest.


2. Determine the unique stem values.
7, 8, 9 are the different stem values in this example.
3. List the stems in a vertical column and then add the leaf values to the right of the appropriate stem, in
ascending order.

7|8 8 9 9 9
8|0 0 0 0 1 1 2 3 3 4 4 4 5 6 7 8
9|0 2 5

To get more detail the stems can be split in half

7(5) | 8 8 9 9 9
8(0) | 0 0 0 0 1 1 2 3 3 4 4 4
8(5) | 5 6 7 8
9(0) | 0 2
9(5) | 5

• The stem labeled 7(5) stores all the scores between 75 and 79.
• The stem 8(0) stores all the scores between 80 and 84.

Measures of Central Tendency

Measures of Central Tendency


Central tendency is a single value used to describe the center point of a data set.

© UNITAR International University Page 11 of 22


The Mean
The mean, or average, is the most common measure of central tendency.
• Calculate the mean by adding all the values in a data set and then dividing the result by the number of
observations.

The formula for the Sample Mean:

The formula for the Population Mean:

Example: suppose a sample of size n = 5 gives the following values:


6.2 7.1 4.8 9.0 3.3

The sample mean:

Advantages and Disadvantages of Using the Mean to Summarise Data


Advantages:
• Simple to calculate.
• Summarises the data with a single value

Disadvantages:
• With only a summary value you lose information about the original data.

© UNITAR International University Page 12 of 22


• Sample 1 with n = 3: 999, 1000, 1001 𝑥̅ = 1000
• Sample 2 with n = 3: 0, 1000, 2000 𝑥̅ = 1000
• Just knowing the mean does not help you know what the underlying data looks like.
• The value of the mean is sensitive to outliers (values that are much higher or lower than most of the data).

The Median
The median is the value in the data set for which half the observations are higher and half the observations are
lower.
• First arrange the data in ascending order.

Example with sample of size n = 7:


21 27 27 28 34 45 50

The median value is, therefore, in the fourth position of our sorted data.
21 27 27 28 34 45 50

The median is not sensitive to outliers.


21 27 27 28 34 45 5000
• The median is still 28.

When there are odd numbers of data values, the median is always the middle value in the data set.

When there are even numbers of data values, the median is halfway between the two middle values.
Example with a sample of size n = 6:

145 157 170 182 204 209

The Mode
The mode is the value that appears most often in a data set.
• If no data value or category repeats more than once, then we say that the mode does not exist.
• More than one mode can exist if two or more values tie for the most frequent.

The mode is a particularly useful way to describe categorical data.

Example with numerical data:

• Number of children per family in a sample of 24 families:

0,0,0,0,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,4,5

© UNITAR International University Page 13 of 22


Number
of children Frequency
The value that appears most often is 2
0 4 (occurs 8 times), so the mode = 2
1 5 children.
2 8
3 4
4 2
5 1

Example with categorical data:

• The car that appears most often is Toyota (which occurs 7 times), so the mode is the Toyota model.

Example:
Prices for 5 homes have been collected

House Prices:

$2,000,000
500,000
300,000
100,000
100,000

Sum 3,000,000

© UNITAR International University Page 14 of 22


Which Measure of Central Tendency Should You Use?

The mean is generally used as it is relatively easy to determine and most widely understood by people with little
statistical training.

If outliers are present, the median is often used, since the median is not sensitive to outliers.
• For example, median home prices may be reported for a region; it is less sensitive to outliers.

For categorical data, the mode is the only choice.

Measures of Variability

Measures of variability show how much spread is present in the data.

© UNITAR International University Page 15 of 22


The Range
Simplest measure of variation. Difference between the highest value and the lowest value in a data set.

Advantages:
• Easy to calculate and understand

Disadvantages:
• Only based on two numbers in the data set

(Ignores the way in which data are distributed)


• Sensitive to outliers

Example:

The Variance and Standard Deviation

© UNITAR International University Page 16 of 22


The Standard Deviation
The standard deviation is the square root of the variance.
• Has the same units as the original data

Sample standard deviation formula:

Calculating the Sample Standard Deviation

© UNITAR International University Page 17 of 22


Short-Cut Formulas for the Sample Variance and Standard Deviation
Equivalent, but easier for hand calculations:

The Variance and Standard Deviation for a Population


Used when the data set represents an entire population rather than a sample from a population

© UNITAR International University Page 18 of 22


Short-Cut Formulas for the Population Variance and Standard Deviation

Example calculation using short-cut formula:

© UNITAR International University Page 19 of 22


The standard deviation is a common measure of consistency in business applications, such as quality control.
• The standard deviation measures the amount of variability around the mean.

The standard deviation is affected by the scale of the data.


• When sample means are very different, comparing standard deviations can be misleading.

The Coefficient of Variation


The coefficient of variation, CV, measures the standard deviation in terms of its percentage of the mean.

• A high CV indicates high variability relative to the size of the mean.


• A low CV indicates low variability relative to the size of the mean.

A smaller coefficient of variation indicates more consistency within a set of data values.

Example:

© UNITAR International University Page 20 of 22


Working with Grouped Data
Suppose data has already been summarised by a frequency distribution.
• The individual data values are no longer shown.
• Only grouped data is available.

To estimate the average for the frequency distribution:


• Find the midpoint for each group.

(The midpoint is the halfway point in each group.)


• Use the midpoint as a representative value for that group.

Example: The Mean of Grouped Data


Example An online merchant has collected the following grouped data for the number of web pages viewed
by a sample of its customers:

Number of pages Frequency

1 to under 5 6

5 to under 9 12

9 to under 13 10

13 to under 17 4

The merchant would like to calculate the average number of viewed pages.

1. Find the midpoint of each class

Midpoint
Number of pages Frequency
(mi)

1 to under 5 3 6

5 to under 9 7 12

9 to under 13 11 10

© UNITAR International University Page 21 of 22


13 to under 17 16 4

2. Calculate the mean

The average number of viewed pages is about 8.5.

The Variance and Standard Deviation of Grouped Data

- end of content –

© UNITAR International University Page 22 of 22

You might also like