BST 121
BST 121
Medical Statistics
Prepared By
Dr. Mahmoud Mokhtar
Dr. Marwa Hani Maneea
Vision
The College of Oral and Dental Medicine - Modern
University for Technology and Information aspires to
be one of the most distinguished colleges at the local
and regional levels in the field of dentistry.
Mission
The college is committed to preparing dentists who are
distinguished by professional merit and are able to
comply with the requirements of the labor market and
keep pace with scientific development and contribute
to it through research activities while meeting the needs
of the surrounding community within the framework of
ethical values.
Contents
Chapter 1
1.1 Introduction 2
1.2 Some Basic Concepts 2
9
1.3 Data presentation
1-3-1-Frequency 9
Distribution for
Qualitative Data
1-3-2-Charts and graphs 16
2-1-Mean 27
2-2 Median 31
2-3 Mode 36
Chapter 1
2
1.1 Introduction:
Statistics is a field of study concerned with
1- collection, organization, summarization and analysis of data.
2- drawing of inferences about a body of data when only a part of the data
is observed.
Statisticians try to interpret and communicate the results to others.
Types of statistics:
applied statistics can be divided into two areas: descriptive statistics and
inferential statistics.
Descriptive Statistics
Descriptive statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.
Inferential Statistics
Inferential statistics consists of methods that use sample results to help
make decisions or predictions about a population
1.2 Some Basic Concepts
Data:
Data is the raw material of statistics.
Statistics:
Statistics is the field of study concerned with:
1-The collection, organization, summarization, and analysis of data.
(Descriptive Statistics)
2-The drawing of inferences and conclusions about a body of data
3
(population) when only a part of the data (sample) is observed. (Inferential
Statistics)
Sources of Data:
1. Routinely kept records.
2. Surveys.
3.Experiments.
4.External sources. (Published reports, data bank, . . .)
Population:
A population is the largest collection of entities (elements or individuals) in
which we are interested at a particular time and about which we want to draw
some conclusions. When we take a measurement of some variable on each of
the entities in a population, we generate a population of values of that
variable.
Two kinds of populations: finite or infinite.
Example:
medicine, then our population consists of the weights of all of these students,
4
The number of elements in the population is called the population size
and is denoted by N.
Sample:
data. This part of the population on which we collect data is called the
sample.
Variables:
Example of Variables:
(1) No. of patients (2) Height (3) Sex (4) Educational Level
Independent Variable:
The variable in the study under consideration. The cause for the
Dependent Variable:
5
The variable being affected by the independent variable. The effect of
the study.
Parameter:
A numerical value summarizing all the data of an entire population
Types of Variables:
Examples
(i)Family Size (ii) No. of patients (iii) Weight (iv) height (v) body
temperature.
(a)Discrete Variables:
Examples:
Family size (x = 1, 2, 3, .
(b)Continuous Variables:
6
Variables that can assume an infinite number of values between any two
specific values. They are obtained by measuring and they often include
fractions and decimals.
Examples
below zero. Interval scales hold no true zero and can represent values below zero.
For example, you can measure temperature below 0 degrees Celsius, such as -10
degrees. Ratio variables, on the other hand, never fall below zero. Height and
weight measure from 0 and above, but never fall below it.
Examples
place of birth, types of drug, stages of breast cancer (I, II, III, or IV),
degree of pain (minimal, moderate, severe), gender (male or female),
hair color (blond, brown, red, gray, black), Nationality, Students Grades,
Educational level.
7
Types of Qualitative Variables:
Examples
•Sick - well
exist .
Examples:
8
• Educational level (elementary, intermediate, .
• Military rank
Types of variables
Qualitative Quantitative
Interval Ratio
9
1.3 Data presentation
• Tables
– Simplest way to summarize data
– Data is presented as absolute numbers or percentages
• Charts and graphs
– Visual representation of data
– Usually, data is presented using percentages
Example
10
Construct a frequency distribution table for these data.
Solution
Calculating Percentage
Example:
11
The frequency of a particular data value is the number of times the data
value occurs.
Ungrouped data
A frequency table is constructed by arranging collected data values in
ascending order of magnitude with their corresponding frequencies.
We use the following steps to construct a frequency table:
Step 1:
Construct a table with three columns. Then in the first column, write
down all of the data values in ascending order of magnitude.
Step 2:
To complete the second column, go through the list of data values and
place one tally mark at the appropriate place in the second column for
every data value. When the fifth tally is reached for a mark, draw a
horizontal line through the first four tally marks as shown for 7 in the
above frequency table. We continue this process until all data values in
the list are tallied.
Step 3:
Count the number of tally marks for each data value and write it in the
third column.
Example
The marks awarded for an assignment set for a Year 8 class of 20
students were as follows:
6 7 5 7 7 8 7 6 9 7
4 10 6 8 8 9 5 6 4 8
Present this information in a frequency table.
12
Solution
13
10, 15, 20 etc. Likewise, if the size of the group is 10, then the groups
should start at 10, 20, 30, 40 etc.
The frequency of a group : is the number of data values that fall in the
range specified by that group (or class interval).
Example
The number of calls from motorists per day for roadside service was
recorded for the month of December 2003. The results were as
follows:
28 122 217 130 120 86 80 90 120 140
70 40 145 187 113 90 68 174 194 170
100 75 104 97 75 123 100 82 109 120
81
14
40. So, the groups will start at 0, 40, 80, 120, 160 and 200 to include all
of the data. Note that in fact we need 6 groups (1 more than we first
thought).
Step 2: Go through the list of data values. For the first data value in
the list, 28, place a tally mark against the group 0-39 in the second
column. For the second data value in the list, 122, place a tally mark
against the group 120-159 in the second column. For the third data
value in the list, 217, place a tally mark against the group 200-239 in the
second column.
We continue this process until all of the data values in the set are
tallied.
Step 3: Count the number of tally marks for each group and write it in
the third column. The finished frequency table is as follows:
Frequency 11 46 70 45 16 1
Solution 𝐹𝑟𝑒𝑞
R.F= 𝑛
Sum of Frequency =
sample size = n
16
1-3-2-Charts and graphs
Data may be presented diagrammatically or visually by use of bar graphs,
histograms, frequency polygon, Ogive or Pie-chart. These visual diagrams
give a visual impression to the statistician who now goes ahead to analyze
and make conclusions about the data.
Bar Graph
This is at times called a bar chart. Class frequencies are plotted against class limits.
Since consecutive classes can never have common limits, the bars have spaces
between them when plotted.
Example
Twenty-five students are asked their blood type. Their responses are as follows:
A; B; O; A; AB; O; O; A; O; B; A; A; A; O; O; O; B; O; AB; B, O, B, O, A, A
Create a bar chart.
Solution
Data value Frequency
A 8
AB 2
B 5
O 10
17
Example
Treatment group Frequency
1 15
2 25
3 20
Histogram:
A histogram is a specific type of bar chart, where the categories are ranges of
numbers. Histograms therefore show combined continuous data.
Example
You have been given a list of ages in years, and you need to show them in a graph.
You can choose to group them into ten-year age categories, 0–10, 11–20, 21–30
and so on:
18
Age Number of people
0-10 2
11-20 5
21-30 7
31-40 8
41-50 4
51-60 3
19
Polygon
A graph formed by joining the midpoints of the tops of successive bars in a histogram
with straight lines is called a polygon.
Example
A frequency polygon was constructed from the frequency table below.
(the midpoint =(upper limit+lower limit)/2)
Scores 40-49 50-59 60-69 70-79 80-89 90-99 100-109
Frequency 0 5 10 30 40 15 0
Mid point 44.5 54.5 64.5 74.5 84.5 94.5 104.5
20
Pie Chart
A circle divided into portions that represent the relative frequencies or percentages
of a population or a sample belonging to different categories is called a pie chart.
Example
The next chart explain usage of home budget
Stem-and-Leaf Graph:
One simple graph, the stem-and-leaf graph or stem plot, comes from the
field of exploratory data analysis. It is a good choice when the data sets
are small.
To create the stem-and-leaf plot:
1.) Divide each observation of data into a stem and a leaf. The leaf consists
of a final significant digit.
For example:
1-The number 23 has stem two and leaf three. The number 432 has stem
43 and leaf two. Likewise, the number 5,432 has stem 543 and leaf two.
The decimal 9.3 has stem nine and leaf three.
21
2-Write the stems in a vertical line from smallest to largest. Draw a
vertical line to the right of the stems. Then write the leaves in increasing
order next to their corresponding stem.
Example:
For Susan Dean's spring pre-calculus class, scores for the first exam were
as follows:
33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80;
83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 96; 100
Create a Stem and leaf plot.
Solution
Stem Leaf
3 3
4 2 9 9
5 3 5 5
6 1 3 7 8 8 9 9
7 2 3 4 8
8 0 3 8 8 8
9 0 2 4 4 4 4 6
10 0
The Cumulative Frequency Curve
This is also called an Ogive. It is obtained by plotting the cumulative frequency
curve against the Class boundaries.
Example
22
Solution
frequency 4 6 5 3 2
cumulative 4 10 15 18 20
frequency
23
Exercise (1)
(1) Indicate which of the following variables are quantitative and which
are qualitative.
9. Colors of cars
24
(2) Identify each of the following as examples of (1) nominal, (2)
ordinal, (3) discrete, or (4) continuous variables:
(3) The following data give the results of a sample survey. The letters A,
B, and C represent the three categories.
ABBACBCCCA
CBCACCBCCA
ABCCBCBACA
25
(4) The following data give the results of a sample survey. The letters Y,
N, and D
DNNYYYNYDY
YYYYNYYNNY
NYYNDNYYYY
YYNNYYNNDY
(5)For the Park City basketball team, scores for the last 30 games were
as follows (smallest to largest):
32; 32; 33; 34; 38; 40; 42; 42; 43; 44; 46; 47; 47; 48; 48; 48; 49; 50; 50;
51; 52; 52; 52; 53; 54; 56; 57; 57; 60; 61 .
Construct a stem plot for the data.
26
(6)The following scores were made on a 53-item test:
25 30 34 37 41 42 46 49 53
26 31 34 37 41 42 46 50 53
28 31 35 37 41 43 47 51 54
29 32 36 38 41 44 48 52 54
30 33 36 39 41 44 48 52 55
30 33 37 40 42 45 48 52
1- Set up a frequency table for the above data, then calculate the Relative
Frequency and the percentage frequency.
27
Chapter 2
Measures of Central Tendency
These include the mean, median and mode. These values locate the
average value of a variable in a specific position of the number line with
respect to the data.
2-1-Mean
The mean, also called the arithmetic mean, is the most frequently used
measure of central tendency.
and the mean calculated for population data is denoted by 𝜇 (Greek letter
mu).
∑𝑛𝑖=1 𝑥𝑖 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛
𝑀𝑒𝑎𝑛 = 𝑋̅ = = .
𝑛 𝑛
Example 1
Determine the mean mark of a class test using the data 28, 35, 18, 40, 62,
50 and 70.
Solution:
28
28 + 35 + 18 + 40 + 62 + 50 + 70
𝑀𝑒𝑎𝑛 = 𝑋̅ = = 43.3.
7
Example 2
The following are the ages (in years) of eight patients: 53, 32, 61,27, 39,
44, 49, 57. Find the mean age of these patients.
Solution
53 + 32 + 61 + 27 + 39 + 44 + 49 + 57 362
𝑋̅ = = = 45.25 𝑦𝑒𝑎𝑟𝑠.
8 8
Example 3
If the heights of 5 people are 142 cm, 150 cm, 149 cm, 156 cm, and 153 cm.
Solution
142+150+149+156+153 750
̅=
Mean height =𝑋 = = 150 𝑐𝑚.
5 5
∑𝑛𝑖=1 𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛 = 𝑋̅ = 𝑛 ,
∑𝑖=1 𝑓𝑖
Example 1
Calculate the mean
Classes 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64
Frequency 6 6 4 3 3 6 5 1 6
29
Solution:
Classes frequency 𝑥𝑖 𝑓𝑖 𝑥𝑖
20-24 6 22 132
25-29 6 27 162
30-34 4 32 128
35-39 3 37 111
40-44 3 42 126
45-49 6 47 282
50-54 5 52 260
55-59 1 57 57
60-64 6 62 372
Sum 40 1630
1630
𝑋̅ = = 40.75.
40
Example 2
The following table indicates the data on the number of patients visiting a
hospital in a month.
Number of days
Number of patients
visiting hospital
0-10 2
10-20 6
20-30 9
30-40 7
40-50 4
50-60 2
Solution
In this case, we find the class mark (also called as mid-point of a class) for
each class.
30
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
Note: Class mark (mid − point) = .
2
∑𝑛𝑖=1 𝑥𝑖 𝑓𝑖 860
𝑋̅ = = = 28.67
∑𝑛𝑖=1 𝑓𝑖 30
Advantages:
• Uniqueness. For a given set of data there is one and only one
mean.
• Simplicity. It is easy to understand and to compute.
• The mean takes into account all values of the data.
Disadvantages:
Example:
Sample Data mean
A 2,4,5,7,7,10 5.83
B 2,4,5,7,7,100 20.83
(i) If the number of observations in a data set is odd, then the median is
given by the value of the middle term in the ranked data.
𝑥(𝑛+1) 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝑋̃ = {1
{𝑥(𝑛) + 𝑥(𝑛+1) } 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2 2 2
Example 1
Upgrading of the hospital, the following data give the prices of seven
new medical equipment: 312, 257, 421, 289, 526, 374, 497. Find the
median.
Solution
32
First, we rank the given data in increasing order as follows:
Since there are seven values, the middle term is the fourth term,
the median= the value of the fourth term in the ranked data=374.
Example 2
Determine the median of
(i) 12 15 10 11 16 18 14
(ii) 3 7 9 10 13 12 8
Solution
Example 3
Let's consider the data:82, 56, 67, 54, 34, 78,29, 43, 23. What is the
median?
33
Solution
Arranging in ascending order, we get: 23,29, 34, 43, 54, 56, 67,
78,82.
n (no. of observations) = 9.
𝑛+1 9+1
So, the order of the median= = = 5, the median is the fifth
2 2
term.
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑥̃ = 54.
Example 4
Let's consider the data: 50, 67, 24, 34, 78, 43. What is the median?
Solution
Arranging in ascending order, we get: 24, 34, 43, 50, 67, 78.
n (no.of observations) = 6.
𝑛 𝑛 6 6
The order of the median= , + 1 = , + 1 = 3,4 , the median is the
2 2 2 2
34
(iii) The class that contains the cumulative frequency N/2 is called the
median class.
𝑁
− 𝑐𝑓
̃
𝑋=𝐿+ 2 𝑐
𝑓
Where
Example 1
Solution
Cumulative
Classes Frequency
frequency
0-10 2 2
10-20 12 2 + 12 = 14
20-30 22 14 + 22 = 36
30-40 8 36 + 8 = 44
40-50 6 44 + 6 = 50
35
𝑁
𝑁 = 50 ⟹ = 25,
2
The median class: 20-30.
Advantages:
• Uniqueness. For a given set of data there is one and only one
median.
• Simplicity. It is easy to calculate.
• It is not affected by extreme values as is the mean.
Disadvantages
• The median does not take into account all values of the sample.
36
2-3 Mode
The value which appears most often in the given data i.e. The
Note:
2-Depending upon the number of modes the data has, it can be called
Example 1
The data: 6, 8, 9, 3, 4, 6, 7, 6, 3 the value 6 appears the most number
of times.
Thus, mode = 6.
Example 2
Find the mode of 5, 3, 5, 8, 9
Solution
Mode =5.
Example 3
Find the mode of 8, 9, 9, 7, 8, 2, and 5.
Solution
It is a bimodal Data: 8 and 9
37
Example 4:
Find the mode of 4, 12, 3, 6, and 7.
Solution
No mode for this data.
Example 5:
Find the mode of {19, 8, 29, 35, 19, 28, 15}
Solution
19 appears twice, all the rest appear only once, so 19 is the mode.
Example 6:
Find the mode of: {1, 3, 3, 3, 4, 4, 6, 6, 6, 9}
Solution
3 appears three times, as does 6.
So there are two modes: at 3 and 6.
Having two modes is called "bimodal".
Step 1: Find modal class i.e., the class with maximum frequency.
𝑓𝑚 −𝑓1
𝑀𝑜𝑑𝑒 = 𝑋̂ = 𝐿 + [(𝑓 ]h.
𝑚 −𝑓1 )+(𝑓𝑚 −𝑓2 )
Where:
L = lower limit of the modal class,
𝑓𝑚 = frequency of the modal class,
𝑓1 = frequency of class preceding the modal class,
𝑓2 = frequency of class succeeding the modal class,
h = width of the modal class.
38
Example 1
Solution
𝑓𝑚 − 𝑓1 12 − 10
𝑀𝑜𝑑𝑒 = 𝐿 + [ ] ℎ = 40 + [ ] × 20
(𝑓𝑚 − 𝑓1 ) + (𝑓𝑚 − 𝑓2 ) (12 − 10) + (12 − 6)
𝑀𝑜𝑑𝑒 = 45.
Example 2
The heights, in cm, of 50 students are recorded
Number of students 7 14 10 10 9
39
Solution
𝑋̂=133.18.
Example 3
Find the mean, mode and median for the following data,
Class 0-10 10-20 20-30 30-40 40-50
Frequency 8 16 36 34 6
Solution:
Class 𝑓𝑖 𝑥𝑖 Cumulative 𝑥𝑖 𝑓𝑖
frequency
0-10 8 5 8 40
10-20 16 15 24 240
20-30 36 25 60 900
30-40 34 35 94 1190
40-50 6 45 100 270
Sum 100 2640
∑𝑛𝑖=1 𝑥𝑖 𝑓𝑖 2640
𝑀𝑒𝑎𝑛 = 𝑛 = = 26.4.
∑𝑖=1 𝑓𝑖 100
Here, N =100 ⇒ N / 2 = 50.
Cumulative frequency just greater than 50 is 60 and corresponding
class is 20-30.
40
Thus, the median class is 20-30.
Hence, L = 20, c = 10, f = 36, c. f. of preceding class = 24 and N/2=50
𝑁
− 𝑐𝑓 50 − 24
̃
Median = 𝑋 = 𝐿 + 2 × 𝑐 = 20 + × 10 = 27.2.
𝑓 36
Median = 27.2.
Mode = 28.8.
Disadvantages:
41
Chapter 3
Measures of Dispersion
The measures of central tendency, such as the mean, median, and mode,
do not reveal the whole picture of the distribution of a data set. Two data
sets with the same mean may have completely different spreads. The
variation among the values of observations for one data set may be much
larger or smaller than for the other data set. (Note that the words
We also need a measure that can provide some information about the
variation among data values. The measures that help us learn about the
spread of a data set are called the measures of dispersion. The measures
42
They include range, mean deviation, quartiles, percentiles, deciles,
variance and standard deviation.
3-1-The range
It is the simplest measure of dispersion to calculate. It is obtained by
taking the difference between the largest and the smallest values in a
data set.
Example 1
For the set S below, find the range
(𝑖)𝑆 = {12,17,21,14,23,19}
(𝑖𝑖)𝑆 = {43,50,64,74,85,67,79,38}
Solution:
(𝑖)𝑇ℎ𝑒 𝑟𝑎𝑛𝑔𝑒 𝑜𝑓 𝑆 = 23 − 12 = 11.
(𝑖𝑖)𝑇ℎ𝑒 𝑟𝑎𝑛𝑔𝑒 𝑜𝑓 𝑆 = 85 − 38 = 47.
43
In contrast, a larger value of the standard deviation for a data set
indicates that the values of that data set are spread over a relatively
larger range around the mean.
The standard deviation (𝜎)is obtained by taking the positive square root
of the variance.
The Variance
2) ∑𝑛 ̅ 2
𝑖=1(𝑥𝑖 −𝑋)
Variance(𝜎 = ,
𝑛
where:
𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑏𝑒 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑙𝑢𝑒𝑠.
𝑋̅ is the sample mean,
n is the sample size.
∑𝑛𝑖=1(𝑥𝑖 − 𝑋̅ )2
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝜎) = √𝜎 2 =√
𝑛
Example 2
Find the variance of 43, 46, 50, 53, 57, 61.
Solution
44
𝑥 (𝑥 − 𝑥̅ ) (𝑥 − 𝑥̅ )2
43 -8.66 74.9956
46 -5.66 32.0356
50 -1.66 2.7556
53 1.34 1.7956
57 5.34 28.5156
61 9.34 87.2356
Sum 227.3336
𝟒𝟑+𝟒𝟔+𝟓𝟎+𝟓𝟑+𝟓𝟕+𝟔𝟏 𝟑𝟏𝟎
̅=
Mean=𝒙 = = 𝟓𝟏. 𝟔𝟔.
𝟔 𝟔
𝟐𝟐𝟕.𝟑𝟑𝟑𝟔
Variance= = 𝟑𝟕. 𝟖𝟖𝟗𝟑𝟑𝟑𝟑.
𝟔
The standard deviation=√𝟑𝟕. 𝟖𝟖𝟗𝟑𝟑𝟑𝟑 = 𝟔. 𝟏𝟓𝟓𝟑𝟑𝟗𝟓.
Example 3
Consider the following set of data: 12,15,11,17,18,20,19. Find the standard
deviation and the variance.
Solution
𝑥𝑖 𝑥𝑖 − 𝑋̅ (𝑥𝑖 − 𝑥̅ )2
12 -4 16
15 -1 1
11 -5 25
17 1 1
18 2 4
20 4 16
19 3 9
Sum 112 72
112
𝑋̅ = = 16.
7
2)
∑𝑛𝑖=1(𝑥𝑖 − 𝑋̅)2
Variance(𝜎 = ≈ 10.3
𝑛
Standard deviation=3.21.
45
Coefficient of variation
It is sometimes useful to describe variability by expressing the standard
deviation as a proportion of mean, usually a percentage. The formula for
it as a percentage is:
Standard deviation
Coefficient of variation = × 100.
Mean
For Grouped data
The variance:
𝒏
1
̅ )2
𝝈2 = ∑ 𝒇𝒊 (𝒙𝒊 − 𝑿
𝑵
𝒊=1
𝒏
𝟏
̅ )𝟐
𝝈 = √ ∑ 𝒇𝒊 (𝒙𝒊 − 𝑿
𝑵
𝒊=𝟏
Where
Example 4
Calculate the standard deviation and the variance from the following distribution of
marks
46
Marks 1-3 3-5 5-7 7-9
No. of students 40 30 20 10
Solution
Marks 𝑓𝑖 𝑥𝑖 𝑥𝑖 𝑓𝑖 (𝒙𝒊 − 𝑿̅ )𝟐 ̅ )𝟐
𝒇𝒊 (𝒙𝒊 − 𝑿
1-3 40 2 80 4 160
3-5 30 4 120 0 0
5-7 20 6 120 4 80
7-9 10 8 80 16 160
Sum 100 400 400
∑𝑛𝑖=1 𝑥𝑖 𝑓𝑖 400
𝑋̅ = 𝑛 = = 4.
∑𝑖=1 𝑓𝑖 100
1 400
The variance: 𝜎2 = ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑋̅ )2 = = 4.
𝑁 100
1
The standard deviation : 𝜎 = √ ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑋̅)2 =2.
𝑁
Example 5
For the distribution given below, find the standard deviation.
Classes 10-12 13-15 16-18 19-21 22-24 25-27
𝑓 3 8 12 13 10 4
Solution
47
Classes 𝑓 𝑥 𝑥𝑓 (𝑥𝑖 − 𝑋̅)2 𝑓𝑖 (𝑥𝑖 − 𝑋̅)2
943
𝑋̅ = = 18.86.
50
𝜎 2 = 16.1604, 𝜎 = √16.1604 = 4.02.
Example 6
Calculate the standard deviation and the variance from the following data
Classes 10-14 15-19 20-24 25-29 30-34 35-39
Frequency 2 5 8 12 7 6
Solution
Classes 𝑓𝑖 𝑥𝑖 𝑥𝑖 𝑓𝑖 ̅)
(𝒙𝒊 − 𝑿 ̅ )𝟐
(𝒙𝒊 − 𝑿 ̅ )𝟐
𝒇𝒊 (𝒙𝒊 − 𝑿
10-14 2 12 24 14.375 206.64 413.281
15-19 5 17 85 -9.375 87.89 439.453
20-24 8 22 176 -4.375 19.14 153.125
25-29 12 27 324 0.625 0.39 4.688
30-34 7 32 224 5.625 31.64 221.484
35-39 6 37 222 10.625 112.89 677.344
Sum 40 1055 1909.375
48
∑𝑛𝑖=1 𝑥𝑖 𝑓𝑖 1055
𝑋̅ = = = 26.375.
∑𝑛𝑖=1 𝑓𝑖 40
1 1909.375
The variance: 𝜎2 = ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑋̅ )2 = = 47.734375.
𝑁 40
1
The standard deviation : 𝜎 = √ ∑𝑛𝑖=1 𝑓𝑖 (𝑥𝑖 − 𝑋̅)2 =6.909.
𝑁
Example 7:
Calculate the standard deviation from the frequency table
Class 5-10 10-15 15-20 20-25
Frequency 5 6 15 10
Solution
600
𝑋̅ = = 16.6
36
875.16 875.16
𝝈2 = = 24.31, 𝜎=√ = 4.93.
36 36
49
Exercise (2)
(I)Find: the arithmetic mean, median, mode, the variance and the
standard deviation for the following ungrouped data
1- 43, 47, 56, 66, 78, 88, 95, 101, 105 and 110
2- 21, 30, 38, 45, 50, 56, 71, 82 and 87.
5- 3,6,5,8,6,5,5,4,96.
(II)Find: the arithmetic mean, median, mode, the variance and the
standard deviation for the following grouped data
1-
2-
Class 20- 25- 30- 35- 40- 45- 50- 55- 60-
24 29 34 39 44 49 54 29 64
Frequency 6 6 4 3 3 6 5 1 6
3-
4-
5-
51
Chapter 4
CORRELATION AND REGRESSION
4-1-CORRELATION
Correlation coefficients measure the strength of the relationship between
two variables. A correlation between variables indicates that as one
variable changes in value, the other variable tends to change in a specific
direction. Understanding that relationship is useful because we can use
the value of one variable to predict the value of the other variable. For
example, height and weight are correlated—as height increases, weight
also tends to increase. Consequently, if we observe an individual who is
unusually tall, we can predict that his weight is also above the average.
52
Zero Correlation:
If there is no relationship between x and y then there is zero or no
correlation.
1-The relationship between the speed of a wind turbine and the amount
of energy it produces. As the turbine speed increases, electricity
production also increases.
4- The more time you spend on a project, the more effort you'll have put in.
5. The more overtime you work, the more money you'll earn.
53
5- The more you work in the office, the less time you'll spend at home.
Examples of No correlation
1. The nicer you treat your employees, the higher their pay will be.
4. The earlier you arrive at work, your need for more supplies increases.
5. The more funds you invest in your business, the more employees will
leave work early.
Important Notes:
1- The correlation coefficient lies between -1 and 1.
2-The correlation coefficient lies between 0 and 1 for a positive
correlation or between −1 and 0 for a negative correlation.
3-If r = +1, then the correlation between the two variables is said to be
perfect and positive.If r = -1, then the correlation between the two
variables is said to be perfect and negative
54
Some properties of the correlation coefficient (r) , 𝒓 ∈ [−𝟏, 𝟏]
r=0 No correlation
2-Multiple Correlation:
Under Multiple Correlation three or more than three variables are
studied.
3-Partial correlation:
Analysis recognizes more than two variables but considers only two
variables keeping the other constant.
56
4-Total correlation:
Is based on all the relevant variables, which is normally not feasible.
2-Non-Linear correlation:
The correlation would be nonlinear if the amount of change in one
variable does not bear a constant ratio to the amount of change in the
other variable.
57
Types of Correlation
Based on the direction Based upon the Based upon the constancy
number of variables of the ratio of change
of change of variables between the variables
studied
Partial Total
−𝟏 ≤ 𝒓 ≤ 𝟏
58
Example 1
X 2 4 5 6 8 11
Y 18 12 10 8 7 5
Solution
X y XY X2 Y2
2 18 36 4 324
4 12 48 16 144
5 10 50 25 100
6 8 48 36 64
8 7 56 64 49
11 5 55 121 25
Sum 36 60 293 266 706
6 × 293 − 36 × 60
𝒓= = −0.920.
√6 × 266 − (36)2 √6 × 706 − (60)2
Example 2:
x 2 -1 7 -8 5 -4 0 -5 8 -3
y 3 -4 7 -8 1 0 -3 -5 4 -1
Solution
59
x y xy 𝑥2 𝑦2
2 3 6 4 9
-1 -4 4 1 16
7 7 49 49 49
-8 -8 64 64 64
5 1 5 25 1
-4 0 0 16 0
0 -3 0 0 9
-5 -5 25 25 25
8 4 32 64 16
-3 -1 3 9 1
Sum 1 -6 188 257 190
1886
r= = 0.86.
√2569√1846
The relation between x and y is direct and strong.
Important Note
Pearson’s correlation coefficient (r)does not change if we add or subtract
a constant number from or to all values of (𝑥) and also if we add or
subtract other constant number from or to all values of (𝑦).
Then if we put:
𝑋 = 𝑥 − 𝑥̅ 𝑎𝑛𝑑 𝑌 = 𝑦 − 𝑦̅ ,
Where:
60
∑𝑥
𝑥̅ is the mean of 𝑥 :𝑥̅ = .
𝑛
∑𝑦
𝑦̅ is the mean of 𝑦 :𝑦̅ = .
𝑛
𝑛 ∑ 𝑋𝑌 − (∑ 𝑋)(∑ 𝑌)
𝑟=
√𝑛 ∑ 𝑋 2 − (∑ 𝑋)2 √𝑛 ∑ 𝑌 2 − (∑ 𝑌)2
Example 3
Calculate Pearson’s correlation coefficient between the variables x and y for the
following data, then determine its type.
x 56 62 76 77 83 86 90 92 98
y 67 44 53 48 55 42 41 34 39
Solution
∑ 𝑥 720
𝑥̅ = = = 80, 𝑋 = 𝑥 − 80.
𝑛 9
∑ 𝑦 423
𝑦̅ = = = 47, 𝑌 = 𝑦 − 47.
𝑛 9
𝒙 𝒚 𝑿 = 𝒙 − 𝟖𝟎 𝒀 = 𝒚 − 𝟒𝟕 𝑿𝟐 𝒀𝟐 𝑿𝒀
56 67 -24 20 576 400 -480
62 44 -18 -3 324 9 54
76 53 -4 6 16 36 -24
77 48 -3 1 9 1 -3
83 55 3 8 9 64 24
86 42 6 -5 36 25 -30
90 41 10 -6 100 36 -60
92 34 12 -13 144 169 -156
98 39 18 -8 324 64 -144
720 423 0 0 1538 804 -843
61
r = -0.76.
The relation between x and y is inverse and strong.
Example 4
Find the rank correlation coefficient from the following data:
X 17 13 15 16 6 11 14 9 7 12
Y 36 46 35 24 12 18 27 22 2 8
Solution
x y Rank x=𝑅1 Rank y=𝑅2 d=𝑅1 -𝑅2 𝒅𝟐
17 36 1 2 -1 1
13 46 5 1 4 16
15 35 3 3 0 0
16 24 2 5 -3 9
6 12 10 8 2 4
11 18 7 7 0 0
14 27 4 4 0 0
62
9 22 8 6 2 4
7 2 9 10 -1 1
12 18 6 9 -3 9
Sum 44
𝟔(𝟒𝟒)
𝒓=𝟏− = 𝟎. 𝟕𝟑𝟑.
𝟏𝟎(𝟏𝟎𝟎 − 𝟏)
Correlation is direct strong.
Remark:
If there is some equal values, the rank is the mean of ranks.
Example 5
The following table gives the score of 10 students in statistics(x) and
anatomy(y). Find the Spearman’s rank correlation coefficient and
determine its type.
X 68 71 75 80 77 54 65 54 50 70
Y 46 60 40 36 41 36 25 31 52 58
Solution
63
50 52 1 8 -7 49
70 58 6 9 -3 9
Sum 141.5
Also, y has two equal values 36 and 36 , their ranks are 3 and 4
4+3
their mean= = 3.5.
2
6(141.5)
𝑟 =1− = 0.14.
10(100 − 1)
There is a direct week relation between scores of statistics and anatomy.
Example 6
The following table gives the score of 10 students in English(x) and
biophysics(y). Find the Spearman’s rank correlation coefficient and
determine its type.
X pass good pass good good v.good pass v.good pass Good
Y good pass pass good v.good Exc. Pass good pass v.good
Solution
x y Rank x=𝑅1 Rank y=𝑅2 d=𝑅1 -𝑅2 𝒅𝟐
Pass Good 2.5 6 -3.5 12.25
Good Pass 6.5 2.5 4 16
Pass Pass 2.5 2.5 0 0
Good Good 6.5 6 0.5 0.25
Good v.good 6.5 8.5 -2 4
v.good Exc. 9.5 10 -0.5 0.25
Pass Pass 2.5 2.5 0 0
64
v.good Good 9.5 6 3.5 12.25
Pass pass 2.5 2.5 0 0
good v.good 6.5 8.5 -2 4
49
In the scores of English(x):
There is four equal values (pass), their ranks are: 1, 2, 3 and 4.
1+2+3+4
their mean= = 2.5.
4
And, (good) repeated four times, their ranks are: 5,6,7 and 8.
5+6+7+8
their mean= = 6.5.
4
And, (good) repeated four times, their ranks are: 5,6,7 and 8.
5+6+7
their mean= = 6.
3
𝟔(𝟒𝟗)
𝒓=𝟏− = 𝟎. 𝟕.
𝟏𝟎(𝟏𝟎𝟎 − 𝟏)
There is a direct and strong relation between scores of English and
biophysics.
65
4-2- REGRESSION
Regression Analysis is a very powerful tool in the field of statistical
analysis in predicting the value of one variable, given the value of
another variable (unknown variable from known variable), when those
variables are related to each other.
3-Non-linear regression.
Simple linear regression
It is used to estimate the relationship between two quantitative
variables. You can use simple linear regression when you want to know
the value of the dependent variable at a certain value of the independent
variable .
This method is used to fit the best straight line to a set of points such that
the sum of squares of the deviations of the points from the straight line is
as small as possible.
67
2- “a” is a constant indicating the slope of the regression line, and it
gives a measure of the change in “y “for a unit change in “x”. It is also
regression coefficient of “y” on “x”.
(∑𝑛𝑖=1 𝑦𝑖 ) − 𝑎(∑𝑛𝑖=1 𝑥𝑖 )
𝑏=
𝑛
Remark
• From basic algebra, recall that the slope is a number that describes
the steepness of the line.
• The sign of the slope b determines whether the line slopes upward
or downward, as shown in the figure below:
a) If b > 0, the line slopes upward to the right.
b) If b = 0, the line is horizontal.
c) If b < 0, the line slopes downward to the right.
68
• More specifically, when the slope b is positive, an increase in x
results in an increase in y. And if the slope b is negative, then an
increase in x results in a decrease in y.
Example 7
x y xy 𝑥2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
10 20 49 30
b=0.9, a=2.2
(a)The least square regression line is:
𝑦 = 2.2 + 0.9 𝑥.
(b)at x=10, y=2.2+0.9(10)=11.2.
69
Important Remarks:
1-The equation of the regression line of 𝑦 𝑜𝑛 𝑥 is: 𝑦 = 𝑎𝑥 + 𝑏, then:
𝑎 (The regression coefficient of 𝑦 𝑜𝑛 𝑥 )
𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖− (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑎= 2 ⟶ (1)
𝑛(∑𝑛𝑖=1 𝑥𝑖2 ) − (∑𝑛𝑖=1 𝑥𝑖 )
(∑𝑛𝑖=1 𝑦𝑖 ) − 𝑎(∑𝑛𝑖=1 𝑥𝑖 )
𝑏=
𝑛
And if the equation of the regression line of 𝑥 𝑜𝑛 𝑦 is: 𝑥 = 𝑐𝑦 + 𝑑, then:
𝑐 (The regression coefficient of𝑥 𝑜𝑛 𝑦 )
𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖− (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑐= 2 ⟶ (2)
𝑛(∑𝑛𝑖=1 𝑦𝑖2 ) − (∑𝑛𝑖=1 𝑦𝑖 )
(∑𝑛𝑖=1 𝑥𝑖 ) − 𝑐(∑𝑛𝑖=1 𝑦𝑖 )
𝑑=
𝑛
Solution
n=8
𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
5 15 25 225 75
6 12 36 144 72
4.5 14 20.25 196 63
6.5 13 42.25 169 84.5
7.5 9 56.25 81 67.5
5.5 13 30.25 169 71.5
4 17 16 289 68
8 9 64 81 72
Sum 47 102 290 1354 573.5
(i) The regression coefficient of 𝑦 𝑜𝑛 𝑥
𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖− (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑎= 2 = −1.86.
𝑛(∑𝑛𝑖=1 𝑥𝑖2 ) − (∑𝑛𝑖=1 𝑥𝑖 )
(ii)The regression coefficient of𝑥 𝑜𝑛 𝑦
𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖− (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑐= 2 = −0.48.
𝑛(∑𝑛𝑖=1 𝑦𝑖2 ) − (∑𝑛𝑖=1 𝑦𝑖 )
71
(iii)The correlation coefficient of Pearson between 𝑥 𝑎𝑛𝑑 𝑦.
𝑟 = −√𝑎𝑐 = −0.95.
Example 9
The following scores represent a nurse’s assessment (X)and a physician’s
assessment (y)of the condition of 10 patients at time of admission to a
trauma center.
X 18 13 18 15 10 12 8 4 7 3
Y 23 20 18 16 14 11 10 7 6 4
(a) Draw a scatter diagram.
(b) Compute the linear regression equation by the least square method.
(c) Predict physician’s assessment for x=9.
(d) Compute the sample correlation coefficient.
Solution
(a)
72
(c )𝑦 = 10.948.
(d)𝑟 = 0.912.
It is strong direct correlation.
Example 10
Use the following table to compute:
x 42 45 51 58 60 61 66 69 73 75
y 22 23 25 26 28 29 31 31 32 33
(i) Pearson’s correlation coefficient between the variables x and y, and determine
its type.
(ii) The equation of the regression line of y on x.
Solution
x y 𝑥2 𝑦2 𝑥𝑦
42 22 1764 484 924
45 23 2025 529 1035
51 25 2601 625 1275
58 26 3364 676 1508
60 28 3600 784 1680
61 29 3721 841 1769
66 31 4356 961 2046
69 31 4761 961 2139
73 32 5329 1024 2336
75 33 5625 1089 2475
∑ 600 280 37146 7974 17187
n=10
𝑛 ∑𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 −∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
(𝑖) 𝑟 =
2 2
√𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑥𝑖 −(∑𝑖=1 𝑥𝑖 )
√𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑦𝑖 −(∑𝑖=1 𝑦𝑖 )
r =0.99.
It is direct strong correlation.
𝑛(∑ 𝑥 𝑦)−(∑ 𝑥)(∑ 𝑦) 3870
(ii) a = = = 36.15,
𝑛(∑ 𝑥 2 )−(∑ 𝑥)2 √11460
73
(∑ 𝑦) − 𝑏(∑ 𝑥)
𝑏= = −2141.
𝑛
The equation of the regression line of y on x
𝑦 = 𝑎 𝑥 + 𝑏 = −2141 + 36.15 𝑥.
Example 11
The following data, give the relation between the two variables x (the
age) and y (no of cases of hypertension reported in clinic A), where:
Solution
74
Exercise (3)
Calculate Pearson's correlation coefficient between x and y, rank
correlation and determine the linear regression equation, for the
following tables:
1-
x 2 5 1 3 4 1 5
y 24 28 22 26 25 24 26
2-
3-
x 30 20 10 30 10
y 0.9 0.8 0.5 1 0.8
4-
x 45 30 90 60 105 65 90 80 55 75
y 40 35 75 65 90 50 90 80 45 65
Chapter 5
Normal Distribution
Continuous Distribution
A continuous random variable is a variable whose possible values form
some interval of numbers.
Typically, a continuous variable involves a measurement of something,
such as the height of a person, the weight of a newborn baby, or the length
of time a car battery lasts. Continuous curves such as the one shown on
the graphs of function called probability densities, or informally,
continuous distributions.
Probability densities are characterized by the fact that the area under the
curve between any two values a and b gives the probability that a random
variable having this continuous distribution will take on a value on the
interval from a to b.
Normal distribution
The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference
for many probability problems.
76
Characteristics of the Normal distribution
shaped curve depends on the two values 𝜇 (the mean) and 𝜎(the
77
(2) Normal distributions are symmetric around their mean (the line 𝑥 =
𝜇, it divides the area into two equal parts).
(3) The mean, median, and mode of a normal distribution are equal.
(5) Normal distributions are denser in the center and less dense in the tails.
(6) Normal distributions are defined by two parameters, the mean (μ) and
the standard deviation (σ).
78
(10)68% of the area of a normal distribution is within one standard
deviation of the mean. i.e., about 2/3 of all cases fall within one standard
deviation of the mean, that is
P (μ - σ ≤ X ≤ μ + σ) = .6826.
1 (𝑥−𝜇 2 )⁄
2) −
𝑓(𝑥; 𝜇, 𝜎 = 𝑒 2𝜎 2
√2𝜋𝜎 2
79
The standardized normal distribution
The normal distribution with μ = 0 and σ = 1 is called standard normal distribution
As you might suspect from the formula for the normal density function, it
would be difficult and tedious to do the calculus every time we had a new
set of parameters for μ and σ.
So instead, we usually work with the standardized normal distribution,
where μ = 0 and σ = 1, i.e., N (0,1). That is, rather than directly solve a
problem involving a normally distributed variable X with mean μ and
standard deviation σ, an indirect approach is used.
1. We first convert the problem into an equivalent one dealing with a
normal variable measured in standardized deviation units, called a
2
standardized normal variable. To do this, if X ∼ N (μ, σ ), then
𝑥−𝜇
𝑧= ~𝑁(0,1)
𝜎
Example 1
If a random variable has the normal distribution with μ = 82.0 and 𝜎 =
4.8, find the probabilities that it will take on a value
(a) Less than 89.2
(b) Greater than 78.4
80
(c) Between 83.2 and 88.0
(d) Between 73.6 and 90.4
Solution
(a)We have
89.2 − 82
𝑧= = 1.5
4.8
Therefore, the probability is: 0.4332+0.5=0.9332.
(b)We have
78.4 − 82
𝑧= = −0.75
4.8
Therefore, the probability is: 0.2734+0.5=0.7734
(c)We have
83.2 − 82 88 − 82
𝑧1 = = 0.25, 𝑧2 = = 1.25
4.8 4.8
Therefore, the probability is: 0.3944-0.0987=0.2957
(d)We have
73.6 − 82 90.4 − 82
𝑧1 = = −1.75, 𝑧2 = = 1.75
4.8 4.8
Therefore, the probability is: 0.4599+0.4599=0.9198
Example 2
As reported in Runner's World magazine, the times of the finishers in the
New York City 10-Km run are normally distributed with mean 61 minutes
and standard deviation 9 minutes.
81
(a)Determine the percentage of finishers with times between 50 and
70minutes.
(b)Determine the percentage of finishers with times less than 75 minutes.
Solution
(a)We have
50 − 61 70 − 61
𝑧1 = = −1.22, 𝑧2 = =1
9 9
Therefore, the probability is: 0.3888+0.3413=0.7301⟹ 73.01%
(b)We have
75 − 61
𝑧= = 1.5556
9
Therefore, the probability is: 0.5+0.4406=0.9406⟹ 94.06%
Example 3
If Z is a standard normal random variable find the value of the +ve of the
real number "a" which satisfies:
(𝑖)𝑃(𝑍 ≥ 𝑎) = 0.4013 (𝑖𝑖)𝑃(𝑍 ≤ 𝑎) = 0.648
(𝑖𝑖𝑖)𝑃(𝑍 ≥ −𝑎) = 0.8577 (𝑖𝑣)𝑃(𝑍 ≤ −𝑎) = 0.2643
Solution
(𝑖)𝑃(𝑍 ≥ 𝑎) = 𝑃(𝑍 ≥ 0) − 𝑃(0 ≤ 𝑍 ≤ 𝑎)
𝑎 = 0.25
82
(𝑖𝑖)𝑃(𝑍 ≤ 𝑎) = 𝑃(𝑍 ≤ 0) + 𝑃(0 ≤ 𝑍 ≤ 𝑎)
𝑎 = 0.38
𝑎 = 1.07
𝑎 = 0.63
Example 4
(𝑖)𝑃(−𝑎 ≤ 𝑍 ≤ 𝑎) = 0.9010
(𝑖𝑖)𝑃(−1.4 ≤ 𝑍 ≤ 𝑎) = 0.7270
Solution
Due to symmetry:
83
= 𝑃(0 ≤ 𝑍 ≤ 𝑎) + 𝑃(0 ≤ 𝑍 ≤ 𝑎) = 2𝑃(0 ≤ 𝑍 ≤ 𝑎) = 0.9010
Due to symmetry:
⟹ 𝑃(0 ≤ 𝑍 ≤ 1.4) + 𝑃(0 ≤ 𝑍 ≤ 𝑎) = 0.7270
𝑎 = 0.87
Due to symmetry:
𝑃(0 ≤ 𝑍 ≤ 𝑎) + 0.2389 = 0.7290 ⟹ 𝑃(0 ≤ 𝑍 ≤ 𝑎) = 0.4901
𝑎 = 2.33
Example (5)
Using the table of the area under the standard normal curve where
Z is a standard normal random variable, find:
(𝑖)𝑃(0.86 ≤ 𝑍 ≤ 1.42)
(𝑖𝑖)𝑃(−1.12 ≤ 𝑍 ≤ 0.64)
(𝑖𝑖𝑖)𝑃(−1.92 ≤ 𝑍 ≤ −0.83)
Solution
84
(𝑖)𝑃(0.86 ≤ 𝑍 ≤ 1.42) = 𝑃(0 ≤ 𝑍 ≤ 1.42) − 𝑃(0 ≤ 𝑍 ≤ 0.86)
Due to symmetry:
⟹ 𝑃(0 ≤ 𝑍 ≤ 1.12) + 𝑃(0 ≤ 𝑍 ≤ 0.64) = 0.3686 + 0.2389
= 0.6075
85
86
87
𝑬𝒙𝒆𝒓𝒄𝒊𝒔𝒆 (4)
(1) If the weight of 1000 persons is normally distributed with mean 80 kg. and
standard deviation 5 kg. Find:
(i) The number of persons, whose weights are more than 92 kg.
(ii) The number of persons, whose weights are less than 75 kg.
(iii) The number of persons, whose weights are between 60 kg. and 98 kg.
(2) If Z is a standard normal random variable find the value of the +ve of the
real number "a" which satisfies:
88
(4) When revising 100 books, we find that the misprints are normally
distributed with mean 16 and standard deviation 5. Find the number of
books with less than 10 misprints.
(5)A factory produces tyres for cars, the lengths of its diameters (X) are
normally distributed with mean 𝜇 = 24 and standard deviation 𝜎 = 1.5.
Calculate the following probabilities:
(𝑖)𝑋 ≤ 21 (𝑖𝑖 )𝑋 ≥ 25
(iii) The percentage of tyres whose diameters are “a” such that 21 ≤ 𝑎 ≤ 27.
89
Chapter 6
Test of Hypotheses about Population Mean (𝝁)
This chapter covers hypothesis testing, the second of two general areas of
statistical inference. Hypothesis testing is a topic with which you as a
student are likely to have some familiarity.
Hypotheses are assumptions about the parameters of one or more
populations. We test hypotheses to assess their correctness.
The purpose of hypothesis testing is to help the researcher or administrator
in reaching a decision concerning a population by examining a sample
from that population.
The main steps
2- Test statistic
4- The decision.
90
Hypothesis
For example, if 𝛼 = 0.05, This means that there is a 5% chance that you
will accept your alternative hypothesis when your null hypothesis is
actually true.
Critical values
The values of the test statistic that separate the rejection region from the
acceptance region.
Acceptance region
A set of values of the test statistic leading to acceptance of the null
hypothesis (Values of the test not included in the critical region).
Rejection region
A set of values of the test statistic leading to rejection of the null
hypothesis.
91
Statistical decision
It consists of rejecting or not rejecting the (H0). It is rejected if the
computed value of the test statistic falls in the rejection area, and it is
not rejected if the computed value of the test statistic falls in the
acceptance region.
Conclusion
Determine whether or not (H0) can be rejected. If (H0) is rejected, the
statistical conclusion is that the alternative hypothesis (H1) is true.
Two side test
If the rejection area is divided into the two tails the test is called two-
sided test.
92
One sided test:
If the rejection region is only in one tail it is called one-side test.
93
Errors
There are two possible errors to come to the wrong conclusion.
Types of Errors
Type I Error Rejecting Ho Type II Error Accepting
when in fact Ho when in fact
Ho is actually true Ho is actually false
Decision
Accept 𝑯𝟎 Reject 𝑯𝟎
𝐻0 is True Correct decision Incorrect decision
Probability=1 − 𝛼 Type I error
Probability=𝛼
94
How to perform a test of hypothesis for the population mean 𝜇 when the
population standard deviation is 𝜎, here there are three possible cases as
follows:
95
Example 1
Does the evidence support the idea that the average lecture consists of
3000 words if a random sample of the lectures of 16 professors had a
mean of 3472 words, given the population standard deviation is 500
words? Use 𝛼 = 0.01. Assume that lecture lengths are approximately
normally distributed. Show all steps.
Solution
𝜇 = 3000,
𝜎 = 500,
𝑋̅ = 3472,
𝑛 = 16,
𝛼 = 0.01
1- 𝐻0 : 𝜇 = 3000
2- 𝐻1 : 𝜇 ≠ 3000
3- 𝛼 = 0.01
𝑋̅−𝜇 3472−3000
4- 𝑧(𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 ) = 𝑧𝑐 = 𝜎 = 500 = 3.78
√𝑛 √16
96
Example 2
A certain breed of rats shows a mean weight gain of 65gm, during the first 3 months
of life.16 of these rats were fed a new diet from birth unit age of 3 months. The mean
was 60.75 gm. If the population variance is 10 gm, is there a reason to believe at the
5% level of significance that the new diet causes a change in the average amount of
weight gained.
Solution
𝐻0 : 𝜇 = 65 , 𝐻1 = 𝜇 ≠ 65, , ̅𝑋 = 60.75, 𝛼 = 0.05
𝑋̅ − 𝜇 60.75 − 65
𝑍= 𝜎 = = −5.38
√𝑛 √10
16
Since the calculated values falls in the rejection region, we reject the 𝐻0 and accept
the 𝐻1 .
Example 3
Grain millers claim that the average weight of a bag of maize flour is
80kg. If a random sample of 100 bags had a mean of 79kg and standard
deviation of 4kg test whether the average weight of the bags is less than
80kg at 5% level of significance.
Solution:
The model is normal
𝐻0 : 𝜇 = 80 𝐾𝑔
𝐻1 : 𝜇 < 80 𝐾𝑔
𝜎 = 4, ̅𝑋 = 79, 𝛼 = 0.05.
97
The Z-value which of 5% to the left is −1 645
79 − 80
𝑧= = −2.5
4
√100
−2 5 is in the rejection region so we reject the null hypothesis, i.e.,
we accept the alternative hypothesis that the average weight of bags
is less than 80kg.
Example 4
𝐻1 : 𝜇 ≠ 58
𝜎=2
98
This is a two tailed test i.e., there is an area of 1% at either tail.
Since 6.3245 is greater than 2.33, we reject 𝐻0 and conclude that there is
a significant change in performance of students at 2% level of
significance.
Example 5
Solution
Step 4: Make the decision. Since the test value, +1.32, is less than the
critical value, +1.65, and not in the critical region, the decision is “Do not
reject the null hypothesis.”
Example 6
In a certain community, a claim is made that the average income of all
employed individuals is 35,500$. A group of citizens suspects this value
is incorrect and gathers a random sample of 140 employed individuals in
hopes of showing that 35,500$ is not the correct average. The mean of
the sample is $34,325 with a population standard deviation of 4,200$.
Test at α = 0.10. Show all steps.
Solution
𝜇 = 35.500, 𝜎 = 4.200, 𝑥̅ = 34.325, 𝑛 = 140, 𝛼 = 0.10
100
(1)𝐻0 : 𝜇 = 35.500
(2)𝐻1 : 𝜇 ≠ 35.500
(3)𝛼 = 0.1
34325 − 35500
(5)𝑧 = = −3.31.
4200
( )
√140
(6)𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0 , 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 − 3.31 < −1.645.
101
p-Value
Assuming that the null hypothesis is true, the p-value can be defined as
the probability that a sample statistic (such as the sample mean) is at
least as far away from the hypothesized value in the direction of the
alternative hypothesis as the one obtained from the sample data under
consideration.
Note that the p-value is the smallest significance level at which the null
hypothesis is rejected
102
The p-value for a right- tailed test.
For a two-tailed test, the p-value is twice the area in the tail of the
sampling distribution curve beyond the observed value of the sample
statistic. Each of the areas in the two tails gives one-half the p-value.
103
Example 7
The mean GPA at a certain university is 2.80 with a population standard
deviation of 0.3. A random sample of 16 business students from this
university had a mean of 2.91. Test to determine whether the mean GPA
for business students is greater than the university mean at the 0.10 level
of significance. Show all steps.
Solution
𝜇 = 2.80, 𝜎 = 0.3, 𝑥̅ = 2.91, 𝑛 = 16, 𝛼 = 0.10
(1)𝐻0 : 𝜇 = 2.8
(3)𝛼 = 0.10
2.91 − 2.80
(4)𝑧 = = 1.46.
0.3
( )
√16
104
Example 8
A study by the Web metrics firm Experian showed that in August of 2011,
the mean time spent per visit to Facebook was 20.8 minutes with a
population standard deviation of 8 minutes. Suppose a simple random
sample of 60 visits in August 2013 has a mean of 21.5 minutes. A social
scientist is interested to know whether the mean time of Facebook visits
has changed. Use α = 0.05. Show all steps.
Solution
𝜇 = 20.8, 𝜎 = 8, 𝑥̅ = 21.5, 𝑛 = 60, 𝛼 = 0.05
(1)𝐻0 : 𝜇 = 20.8
(2)𝐻1 : 𝜇 ≠ 20.8
(3)𝛼 = 0.05
21.5 − 20.8
(4)𝑧 = = 0.68.
8
( )
√60
105
Example 9
The management of Priority Health Club claims that its members lose an
average of 10 pounds or more within the first month after joining the
club. A consumer agency that wanted to check this claim took a random
sample of 36 members of this health club and found that they lost
an average of 9.2 pounds within the first month of membership. The
population standard deviation is known to be 2.4 pounds. Find the p-
value for this test. What will your decision be if 𝛼 = 0.01? What if
𝛼 = 0.05?
Solution
Let 𝜇 be the mean weight lost during the first month of membership by
all members of this health club, and let be the corresponding mean for
the sample. From the given information,
𝑛 = 36, 𝑥̅ = 9.2 𝑝𝑜𝑢𝑛𝑑𝑠, 𝑎𝑛𝑑 𝜎 = 2.4 𝑝𝑜𝑢𝑛𝑑𝑠.
The claim of the club is that its members lose, on average, 10 pounds or
more within the first month of membership. To perform the test using
the p-value approach, we apply the following four steps.
Step 1. State the null and alternative hypotheses.
106
Here, the population standard deviation is known, and the sample size is
large 𝑛 > 30). Hence, the sampling distribution of 𝑥̅ is normal with its
mean equal to 𝜇 and the standard deviation equal to 𝜎𝑥̅ Consequently,
we will use the normal distribution to find the p-value and perform the
test.
Step 3. Calculate the p-value.
The < sign in the alternative hypothesis indicates that the test is left-
tailed.
The p-value is given by the area to the left of 𝑥̅ = 9.2 under the
sampling distribution curve of 𝑥̅ , as shown in the next figure.
To find this area, we first find the z value for as follows:
9.2 − 10
𝑧= = −2.00
2.4
√36
107
−2.00 . From the normal distribution table, the area to the left of 𝑧 =
−2.00 is .0228 .
(𝑃(𝑧 < −2) = 0.5 − 𝑃(0 < 𝑧 < 2) = 0.5 − 0.4772 = 0.0228)
Consequently,
𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.0228.
Step 4. Make a decision.
Thus, based on the p-value of .0228, we can state that for any 𝛼
(significance level) greater than .0228 we will reject the null hypothesis
stated in Step 1, and for any 𝛼 less than or equal to .0228 we will not
reject the null hypothesis.
Since 𝛼 = 0.01 is less than the p-value of .0228, we do not reject the
null hypothesis at this significance level. Consequently, we conclude
that the mean weight lost within the first month of membership by the
members of this club is 10 pounds or more.
Now, because 𝛼 = 0.05 is greater than the p-value of .0228, we reject
the null hypothesis at this significance level. Therefore, we conclude that
the mean weight lost within the first month of membership by the
members of this club is less than 10 pounds.
108
Exercise (5)
109
References
1-Biostatistics: Basic Concepts and Methodology for the Health Sciences,
10th Edition International Student Version.
110