Ch.3-Estimation module
Ch.3-Estimation module
Topic outline
1. Concepts of statistical estimation
2. Synopsis
3. Wrap up discussion questions
4. Next session’s assignment
Reading Text:
The sampling process is used to draw statistical inference about the characteristics of a
population or process of interest. On many occasions we do not have enough
information to calculate an exact value of population parameters (such as μ, σ and P)
and therefore make the best estimate of this value from the corresponding sample
statistics (such as x , s, and p).The need to use the sample statistic to draw conclusions
about the population characteristic is one of the fundamental applications of statistical
inference in business and economics. For instance, statistical estimation could be used
in the following cases:
A bank needs to understand the proportion of consumers aware of its
services and credit schemes.
Any service centre needs to determine the average amount of time a
customer spends in queue.
In all such cases, a decision-maker needs to examine the two concepts of estimation
and hypothesis testing that are useful for drawing statistical inference about an
unknown population or process parameters based upon random samples. In this section
we shall discuss methods to estimate unknown population parameters and then to
determine the range of values (confidence interval) likely to contain the parameter
value. Estimation is a procedure of assigning numerical values to a population
parameter based on information collected from a corresponding random sample
statistic. There are two types of estimates that we can make about a population: a
point estimate and an interval estimate. Point estimation is a statistical procedure in
which we use a single value to estimate unknown population parameter. A point
estimate is a single number that is used as an estimate of unknown population
parameter (it is obtained from random sample data).
For example, suppose from a sample of 5,000 households studied, the mean housing
expenditure per month for this sample is Birr 450. Then, using x as a point estimate of
µ, we can state that the mean housing expenditure per month µ for all households is
about Birr 450, i.e. point estimation. Instead of saying that the mean housing
expenditure per month for all households is Birr 450, we obtain an interval by
subtracting from and adding to Birr 450. Then we state that this interval contains the
population mean µ. For purposes of illustration, we can subtract and add Birr 50 to Birr
450. Then, we obtain the interval Birr 400 – Birr 500. This interval is likely to contain
the population mean µ. This procedure is called interval estimation. The value Birr 400
is called the lower limit of the interval and Birr 500 is called the upper limit of the
interval, this is interval estimate.
A drawback of point estimate compared to interval estimate is that the former is based
on single element chosen from a sampling distribution (not range of values), but the
fact is the unknown parameter may be above or below the estimate. Also, it conveys
little information about the accuracy of the estimate; it does not tell as to how confident
we can be that the estimate is close to the parameter it is estimating. On the contrary,
interval estimation gives the estimate in ranges or intervals and specifies the level of
confidence concerning the reliability of the estimate. When we make an estimate of a
population parameter, we use a sample statistic. This sample statistic is an estimator,
refer to Table 27.1.
Point estimator
Population Parameter
(sample statistic)
Mean µ x =∑Xi/n
2
Variance, σ S2=∑(Xi-x)2/n-1
Standard deviation σ s2
Proportion π or P p=X/n
The best estimator should be highly reliable and have the following desirable
properties:
Interval estimate: range of values within which the population parameter is expected
to occur ( it has upper and lower bounds).
Confidence interval: A range of data constructed from sample data so the parameter
occurs within that range at a specified probability. The specified probability is called the
level of confidence it is denoted by 1- 𝛼. Confidence level in decimal form is called
confidence coefficient. More common values are 90%, 95% & 99%.The corresponding
confidence coefficient are 0.9 0.95 & 0.99. 𝛼 is called the significance level.
Confidence Levels (1- 𝜶): are probabilities specifying the level of accuracy of the
interval estimate; it indicates the degree of sureness that the interval estimate contains
the parameter being estimated. Although any value of confidence level can be chosen,
popular confidence levels are like 90%, 95% and 99%.
95% (1- 𝛼)
-1.96 0 1.96
Synopsis
Distinguish between the point estimation and interval estimation; how is the
latter better than the former?
What are the properties of a good estimator? Explain
How is the confidence interval estimate of a population parameter developed?
What is the margin of error?
What is the standard error?
Compute and interpret a confidence interval for a population mean using the
normal distribution
Topic outline
σ
Where standard error of the sampling distribution of sample means: ( 𝐱) =σx = or
n
𝑆
𝑆x =
n
σ 𝑆
The margin of error E=Z*σx = Z ∗ or E=Z*𝑆x = Z ∗
n n
n=sample size; x=sample mean; Z based on the confidence level (divide the given
confidence level by two and read the corresponding z value from the Z-table); σ=
population standard deviation, and S= sample standard deviation.
Note: as the sample size increases the standard error decreases. When sampling from
the same population, using a fixed sample size, the higher the confidence level, the
wider the confidence interval.
Example 28
1. The sponsor of TV program targeted at the children's market (age 4-10) wants
to find out the average amount of time children spend watching TV. A random
sample of 100 children indicated the average time spent by these children
watching TV per week to be 27.2 hours. From previous experience, the
population standard deviation of the weekly TV watched is known to be 8 hours.
A confidence level of 95% is adequate.
a. What is the population mean of weekly TV watching time for children?
b. What is the best estimate of the population mean? What is this value
called?
c. Develop a 95% confidence interval for the population mean of weekly TV
watching time.
d. Interpret the confidence interval.
Solution:
𝑆
(1-𝛼)=99%=0.99, Z0.99/2=Z0.4950=2.5; use x ± Z ∗
n
Synopsis
The confidence interval for the population mean µ at a given confidence level
(1-𝛼) is computed as follows:
σ 𝑆
o x± Z ∗ or x ± Z ∗
n n
Topic outline
Reading Text:
The estimator of the population proportion P is the sample proportion p.If the sample
size is large, p has an approximately normal sampling distribution. For estimating p, as
a rule of thumb, a sample is considered large enough when both n*p and n*q are
greater than 5. The mean of the sampling distribution of p is the population proportion
P, and the standard error or standard deviation of the sampling distribution of p that is
𝝈𝒑 = p(1 − p)
𝑛 , where 1-p is denoted by q
Since the standard deviation of the estimator depends on the unknown population
parameter, its value is also unknown to us. It turns out, however, that for large samples
we may use our actual estimate p instead of the unknown parameter P in the formula
for the standard deviation. The (1-𝛼) Confidence Interval for the Population Proportion
P is given by
p(1 − p)
p 𝑛
±z∗
Where p is the sample proportion, Z is based on the confidence level; and n is the
sample size
Example 29
A market research firm wants to estimate the share that foreign companies have in the
U.S. market for certain products. A random sample of 100 consumers is obtained, and
34 people in the sample are found to be users of foreign-made products; the rest are
users of domestic products. Give a 95% confidence interval for the share of foreign
products in this market.
We have x=34 and n =100, p=34/100=0.34
p(1 − p) 0.34 1 − 0.34
p
±z∗ 𝑛 = 0.34 ± 1.96 ∗ 100
Exercise 29
Synopsis
For large samples (np and nq>5),the confidence interval for the population
p(1 − p)
proportion P is found by: p 𝑛
±z∗
Topic outline
1. Characteristics of t-distribution
2. Developing confidence interval for the population mean using t-distribution
3. Synopsis
4. Wrap up discussion questions
5. Next session’s assignment
Reading Text:
So far, it was indicated that use of the normal distribution in estimating a population
mean is warranted for any large sample (n ≥ 30), and for a small sample (n<30) only if
the population is normally distributed and population standard deviation, σ is known.
However, when the sample is small (n<30) and the population is normal or
approximately normal, but σ is not known, we cannot use the normal distribution for
determining confidence intervals for the unknown population mean, but we can use the
t-distribution.
Note when σ is approximated by the sample standard deviation (S), the standard error
S
(Sx , i.e. ) will be somewhat different from sample to sample, due to the variability of
n
S. As a result, when S is used in the Z conversion formula for small samples, it results in
converted values that are not distributed as Z values. Instead, the values are distributed
according to the t distribution. This distribution was developed by William S. Gossett in
1908. Gosset worked in an Irish Guinness Brewery and published a paper about the t-
distribution using the pen name ‘Student’. In fact, the t distribution has many similar
characteristics to the z distribution.
Characteristics of t-distribution:
It is continuous distribution;
It is bell shaped and symmetrical.
There is no one t distribution, but rather a family of t-distributions. All have the
same mean of 0, but standard deviation varies according to sample size; thus,
different t distribution exists for different sample size (refer to Fig. 30.1).
It is more spread out (flatter) and wider than the z, thus:
o Standard deviation of t is greater than z, and thus it has a standard
deviation greater than one. The variance for t-distribution=df/df-2; note:
df stands for degree of freedom
o The value of t for a given level of confidence is larger in magnitude than
the corresponding z value.
It approaches z distribution as sample size; n increases (refer to Fig. 30.1). In other
words the t-distribution is approximately normal for n ≥ 30
The t-distribution is defined by the degrees of freedom (df) which is equal to n -1,
that is its only parameter. The degree of freedom is the number of items in a
sample that are free to vary. To illustrate the meaning of degrees of freedom:
Assume that the mean of four numbers is known to be 5. The four numbers are 7,
4, 1, and 8. The deviations of these numbers from the mean must total 0. The
deviations of +2, −1, −4, and +3 do total 0. If the deviations of +2, −1, and −4 are
known, then the value of +3 is fixed (restricted) in order to satisfy the condition that
the sum of the deviations must equal 0. Thus, 1 degree of freedom is lost in a
sampling problem involving the standard deviation of the sample because one
number (the arithmetic mean) is known.
Computing t values:
(x−𝜇)
t= 𝑆/ n
where x is the sample mean of n measurements, µ population mean; S is the
When the population standard deviation (σ) is not known, for n<30, σ can't be
approximated by S, and we can't use z distribution. In such cases, the t-distribution is
used to construct a confidence interval for estimating the population mean, µ using the
following formula:
𝑆
The 1-σ confidence interval for µ= X ± t df Sx = X ± t df n
Example 30
The mean operating life for a random sample of (n=10) light bulbs is X=4,000 hours,
with the sample standard deviation S=200 hours. The operating life of bulbs in general
is assumed to be approximately normally distributed.
Required: estimate the mean operating life for the population of bulbs from which this
sample was taken, using a 95 percent confidence interval.
Solution:
Given: n = 10; X (a point estimate) = 4000; S = 200, and confidence level = 95% or
0.95; 𝛼=0.05; degree of freedom =n-1=10-1=9; area in each tail =0.5 - (0.95/2) =
0.5 - 0.4750 = 0.025
From the t distribution table, the value of t for df = 24 and 0.025 area in the right tail
or area of 0.05 at the two tails is 2.262.
𝑆 200
The 95% confidence interval for µ = X ± t df = 4000 ± 2.262 =
n 10
4000 ±2.262*63.25=4000±143.07=3,856.93≤µ≤4143.07
Thus, we can state with 95% confidence that the mean operating life for all bulbs lies
approximately between 3857 hours and 4143 hours.
Sometimes you might be provided with the raw data for the sample (10 bulbs). Under
this condition, you are first required to calculate the sample mean and the sample
Exercise 30
1. The Dr. wanted to estimate the mean cholesterol level for all adult males. He
took a sample of 25 adult males and found that the mean cholesterol level for
this sample is 186 with a standard deviation of 12. Assume that the cholesterol
levels for all adult males are (approximately) normally distributed. Construct a
95% confidence interval for the population mean µ.
2. The high cost of health care is a matter of major concern for a large number of
families. A random sample of 25 families selected from an area showed that they
spend an average of Birr 30 per month on health care with a standard deviation
of Birr 10. Make a 98% confidence interval for the mean health care expenditure
per month incurred by all families in this area. Assume that the monthly health
care expenditures of all families in this area have a normal distribution.
Synopsis
Briefly explain the similarities and the difference between the standard
normal distribution and the t distribution.
What are the parameters of a normal distribution and a t-distribution?
Briefly explain the meaning of the degrees of freedom for a t distribution.
What assumptions must hold true to use the t distribution to make a
confidence interval for
Topic outline
Reading Text:
The populations sampled so far have been very large or infinite. What if the sampled
population is not very large? Some adjustments in the way the standard error of the
sample means and the standard error of the sample proportions are computed are
required. Thus, for a finite population, where the total number of objects or individuals
is N and the number of objects or individuals in the sample is n, we need to adjust the
standard errors in the confidence interval formulas for the population mean and
proportion. This adjustment is called the finite-population correction factor (FPC).
Particularly, it is needed when the sampling is done without replacement from a small
population; and when the sample constitutes more than 5% of the population
(n/N>0.05)
𝑁−𝑛
FPC= ; Multiplying this correction factor by the standard error reduces the standard
N−1
error. Logically, if the sample is a substantial percentage of the population, the estimate
of the population parameter is more precise. As N becomes larger relative to n, then
n/N becomes small and so FPC approaches unit. If n/N ≤ 0.05 or in other words, if the
sample size is not more than 5 % of the population size, then the FPC may be omitted.
Accordingly, to develop a confidence interval for the mean from a finite population and
unknown population standard deviation the formula is as follows:
𝑆 𝑁−𝑛
X ± t df ∗( )
n N−1
p(1 − p) 𝑁−𝑛
p± z* 𝑛 ∗( )
N−1
Example 31
1. Suppose, 250 families reside around Unity University; and a random sample of 40 of
these families revealed their mean annual community contribution was $450 and the
standard deviation of this was $75.
a. What is the population mean? What is the best estimate of the population mean
annual contribution?
b. Develop a 90% confidence interval for the population mean. What are the
endpoints of the confidence interval?
c. Using the confidence interval, explain why the population mean could be $445.
Could the population mean be $425? Why?
Solution:
a. We do not know the population mean. This is the value we wish to estimate. The
best estimate we have of the population mean is the sample mean, which is
$450.
b. Given: X=$450, s=$75, N=250, n=40, df=40-1=39, 1- 𝛼=0.90, t39=1.685
𝑆 𝑁−𝑛 75 250−40
X ± t df ∗( )=450 ± 1.685 ∗ ∗( )=$450±$19.98* 0.8434
n N−1 40 250−1
c. The former can be a possibility, as it is in the confidence interval; but the latter is
not likely, it is not within the range.
2. The same study on community contributions, in the above case, revealed that 15 of
the 40 families sampled participate in community wide green initiatives regularly.
Construct the 95% confidence interval for the proportion of families participating in
community wide green initiatives regularly.
Given: N=250, n=40, p=15/40=0.375
Synopsis:
Adjust (multiply) the standard errors in the confidence interval formulas for the
population mean and proportion by the finite-population correction factor (FPC);
For sampling done without replacement from a small population;
when the sample constitutes more than 5% of the population (n/N>0.05)
𝑁−𝑛
FPC= ;
N−1
What are the conditions to use the finite-population correction factor while
developing confidence intervals for µ and P?
How is the FPC used?
Reading Text:
The reason that the resources at researchers’ disposal are limited will compel us not
take census or large sample, as long as small sample sizes can satisfactorily help us
achieve the research objective/result. Too large data wastes resource, too small data
may not be representative, making the resulting conclusion uncertain.
σ PQ
From previous sections we understand that standard error σx = , and σ𝑝 = n of
n
sampling distribution of sample statistic x and 𝑝 are inversely related to sample size, n.
An equation for determining sample size can be derived from margin of error (E)
formula, by solving for n.
σ 𝑍𝛼2/2 ∗σ 2
E= Z𝛼/2 ∗ ; accordingly, n=
n E2
2
PQ 𝑍𝛼/2 ∗𝑃∗𝑄
E= Z𝛼/2 ∗ n , accordingly, n= 𝐸
When samples are drawn without replacement from a finite population of size N, the
use of finite population correction factor reduces the standard error by a value equal to
(N − n) /(N −1). Accordingly, sample size determination formula for estimating the
population mean and proportion are multiplied by the finite population correction factor.
The revised sample size, taking into consideration the size of the population, is given by
𝑛 0 ∗N
n= ;
𝑛 0 +(N−1)
N= Population size
Example 32
Exercise 32
1. A student in public administration wants to estimate the mean monthly earnings
of city council members in large cities. She can tolerate a margin of error of $100
in estimating the mean. She would also prefer to report the interval estimate
with a 95% level of confidence. The student found a report by the Department
of Labor that reported a standard deviation of $1,000. What is the required
sample size?
2. A university’s office of research wants to estimate the arithmetic mean grade
point average (GPA) of all graduating seniors during the past 10 years. GPAs
range between 2.0 and 4.0. The estimate of the population mean GPA should be
within plus or minus 0.05 of the population mean. Based on prior experience, the
population standard deviation is 0.279. Using a 99% level of confidence, how
many student records need to be selected?
3. Suppose the U.S. president wants to estimate the proportion of the population
that supports his current policy toward revisions in the health care system. The
president wants the estimate to be within .04 of the true proportion. Assume a
95% level of confidence. The president’s political advisors found a similar survey
from two years ago that reported that 60% of people supported health care
revisions.
a. How large of a sample is required?
b. How large of a sample would be necessary if no estimate were available
for the proportion supporting current policy?
4. For a population of 500, what should be the sampling size necessary to estimate
the population mean at 95 per cent confidence with a sampling error of 5 and
the standard deviation equal to 10?
Synopsis:
𝑍2𝛼/2∗𝑃∗𝑄
n= ; (for P)
E2
𝑛 0 ∗N
If population is finite, n=
𝑛 0 +(N−1)