Lecture 3
Lecture 3
3) In a factory bulbs are produced with a standard 4) The following table shows the distribution of the land
deviation of 20 hours and an average durability of cultivated by farmers selected randomly among 2000
550 hours. It is known that the durability of bulbs farmers and monthly incomes of them.
normal distributed. Quality is kept under control
by randomly selecting 100 units with a probability Cultivated Number of Annual Number of
of 95%. In a selected sample, the average is Land (in Farmers Income Farmers
calculated as 540 hours. acres)
16 5 140-148 3
20 8 148-156 8
25 12 156-164 13
20
A) x 540 x 2
n 100
With 95% confidence the durability time of
x Z /2 x 540 1.96*(2) bulbs varies between 536 and 544 hours. So the
= 536 xi 544 bulbs are produced below the standard.
B) x 1.96 x
x 550 1.96(2) 546 554 When the average durability time varies
546 xi 554 between 546 and 554 hours, the production is up
to the standard.
4) s.d. of population is not known, n 30 and n / N 0.05
A) Estimations for Cultivated Land
Cultivated
Land (in acres)
Number of
Farmers
fi xi fi x 2i
x
fx i i
1000
25 acres.
16 5 80 1280
n 40
20 8 160 3200
fi xi 1000
2 2
25 12 300 7500
fi xi2 26570
28 10 280 7840
s n 40 6.345
30 3 90 2700 n 1 39
45 2 90 4050
40 1000 26570 n / N 40 / 2000 0.02 0.05
s 6.345
ˆ x 1.003
n 40
1047936
[156-164) 13 160 2080 332800 i i
fi mi2
s n 40 9.273
[164-172) 10 168 1680 282240
n 1 39
[172-180) 6 176 1056 185856
In a region 26 districts were randomly selected from 300 districts and the population distribution in
these districts was determined as follows. It is known that population is normal distributed. Estimate
the total population in this region with 95% probability.
Numer of Number
fi m 2
x
fm i i
702
27000
inhabitants of mi fi mi i
n 26
(1000) district
fi mi 702
2 2
20-24 7 22 154 3388
fm i
2
i
n
19436
26
24-28 10 26 260 6760 s 4.391
n 1 25
28-32 6 30 180 5400
32-40 3 36 108 3888
n / N 26 / 300 0.087 0.05
26 702 19436
s N n 4.391 300 26
ˆ x 0.824
n N 1 26 300 1
N ˆ N x t /2,n 1 x 300 [27 2.06(0.824)] With 95% confidence the region population varies
= 300 (27 1.698)=7591 8609 between 7591000 and 8609000.
Critical Values of t
If we are sampling from a normal distribution, the t-statistic has a sampling distribution very much like that of
the z-statistic: mound-shaped, symmetric, with mean 0. The primary difference between the sampling
distributions of t and z is that the t-statistic is more variable than the z, which follows intuitively when you
realize that t contains two random quantities ( x and s), whereas z contains only one x .
The actual amount of variability in the sampling distribution of t depends on the sample size n. A convenient
way of expressing this dependence is to say that the t-statistic has (n – 1) degrees of freedom (df). Recall that
the quantity (n – 1) is the divisor that appears in the formula for s 2 . This number plays a key role in the
sampling distribution of s.2 In particular, the smaller the number of degrees of freedom associated with the t-
statistic, the more variable will be its sampling distribution.
• Note that ta values are listed for various degrees of freedom, where a refers to the tail area under the t-
distribution to the right of ta. For example, if we want the t-value with an area of .025 to its right and 4 df,
we look in the table under the column t.025 for the entry in the row corresponding to 4 df. This entry,
t.025 = 2.776, is highlighted in Figure 6.9. Recall that the corresponding standard normal z-score is z.025
= 1.96.
• The last row of Table III, where df = (infinity), contains the standard normal z-values. This follows from
the fact that as the sample size n grows very large, s becomes closer to s and thus t becomes closer in
distribution to z. In fact, when df = 29, there is little difference between corresponding tabulated values of
z and t. Thus, researchers often choose the arbitrary cutoff of n = 30 (df = 29) to distinguish between the
large sample and small-sample inferential techniques when s is unknown.
• Example: Consider the pharmaceutical company that desires an estimate of the mean increase in blood
pressure of patients who take a new drug. The blood pressure increases (points) for the n = 6 patients in
the human testing phase are shown in table below. Use this information to construct a 95% confidence
interval for , the mean increase in blood pressure associated with the new drug for all patients in the
population.
Solution:
df n 1 5 30
1.286 3.28
We can be 95% confident that the mean increase in blood pressure associated with taking this new drug
is between 1.286 and 3.28 points.
Examples 6-7
6) In order to determine the monthly average 7) To estimate the average number of children
wages of the workers operating in the same in a rural area, 350 families were selected
sector, 320 people were randomly selected randomly from among 5000 families.
from among 8000 workers. The average number of children of selected
It was determined that the monthly average families was calculated as 5.4 and the standard
wage was 8500 TL. and the standard deviation deviation as 2.1.
was 2600 TL .
Between what values does the average worker a) Construct a 99% confidence interval for the
wage vary with a probability of 99.73%? the average number of children in region.
b) For the average number of children
determine the mininum value with a 99%
confidence.
6) s.d of population is not known, population is 7) s.d of population is not known, n 30 ,
normal distributed n 30 ,
n / N 350 / 5000 0.07 0.05
ˆ x Z /2ˆ x
8500 3(145.34) 8063.97 8936.03 b) ˆ x Z ˆ x 5.4 2.33(0.1083)
= 5.4 0.2523 5.1477
5.1477
With 99.73% confidence the average wage of
workers varies between 8063.97 and 8936.03 TL.
With 99% confidence the average number of
children should be greater than 5.
Example- 8
Some quality-control experiments require destructive sampling (i.e., the test to determine whether the item is
defective destroys the item) in order to measure some particular characteristic of the product. The cost of
destructive sampling often dictates small samples. For example, suppose a manufacturer of printers for personal
computers wishes to estimate the mean number of characters printed before the printhead fails. Suppose the
printer manufacturer tests n = 15 randomly selected printheads and records the number of characters printed
until failure for each. These 15 measurements (in millions of characters) are listed in the table below.
A) Form a 99% confidence interval for the mean of mean number of characters printed before the printhead
fails. Interpret the result.
B) What assumption is required for the interval, part a, to be valid? Is it reasonably satisfied?
8) 0.01 / 2 0.005 n 15 t0.005,14 2.977 a) The manufacturer can be 99% confident that the
df n 1 14 printhead has a mean life of between 1.091 and
1.13 1.55 1.29 1.387 million characters. If the manufacturer were
x 1.239 to advertise that the mean life of its printheads is (at
15 least) 1 million characters, the interval would
support such a claim. Our confidence is derived
(1.13 1.239) 2 (1.29 1.239) 2
s 0.193 from the fact that 99% of the intervals formed in
14 repeated applications of this procedure would
s 0.193 contain .
x t0.005,14 1.239 2.977
n 15
b) Because n is small, we must assume that the number of
= 1.239 0.148 characters printed before printhead failure is a random variable
1.091 1.387 from a normal distribution—that is, we assume that the population
from which the sample of 15 measurements is selected is
distributed normally.
!!! An assumption that the population is approximately normally distributed is necessary for making small-sample inferences
about when is unknown and when using the t-statistic. Although many phenomena do have approximately normal
distributions, it is also true that many random phenomena have distributions that are not normal or even mound-shaped.
Empirical evidence acquired over the years has shown that confidence intervals based on the t-distribution are rather insensitive
to moderate departures from normality—that is, use of the t-statistic when sampling from slightly or moderately skewed mound-
shaped populations generally produces credible results; however, for cases in which the distribution is distinctly nonnormal, we
must either take a large sample or use a nonparametric method !!!
Confidence Interval for a Population Proportion
In some cases it is necessary to estimate the proportion or number of units in the population that have a
particular characteristic. In this case, the populations have two-groups consisting of those with and without
a certain feature, or they are transformed into this shape for the purpose of the research.
Researchers may sometimes want to treat multigroup populations as two-group. For example, while the
distribution of a class is multigroup according to the grades taken from any course (between 0-100), it can
be converted into two groups as successful and unsuccessful students from this course, when desired.
If A is the number of people with a certain feature in a population of N units, than the proportion of those
who have this feature will be;
A a
P For sample: p
N n
and the proportion of those who do not have this feature will be;
NA na
Q 1 P For sample: q 1 p
N n
Connection between ratio and mean
If the units with the examined feature are indicated by (1) in the population and those without this feature are
indicated by (0); N
Xi A
i 1
X i
A
X i 1
P
N N
= P 1 2 P P 2 (1 P) P 2
= P 2 P 2 P 3 P 2 P 3 P P 2 P (1 P ) PQ
For sample = pq
The fact that p̂ is a “sample mean number of successes per trial” allows us to form confidence intervals about p in a manner that is
completely analogous to that used for large-sample estimation of .
Example
A food-products company conducted a market study by randomly sampling and interviewing 1,000
consumers to determine which brand of breakfast cereal they prefer. Suppose 313 consumers were found to
prefer the company’s brand. How would you estimate the true fraction of all consumers who prefer the
company’s cereal brand?
313 pq
pˆ 0.313 pˆ Z / 2 pˆ 0.313 1.96
1000 1000
qˆ 1 pˆ 1 0.313 0.687 0.313 0.687
= 0.313 1.96
1000
np 1000(0.313) 313
= 0.313 0.029
nq 1000(0.687) 687
0.0284 pˆ 0.342
The company can be 95% confident that the interval from 28.4% to 34.2% contains the true
percentage of all consumers who prefer its brand—that is, in repeated construction of confidence
intervals, approximately 95% of all samples would produce confidence intervals that enclose p.
Suppose you want to estimate the proportion of executives who die from a work-related injury using a sample size
of n = 100. This proportion is likely to be near 0, say, p 0.001 . If so, then np 100(0.001) 0.1 is less than the
recommended value of 15 . Consequently, a confidence interval for p based on a sample of n = 100 will probably
be misleading. To overcome this potential problem, an extremely large sample size is required. Because the value
of n required to satisfy “extremely large” is difficult to determine statisticians have proposed an alternative
method, based on the Wilson (1927) point estimator of p. The procedure is outlined in the box below. Researchers
have shown that this confidence interval works well for any p, even when the sample size n is very small.
Example
According to the Bureau of Labor Statistics, the probability of injury while working at a jewelry store is less
than 0.01. Suppose that in a random sample of 200 jewelry store workers, 3 were injured on the job. Estimate
the true proportion of jewelry store workers injured on the job using a 95% confidence interval.
Solution:
Because the number of “successes” (i.e., number of injured jewelry store workers) in the sample is x = 3, the
adjusted sample proportion is
Consequently, we are 95% confident that the true proportion of jewelry store workers who are injured
while on the job falls between 0.004 and 0.046.
Determining the Sample Size for and p̂
Sample Size for Confidence Interval for Sample Size for Confidence Interval for
p̂
In order to estimate with a sampling error SE and In order to estimate a binomial probability p̂ with sampling
with 100(1 )% confidence, the required sample size is error SE and with 100(1 )% confidence, the required
found as follows: sample size is foundby solving the following equation:
Z /2 SE
n pq
Z /2 SE
n
The Solution for n is giving by the equation:
The Solution for n can be written as follows:
( Z )
2
n /2
SE
( Z /2 ) 2 ( pq)
n
Note: The value of is usually unknown. It can be ( SE ) 2
estimated by the standard deviation, s, from a prior
sample. Alternatively, we may approximate the range R Note: Because the value of the product pq is unknown, it
of observations in the population, and (conservatively) can be estimated by using the sample fraction of successes,
estimate R / 4 . In any case, you should round the from a prior sample. In any case, you should round the
value of n obtained upward to ensure that the sample value of n obtained upward to ensure that the sample size
size will be sufficient to achieve the specified reliability. will be sufficient to achieve the specified reliability.
Example
In a region where 10000 families live, a firm will conduct a sampling study to investigate
whether there has been a significant change in its market share, which has been 20% in recent
years. What should the sample size be to estimate the market share with a probability of 99%
with a margin of 0.05?
SE 0.05
• A specialty manufacturer wants to purchase remnants of sheet aluminum foil. The foil, all of
which is the same thickness, is stored on 1,462 rolls, each containing a varying amount of foil.
To obtain an estimate of the total number of square feet of foil on all the rolls, the manufacturer
randomly sampled 100 rolls and measured the number of square feet on each roll. The sample
mean was 47.4, and the sample standard deviation was 12.4.
A) Find an approximate 95% confidence interval for the mean amount of foil on the 1,462 rolls.
B) Estimate the total number of square feet of foil on all the rolls by multiplying the confidence
interval, part a, by 1,462. Interpret the result.
45.01 49.79
characterized by a quantity called the degrees of freedom (df) associated with the distribution. Several chi-square
distributions with different df values are shown in the figure below. You can see that unlike z- and t-distributions,
the chi-square distribution is not symmetric about 0.
Critical Values of
2
Critical Values of
2
Example
The number of supermarkets in Myanmar’s most populated cities is increasing and market competition is also high.
Kaggle has published a study regarding the growth of supermarkets. A three-month dataset has been collected based
on the historical sales of a supermarket company located at Mandalay. The product lines under consideration in the
study are electronic accessories, fashion accessories, food and beverages, health and beauty, home and lifestyle, and
sports and travel. To analyze the average unit price of all the electronic accessories, a random sample of eight
accessories’ unit prices (in Kyat) paid by cash are listed in the following table.
e) 95% confident that the true mean unit price is between 27.91 and 84.44 Kyat.
f) Population is normally distributed and that the sample is a random sample
( Z ) 1.96(33.801)
2 2
g) .
n /2 702.26 703
SE 2.5
h) 0.01 / 2 0.005 (n 1) s 2 7(1142.529) (n 1) s 2 7(1142.529)
394.408 8084.53
0.005,7
2
20.277 s 2 (33.8013) 2 / 2
2
20.2777 (12 / 2) 0.98926
0.995,7
2
0.98926 =1142.53
19.86 89.91