Stats Book
Stats Book
by
and
Palgrave Macmillan
© J. Murdoch and J. A. Barnes 1973
Published by
THE MACMILLAN PRESS LTD
London and Basingstoke
Associated companies in New York Toronto
Melbourne Dublin Johannesburg and Madras
SBN 333 12017 5
the reader to perform to help him understand the concepts involved. In the
chapters of this book references to Statistical Tables for Science, Engineering
and Management are followed by an asterisk to distinguish them from
references to tables in this book.
The problems and examples given represent work by the authors over many
years and every attempt has been made to-select a representative range to
illustrate the basic concepts and application of the techniques. The authors
would like to apologise if inadvertently examples which they have used have
previously been published. It is extremely difficult in collating a problem book
such as this to avoid some cases of duplication.
It is hoped that this new book, together with its two companion books, will
form the basis of an effective approach to the teaching of statistics, and
certainly the results from its trials at Cranfield have proved very stimulating.
J. Murdoch
Cranfield J. A. Barnes
Contents
list of symbols xi
Probability theory
1.2.1 Introduction 1
1.2.2 Measurement of probability
1.2.3 Experimental Measurement of Probab~lity 2
1.2.4 Basic laws of probability 2
1.2.5 Conditional probability 6
1.2.6 Theory of groups 10
1.2.7 Mathematical expectation 11
1.2.8 Geometric probability 12
1.2.9 Introduction to the hypergeometric law 13
1.2.10 Introduction to the binomial law 14
1.2.11 Management decision theory 15
1.3 Problems 17
1.4 Worked solutions 19
1.5 Practical experiments 25
Appendix I-specimen experimental results 27
2 Theory of distributions 32
2.2.1 Introduction 32
2.2.2 Frequency distributions 33
2.2.3 Probability distributions 35
2.2.4 Populations 35
2.2.5 Moments of distribution 37
2.2.6 Summary of terms 38
2.2.7 Types of distribution 40
2.2.8 Computation of moments 42
2.2.9 Sheppard's correction 45
2.3 Problems 45
2.4 Worked solutions 48
vii
viii Contents
4 Normal distribution 80
4.2.1 Introduction 80
4.2.2 Equation of normal curve 80
4.2.3 Standardised variate 81
4.2.4 Area under normal curve 81
4.2.5 Percentage points of the normal distribution 82
4.2.6 Ordinates of the normal curve 82
4.2.7 Fitting a normal distribution to data 82
4.2.8 Arithmetic probability paper 82
4.2.9 Worked examples 83
4.3 Problems 89
4.4 Worked solutions 92
4.5 Practical experiments 102
Appendix I-Experiment 10 of Laboratory Manual* 102
Appendix 2-Experiment 11 of Laboratory Manual* 105
8 Sampling theory and significance testing (11)- 't', 'F' and X2 tests 170
8.2.1 Unbiased estimate of population variance 170
8.2.2 Degrees of freedom 171
8.2.3 The 'u '-test with small samples 171
8.2.4 The 't'-test of significance 172
8.2.5 The 'F'-test of significance 174
8.2.6 The 'X2 '-test of significance 175
8.2.7 One- and two-tailed tests 177
8.2.8 Worked examples 177
8.3 Problems 182
8.4 Worked solutions 184
8.5 Practical experiments 192
Appendix I-Experiment 14 of Laboratory Manual* 193
°i
°ij
observed frequency for X2 goodness-of-fit test
observed frequency in ijth :cell of contingency table
P probability
P(A) probability of an event A
Pf~ number of permutations of n objects taken x at a time
P(A/B) conditional probability of A on assumption that B has occurred
E[x] expected value of variate, x
r sample correlation coefficient
,
S standard deviation of a sample
S2 unbiased sample estimator of popUlation variance
t Student's 't'
U coded variable used in calculating mean and variance of sample
alsot
U standardised normal variate
Xi value of variate
Yi value of dependent variable corresponding to Xi in regression
Yi estimated value of dependent variable using the regression line
t Little confusion should arise here on the use of the same symbol in two different ways.
Their use in both these areas is too standardised for the authors to suggest a change.
xi
xii List of Symbols
Greek Symbols
Mathematical Symbols
n
L summation from i = 1 to n
;=1
Note: The authors use ( n ) but in order to avoid any confusion both are given
in the definitions. x
1 Probability theory
1.2.1 Introduction
Probability or chance is a concept which enters all activities. We speak of the
chance of it raining today, the chance of winning the football pools, the chance
of getting on the bus in the mornings when the queues are of varying size, the
chance of a stock item going out of stock, etc. However, in most of these uses of
probability, it is very seldom that we attempt to measure or quantify the
statements. Most of our ideas about probability are intuitive and in fact
probability is a quantity rather like length or time and therefore not amenable
to simple definition. However, probability (like length or time) can be measured
and various laws set up to govern its use.
The following sections outline the measurement of probability and the rules
used for combining probabilities.
~ t!o or 0.5 Probability that an unbiased coin shows 'heads' after one toss
or 0.167 Probability that a die shows 'six' on one roll
Probability that you will live forever (absolute impossibility)
Figure 1.1. Probability scale.
It will be seen that on this continuous scale, only the two end points are
concerned with deductive logic (although even here, there are certain logical
difficulties with the particular example quoted).
On this scale absolute certainty is represented by p = I and an impossible
event has probability of zero. However, it is between these two extremes that
the majority of practical problems lie. For instance, what is the chance that a
machine will produce defective items? What is the probability that a machine
will find the overhead service crane available when required? What is the
probability of running out of stock of any item? Or again, in insurance, what
is the chance that a person of a given age will survive for a further year?
For example, what is the probability of an item's going out of stock in a given
period?
Measurement showed that 189 items ran out in the period out of a total
number of stock items of 2000, therefore the estimate of probability of a stock
running out is
P(A) = ~ = 0.0945
Again, if out of a random sample of 1000 men, 85 were found to be over
1.80 m tall, then
estimate of probability of a man being over 1.8Q m tall = 18go = 0.085.
This law can be extended by repeated application to cover the case of more
than two mutually exclusive events.
ThusP(A or B or Cor ... ) = peA) + PCB) + P(C) + ...
The events of this law are mutually exclusive events, which simply means
that the occurrence of one of the events excludes the possibility of the occurrence
of any of the others on the same trial.
For example, if in a football match, the probability that a team will score
ogoals is 0.50, 1 goal is 0.30,2 goals is 0.15 and 3 or more goals is 0.05, then
the probability of the team scoring either 0 or 1 goals in the match is
p(0 or 1) =P(O) + P(1) = 0.50 + 0.30 = 0.80
Also, the probability that the team will score at least one goal is
peat least one goal) =PO) + P(2) + P(3 or more) = 0.30 + 0.15 + 0.05 = 0.50
Any event either occurs or does not occur on a given occasion. From the
definition of probability and the addition law, the probabilities of these two
alternatives must sum to unity. Thus the probability that an event does not
occur is equal to
1 - (probability that the event does occur)
In many examples, this relationship is very useful since it is often easier to
find the probability of the complementary event first.
For example, the probability of a team's scoring at least one goal in a
football match can be obtained as
peat least 1 goal) = I-p(O goals) = 1-0.50 = 0.50
as before.
As a further example, suppose that the probabilities of a man dying from
heart disease, cancer or tuberculosis are 0.51,0.16 and 0.20 respectively. The
probability that a man will die from heart disease or cancer is 0.51 + 0.16 = 0.67.
The probability that he will die from some cause other than the three mentioned
is 1-(0.51 + 0.16 + 0.20) = 0.13; i.e., 13% of men can be expected to die from
some other cause.
However, consider the following example. Suppose that of all new cars sold,
40% are blue and 30% have two doors. Then it cannot be said that the probability
of a person's owning either a blue car or a two-door car is 0.70 (= 0.40 + 0.30)
since the events (blue cars and two-door cars) are not mutually exclusive, i.e., a
car can be both blue and have two doors. To deal with this case, a more general
version of the addition law is necessary. This may he stated as
PeA or B or both) =P(A) +P(B) - P(A and B)
4 Statistics: Problems and Solutions
Examples
1. In the throw of two dice, what is the probability of obtaining two sixes?
One of the dice must show a six and the other must also show a six. Thus the
required probability (independent events) is
p=!xi=~
2. In the throw of two dice, what is the probability of a score of 9 points?
Here we must consider the number of mutually exclusive ways in which the
score 9 can occur. These ways are listed below
i
Dice A 3 5 6
and
DiceB l or
5
or
4
I or
J
Probability Theory 5
= 0.70-0.12 = 0.58
6 Statistics: Problems and Solutions
Not
blue 0.6
Blue 0.4
_____
0.3
-4~--~-----
0.7
2 Not
doors 2 doors
Figure 1.2
This result is valid on the assumption that the number of doors that a car has
is not dependent on its colour.
Figure 1.2 illustrates the situation. The areas of the four rectangles within the
square represent the proportion of all cars having the given combination of
colour and number of doors. The total area of the three shaded rectangles is
equal to 0.58, the proportion of cars that are either blue or have two doors or
are two-door blue cars.
Either of these two situations satisfies (b) and the probability of at least one
of the burnt-out bulbs being in the first two tested is given by their sum
~+~=~=~
(c) The probability of at least one burnt-out bulb being found in two tests
is equal to the sum of the answers to parts (a) and (b), namely
~+n=i1=~
8 Statistics: Problems and Solutions
As a check on this result, the only other possibility is that neither of the
faulty bulbs will be picked out for the first two tests. The probability of this,
using the multiplication law with the appropriate conditional probability, is
WXfy=Y
The situation in part (c) therefore has probability of I-Y = ~ as given by
direct calculation.
Consider a box containing r red balls and w white balls. A random sample of
two balls is drawn. What is the probability of the sample containing two red
balls?
r= red bolls
w= white bolls
If the first ball is red (event A), probability of this event occurring
r
P(A)=-
r+w
The probability of the second ball being red (event B) given the first was red
is thus
P(B/A}=~
r+w-I
since there are now only (r-I) red balls in the box containing (r+w-l) balls.
:. Probability of the sample containing two red balls
_ r (r-l)
---x
(r+w) (r+w-l)
In similar manner, probability of the sample containing two white balls
_ w (w-i)
---x
(r+w) (r+w-l)
Also consider the probability'of the samples containing one red and one
white ball. This event can happen in two mutually exclusive ways, first ball red,
second ball white or first ball white, second ball red.
Thus, the probability of the sample containing one white and one red
ball is
Examples
1. In a group of ten people where six are male and four are female, what is the
chance that a committee of four, formed from the group with random selection,
comprises (a) four females, or (b) three females and one male?
(a) Probability of committee with four females
-to x~ xi x~ = 0.0048
(b) Committee comprising three females and one male. This committee can
be formed in the following mutually exclusive ways
1st member M F F F
2nd member F or M or F or F
3rd member F F M F
4th member F F F M
The probability of the first arrangement is
fu x~ x~ x, = 0.0286
The probability for the second arrangement is
10 x ~ x~ x~ = 0.0286
and similarly for the third and fourth columns, the position of the numbers in
the numerator being different in each of the four cases. The required probability
is thus 4 x 0.0286 = 0.114.
(a) Probability of no defective items in the sample only arises in one way.
Probability of no defective items
2 3 4 5 6 7 8 9 10
D G G G G G G G G G D = defective item
G D G G G G G G G G G = good item
G G G G G G G G G D
10 Statistics: Problems and Solutions
Permutations
Groups form different permutations if they differ in any or all of the following
aspects.
(l) Total number of items in the group.
(2) Number of items of anyone type in the group.
(3) Sequence.
Thus
ABB, BAB are different permutations (3); AA, BAA are different permutations
(l) and (2); CAB, CAAB are different permutations (l) and (2);BAABA, BABBA
are different permutations because of (2).
Thus distinct arrangements differing in {l) and/or (2) and/or (3) form different
permu ta tions.
or (:)= x!(:~x)!
As an example, a committee of three is to be formed from five department
heads. How many different committees can be formed?
Example
The probability that a man aged 55 will live for another year is 0.99. How large
a premium should he pay for £2000 life insurance policy for one year?
(Ignore insurance company charges for administration, profit, etc.)
Let s = premium to be paid
Expected return = 0 x 0.99 + £2000 x 0.01 = £20
:. Premium s = £20 (should equal expected return)
12 Statistics: Problems and Solutions
2 I 2
I 4 I
2 I 2
Figure 1.3
Given that the lines are 1 mm thick, the sides of the squares are 60 mm and
the diameter of the coin is 20 mm what is
(a) the chance of getting the coin in the 4 square?
(b) the chance of getting the coin in a 2 square?
(c) the expected return per trial, if returns are made in accordance with the
numbers in the squares?
E
E
iD
Figure 1.4
Probability Theory 13
Considering one square (figure 1.4), total possible area (ignoring small
edge-effects of line thickness) = 61 2 = 3721
For one square, the probability that the coin does not touch a line is
1600
3721 = 0.43
for (b) x= 1,
Both results are the same as before but are obtained more easily.
P(x) = (:)px(1_p)n-x
To illustrate the use of the binomial law consider the following example. A
firm has 10 lorries in service distributing its goods. Given that each lorry spends
10% of its time in the repair depot, what is the probability of (a) no lorry in the
depot for repair, and (b) more than one in for repair?
(a) Probability of success (Le., lorry under repair), p = 0.10
Number of trials n = 10 (lorries)
P(O) = eo
O) x 0.10 0 x 0.9010 = 0.3487
Examples
run criterion may not be relevant to his once only decision. Compared with the
guaranteed price, by advertising, he will either lose £50 or be £20 in pocket with
probabilities of 0.4 and 0.6 respectively. He would probably make his decision
by assessment of the risk of 40% of losing money. In practice, he could probably
increase the chances of a private sale by bargaining and allowing the price to drop
as low as £830 before being out of pocket.
As a further note, the validity of the estimate (usually subjective) of a 60%
chance of selling privately at the price asked should be carefully examined as
well as the sensitivity of any solution to errors in the magnitude of the
probability estimate.
2. A firm is facing the chance of a strike occurring at one of its main plants.
Considering only two points (normally more would be used), management
assesses the following:
(a) An offer of 5% pay increase has only a 10% chance of being accepted
outright. If a strike occurs:
3. A man fires shots at a target, the probability of each shot scoring a hit being
1/4 independently of the results of previous shots. What is the probability that
in three successive shots
(a) he will fail to hit the target?
(b) he will hit the target at least twice?
4. Five per cent of the components in a large batch are defective. If five are
taken at random and tested
(a) What is the probability that no defective components will appear?
(b) What is the probability that the test sample will contain one defective
component?
(c) What is the probability that the test sample will contain two defective
components?
6. A certain type of seed has a 90% germination rate. If six seeds are planted,
what is the chance that
(a) exactly five seeds will germinate?
(b) at least five seeds will germinate?
7. A bag contains 7 white, 3 red, and 5 black balls. Three are drawn at random
without replacement. Find the probabilities that (a) no ball is red, (b) exactly
one is red, (c) at least one is red, (d) all are of the same colour, (e) no two are
of the same colour.
9. If the probability that any person 30 years old will be dead within a year is
0.01, find the probability that out of a group of eight such persons, (a) none,
(b) exactly one, (c) not more than one, (d) at least one will be dead within a
year.
10. A and B arrange to meet between 3 p.m. and 4 p.m., but that each should
wait no longer than 5 min for the other. Assuming all arrival times between
3 o'clock and 4 o'clock to be equally likely, find the probability that they meet.
11. A manufacturer has to decide whether or not to produce and market a new
Christmas novelty toy. Ifhe decides to manufacture he will have to purchase a
special plant and scrap it at the end of the year. If a machine costing £ 10 000
is bought, the fixed cost of manufacture will be £1 per unit; if he buys a
machine costing £20 000 the fixed cost will be SOp per unit. The selling
price will be £4.50 per unit.
Given the following probabilities of sales as:
Sales £2000 £5000 £10 000
Probability 0.40 0.30 0.30
What is the decision with the best pay-off?
12. Three men arrange to meet one evening at the 'Swan Inn' in a certain town.
Probability Theory 19
There are, however, three inns called 'The Swan' in the town. Assuming that each
man is equally likely to go to anyone of these inns
(a) what is the chance that none of the men meet?
(b) what is the chance that all the men meet?
5 % defective
Figure 1.5
14. A marketing director has just launched four new products onto the market.
A market research survey showed that the chance of any given retailer adopting
the products was
Product A 0.95 Product C 0.80
ProductB 0.50 ProductD 0.30
What proportion of retailers will (a) take all four new products, (b) take
A, Band C but not D?
By the multiplication law, probability of selecting five good items from the
large batch = 0.95 x 0.95 x 0.95 x 0.95 x 0.95 = 0.77
(b) In a sample of five, one defective item can arise in the following five
ways:
D A A A A
A D A A A
D = defective part
A A D A A
A = acceptable part
A A A D A
A A A A D
The probability of each one of these mutually exclusive ways occurring
= 0.05 x 0.95 x 0.95 x 0.95 x 0.95 = 0.0407
The probability that a sample of five will contain one defective item
= 5 x 0.0407 = 0.2035
(c) In a sample of five, two defective items can occur in the following ways:
D D D D A A A A A A
D A A A D D D A A A
A D A A AD A D D A
A A D AA D A D A D
A A A D A A D A D D
or in 10 ways.
Probability of each separate way = 0.05 2 x 0.95 3 = 0.00214
Probability that the sample will contain two defectives
= 10 x 0.00214 = 0.0214
It will be seen that permutations increase rapidly and the use of basic laws
is limited. The binomial law is of course the quicker method of solving this
problem, particularly if binomial tables are used.
*
7. Conditional probability:
(a) Probability that no ball is red = H x x ~ = 0.4835
(b) Probability that 1 ball is red = 3 x (Is x H x Ii) = 0.4352
(c) Probability that at least 1 is red = 1-0.4835 = 0.5165
(d) Probability that all are the same colour
=P(all white) +p(all red) +p(all black)
=(~ x/4 xfJ) +(fs xf4 x-i3)+(-ls- xt.J xi\)=0.1011
(e) Probability that all are different = 6 x ~ x -h x &= 0.231
10. At the present stage this is best done geometrically, as in figure 1.6,
A and B will meet if the point representing their two arrival times is in the
shaded area.
p(meet) = I-p(point in unshaded area) = 1-(H)2 = D4
- 5 m in
8 's a r r vi al
lime
II. There are three possibilities: (a) to produce the toys on machine costing
£10 000; (b) to produce the toys on machine costing £20 000; (c) not to
produce the toys at all.
The solution is obtained by calculating the expected profits for each
possibili ty.
13. (a) There will be no defective components in the assembly if all five
components selected are acceptable ones. The chance of such an occurrence is
given by the product of the individual probabilities and is
0.90 x 0.90 x 0.98 x 0.95 x 0.95 = 0.7164
(b) If the assembly contains one defective component, anyone (but only
one) of the five components could be the defective. There are thus five mutually
exclusive ways of getting the required result, each of these ways having its
probability determined by multiplying the appropriate individual probabilities
together.
Probability Theory 2S
1st x component D A A A A
2nd x component A D A A A
A = acceptable part
y component A or A or D or A or A
D = defective part
1st z componen t A A A D A
2nd z component A A A A D
The probability of there being just one defective component in the assembly
is given by
2 x (0.10 x 0.90 x 0.98 x 0.95 x 0.95) +(0.90 x 0.90 x 0.02 x 0.95 x 0.95) +
+2 x (0.90 x 0.90 x 0.98 x 0.05 x 0.95) = 0.1592+0.0146+0.0754 =0.2492
1.5.1 Experiment 1
This experiment, in being the most comprehensive of the experiments in the
book, is unfortunately also the longest as far as data collection goes. However,
as will be seen from the points made, the results more than justify the time.
Should time be critical it is possible to miss experiment 1 and carry out
experiments 2 and 3 which are much speedier. In experiment 1 the data
collection time is relatively long since the three dice have to be thrown 100
times (this cannot be reduced without drastically affecting the results).
26 Statistics: Problems and Solutions
Appendix 1 contains full details of the analysis of eight groups' results for
the first experiment, and the following points should be observed in summarising
the experiment:
(1) The variation between the frequency distributions of number of ones
(or number of sixes) obtained by all groups, and that the distributions
based on total data (sum of all groups) are closer to the theoretical situation.
(2) The comparison of the distributions of score of the coloured dice and
the total score of three dice show clearly that the total score distribution now
tends to a bell-shaped curve.
1.5.2 Experiment 2
This gives a speedy demonstration of Bernoulli's law. As n, the number of
trials, increases, the estimate of p the probability gets closer to the true
population value. For n = 1 the estimate is either p = 1 or 0 and as n increases,
the estimates tend to get closer to p = 0.5. Figure 1.7 shows a typical result.
1.0
x
.~ x j\ x x---x______
:g 0.5 x~x L / " /x-===-----
.0
o \ x \ / ""'x7 --=::::::::::::x
ct \/
x
\
o
Figure 1.7
1.5.3 Experiment 3
Again this is a simple demonstration of probability laws and sampling errors.
Four coins are tossed 50 times and in each toss the number of heads is
recorded. See table 6 of the laboratory manual.
Note
It is advisable to use the specially designed shakers or something similar.
Otherwise the coins will roll or bias in the tossing will occur. The results
of this experiment are summarised in table 8 of the laboratory manual and the
variation in groups' results are stressed as is the fact that the results based on all
groups' readings are closer to the theoretical than those for one group only.
to obey the theoretical law exactly, they will have been shown that in statistics
all samples vary, but an underlying pattern emerges. The larger the samples
used the closer this pattern tends to be to results predicted by theory. The
basic laws-those of addition and multiplication-and other concepts of
probability theory, have been illustrated.
Other experiments with decimal dice can be designed.t
Probability Theory
Number of persons: 2 or 3.
Object
The experiment is designed to illustrate
(a) the basic laws of probability
(b) that the relative frequency measure of probability becomes more
reliable the greater the number of observations on which it is based, that is,
Bernoulli's theorem.
Method
Throw three dice (2 white, 1 coloured) a hundred times. For each throw,
record in table 1
(a) the number of ones
(b) the number of sixes
(c) the score of the coloured die
(d) the total score of the three dice.
Draw up these results, together with those of other groups, into tables
(2,3, and 4).
Analysis
1. For each set of 100 results and for the combined figures of all groups,
calculate the probabilities that, in a throw of three dice:
(a) no face shows a one
(b) two or more sixes occur
(c) a total score of more than 13 is obtained
t Details from: Technical Prototypes (Sales) Limited, lA West Holme Street, Leicester.
28 Statistics: Problems and Sc.lutions
~o. No. Col- Total No. No. Col- Total No. No. Col- Total No. No. Col- Total
of oured score of of loured scare of of oured scare of of oured scare
.
Iof
1'5 6'5 die 1'5 6'5 die 1'5 6'5 die 1'5 6'5 die
0 I:'
"
\ ~ I b 0 \ ~ 0 0 :l... 8 0 0
I 0 2- f 2- 0 I I.f- 0 0 4- \ \ 0 0 :2- 1 0
..
..... c;-
1-
0
0
I
4-
(,
~
\b
\
I
0
1 I
, 0
1';2.
1
0 0
0 \ \
g
\
0
0
\ b
\
"
12-
0 I f I~ 0 0 4- I .. 0 0 Z- 'I 0 0 ." 7
0 0 ~ ~ 0 1 Lt \~ 0 0 :2- b \ \ \ 12-
1. 0 I ~ I 0 I b 0 I I~ 0 0 ~ 1:\
,
~
1
0
::z..
0
0
1
::.
';
:z.
7
14
0
0
2-
\
0
G.
b
I
,..
1 b
b
0
0
1
2-
0
0
"\
S""
I"
11
1-:3
0
0
0
, ..,
I
2-
4- \~
'I
\ 3
4- \ 0
0 0 2- 7 \ 0 S"" 1 I 1 0 , 10 0 1 ~ ''"l.
0 b
" 0 I .. I 12.
I 12- I 1 10 0 ~ 0 4-
q ::.
,
1. 0 2- l.!- I 0 ~ 0 0 \ 1 0 1 4- 14-
2-
1
0
"
1
12-
7
0
0
0
0
3
..-
9
\ 0
0
0 I
4-
S-
11-
I;'
0
0
0
0
S""
4-
I,.
10
0 0 ~ 1 \ 0 S- 9 1 :2- I I ~ 0 0 S- \ 0
I
2..
0
0
1
I
b
Lt-
I
2- 0
I b
1
\ 0
7
0
I
:2-
I "
'!:.
14-
, 0
1
:z... 1
1
,
\ 12-
f/
1 I I 10 \ 0 I 7 1 \ I 12- 0 I b lIf-
O 0 '3 7 0 0 u- 1'2. 0 I b I~ 2- 0 \ 4-
Probability Theory 29
----
-1-t+\- I ./
3
4
0
0
0
b
0
0
u
0
0
I
0
I
0
Ii-
2-
0
1... 0002-<;" 00046
I .; 5 !, I it I I 2- 2. I ~ 0022.<;' 0.0278
It
+++t- / 6 It ~ b 2. 'I 10 u- '3> 1.,-2 O' Mz..5:' 0.0463
,
-l+It-Wt- 1111 13 q 1£1- ~ 10 7 1.2- 9 '? 7 'iI 00~7C 00972
S- t. 3 ,. 0·"",7 0.0463
':,"
-W+- 15 € lr 4- I if-
M.ft- 5' -; S 4- J.. 4- 3 2-9 O·",G-:> 0.0278
- " 16
-
17 I 0 0 0 I I 2. I b 0·007<;; 0.0138
18 0 0 I 1 I 0 0 I 4- 000) 0.0046
tbQD 13
2. Compare these results with those expected from theory and comment on
the agreement both for individual groups and for the combined observations.
3. Draw probability histograms both for the score of the coloured die and for
the total score of the three dice, on page 27. Do this for your own group's
readings and for the combined results of all groups.
Tobie 3
I 2 3 4 5 6
IN
-
2 Theory of distributions
7.8 8.0 8.6 8.1 7.9 8.2 8.1 7.9 8.2 8.1
8.4 8.2 7.8 8.0 7.S 7.4 8.0 7.3 7.6 7.8
7.7 7.8 7.S 7.9 7.8 8.3 7.9 8.0 8.2 7.4
7.1 7.S 7.9 8.2 8.S 7.9 7.S 7.8 8.4 8.1
8.2 7.9 8.7 7.7 7.8 8.0 8.1 8.2 7.9 7.3
8.0 8.1 7.8 8.1 7.6 7.8 7.9 8.S 7.8
8.3 7.9 8.1 7.6 7.9 8.3 7.4 7.9 8.7
7.6 8.0 8.0 8.2 8.2 7.9 8.1 8.4 7.6
7.9. 7.7 7.9 7.8 7.8 7.7 7.S 7.7 8.1
8.1 8.0 8.1 7.7 8.0 8.0 8.0 8.1 7.7
Referring to these data, it will be seen that the figures vary one from the
other; the first is 7.8 h, the next 8.4 h and so on; there is one as low as 7.1 hand
one as high as 8.7 h.
In statistics the basic logic is inductive, and the data must be looked at as a
whole and not as a collection of individual readings.
It is often surprising to the non-statistician or deterministic scientist how
often regularities appear in these statistical counts.
The process of grouping data consists of two steps usually carried out
together.
These class intervals are usually of equal size, although in certain cases unequal
class intervals are used. For usual sample sizes, the number of class intervals is
chosen to be between 8 and 20, although this should be regarded as a general
rule only. For table 2.1, class intervals of size 0.2 h were chosen, i.e.,
7.1-7.3,7.3-7.5, ... ,8.7-8.9
(3) More precise defmition of the boundaries of the class intervals is however
required, otherwise readings which fall say at 7.3 can be placed in either of two
class intervals.
Since in practice the reading recorded as 7.3 h could have any value between
7.25 h and 7.35 h (normal technique of rounding off), the class boundaries will
now be taken as:
7.05-7.25,7.25-7.45, ... , 8.45-8.65, 8.65-8.85
Note: Since an extra digit is used there is no possibility of any reading's falling
on the boundary of a class.
The summarising of data in figure 2.1 into a distribution is shown in
table 2.2. For each observation in table 2.1 a stroke is put opposite the sub-range
into which the reading falls. The strokes are made in groups of five for easy
summation.
7.05-7.25 I 1 0.01
7.25-7.45 tttt .5 0.05
7.45-7.65 tttttttt 10 0.11
7.65-7.85 tttt tttt tttt 1111 19 0.20
7.85-8.05 tttt tttt tttt tttt tttt II 27 0.28
8.05-8.25 tttt tttt tttt tttt II 22 0.23
8.25-8.45 ttttl 6 0.06
8.45-8.65 III 3 0.03
8.65-8.85 II 2 0.02
Total = 95 Total = 1.00
Table 2.2
The last operation is to total the strokes and enter the totals in the next to
last column in table 2.2 obtaining what is called a frequency distribution. There
are for example, one reading in class interval 7.05-7.25, five readings in the
next, ten in the next, and so on. Such a table is called a frequency distribution
since it shows how the individuals are distributed between the groups or class
intervals. Diagrams are more easily assimilated so it is normal to plot the
Theory of Distributions 3S
30
25
20
15
10
Frequency scale
~=O.I
Sample
2400
Scale of x
'--......._'----'-_-'----'-1----,I 1 I I I 1
-3 -2 -I 0 2 3 -3 -2 -I 0 2 3
Figure 2.2. The effect of the sample size on the histogram shape.
here being taken from an experiment in a laboratory. A sample size of 100 gives
an irregular shape similar to those obtained from the data of output times.
However, with increasing sample size, narrower class intervals can be used and
the frequency distribution becomes more uniform in shape until with a sample of
10 000 it is almost smooth. The limit as the sample size becomes infmite is also
shown. Thus with small samples, irregularities are to be expected in the frequency
distributions, even when the population gives a smooth curve.
It is the assumption that the population from which the data was obtained
Theory of Distributions 37
has a smooth curve (although not all samples have), that enables the statistician
to use the mathematics of statistics.
_XI_
Figure 2.3
Consider now the 1st moment of the distribution about the origin
N
=L P; x; = x (the arithmetical average)
;=1
Thus the lst statistic or measure is the arithmetical average x. Higher moments
are now taken about this arithmetical average rather than the origin.
Thus, the 2nd moment about the arithmetical average
N
= L p;(x; _X)2
;=1
38 Statistics.' Problems and Solutions
This 2nd moment is called the variance in statistics, and its square root is called
the standard deviation.
Thus the standard deviation of the distribution
N
4 th moment about the average = L PiCXi -
i=l
X)4
N
or in general the kth moment about the average = L plXi - x)k
i=l
The first two moments, the mean and the variance, are by far the most
important.
Random Sample
A random sample is a sample selected without bias, i.e., one for which every
member of the population has an equal chance of being included in the sample.
Population or Universe
This is the total number of possible observations. This concept of a population
is fundamental to statistics. All data studied are in sample form and the
statistician's sample is regarded as having been drawn from the population of all
possible events. A population may be finite or infinite. In practice, man~ finite
populations are so large they can be conveniently considered as infinite in size.
3.95-4.95 8
4.95-5.95 7
5.95-6.95 5
The class boundaries shown in this example are suitable for measurements
recorded to the nearest 0.1 of a unit. The boundaries chosen are convenient for
easy summary of the raw data since the first class shown contains all
measurements whose integer part is 4, the next class all measurements starting
with 5 and so on.
It would have been valid but less convenient to choose the class as, say,
3.25-4.25,4.25-5.25, ...
In grouping, any group is called a class and the nJlmber of values falling in
the class is the class frequency. The magnitude of the range of the group is
called the class interval, i.e., 3.95-4.95 or 1.
Number of Groups
F or simplicity of calculation, the number of intervals chosen should not be too
large, preferably not more than twenty. Again, in order that the results obtained
may be sufficiently accurate, the number must not be too small, preferably
not less than eight.
Types of Variable
Continuous. A continuous variable is one in which the variable can take every
value between certain limits a and b, say.
Discrete. A discrete variable is one which takes certain values only-frequently
part or all of the set of positive integers. For example, each member of a
sample mayor may not possess a certain attribute and the observation recorded
(the value of the variable) might be the number of sample members which possess
the given attribute.
Frequency Histogram
A frequency distribution shows the number of samples falling into each class
interval when a sample is grouped according to the magnitude of the values. If the
class form, frequency is plotted as a rectangular block on the class interval the
diagram is called a frequency histogram. Note: Area is proportional to frequency.
Probability Histograms
A probability histogram is the graphical picture obtained when the grouped
40 Statistics: Problems and Solutions
sample data are plotted, the class probability being erected as a rectangular
block on the class interval. The area above any class interval is equal to the
probability of an observation being in that class since the total area under the
histogram is equal to one.
Variate
A variate is a variable which possesses a probability distribution.
Type 1: Unimodal
Examples of this variation pattern are: intelligence quotients of children,
heights (and/or weights) of people, nearly all man-made objects when produced
under controlled conditions (length of bolts mass-produced on capstans, etc.).
A simple example of this type of distribution can be illustrated if one
assumes that the aim is to make each item or product alike but that there
exists a very large number of small independent forces deflecting the aim, and
under such conditions, a unimodal distribution arises. For example, consider
a machine tool mass-producing screws. The setter sets the machine up as
correctly as he can and then passe:; it over to the operator and the screws
produced form a pattern of variation of type 1. The machine is set to produce
each screw exactly the same, but, because of a large number of deflecting
forces present, such as small particles of grit in the cooling oil, vibrations in
the machine, slight variation in the metal-manufacturing conditions are not
constant, hence there is variation in the final product. (See simple quincunx
unit on page 61.)
Type 4: Bimodal
This type cannot be classified as a separate form unless more evidence of
measures conforming to this pattern of variation are discovered. In most cases
42 Statistics: Problems and Solutions
this type arises from the combination of two distributions of type 1 (see
figure 2.5).
------
Figure 2.5. Bimodal distribution arising from two type-J distributions with
different means m 1 and m2.
Type 6: U-Shaped
This type is fascinating in that its pattern is the opposite to type 1. A variable
where the least probable values are those around the average would not be
expected intuitively and it is rare when it occurs in practice. One example,
however, is the degree of cloudiness of the sky-at certain times of the year
the sky is more likely to be completely clear or completely cloudy than anything
in between.
IfiXi
The 1st moment (arithmetic average) =~ fi =X
i
or
or
1st moment x = Xo +
c
t.L
i
fi
fiUi
Example
The values given in table 2.3 have been calculated using the data from table 2.2.
7'.05-7.25 7.15 I -4 -4 16
7.25-7.45 7.35 5 -3 -15 45
7.45-7.65 7.55 10 -2 -20 40
7.65-7.85 7.75 19 -I -19 19
7.85-8.05 7.95 27 0 0 0
8.05-8.25 8.15 22 +1 +22 22
8.25-8.45 8.35 6 +2 +12 24
8.45-8.65 8.55 3 +3 +9 29
8.65-8.85 8.75 2 +4 +8 32
~fj = 95 ~fjuj= -7 ~fiul= 225
Table 2.3
Table 2.4
Theory of Distributions 45
Let Xo = 7.95 and c = 0.2 then 'Luj/Jj = 0.09 and 'L(pjU;) = 2.31
0.09 0.09 0.11 0.09 0.09 0.11 0.09 0.07 0.09 0.06
0.09 0.09 0.09 0.11 0.09 0.07 0.09 0.06 0.10 0.07
0.09 0.10 0.06 0.10 0.08 0.06 0.09 0.08 0.08 0.08
0.08 0.10 0.08 0.07 0.09 0.08 0.09 0.11 0.09 0.09
0.08 0.10 0.09 0.08 0.10 0.08· 0.08 0.09 0.09 0.09
0.08 0.06 0.08 0.08 0.10 0.09 0.09 0.10 0.10 0.11
0.26 0.28 0.31 0.22 0.25 0.28 0.28 0.26 0.29 0.25
0.24 0.29 0.26 0.28 0.24 0.26 0.29 0.23 0.26 0.26
0.25 0.30 0.25 0.29 0.17 0.26 0.33 0.24 0.18 0.34
0.26 0.31 0.23 0.29 0.22 0.26 0.29 0.25 0.24 0.28
0.27 0.32 0.23 0.26 0.25 0.28 0.36 0.42 0.24 0.21
0.23 0.27 0.46 0.23 0.28 0.31 0.29 0.31 0.25
0.24 0.28 0.33 0.24 0.29 0.36 0.32 0.27 0.24
0.25 0.29 0.33 0.25 0.35 0.24 0.33 0.28 0.26
0.26 0.20 0.24 0.26 0.34 0.30 0.30 0.29
0.18 0.22 0.25 0.27 0.33 0.30 0.30 0.23
1.05 1.68 0.78 1.10 0.32 1.61 0.10 0.43 3.70 0.09
0.21 2.71 2.12 2.81 3.30 0.15 0.54 3.12 0.80 1.76
1.14 0.16 0.31 0.91 0.18 0.04 1.16 2.16 1.48 0.63
0.57 0.65 4.60 1.72 0.52 2.32 0.08 0.62 3.80 1.21
1.16 0.58 0.57 0.04 1.19 0.11 0.05 2.68 2.08 0.01
0.15 0.42 0.25 0.05 1.88 3.90
Theory of Distributions 47
4. The number of defects per shift from a large indexing machine are given
below for the last 52 shifts:
2 6 4 5 1 3 2 1 4 2 1 4 6
3 4 3 2 4 5 4 3 6 3 0 7 4
7 3 5 4 3 2 0 5 2 5 3 2 9
5 3 2 1 0 3 3 4 3 2
5. The crane handling times, in minutes, for a sample of 100 jobs lifted and
moved by an outside yard mobile crane are given below:
5 6 21 8 7 8 11 5 10 21
13 15 17 7 27 6 6 11 9 4
7 4 9 192 10 15 31 15 11 38
16 52 87 20 18 22 11 7 9 8
6 10 10 17 37 32 10 26 14 15
28 182 17 27 4 9 19 10 44 20
15 5 20 8 25 14 23 13 12 7
9 92 33 22 19 151 171 21 4 6
31 13 7 45 6 7 17 7 19 42
9 6 55 61 52 7 5 102 8 23
7. The number of goals scored in 57 English and Scottish league matches for
Saturday 23rd September, 1969, was:
0 2 3 3 5 2 2 1 4 4
2 3 3 3 2 2 0 4 6 2 5
6 1 4 4 3 4 2 7 6 2
3 6 4 2 4 3 3 3 6 8 3
5 3 3 2 3 I
J 3 5
48 Statistics: Problems and Solutions
9. The sales value for the last 30 periods of a non-seasonal product are given
below in units of £100:
43 41 74 61 79 60 71 69 63 77
70 66 64 71 71 74 56 74 41 71
63 57 57 68 64 62 59 52 40 76
10. The records of the total score of three dice in 100 throws are given below:
16 4 9 12 11 8 15 13 12 13
8 7 6 13 10 11 16 14 7 12
14 14 4 13 9 12 8 10 12 14
8 4 10 6 9 10 13 12 13 13
16 7 13 12 9 8 10 11 12 10
15 12 4 16 10 9 13 10 9 12
9 4 14 13 7 6 11 9 15 8
5 12 7 6 7 13 13 11 13 14
12 7 10 12 12 12 13 9 16 4
t These data were taken from Facts from Figures by M. J. Moroney, Pelican.
Theory of Distributions 49
30
20
10
Basic minutes
Table 2.5
Transforming x =Xo + cu
Letxe = 0.09
c = 0.D1
1st moment about the origin = arithmetical mean,
__ ""2:,uf _ (-18)_.
x -Xo + c ""2:,f - 0.09 + 0.01 x 60 - 0.087 mm
Standard deviation
s'= V(1.64 x 10- 4 ) = 0.013 min
The histogram is shown in figure 2.6.
2. Range = 0.46 - 0.17 = 0.29 min; size of class interval = 0.03 min, giving
9-10 class intervals.
0.165-0.195 0.18 3 -3 -9 27
0.195-0.225 0.21 5 -2 -10 20
0.225-0.255 0.24 25 -1 -25 25
0.255-0.285 0.27 26 0 0 0
0.285-0.315 0.30 19 +1 +19 19
0.315-0.345 0.33 10 +2 +20 40
0.345-0.3 7 5 0.36 3 +3 +9 27
0.375-0.405 0.39 0 +4 0 0
0.405-0.435 0.42 1 +5 +5 25
0.435-0.465 0.45 1 +6 +6 36
'J:,f= 93 'J:,uf= +15 'J:,u 2 f=219
Table 2.6
(For histogram see figure 2.7.)
Average time
e
Variance of sample
30
20
10
0.45
Basic minutes
3. Range = 4.60 - 0.01 = 4.59 min; width of class interval = 0.5 min.
0-0.499 19 -2 -38 76
0.50-0.999 11 -1 -11 11
1.00-1.499 7 0 0 o
1.50-1.999 6 +1 +6 6
2.00-2.499 4 +2 +8 16
2.50-2.999 3 +3 +9 27
3.00-3.499 2 +4 +8 32
3.50-3.999 3 +5 +15 75
4.00-4.499 0 +6 +0 o
4.50-4.999 1 +7 +7 49
56 ~uf=+4
Table 2.7
(For histogram, see figure 2.8.)
Transform
x =xo + cu
Let
xo = 1.25, c = 0.50
1st moment about the origin = arithmetic average,
__
x - xo + c
~uf _
~f -
±_ .
1.25 + 0.50 x 56 - 1.29 mm
52 Statistics.' Problems and Solutions
20
15
>.
u
c
CIl
fr 10
CIl
~
(S')2 = c2 [
LU2 f - (LUJ)2]
Lf Lf = 0.5 2
Standard deviation of sample s' = 1.14 min
[292 - (+4)2]
56 56 = 0.25 e
92 ;, 0.29) = 1.30
0 3 -3 -9 27
7 -2 -14 28
2 9 -1 -9 9
3 12 0 0 0
4 9 +1 +9 9
5 6 +2 +12 24
6 3 +3 +9 27
7 2 +4 +8 32
8 0 +5 +0 0
9 1 +6 +6 36
Lf= 52 Luf= +12 Lu 2 f= 192
Table 2.8
(For histogram see figure 2.9.)
Theory of Distributions 53
10
>-
u
C
Ol
::J
0-
Ol
~ 5
35
35
30
25
~ 20
c
CD
::l
0-
CD
....
lJ.. 15
10
2 3
2
10 10
IX) (l)
Transfonn X = Xo + eu
Lete= 10 min
Xo = 35
1082 - (-104.5)2]
[
(S')2 = 102 100 100 = 102(9.72) = 972
549.5- 649.5 -4 -4 16
649.5- 749.5 1 -3 -3 9
749.5- 849.5 10 -2 -20 40
849.5- 949.5 26 -1 -26 26
949.5-1049.5 26 0 0 o
1049.5-1149.5 18 +1 +18 18
1149.5-1249.5 11 +2 +22 44
1249.5-1349.5 7 +3 +21 63
'Lf= 100 'Luf= +8
Table 2.10
(For histogram, see figure 2.11.)
Transforming x =xo + uc
where c = 100 h
Xo = 1000 h
30
>- 20
o
C
QI
:J
g
tt 10
550 650
Lifetime (h)
Figure 2.11. Lifetime of electric light bulbs.
S6 Statistics: Problems and Solutions
Number of
Frequency (f) u uf u2f
goals/match
0 2 -4 -8 32
I 9 -3 -27 81
2 II -2 -22 44
3 IS -I -IS IS
4 8 0 0 0
5 5 +1 +5 5
6 5 +2 +10 20
7 I +3 +3 9
8 I +4 +4 16
--- ---
'kf= 57 'kuf= -50 'ku 2 f= 222
Table 2.11
(For histogram, see figure 2.12.)
Xo =4 c =1
Average goals/match
x=4+1x ( -50)
57 =3.12
Variance of sample
222 - (50)2]
[
(s')2 = 12 x. 5/ 7 =3.13
Standard deviation of sample = 1.8
15
:>. 10
u
c
CI>
:l
CT
~ 5
7 8
Goals / mate h
Figure 2.12. Number of goals scored in soccer matches.
Theory of Distributions 57
54.5- 64.5 -4 -4 16
64.5- 74.5 2 -3 -6 18
74.5- 84.5 9 -2 -18 36
84.5- 94.5 22 -1 -22 22
94.5-104.5 33 0 0 0
104.5-114.5 22 +1 +22 22
114.5-124.5 8 +2 +16 32
124.5-134.5 2 +3 +6 18
134.5-144.5 1 +4 +4 16
'i,f= 100 'i,uf= -2 'i,u 2 f= 180
Table 2.12
Transfonning x =Xo + cu (For histogram, see figure 2.13.)
where Xo = 99.5
c= 10
Average intelligence quotient
30
>-
"
c
Ol
::J
c- 20
Ol
~
10
2 2
55 65 135 145
Intelligence quotients
Figure 2.13. Intelligence quotients of children.
58 Statistics: Problems and Solutions
Table 2.13
(For histogram, see figure 2.14.)
where Xo =61.5
c=4
Average sales/period
x = 61.5 + 4Gb)= 63
(:lJlf] =
Variance of sample
207 -
(S')2 =4 2 [ 30 30 108.3
+1
0 -17
+29
0 0
17
29
l3.5-15.5 11 +2 +22 44
15.5-17.5 5 +3 +15 45
~f= 100 ~uf=+2 ~u'},f= 250
Table 2.14
where Xo = 10.5
c=2
Average score
(S')2 = 22 [ 250-(~tl
1O~00 = 10
Standard deviation s' = 3.16
30
>- 20
o
c:
Q)
:l
i
It 10
Figure 2.16
F or the case of
(a) The cutter has held the standard and produced a bell-shaped curve.
(b) Here either consciously or not, the standard has been changing.
(c) Here the negative skew distribution has arisen by the cutter again either
consciously or not, placing control on the short end of the straw.
Laboratory Equipment
Shove-halfpenny board or specially designed board (available from Technical
Prototypes (Sales) Ltd).
Method
After one trial, carry out 50 further trials, measuring the distance travelled each
time, the object being to send the disc the same distance at each trial.
Analysis
Summarise the data into a distribution, draw a histogram and calculate the mean
and standard deviation.
time, and if the average number of successes in time tis m, then the probability
of x successes in time t is
as before.
Tutors must stress the relationship between these distributions so that students
can understand the type to use for any given situation.
Tutors can introduce students to the use of binomial distribution in place of
hypergeometric distribution in sampling theory when n/N';;;; 0.10.
Students should be introduced to the use of statistical tables at this stage. For
all examples and problems, the complementary set of tables, namely Statistical
Tables by Murdoch and Barnes, published by Macmillan, has been used. As
mentioned in the preface, references to these tables will be followed by an
asterisk.
Note: The first and second moments of the binomial and Poisson distributions
arr given below.
Binomial Poisson
1st moment (mean) J.l. np m
2nd moment about the
mean (variance) a 2 np(1-p) m
Hypergeometric, Binomial and Poisson Distributions 65
P()= (~)C~~J
x CD
Probability of exactly five diamonds in the hand
2. A distribution firm has 50 lorries in service delivering its goods; given that
lorries break down randomly and that each lorry utilisation is 90%, what
proportion of the time will
(a) exactly three lorries be broken down?
(b) more than five lorries be broken down?
(c) less than three lorries be broken down?
This is the binomial distribution since the probability of success, i.e. the
probability of a lorry breaking down, is p = 0.10 and this probability is constant.
The number of trials n = 50.
(a) Probability of exactly three lorries being broken down
p(x> 5) = I
x=6
(50) 0.1O x (1-0.1O)50-X
x
66 Statistics: Problems and Solutions
from table 1*
p(x > 5) = 0.3839
(c) Probability ofless than three lorries being broken down
3. How many times must a die be rolled in order that the probability of 5
occurring is at least 0.75?
This can be solved using the binomial distribution. Probability of
success, i.e. a 5 occurring, is p = i.
Let k be the unknown number of rolls required, then probability of x
number of 5's in k rolls
Probability required is
x=l
f (k)(I)X(~)k-X = 0.75
x 6 6
Since
1- ~1 (=)(~r(~r-X =probabilitYOfnotgettinga5inkthrOws=(~y
~ k=1-0.75=0.25
4. A firm receives very large consignments of nuts from its supplier. A random
sample of 20 is taken from each consignment. If the consignment is in fact 30%
defective, what is
(a) probability of finding no defective nuts in the sample?
(b) probability of finding five or more defective nuts in the sample?
5. The average usage of a spare part is one per month. Assuming that all machines
using the part are independent and that breakdowns occur at random, what is
(a) the probability of using three spares in any month?
(b) the level of spares which must be carried at the beginning of each month
so that the probability of running out of stock in any month is at most 1 in
100?
This is the Poisson distribution.
The expected usage m = 1.0
(a):. Probability of using three spares in any month
1.03 e-l.O
P(3) = 3!
(b) This question is equivalent to: what demand in a month has a probability
of at most 0.01 of being equalled or exceeded?
Note: Runout if stock falls to zero.
From table 2*
Stocking four spares, probability of four or more = 0.0190
Stocking five spares, probability of five or more = 0.0037
:. Stock five, the probability of runout being 0.0037
(Note: It is usual to go to a probability of 1/100 or less.)
random, and calculate and draw the probability distribution of the number of
failures per six months per machine over 100 machines.
Calculate the average and the standard deviation of the distribution.
This is the Poisson distribution.
Expected number of failures per machine per six months, m = 2.
Expected
Number Probability number
u uh u2h
of failures Pi of failures
Ii
0 0.1353 13.5 -2 -27 54.0
I 0.2707 27.1 -I -27.1 27.1
2 0.2707 27.1 0 0 0
3 0.1804 18.0 +1 +18 18.0
4 0.0902 9.0 +2 +18 36.0
5 0.0361 3.6 +3 +10.8 32.4
6 0.0121 1.2 +4 4.8 19.2
7 or over 0.0045 0.5 +5 2.5 12.5
1.0000 'l:.li = 100 'l:.uli =0 u 2 Ii = 199.2
Table 3.1. The values have been calculated from table 2 * of statistical tables.
Transform x =uc + xo
Xo =2
c=l
The arithmetical average
Variance
Students will be introduced here to some of the logic used later so that they
can see, even at this introductory stage, something of the overall analysis using
statistical methods.
Table 3.2
Table 3.3
The agreement will be seen to be fairly close and when tested (see chapter 8), is
a good fit. It is interesting to see that the greater part of the variation is due to this
basic law of variation. However, larger samples tend to show that the Poisson
does not give a correct fit in this particular context.
Number of deaths/corps/year o 1 2 3 4
Frequency 109 65 22 3
Table 3.4
From this table the average number of deaths/corps/year, m = 0.61
Setting up the null hypothesis, namely, that the probability of a death has been
constant over the years and is the same for each corps, is equivalent to
postulating that this pattern of variation follows the Poisson law. Fitting a
Poisson distribution to these data and comparing the fit, gives a method of
testing this hypothesis. Using table 2* of statistical tables, and without
interpolating, i.e. use m = 0.60, gives the results shown in table 3.5
Table 3.5
3. Outbreaks of War
The data in table 3.6 (from Mathematical Statistics by J. F. Ractliffe, Q.U.P.) give
the number of outbreaks of war each year between the years 1500 and 1931
inclusive.
Table 3.6
Setting up a hypothesis that war was equally likely to break out at any instant
of time during this 432-year period would give rise to a Poisson distribution. The
fitting of this Poisson distribution to the data gives a method of testing this
hypothesis.
The average number of outbreaks/year = 0.69 ~ 0.70
Using table 2* of statistical tables, table 3.7 gives a comparison of the actual
variation with that of the Poisson. Again comparison shows the staggering fact
that life has closely followed this basic law of variation.
Hypergeometric, Binomial and Poisson Distributions 71
Number of
outbreaks of war o 2 3 4 5 or more Total
Table 3.7
The actual demands at the MacDill Airforce Base per week for three spares
for B-47 airframe over a period of 65 weeks are given in table 3.8.
The Poisson frequencies are obtained by using the statistical tables and
table 3.8 gives a comparison of the actual usage distribution with that of the
Poisson distribution.
The theoretical elements assuming the Poisson distribution are shown in the
table also. It will be seen that these distributions agree fairly well with actual
demands.
o 75 74.2
1 90 90.1
2 54 54.8
3 22 22.2
4 6 6.8
5 2 1.6
6 or more 0.4
2. If the chance that anyone of ten telephone lines is busy at any instant is 0.2,
what is the chance that five of the lines are busy?
5. In a quality control scheme, samples of five are taken from the production at
regular intervals of time.
What number of defectives in the samples will be exceeded 1/20 times if the
process average defective rate is (a) 10%, (b) 20%, (c) 30%?
7. From a group of eight male operators and five female operators a committee
of five is to be formed. What is the chance of
(a) all five being male?
(b) all five being female?
(c) how many ways can the committee be formed if there is exactly one
female on it?
8. In 1000 readings of the results of trials for an event of small probability, the
frequencies!i and the numbers Xi of successes were:
Xi o 1 2 3 4 5 6 7
!i 305 365 210 80 28 9 2
Show that the expected number of successes is 1.2 and calculate the expected
frequencies assuming Poisson distribution.
Calculate the variance of the distribution.
2. p(5 lines busy) = (150) 0.25 0.8 5 = 0.0264 from table 1* in statistical tables.
o 0.33
I 0.41
2 0.20
approximately
3 0.05
4 0.01
5 o
Table 3.10
5. (a) n = 5
P =0.10
From table 1*
:. Probability of exceeding 1 = 0.0815
:. Probability of exceeding 2 = 0.0086
1 in 20 times is a probability of 0.05
:. Number of defectives exceeded 1 in 20 times is greater than 1 but less
than 2.
Hypergeometric, Binomial and Poisson Distributions 7S
(b) n = 5
p = 0.20
From table 1*
Probability of more than 2 = 0.0579
:. Number of defectives exceeded 1 in 20 times (approximately) is 2
(c) n = 5
p = 0.30
From table 1*
Probability of more than 3 = 0.0318
:. Number of defectives exceeded 1 in 20 times is nearly 3
6. n = 20
p = 0.20
From table 1*
Probability of more than four rejects = 0.3704
:. Four will be exceeded 37 times in 100
Probability
((8)\{(55 \ 8!
_~_ 5! 3! _ 8! 5! 8! _ 8 x 7 x 6 x 5 x 4 _
- (13) - 13! -5!3!13!-13xI2xllxl0x9- 0.044
5 5!8!
(b)M=8,N= 13,N-M=5,n=5
x=O
Probability
= (~)(D
13)
~ =
~
= 8! 5! =
13!
5x4x3x2x 1
13 x 12 x 11 x 10 x 9
= 000078
.
(
5 8! 5!
(c) Number of ways one female can be chosen from five
= (i) = 5
76 Statistics: Problems and Solutions
= 5 x (8) = 5 x £ = 5 x 8 x 7 x 6 x 5 = 350
4 4! 4! 4x3x2x 1
8.
10. x f u z.f u 2f
Assumed
mean 0 305 -1 -305 305
1 365 0 0 0
2 210 +1 210 210
3 80 +2 160 320
4 28 +3 84 252
5 9 +4 36 144
6 2 +5 10 50
7 1 +6 6 36
r-f 1000 r-u[201 r-u 2 f 1317
Table 3.11
2 (r-u[)2
. r-u [ - r-[ _ 1317 - 40 _
Vanance = r-[ - 1000 - 1.277
Binomial Distribution
Number of persons: 2 or 3.
Object
The experiment is designed to demonstrate the basic properties of the binomial
law.
Method
Using the binomial sampling box, take 50 samples of size 10 from the popUlation,
recording in table 18, the number of coloured balls found in each sample.
(Note: Proportion of coloured (Le. other than white) balls is 0.15.)
Analysis
1. Group the data of table 18 into the frequency distribution, using the top
part of table 19.
2. Obtain the experimental probability distribution of the number of coloured
balls found per sample and compare it with the theoretical probability
distribu tion.
3. Combine the frequencies for all groups, using the lower part of table 19,
and obtain the experimental probability distribution for these combined results.
Again, compare the observed and theoretical probability distributions.
4. Enter, in table 20, the total frequencies obtained by combining individual
groups' results. Calculate the mean and standard deviation of this distribution
and compare them with the theoretical values given by np and y[np(1-p)]
respectively where, in the present case, n = 10 and p = 0.15.
Sample Results
1-10 3 2 , 3 'r 3 2 0 0 I
11-20 2 3 :2. 3 3 I ::(. I I 0
21-30 0 0 I I 2 I 3 I I 3
31-40 4- 4- I I 4- I .2 3 2 2.
41-50 I I I I .<. I 0 0 I ::<..
'Tally-marks'
11 t
-:Y $
$ f1
$:;::;, ~
~
;:::::.
Group No._
~ /"
$
Experimental
frequency 7 /q II 9 L.r 50
Experimental
probability O·ltt ~.1>'i O·~ a·li O-Oil 1·0
Theoretical
probability 1)·1'17 Io-~Io' o-17~ 0·1";) 00 ..... ~o, ~-001 1·0
Group I 7 /9 II q 4- 5"0
results
2 Il. LO g g I I 50
3 10 Ib 17 4- :3 50
4 g I b 17 5 :L ~ 50
5 7 5 17 I~ 7 I 50
6 5 ~I III 10 2 I 50
7 I~ Ib l;t B I 50
8 I:' 17 II 7 1. 50
Total frequency
(all Qroups) 74- 1,,;)0 lOIr b5 2.l. 4- I /;..00
Experimental
probability ~·\ts jo.~1S"/O·LW Io·\~l ~-oH 0·010 ~
Table 3.13 (Table 19 of the laboratory manual)
Number of
coloured
bolls per
Ic-l1Iquency sample
f x fx fx2
7/t 0 0 0
I~O I I"?:>O I~O
= 1.19
This equation is
1 e
y = av'(21T)
where JJ, is the mean of the variable x
G is the standard deviation of x
e is the well-known mathematical constant (=2.718 approximately)
1T is another well-known mathematical constant (=3.142 approximately)
This equation can be used to derive various properties of the normal distribution.
A useful one is the relation between area under the curve and deviation from the
mean, but before looking at this we need to refer to a standardised variable.
x-
Figure 4.1
Prob(a < x ~ b)
Figure 4.2 a x
82 Statistics: Problems and Solutions
values between a and b. This is equal to the probability that a single random
value of x will be bigger than a but less than b.
By standardising the variable and using the symmetry of the distribution,
table 3* can be used to find this probability as well as the unshaded areas in each
tail.
o u o
Figure 4.3
(a) u = 1.0
Area = 0.1587
(b) u = 2.0
Area in right tail = 0.02275
Thus shaded area = 1 - 0.02275 = 0.97725
(c) By symmetry area to left of u =- 2 is the same as the area to the right
ofu = +2.
Thus the shaded area = 0.02275
(d) Area above u = +0.5 is 0.3085
Area below u =-1.5 is 0.0668
Total unshaded area = 0.3753
:. shaded area = 0.6247
84 Statistics: Problems and Solutions
2. Jam is packed in tins of nominal net weight 1 kg. The actual weight of jam
delivered to a tin by the filling machine is normally distributed about the set
weight with standard deviation of 12 g.
(a) If the set, or average, filling of jam is 1 kg what proportion of tins
contain
(i) less than 985 g?
(ii) more than 1030 g?
(iii) between 985 and 1030 g?
(b) If not more than one tin in 100 is to contain less than the advertised
net weight, what must be the minimum setting of the filling machine in order to
achieve this requirement?
1000 1030
(iii)
Figure 4.4
Using table 3* and the symmetry of the curve, the required proportion is 0.1056
(11.. ) u = 103012
- 1000 = 30 = 2 5
12·
The lower and upper tail areas have already been found in (i) and (ii) and
thus the solution is
1-(0.1056 + 0.00621) = 1-0.1118 = 0.8882
(b) In this case, the area in the tail is fixed and in order to find the value of
the mean corresponding to this area, the cut-off point (1000 g) must be
expressed in terms of the number of standard deviations that it lies from the
mean.
~.
0.01
u=12
1000
Figure 4.5
From table 4* (or table 3* working from the body of the table outwards),
1% of a normal distribution is cut off beyond 2.33 standard deviations from the
mean.
The required minimum value for the mean is thus
1000 + 2.33 x 12 = 1028 g = 1.028 kg
3. The data from problem 1, chapter 2 (page 46), can be used to show the
fitting of a normal distribution. The observed and fitted distributions are also
shown plotted on arithmetic probability paper.
The mean of the distribution was 0.087 min and the standard deviation
0.013 min. The method of finding the proportion falling in each class of a
normal distribution with these parameters is shown in table 4.1. The expected
class frequencies are found by multiplying each class proportion by the total
observed frequency. Notice that the total of the expected normal frequencies is
not 60. The reason is that about !% of the fitted distribution lies outside the
range (0.045 to 0.125) that has been considered.
Table 4.2 shows the observed and expected normal class frequencies in
cumulative form as a percentage of the total frequency. Figure 4.6 shows these
two sets of data superimposed on the same piece of normal (or arithmetic)
probability paper.
The dots in figure 4.6 represent the observed points and the crosses represent
the fitted normal frequencies. Note that the plot of the cumulative normal
percentage frequencies does not quite give a straight line. The reason for this
is that the Mo of the normal distribution having values less than 0.045 has not
been included. If this 16% were added to each of the cumulative percentages in
the right-hand column of table 4.2 then a straight-line plot would be obtained.
SPS-4
00
0\
0.035-0.045
0.045 -3.23 1-0.0006 = 0.9994
0.045-0.055 0.0063 0.4 0
0.055 -2.46 1-0.0069 = 0.9931
0.055-0.065 0.0386 2.3 5
0.065 -1.69 1-0.0455 = 0.9545
0.065-0.075 0.1333 8.0 4
0.075 -0.92 1-0.1788=0.8212
0.075-0.085 0.2616 15.7 14
0.085 -0.15 1-0.4404 = 0.5596
0.085-0.095 0.2920 17.5 23
0.095 0.62 0.2676
0.095-0.105 0.1838 11.0 9 ~
0.105 1.38 0.0838 ....
s::.
0.105-0.115 2.15 0.0680 4.1 5 .....
0:;.
0.115 0.0158 ....
0.115-0.125 0.0140 0.8 0 ;:;.
0.125 2.92 0·0018 .,..
0.9976 59.8 60 ~
0
Q'
;:;;
Table 4.1
.,~
s::.
;::s
I:l..
~
0
ii:
:::to
0
.,;::s
~
~
!::.
1::::1
1:;.
~
~
I::
...c·
;::s
Observed Fitted normal
Table 4.2
QO
--!
88 Statistics: Problems and Solutions
0.5
I
,,
,,
5 ,,
10 ,,
20
", ,
,
~ 40
c: ,,
,
Q)
~
0'
60
~
..... ",
~ 70
,
", ,
.!:! 80
90
,,
,
" ,
" ,,
99 ,,
""
99.9
A further point to note is that the cumulative frequences are plotted against
the upper class boundaries (not the mid point of the class) since those are the
values below which lie the appropriate cumulative frequencies.
In addition, if the plotted points fall near enough on a straight line, which
implies approximate normality of the distribution, the mean and standard
deviation can be estimated graphically from the plot. To do this the best
straight line is drawn through the points (by eye is good enough). This straight
line will intersect the 16%,50% and 84% lines on the frequency scale at three
points on the scale of the variable.
The value of the variable corresponding to the 50% point gives an estimation
of the median, which is the same as the mean if the distribution being plotted
is approximately symmetrical.
The horizontal separation between the 84% and 16% intercepts is equal to
Normal Distribution 89
2a for a straight line (normal) plot and so half of this distance gives an estimate
of the standard deviation.
Applying this to the fitted normal points, the mean is estimated as 0.087
and the standard deviation comes out as 0;5 (0.100-0.074) = 0.013, the
figures used to derive the fitted frequencies in the first place. The small bias
referred to earlier caused by omitting the bottom It,% of the distribution in the
plot has had very little influence on the estimate in this case.
(a) more than twice the standard deviation above the mean?
(b) further than half the standard deviation below the mean?
(c) within one and a half standard deviations of the mean?
6. The data summarised in table 4.3 come from the analysis of 53 samples of
rock taken every few feet during a tin-mining operation. The original data for
each sample were obtained in terms of pounds of tin per ton of host rock but
since the distribution of such a measurement from point to point is quite skew,
the data were transformed by taking the ordinary logarithms of each sample
value and summarising the 53 numbers so obtained into the given frequency
distribu tion.
Fit a normal distribution to the data.
**7. The individual links used in making chains have a normal distribution of
strength with mean of 1000 kg and standard deviation of 50 kg.
If chains are made up of 20 randomly chosen links
(a) what is the probability that such a chain will fail to support a load of
900 kg?
Normal Distribution 91
0.6-0.799
0.8-0.999 3
1.0-1.199 6
1.2-1.399 8
1.4-1.599 12
1.6-1.799 11
1.8-1.999 6
2.0-2.199 4
2.2-2.399 2
53
Table 4.3
(b) what should the minimum mean link strength be for 99.9% of all chains
to support a load of 900 kg?
(c) what is the median strength of a chain?
**8. The standardised normal variate, U, having mean of 0 and variance of 1, has
probability density function
A,/ ) 1 _!.u·
'l'\u = V(21T) e' ,
If this distribution is truncated at the point Ucx (Le. the shaded portion, a,
of the distribution above Ucx is removed-see figure 4.7), obtain an expression in
terms of a and Ucx showing the amount by which the mean of the truncated
distribution is displaced from U = o.
Figure 4.7
Figure 4.8
Lr\: 56 68
(b) 40-56=-16=_16
u 10 10 .
Thus, 0.0548 of the distribution takes values less than 40.
Figure 4.9 40 56
65-56
(c) For 65, U = 10 = 0.9
Figure 4.10
Normal Distribution 93
60-56
(d) For 60, U =-1-0- = 0.4
Figure 4.11
52-56
(e) For52,u= 10 =-0.4
Figure 4.12
Figure 4.13
~ 99.3 120
(b) (i) For all normal distributions, 1% in the tail occurs at a point 2.33 standard
deviations from the mean. (See table 4* or use table 3* in reverse.)
Thus, 1% of all children will have an LQ. value greater than
99.3 + 2.33 x 13.4 = 99.3 + 31.2 = 130.5
~,
Figure 4.16 99.3 ?
LT\r,
99.3 + 3.09 x 13.4 =99.3 + 41.5 = 140.8
(iii) Ten per cent of children will have I.Q. values less than the value which
90% exceed.
The u-value corresponding to this point is -1.28 and converting this into the
scale of I.Q. gives
99.3 -1.28 x 13.4 = 99.3 -17.2 = 82.1
Normal Distribution 95
Figure 4.18
~ ? 99.3
(c) We need to find the lower and upper limits such that the shaded area is 95%
of the total. There are a number of ways of doing this, depending on how the
remaining 5% is split between the two tails of the distribution. It is usual to
divide them equally. On this basis, each tail will contain 0.025 of the total area
and here the required limits will be 1.96 standard deviations below and above
the mean respectively.
Thus, 95% of children will have I.Q. values between
99.3 - 1.96 x 13.4 and 99.3 + 1.96 x 13.4 i.e.
99.3 - 26.2 and 99.3 + 26.2
We have assumed that the original sample of 100 children was taken randomly
and representatively from the whole population of children about whom the
above probability statements have been made. This kind of assumption should
always be carefully checked for validity in practice.
In addition, the mean and standard deviation of the sample were used as
though they were the corresponding values for the population. In general, they
will not be numerically equal, even for samples as large as 100, and this will
introduce errors into the statements made. However, the answers will be of the
right order of magnitude which is mostly all that is reqUired in practice.
The assumption of normality of the population has already been mentioned.
4. (a) If the mean length is 1.45 m then the maximum deviation allowed for a
stocking to be acceptable is
+ 0.020
- 0.013
(b) This time the two shaded areas are each specified to be 0.025 (2!%).
Therefore the tolerance that can be worked to corresponds to u =± 1.96,
Le. to ± 1.96 x 0.013 = ± 0.025 m, or ±25 mm.
Figure 4.21
~~~ ? 1.45 ?
(c) The lower and upper lengths allowed are 1.425 m and 1.475 m respectively.
The shaded area gives the proportion of stockings that do not meet the standard
when the process mean length is 1.46 m.
5. (a.) For.
183 m, U =
1.83-1.73
0.064 156
.
Figure 4.23
AS: 1.73 1.83
Normal Distribution 97
Js:o
Required proportion = 0.00069
(c) Men shorter than 1.83 - 0.13 = 1.70 will have a clearance of at least
O.13m.
Correspon di ng U = 1.70-1.73
0.064
047
=- .
~64
Figure 4.25 1.70 1.73
(d) The frame height which is exceeded by one man in a thousand will be
3.09 standard deviations above the mean height of men, i.e. at
/13:4
1. 73 + 3.09 x 0.064 = 1.93 m
0.001
Figure 4.26 1.73 ?
For men,.
1 83 m correspon ds to u = 1.83 -1.73 1 56
0.064 =.
Proportion of men taller than 1.83 m = 0.0594
:. Expected proportion of people for whom 1.83 m is too low is
0.00069 x 0.95 + 0.0594 x 0.05 = 0.004, i.e
4 people in a 1000.
98 Statistics: Problems and Solutions
The problem can be extended by allowing some people to wear hats as well
as shoes with different heights of heel.
This problem was intended to give practice in using normal tables of area. Any
practical consideration of the setting of standard frame heights would need to
take account of the physiological and psychological needs of human door users,
of economics and of the requirements of the rest of the building system.
Coded
x [ variable (u) [u [u 2
0.6-0.799 1 -4 -4 16
0.8-0.999 3 -3 -9 27
1.0-1.199 6 -2 -12 24
1.2-1.399 8 -1 -8 8
1.4-1.599 12 0 -33 0
1.6-1.799 11 1 11 11
1.8-1.999 6 2 12 24
2.0-2.199 4 3 12 36
2.2-2.399 2 4 8 32
53 43 178
-33
10
Table 4.4
. .
Standard de."..on " 0.2 J 178- 53
(S3
10 )
2
Using these two values, the areas under the fitted normal curve falling in each
class are found using table 3* of the statistical tables. This operation is carried
out in table 4.5. Note that the symbol u in the table refers to the standardised
normal variate corresponding to the class boundary, whereas in table 4.4 it
represents the coded variable (formed for ease of computation) obtained by
Normal Distribution 99
subtracting 1.5 from each class midpoint and dividing the result by 0.2, the
class width.
Expected
Class Area in
Class u Area above u normal
boundaries each class
frequency
0.4-0.6
0.6 -2.58 1-0.0049 = 0.9951
0.6-0.8 0.0163 0.86
0.8 -2.03 1-0.0212 = 0.9788
0.8-1.0 0.0482 2.55
1.0 -1.48 1-0.0694 = 0.9306
l.0-l.2 0.1068 5.66
1.2 -0.93 1-0.1762 = 0.8238
1.2-l.4 0.1758 9.32
1.4 -0.38 1-0.3520 = 0.6480
l.4-1.6 0.2155 11.42
l.6 0.17 0.4325 10.43
l.6-1.8 0.1967
l.8 0.72 0.2358 7.09
1.8-2.0 0.1338
2.0 1.27 0.1020
2.0-2.2 0.0676 3.58
2.2 1.82 0.0344
2.2-2.4 0.0255 l.35
2.4 2.37 0.0089
2.4-2.6
Table 4.5
7. (a) Since a chain is as strong as its weakest link, the chain will fail to support
a load of 900 kg if one or more of its links is weaker than 900 kg.
The probability that a single link is weaker than 900 kg is given by the area
in the tail of the normal curve below
900-1000 .
u= 50 =-2, 1.e. 0.02275
:. The probability that a single link does not fail at 900 kg = 0.97725 and the
probability that none of the links fails = 0.97725 20 . Thus the probability that
a chain of 20 links will not support a load of 900 kg is
1-(0.97725)20 = 1-0.631 = 0.37
900 1000
Figure 4.27 Single link strength
Let p be the probability that an individual link is stronger than 900 kg.
Then we have that
p20 = 0.999
p = 0.99998 (using 5 figure logarithms)
Figure 4.29
0.0341
~
u =-1.82 1000
Single link strength
Figure 4.30
L o U a u--
Since the mean was previously at u = 0 (Le. when a = 0), the above
expression also represents the shift in mean.
cf>(u Ol ) is the ordinate (from table 5* of statistical tables) of the normal
distribution corresponding to u =u Oi •
The result just obtained can be used to solve the numerical part of the
problem.
The bottle contents are distributed normally but if the segregation process
operates perfectly (which it will not do in practice), the distribution of bottle
contents offered for sale will correspond to the unshaded part of figure 4.31.
991 1000
Figure 4.31 Bottle contents (ml)
Note: The change in mean is positive since the truncation occurs in the lower
tail instead of the upper tail.
The mean volume of bottle contents is therefore 1000 + 0041 = 100004 ml.
Appendix I-Experiment 10
Normal Distribution
Number of persons: 2 or 3.
Object
To give practice in fitting a normal distribution to an observed frequency
distribu tion.
Method
The frequency distribution of total score of three dice obtained by combining
all groups' results in table 2, experiment 1, shOUld be re-listed in tabie 26
(Table 4.6).
Analysis
1. In table 26, calculate the mean and standard deviation of the observed
frequency distribution.
2. Using table 27, fit a normal distribution, haVing the same mean and standard
deviation as the data, to the observed distribution.
3. Draw the observed and normal frequency histograms on page 46 and comment
on the agreement.
Notes
1. It is not implied in this experiment, that the distribution of the total score
of three dice should be normal in form.
2. The total score of three dice is a discrete variable, but the method of fitting
a normal distribution is exactly the same for this case as for a frequency
distribution of grouped values of a continuous variable.
Normal Distribution 103
u =x-xo
c
8.5-9.5 9
13.5-14.5 14
=
14.5-15.5 15
15.5-16.5 16
-
17.5-18.5 18 the sample is given by
= -/(variance)
I~:a~:r~s ~ ~
• •
SI
Tota~:r~s
-ve ~ ~~ =
Net Totals =
7. 5
"". ... ~-
6
6 .5 ..
9
9 .5
"' " .
10
10. 5 ~,
II
1 1. 5 m '"
12
12.5 ..ii' X
13
13 .5
14
14 .5 ,,;
15
15 .5 ~.
16
16.5 .,J,r>,.>
17 '" "
17 . 5 .'"
18
16.5 "",
'"
";,-
Table 4.7 (Table 27 of the laboratory manual)
Notes
1. u is the deviation from the mean, of the class boundary expressed as a
multiple of the standard deviation (with appropriate sign).
. _ class boundary - X
I.e. u - s
2. The area under the normal curve above each class boundary may be found
from the table of area under the normal curve at the end of the book.
The normal curve area or probability for each class is obtained by differencing
the cumulative probabilities in the previous column.
3. Other tables which cumulate the area under the normal curve in a different
way may be used, but some of the column headings will require modification
and the probabilities subtracted or summed as appropriate.
4. In order to obtain equality of expected and observed total frequencies, the
two extreme classes should be treated as open-ended, Le. with class boundaries
of - 0 0 and +00 instead of 2.5 and 18.5 respectively.
Normal Distribution 105
Appendix 2-Experiment 11
Normal Distribution
Number of persons: 2 or 3.
Object
To calculate the mean and standard deviation of a sample from a normal
population and to demonstrate the effect of random sampling fluctuations.
Method
From the red rod population M6/1 (Normally distributed with a mean of 6.0
and standard deviation of 0.2) take a random sample of 50 rods and measure
their lengths to the nearest tenth of a unit using the scale provided. The rods
should be selected one at a time and replaced after measurement, before the
next one is drawn.
Record the measurements in table 28.
Care should be taken to ensure good mixing in order that the sample is
random. The rod popUlation should be placed in a box and stirred-up well
during sampling.
Analysis
1. Summarise the observations into a frequency distribution using table 29.
2. Calculate the mean and standard deviation of the sample data using table 30.
3. Compare, in table 31, the sample estimates of mean and standard deviation
obtained by each group. Observe how the estimates vary about the actual
population parameters.
4. Summarise the observed frequencies of all groups in table 32. On page 51,
draw, to the same scale, the probability histograms for your own results and
for the combined results of all groups. Observe the shapes of the histograms
and comment.
1-10
11-20
21-30
31-40
41-50
Table 4.8 (Table 28 of the laboratory manual)
Summarise these observations into class intervals of width 0.1 unit with the
measured lengths at the mid points using the 'tally-mark' method and table 29.
106 Statistics: Problems and Solutions
Class
interval Class 'Tally-marks' Frequency
(units) mid point
5.35-5.45 5.4
5.45-5.55 5.5
5.55-5.65 5.6
5.65-5.75 5.7
5.75-5.85 5.8
5.85-5.95 5.9
5.95-6.05 6.0
6.05-6.15 6.1
6.15-6.25 6.2
6.25-6.35 6.3
6.35-6.45 6.4
6.45-6.55 6.5
6.55-6.65 6.6
Total frequency
u X-Xo
c
The values of u will be positive or negative integers.
The mean oX of the sample is
- 01 "E,fu
x =xo + . "E,f
=
Normal Distribution 107
Closs Mid
Interval, point ~reQuerlC) Closs
units )( f u fu fu 2
5·;30-5.45 5.4
5.45-5.55 5.5
5.55-5.65 5.6
5.65-5.75 5.7
5.75-5:85 5:8
5.85-5.95 5.9
5.95-6.05 6.0
6.05-6.15 6.1
6.15-6.25 6.2
6.25-6.35 6.3
6.35-6.45 6.4
6.45-6.55 6.5
6.55-6.65 6.6
I~:o~::ms ~ ~
~o::\~~ms ~ ~ R ~
Net totals
~ R
Table 4.10 (Table 30 of the laboratory manual)
=
=
The standard deviation s of the sample is given by
s' =y(variance)
=
=
108 Statistics: Problems and Solutions
Sample
Sample Standard
Group Mean
size deviation
1
2
3
4
5
6
8
Population parameters 6.00 0.2
5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6
1
2
7
8
Total
frequencies
(all groups)
19202122
Figure 5.1 Number of deaths
Table 5.3 gives a full comparison of the probabilities of fmding x defects in the
sample.
Table 5.3
Table 5.4
Relationship Between the Basic Distributions 113
Figure 5.2
~ 25 30.5
= 30.5 - 25 = 5.5 = 1 55
u 3.54 3.54 .
which from table 3* leads to a probability of 0.0606.
Note: Since a continuous distribution is being used to approximate to a
discrete distribution, the value 30.5 and not 30 must be used in calculating the
u value.
3. A machine produces screws 10% of which have defects. What is the probability
that, in a sample of 500
(a) more than 35 defects are found?
(b) between 30 and 35 (inclusive) defects are found?
The binomial law: assuming a sample of 500 from a batch of at least 5000.
The normal approximation can be used since p > 0.10, np = 50.
f.J. = np = 500 x 0.10 = 50
a =V(500 x 0.10 x 0.90) = V45 = 6.7
(a) u_35.5-50=_14.5=_2.16
6.7 6.7
Probability of more than 35 defects from tables* = 1 - 0.01539 = 0.9846
(b) Probability of between 30 and 35 defects, use limits 29.5 and 35.5.
Figure 5.3
30
L
00
J.L= 30
(J =y'30 = 5.48
CT = 5.48
=40.5-30= 192
u 5.48 .
Probability of exceeding 40, p(>40) = 0.0274 from statistical table 3*.
Number of hits j 0 2 3 4 5
Number of areas with j hits 229 211 93 35 7 I
Table 5.5
In statistical logic, as will be seen later, an essential step in testing in the logic
is the setting up of what is called the null hypothesis.
Here the null hypothesis is that the bombs are falling randomly or that there
is no ability to aim at targets of the order of! km 2 in area.
Then if the hypothesis is true, the probability of any given bomb falling in
anyone given area = m.
Probability of x hits in any area
p(x)
= (537)(_1
x 576
)X (575)S37-X
576
from the binomial law.
However, since the probability of success is very small and the number of
attempts is relatively large, the Poisson law can be used as an approximation to
the binomial thus greatly reducing the computation involved.
Thus, for the Poisson calculation
average number of successes m = np = 537 x stt; = 0.93
Number of hits j o I 2 3 4 5
Probability of j hits 0.395 0.367 0.170 0.053 0.012 0.002
Table 5.6
Table 5.7 shows the results obtained by comparing the actual frequency
distribution of number of hits per area with the Poisson expected frequencies if
the hypothesis is true.
Number of hits j 0 2 3 4 5
Actual number of areas with j hits 229 211 93 35 7
Expected number of areas with j hits (Poisson) 227 211 98 31 7
Table 5.7
Relationship Between the Basic Distributions 117
0 2 0.0273
1 6 0.0984 5
2 9 0.1771 9
3 II 0.2125 II
4 8 0.1912 10
5 6 0.1377 7
6 4 0.0826 4.5
7 3 0.0425 2
8 2 0.0191
9 0.0076 0.5
10 0 0.0040 0.2
52 1.0000 51.2
Table 5.9
sPs-s
118 Statistics: Problems and Solutions
However, here again the Poisson law gives an excellent approximation to the
binomial, reducing the computation considerably.
It should be noted that in most attribute quality control tables this Poisson
approximation is used.
Using m = 3.6, table 5.9 gives the comparison of the actual pattern of variation
with the Poisson.
Reference to the table indicates that the defects in the period of 52 shifts did
not show any 'abnormal' deviations from the expected number.
Thus, this comparison gives the basis for determining whether or not a
process is in control, the basic first step in any quality control investigation.
defective. Four resistors are selected at random and installed in a control panel.
What is the probability that no defective resistor is installed?
IT·6.3
u = 14.5-25 = -10.5 = -2 97
3.54 3.54 .
(7= 3.54
From tables*,
Probability of class of 50 having less than 15 boys = 0.0015
Compare this with the correct answer from binomial tables of 0.0013.
4. This by definition is the Poisson law. However, since m > 15, the normal
approximation can be used. Here J.1 = 30, a = y30 = 5.48
40.5-30
u = 5.48 1.92
u=5.48
6. Here, n = 24
Probability of dustcart's being broken down (P) = 0.20. This is the binomial
distribution. Here the normal distribution can be used as an approximation.
Mean p. =np=24 x 0.20 =4.8
Variance a 2 =np(l-p) =24 x 0.20 x 0.80 =3.84
Standard deviation = 1.96
u= 1.96
Table 3* gives the probability of three or less dustcarts being out of service
as 0.2546
Probability of more than three dustcarts being out of service
p(>3) = 1- 0.2546 =0.7454 or 74.5%
7. Here this is the hypergeometric distribution and since the sample size4 is
greater than 10% of population (20) no approximation can be made. Thus the
hypergeometric distribution must be used.
Probability of 0 defects
( 4)(16) 16!
o 4 12!4! 16 15 14 13
P(O) = (20) = 20! =20 x 19 x 18 x 17 = 0.3756
4 16! 4!
Number of persons: 2 or 3.
Object
To demonstrate that the Poisson law may be used as an approximation to the
binomial law for suitable values of n (sample size) and p (proportion of the
population having a given attribute), and that, for a given sample size n, the
approximation improves asp becomes smaller. (Note: for a given value ofp, the
approximation also improves as n incre"ases.)
Method
Using the binomial sampling box, take 100 samples of size 10, recording, in
table 21, the number of red balls in each sample. (proportion of red balls in the
population = 0.02.)
Relationship Between the Basic Distributions 123
Analysis
1. Summarise the data into a frequency distribution of number of red balls per
sample in table 22 and compare the experimental probability distribution with
the theoretical binomial (given) and Poisson probability distributions.
Draw both the theoretical Poisson (mean =0.2) and the experimental
probability histograms on figure 1 below table 22.
2. Using the data of experiment 7 and table 23, compare the observed
probability distribution with the binomial and Poisson (mean = 1.5) probability
distribu tions.
Also, draw both the theoretical Poisson (mean = 1.5) and the experimental
probability histograms on figure 2 below table 23.
Note: Use different colours for drawing the histograms in order that comparison
may be made more easily.
Distribution of linear
6 functions of variables
or the variance of the difference of two variates is the sum of their variances.
Note: It should be noted that while this theorem places no restraint on the
form of distribution of variates the following conditions are of prime importance:
(1) If variates x, y, Z, . .. are normally distributed then w is also normally
distributed.
(2) If variates x, y, Z are Poisson distributed then w is also distributed as
Poisson.
Examples
1. In fitting a shaft into a bore of a housing, the shafts have a mean diameter of
50 mm and standard deviation of 0.12 mm. The bores have a mean diameter of
51 mm and standard deviation of 0.25 mm. What is the clearance of the fit?
The mean clearance = 5 I - 50 = 1 mm
Variance of clearance = 0.122 + 0.25 2 = 0.0769
Standard deviation of clearance = YO.0769 = 0.277 mm
.. 38mm
..
Figure 6.1
3. (a) The time taken to prepare a certain type of component before assembly
is normally distributed with mean 4 min and standard deviation of 0.5 min. The
time taken for its subsequent assembly to another component is independent of
preparation time and again normally distributed with mean 9 min and standard
deviation of 1.0 min.
126 Statistics: Problems and Solutions
What is the distribution of total preparation and assembly time and what
proportion of assemblies will take longer than 15 min to prepare and assemble?
Let w'= total preparation and assembly time for rth unit.
w=4+9= 13 min
o~ = 12 X 0.5 2 + 12 X 1.02 = 1.25
or standard deviation ofw, Ow =v'1.25 = 1.12 min
Figure 6.2
where
x, = preparation time
Y r = assembly time
w= (3 x 4) + 9 = 21 min
o~ = 32 X 0.5 2 + 12 X 12 = 3.25
Standard deviation of w = 1.8 min.
(c) To further clarify the use of constants, consider now example 3(a). Here
the unit has to be sent back through the preparation phase twice before passing
on to. assembly.
Distribution of Linear Functions of Variables 127
Assuming that the individual preparation times are independent, what is the
distribution of the total operation time now?
Here
or
Standard deviation
aw = 1.32
6.2.2 Distribution of Sum of n Variates
The sum of n equally distributed variates has a distribution whose average and
variance are equal to n times the average and variance of the individual variates.
This follows direct from the general theorem in section 6.2.1.
Let
x = y = z . .. and a = b = c ... = 1
then
w=x+x+ ... +x=n.x
a~ =(12 X a~) + (12 X a~) + ... + (12 X a~) =na~
Example
Five resistors from a population whose mean resistance is 2.6 kU and standard
deviation is 0.1 kU are connected in series. What is the mean and standard
deviation of such random assemblies?
Average resistance = 5 x 2.6 = 13 kU
Variance of assembly = 5 x 0.12 = 0.05
Standard deviation = 0.225 kU
(a)
2 3 4 5 6
"):l~
3 4 5 6 7 8 9 10 II 1213 14 15 1617 18
Figure 6.3. Probability distribution (a) the score of 1 die (b) the score of 3 dice.
Distribution of Linear Functions of Variables 129
sampling distribution of means gets closer to normality and similarly the closer
the original distribution to normal the quicker the approach to true normal form.
However the rapidity of the approach is shown in figure 6.3 which shows the
distribution of the total score of three 6-sided dice thrown 50 times. This is
equivalent to sampling three times from a rectangular population and it will
be seen that the distribution of the sum of the variates has already gone a long
way towards normality.
This theorem is most used for testing the difference between two populations,
but this is left until chapter 7.
Example
A firm calculates each period the total value of sales orders received in £.p. The
average value of an order received is approximately £400, and the average number
of orders per period is 100.
What likely maximum error in estimation will be made if in totalling the
orders, they are rounded off to the nearest pound?
Assuming that each fraction of £1 is equally likely (the problem can,
however, be solved without this restriction) the probability distribution of the
error on each order is rectangular as in figure 6.4, showing that each rounding
off error is equally likely.
Consider the total error involved in adding up 100 orders each rounded off.
Statistically this is equivalent to finding the sum of a sample of 100 taken
from the distribution in figure 6.4.
130 Statistics: Problems and Solutions
-50p o +50p
Figure 6.4. Probability distribution of error/order.
From theorem 3 the distribution of this sum will be normal and its mean and
variance are given below.
Average error =0
Variance of sum = 10002 where 02 variance of the distribution of individual
errors
In each case, what is the probability that the plane can take off if 40 kg of
baggage is carried?
3. Two spacer pieces are placed on a bolt to take up some of the slack before
a spring washer and nut are added. The bolt (b) is pushed through a plate (p)
and then two spacers (s) added, as in figure 6.5.
p ___ I-Plate
Bolt (b)
-
.. ..
r---~--~--~--.---
I I
Clearance
Figure 6.5
Given the following data on the production of the components
plate: mean thickness 12 mm, standard deviation of thickness 0.05 mm,
normal distribution
bolt: mean length 25 m, standard deviation of length 0.025 mm, normal
distribution
spacer: mean thickness 3 mm, standard deviation of thickness 0.05 mm,
normal distribution
what is the probability of the clearance being less than 7.2 mm?
4. In a machine fitting caps to bottles, the force (torque) applied is distributed
normally with mean 8 units and standard deviation 1.2 units. The breaking
strength of the caps has a normal distribution with mean 12 units and standard
deviation 1.6 units. What percentage of caps are likely to break on being fitted?
5. Four rods of nominal length 25 mm are placed end to end. If the standard
deviation of each rod is 0.05 mm and they are normally distributed, find the
99% tolerance of the assembled rods.
6. The heights of the men in a certain country have a mean of 1.65 m and
standard deviation of 76 mm.
(a) What proportion will be 1.80 m or over?
132 Statistics: Problems and Solutions
(b) How likely is it that a sample of 100 men will have a mean height as
great as 1.68 m. If the sample does have a mean of 1.68 m, to what extent does
it confirm or discredit the initial statement?
7. A bar is assembled in two parts, one 66 mm ± 0.3 mm and the other
44 mm ± 0.3 mm. These are the 99% tolerances. Assuming normal distribu tions,
find the 99% tolerance of the assembled bar.
8. Plugs are to be machined to go into circular holes of mean diameter 35 mm
and standard deviation of 0.010 mm. The standard deviation of plug diameter is
0.075 mm.
The clearance (difference between diameters) of the fit is required to be at
least 0.05 mm. If plugs and holes are assembled randomly:
(a) Show that, for 95% of assemblies to satisfy the minimum clearance
condition, the mean plug diameter must be 34.74 mm.
(b) Find the mean plug diameter such that 60% of assemblies will have the
required clearance.
In each case find the percentage of plugs that would fit too loosely (clearance
greater than 0.375 mm).
9. Tests show that the individual maximum temperature that a certain type of
capacitor can stand is distributed normally with mean of 130°C and standard
deviation of 3°C. These capacitors are incorporated into units (one capacitor per
unit), each unit being subjected to a maximum temperature which is distributed
normally with a mean of 118°C and standard deviation of 5°C.
What percentage of units will fail due to capacitor failure?
10. It is known that the area covered by 5 litres of a certain type of paint is
normally distributed with a mean of 88 m 2 and a standard deviation of 3 m 2 . An
area of 3500 m 2 is to be painted and the painters are supplied with 40 5-litre tins
of paint. Assuming that they do not adjust their application of paint according
to the area still to be painted, find the probability that they will not have
sufficient paint to complete the job.
11. A salesman has to make 15 calls a day. Including journey time, his time
spent per customer is 30 min on average with a standard deviation of 6 min.
(a) If his working day is of 8 h, what is the chance that he will have to work
overtime on any given day?
(b) In any 5-day week, between what limits is his 'free' time likely to be?
12. A van driver is allowed to work for a maximum of 10 h per day. His
journey time per delivery is 30 min on averag~ with a standard deviation of 8
min.
In order to ensure that he has only a small chance (1 in 1000) of exceeding
the 10 h maximum, how many deliverties should he be scheduled for each day?
Distribution of Linear Functions of Variables 133
,,00.010
0.05 0.05
....~::;f
0.475 0.475+1.645 x 0.01 1.9 1.9+1.645xO.01 4
Individual packets Weight of 4 packs
If individual packages are packed four at a time, the distribution of total net
weight and the probability requirements are shown in figure 6.7.
The mean weight of 4 packages must be
1.9 + 1.645 x 0.OlY4 = 1.9 + 0.033 = 1.933 kg
Thus the process setting must be 1.933/4 = 0.483 kg
The long run proportional saving of butter per nominal !-kg package is
2. (a) The weight of four adult passengers will be normally distributed with
mean of 4 x 75(= 300) kg and standard deviation of y4 x 15(=30) kg. The
shaded area in figure 6.9 gives the probability that the plane is within its
maximum payload.
The standardised normal variate,
= 350 - 300 = 50 = 1 67
u 30 30·
75
Child weight Adult weight
Figure 6.10
u
=35030
- 340 =.!Q =0 33
30·
The probability of safe take-off now becomes 1 - 0.3707 = 0.63
(b) For four adults and one child the weight distribution is shown in
figure 6.11.
As before,
_ 350 - 323 =.1:L= 0 88
u 30.8 30.8 .
Figure 6.13
'£:90 7 7.2
4. A cap will break if the applied force is greater than its breaking strength.
The mean excess of breaking strength is 12 - 8 = 4 units while the standard
deviation of the excess of breaking strength is v'(1.6 2 + 1.22) = v'4.00 = 2.0.
When the excess of cap strength is less than zero the cap will break and the
proportion of caps doing so will be equal to the shaded area of figure 6.14, i.e
the area below
0-4
u =-2- =-2 or 0.0228,
o 4
Figure 6.14 Excess of breaking strength
5. The distribution of the total length of four rods will be normal with a mean
of 4 x 25 = 100 mm and standard deviation of v'4 x 0.05 = 0.10 mm.
Ninety-nine per cent of all assemblies of four rods will have their overall
length within the range
100 ± 2.58 x 0.10 mm i.e. 100 ± 0.26 mm
6. (a) Assuming that heights can be measured to very small fractions of a
136 Statistics: Problems and Solutions
metre, the required answer is equal to the shaded area in figure 6.15.
= 1.80-1.65 = 1 97
u 0.076 .
and area = 0.0244
CT = 0.076
= 1.68 - 1.65 = 3 95
u 0.0076 .
The shaded area is about 0.00004.
Figure 6.16
ill 1.651.68
Mean of 100 heights
Possible alternative conclusions are that this particular sample is a very unusual
one or that the assumed mean height of 1.65 m is wrong (being an underestimate)
or that the standard deviation is actually higher than the assumed value of
76 mm.
7. The standard deviation of each component part is 0.3/2.58. The standard
deviation of an assembly of each part will be 0.3Y2/2.58 about a mean of
Distribution of Linear Functions of Variables 137
66 + 44 = 11 0 mm.
Ninety-nine per cent of assemblies wiIllie within 110 ± 0.3V2 mm, i.e. within
110 ± 0.42 mm.
= 0.375 - 0.256 =0 95
u 0.125 .
0.05 0.375
Figure 6.18 Clearance
= 0.375 - 0.082 = 2 34
u 0.125 .
138 Statistics: Problems and Solutions
118 Y 130 x
Max. applied temperature Capacitor max. temperature
Figure 6.19 Figure 6.20
<Jx _y =V34
3500 3520
Figure 6.22 Area covered by 40 x 5 litres of paint
Distribution of Linear Functions of Variables 139
=3500-3520=_105
u 19.0 .
giving an answer of about 14.7%.
11. (a) The distribution of time spent on a total of 15 calls will be approximately
normal (by the central limit theorem) with a mean of 15 x 30(=450) min and a
standard deviation of V15 x 6(=23.3) min.
The probability that 15 calls take longer than 8 h is represented by the shaded
area in figure 6.23.
480 min (8 h) corresponds to
= 480 - 450 = 1 29
u 6,.115 .
The required probability is 0.0985.
(b) There may be differing interpretations about what is meant by 'free' time
in a week. 'Free' time for the salesman occurs on days when he works less than
8 h. The total of such time is found for five consecutive days, no account being
taken of any 'overtime' that has to be worked. The solution of such a problem is
quite difficult.
In this case, we shall consider 'free' time as the net amount by which his
actual working time is less than his scheduled working time.
/l\ =8Vn
LLkOI
ITn
30 30n 600
Time per delivery Time for n deliveries
Figure 6.25 Figure 6.26
In order that there is only 1 chance in 1000 that n journeys take longer than
10 h (600 min), n must be such that
30n + 3.09 x 8yn .;;; 600
Theilargest value of n that satisfies the inequality can be found by systematic
trial and error. However, a more general approach is to solve the equality as a
quadratic in yn, taking the integral part of the admissible solution as the number
of deliveries to be scheduled.
Thus
30n + 24.72Yn - 600 = 0
inadmissible since the average total journey time would be 12 h, violating the
probability condition.
The number of deliveries to be scheduled is therefore 16.
If 16 deliveries were scheduled, the probability of exceeding 10 h would
actually be less than O.OOI-in fact about 1 in 10 000.
Appendix I-Experiment 12
Object
To demonstrate that the distribution of the means of samples of size n, taken
from a rectangular population, with standard deviation a tends towards the
normal with standard deviation a/Yn.
Method
From the green rod population M6/3 (rectangularly distributed with mean of
6.0 standard deviation of 0.258), take 50 random samples each of size 4,
replacing the rods after each sample and mixing them, before drawing the next
sample of 4 rods.
Measure the lengths of the rods in the sample and record them in table 33.
Analysis
1. Calculate, to 3 places of decimals, the means of the 50 samples and summarise
them into a grouped frequency distribution using table 34.
2. Also in table 34, calculate the mean and standard deviation of the sample
means and record these estimates along with those of other groups in table 35.
Observe how they vary amongst themselves around the theoretically
expected values.
3. In table 36, summarise the frequencies obtained by all groups and draw, on
page 57, the frequency histogram for the combined results. Observe the shape
of the histogram.
142 Statistics: Problems and Solutions
r Sample no. 1 2 3 4 5 6 7 8 9 10
I Total
r Average
r Sample no. 11 12 13 14 15 16 17 18 19 20
r Total
I Average
r Sample no. 21 22 23 24 25 26 27 28 29 30
r Total
I Average
I Sample no. 31 32 33 34 35 36 37 38 39 40
I Total
I Average
r Sample no. 41 42 43 44 45 46 47 48 49 50
I Total
I Average
6.050-6.100 6.075 I
6.350-6.400 6.375 5
-
Totals of +ve terms
~
Total of -ve terms
~~ ~
Net totals
x = 6.000 + 0.075 ~; = =
s' =0.075 [
~fu2 :~~~)2l
~J 'J
=
=
* Strictly the class intervals should read 5.5875-5.6625 and the next 5.6625-5.7375 etc.
but the present tabulation makes summarising simpler.
144 Statistics: Problems and Solutions
7.1 Syllabus
Point and interval estimates; hypothesis testing; risks in sampling; tests for
means and proportions; sample sizes; practical significance; exact and approximate
tests.
= nlPI + n2P2
nl + n2 V:l
~
III 100 (l-a)% confidence interval is 100 (l-a)% confidence interval is approximately ;;;
......
.....
~.
..
p - ua/2 PO-P)]
n ~ 7r ~ P + ua/2J[P(l-P)]
n
X-Ua/2(-fn)~iJ..~X +uaf2(fn) J[ ~
c\:)"
~
IV 100 (l-a)% confidence interval is 100 (l-a)% confidence interval is approximately ~
..
~
;::s
a2 a2) I:l..
(XI-X2)-U a /2J( -1.+-1 ~(iJ..1-iJ..2) (PI - P2) - Ua /2J[PI (l-PI) + P2 (1-P2 )] ~ (7r1 - 7r 2) V:l
nl n2 ,nl n2 C
~
.....
2
- --1+_
a~) ~ (PI - P2) + Ua /2J[PI (1-PI) + P2 (1-P2 )] ;::s
~ (x I -X2) + ua /2 J( a nl n2 .c·
nl n2
Table 7.1
Estimation and Significance Testing (J) 149
Variables Population mean 11 (j..tl and 112 for problems II and IV)
Population standard deviation a (a I and a2 for problems II and
IV) assumed known
Sample mean x (x I and X2 for II and IV)
Proportions Population proportion (1f1 and 1f2 for problems II and IV)
Sample proportion p (P I and P2 for problems II and IV)
u is the standardised normal variate. In the case of variables, a (and a I and a2
as appropriate) is assumed to be known or calculated from a sample of size n
larger than about 30.
normality for large n (and preferably with 1T neither small norlarge )-see chapter 5.
allowance must be made for sampling fluctuations; this is done by using the
standard error of the sample mean to determine a confidence interval for the
popUlation mean.
For 95% confidence, the interval (conventionally symmetric in tail
probability) is
x- 1.96alYn and x + 1.96alYn i.e.
53 - 1.96 x 1.5 and 53 + 1.96 x 1.5
53 - 2.94 and 53 + 2.94
50.06 and 55.94 say 50.1 and 55.9
Notice that the interval does not include the previously assumed mean of 50.0.
In this respect, the two procedures (hypothesis testing and interval estimation)
are equivalent since the test hypothesis will be rejected at the 5% level of
significance if the observed sample mean is more than 1.96 standard errors on
either side of the assumed mean, and if this is the case the 95% confidence
interval cannot include the assumed mean. This argument applies in the two-
sided case for any significance level a and associated confidence probability
(I-a).
Also note, that in this example, the standard deviation was known and the
test and confidence interval estimation was perfectly valid for any size of
sample.
2. A synthesis of pre-determined motion times gives a nominal time of 0.10
min for the operation of piecing-up on a ring frame, compiled after analysis of
the standard method. 160 observed readings of the piecing-up element had an
average of 0.103 min and standard deviation of 0.009 min. Is the observed
time really different from the nominal time?
Here, the popUlation a is not known but an estimate based on 160 (random)
readings will be satisfactory.
Set up H 0: real mean element time, Ilo, = 0.100 min, HI : real mean element
time, Ill' 4= 0.100 min
u =X7jO
a/ n
~ 0.103-0.100= ~= 422
0.009/J160 3 .
This is significant at the 1 in 1000 level (lui> 3.09), the actual type I error being
less than 6 parts in 100000 (table 3*).
Ninety-nine per cent confidence limits for the real mean piecing-up time
under the conditions applying during the sampling of the 160 readings are
Thus, the evidence suggests that the synthesis of the mean operation time
tends to underestimate the actual time by something between 1% and 5%.
Whether this is of any practical importance depends on what use is going to be
made of the synthetic time. Perhaps the method of synthesising the time may be
worth review in order to bring it into line with reality.
3. In special trials carried out on two furnaces each given a different basic mix,
furnace A in 200 trials gave an average output time of7.10 h while 100 trials
with furnace B gave an average output time of 7.15 h.
Given that, from previous records, the variance of furnace A is 0.09 h 2 and
of B is 0.07 h 2 and an assurance that these variances did not change during the
trials, is furnace A more efficient than B?
First of all, set up the test hypothesis that there is no difference in furnace
efficiencies (Le. average outpurtimes). The test is two-sided since there is no
reason to suppose that if one is more efficient then it is known which one it will
be.
Set up
Ho: PoA =PoB, Le. PoA -PoB = 0
HI: PoA - PoB =1= 0
J(ai
u = (oXA - oX2)-(PoA -Poa)
+ a~)
nA nB
which becomes on substituting the observed data and the test assumptions
regarding (p.A - PoB)
both are responsible for differing mean output times would require a properly
designed experiment (this experiment is not designed to answer the question
posed).
In addition, it was assumed that the variances of the output times would be
unchanged during the special trials. This may often be a questionable assumption
and is unnecessary in this example since the sample variances of the 200 and
100 trials respectively could be substituted for ai and a~ with very little effect
on the significance test.
This is almost Significant at the 1% level and suggests that the mean yield for the
whole population of farms is greater in the second year.
As a word of warning, such a conclusion may not really be valid since the two
samples may not cover in the same way the whole range of climatic conditions,
soil fertility, farming methods, etc. The significant result may be due as much
to the samples' being non-representative as to a real change in mean yield for the
whole popUlation. The extent of each would be impossible to determine without
proper design of the survey. There are many methods of overcoming this, one
of which would be to choose a representative samples of farms and use the same
farms in both years.
5. A further test of the types illustrated in examples 3 and 4 can be made
when the population variances are unknown but there is a strong a priori
suggestion that they are equal. In this case, the two sample variances can be
pooled to make the test more efficient, i.e. to reduce {3 for given a and total
sample size (nl + n2).
A group of boys and girls were given an intelligence test by a personnel
officer. The mean scores, standard deviations and the numbers of each sex
154 Statistics: Problems and Solutions
are given in table 7.2. Were the boys significantly more intelligent than the
girls?
Boys Girls
Table 7.2
The question as stated is trivial. If the test really does measure that which is
termed 'intelligence', then on average that group of boys was more intelligent
than that group of girls, although as a group they were more variable than the
girls.
However, if the boys are a random sample from some defined population of
boys and similarly for the girls, then any difference in average intelligence
between the populations may be tested for.
Assuming that there is a valid reason for saying that the two populations have
the same variances, the two sample variances can be pooled by taking a weighted
average using the sample sizes as weights (strictly the degrees of freedom-see
chapter 8-are used as weights but this depends on whether the degrees of
freedom were used in calculating the quoted standard deviations; in any case,
since the sample sizes here are large the error introduced will be negligible).
Pooled estimate of variance of individual scores
Thus there is no evidence that the populations of boys and girls differ in average
intelligence. This conclusion does not mean that there is not a difference,
.merely that if there is one we have not sufficient evidence to demonstrate it,
and even if we did, it may be so small as to be of no practical importance at all.
Confidence limits for the difference between two population means can be
set in the same way as in examples (1) and (2) above.
Estimation and Significance Testing (I) 155
Thus 95% confidence limits for the difference in mean intelligence are given
by
=3± 1
.
96J(1123600
x 122)
Since this is not at all a low probability result, there is no evidence that the
new method is any more or less effective than the previous one.
156 Statistics: Problems and Solutions
This is not numerically large enough to reject the test hypothesis-the type I
error would correspond to just under 16%.
A slight improvement can be made in the adequacy of the normal approximation
by making the so-called correction for continuity. However, with the large
sample sizes generally required for use of the normality condition, this refinement
will not usually be worth incorporating. It is given here as an example.
36 or fewer germinating seeds can be considered as 36.5 or fewer on a continuous
scale. 36.5 corresponds to 0.73 as a proportion of 50 and the corrected value for
u becomes
J
u:!): 0.73 - 0.80
e·8s~ 0.2)
= -0.07"'50 = -1 24
0.4 .
The type I error corresponding to such a value ofu (two tails) is about 21.5%;
too high for most people to contemplate making.
Both this example and example (6) could have been done using the number
of occurrences rather than the proportion of occurrences in a sample. The
approaches are identical but for setting confidence limits, the proportion method
is better.
Standardising the number of 'successes' x in n trials gives
x-mr ---:-::-
- y[n1T(1-1T)]
u -"- ---;-::.;;.....,:..:..:,'
The exact test can be carried for this example since the appropriate
Estimation and Significance Testing (/) 157
Set up the null hypothesis that both germination rates are the same, i.e.
An approximately normal test statistic can be set up (see summary table 7.1) as
J[
(PA -PB)-(1TA -1TB)
U ~ ---,--=---=----=-...:"-'-:---=---.:..:...--"'-'---
1TA (1-1TA ) + 1TB(1-1T B )]
nA nB
Under the null hypothesis, 1TA =1TB = some value 1T, say, and the test statistic
becomes
The actual value of 1T, however, is unknown and it is usual to replace it by its
pooled sample estimate, p, obtained as a weighted average of the two sample
proportions P A and PB, the sample sizes being the weights.
Thus
" ~: §. = 1. 5
, ViS
'Vn
\
\
0 ,005 0 ,005',
....:z.a..::::..........;;:-.-.L----1.--:-::-"':':"""'-----":.... ;; _
Figure 7.1 46 .1 3 50 51 53 .87
region being given by the shaded area in its two tails. The boundaries of the
acceptance region for a 1% significance level are at
The tail areas corresponding to these values are 0.0268 and 0.0006
approximately.
The type II error is therefore equal to
1- (0.0268 + 0.0006) = 0.9726
(ii) The solution to this part is the same as that for part (i) except that the
actual distribution of the sample mean will be centred around 53.0.
The values of u corresponding to the limits of the acceptance region are
v'16
and
I
I I
/' 0005 0.10 0 .005
X,·
~
48.0 50.0
Figure 7.2
The dotted distribution shows how the means of samples of size n will be
distributed when the population average is actually 48.0 units. The extreme
part of the right-hand tail of this distribution will lie above but it will be x;
such a minute proportion in this case as to be negligible.
The following equations can be set up.
or
would have been determined by the practical aspects of the problem. However,
if the actual population mean were less than 48.0 (or bigger than 52.0), the
probability of committing a type II error with a sample size of 134 would be less
than 10%; and if the population mean were actually between 48.0 and 50.0, this
probability would be greater than 10%.
10. What is the smallest random sample of seeds necessary for it to be asserted,
with a probability of at least 0.95, that the observed sample germination
proportion deviates from the population germination rate by less than 0.03?
The standard error of a sample proportion is V[ 1T(l-1T)/n 1 where 1T is the
population proportion and n the sample size. Assuming that n will be large
enough for the approximation of normality to apply reasonably well to the
distribution of p. the problem requires that
1.96)2
n = ( 0.03 1T(l-1T)
1T, the quantity to be estimated is unknown (if it were known, there would be no
need to estimate it) and this creates a slight difficulty in determining n. However,
1T(l-1T) takes its maximum value of! when 1T =1 and
n
= (1.96)2
0.03
1
4
~ 1060
would certainly satisfy the conditions of the problem (whatever the value of 1T).
Alternatively if an estimate is available of the likely value of 1T, this can be
used instead of 1T as an approximation. Such an estimate may come from previous
experience of the population or perhaps from a pilot random sample; the pilot
sample estimate can be used to determine the total size necessary. If the pilot
sample is at least as big as this, no further sampling is needed. If it was not, the
extra number of observations required can be found approximately. If such
extra sampling is not possible for some reason (too costly, not enough time),
the confidence probabilities of types I and II errors will be modified (adversely).
For this example, if the seed population germination rate is usually about
80%, then the required value of sample size for at most a deviation of 0.03
(Le. 3%) with probability of 0.95 is
2. In a dice game, if an odd number appears you pay your opponent 1p and if
an even number turns up, you receive 1p from him. If, after 200 throws, you
are losing SOp and the dice are your opponent's, would you be justified in
feeling cheated?
3. A company, to determine the utilisation of one of its machines, makes
random spot checks to find out for what proportion of time the machine is in
use. It is found to be in use during (a) 49 out of 100 checks, and (b) 280 out of
600 checks.
Find out in each case, the percentage time the machine is in use, stating the
confidence limits.
How many random spot checks would have to be made to be able to estimate
the machine utilisation to within ± 2%7
4. In a straight election contest between two candidates, a survey poll of 2000
gave 1100 supporting candidate A. Assuming sample opinion to represent
performance at the election, will candidate A be elected?
5. In connection with its marketing policy, a firm plans a market research
survey in a country area and another survey in a town. A random sample of the
people living in the areas is interviewed and one question they are asked is
whether or not they use a product of the firm concerned. The results of this
question are:
Town: Sample size = 2000, no. of users = 180
Country: Sample size = 2000, no. of users = 200
Does this result show that the firm's product is used more in the country than in
town?
6. In a factory, sub-assemblies are supplied by two sub-contractors. Over a
period, a random sample of 200 from supplier A was 5% defective, while a
sample of 300 from supplier B was 3% defective.
Does this signify that supplier B is better than supplier A?
A further sample of 400 items from B contained eight defective sub-assemblies.
What is the position now?
7. lfmen's heights are normally distributed with mean of 1.73 m and standard
Estimation and Significance Testing (I) 163
deviation of 0.076 m and women's heights are normally distributed with mean
of 1.65 m and standard deviation of 0.064 m, and if, in a random sample of 100
married couples, 0.05 m was the average value of the difference between
husband's height and wife's height, is the choice of partner in marriage influenced
by consideration of height?
8. For the data of problem (3)(page 46), chapter 2, estimate 99% confidence
limits for the mean time interval between customer arrivals. Also find the
number of observations necessary to estimate the mean time to within 0.2 min.
9. An investigation of the relative merits of two kinds of electric battery showed
that a random sample of 100 batteries of brand A had an average lifetime of
24.2 h, with a standard deviation of 1.8 h, while a random sample of 80 batteries
of brand B had an average lifetime of 24.5 h with a standard deviation of 1.5 h.
Use a significance level of 0.01 to test whether the observed difference between
the two average lifetimes is significant.
10. Two chemists, A and B, each perform independent repeat analyses on a
homogeneous mixture to estimate the percentage of a given constituent.
The repeatability of measurement has a standard deviation of 0.1 % and is
the same for each analyst. Four determinations by A have a mean of 28.4% and
five readings by B have a mean of 28.2%.
(a) Is there a systematic difference between the analysts?
(b) If each analyst carries out the same number of observations as the other,
what should this number be in order to detect a systematic difference between
the analysts of 0.3% with probability of at least 99%, the level of significance
being 1%?
consistent with a setting of 1.00 kg, although a type II error could be committed
in deciding not to re-set the process.
(b) Confidence limits for the actual current process average are, for two levels
of confidence
2. Losing 50p in 200 throws means that there must have been 125 odd numbers
(losing results) and 75 even numbers (winners) in 200 throws. Set up the null
hypothesis that the dice is unbiased.
He: 'IT = 0.5 ('IT == the probability of an odd number)
H\:'IT*0.5
The total number of odd numbers will be binomially distributed and since
n = 200 and 'IT = ! we know that
x-n'IT 124.5-100 k· h . f ..
u ~v[n'IT(l-'lT)] =v(200 x! x!) rna mg t e correctIOn or contmUity
=~~~ = 3.46
The probability of such a deviation is certainly less than 0.0007 and it therefore
seems likely that the dice is biased towards odd numbers.
3. 95% confidence limits for the proportional utilisation of the machine are
approxima tely
which gives
280 +
(b) 600 - 1.96
J( 280 x 320 ) - + -
600 x 600 x 600 - 0.467 - 0.04 - 42.7 to 50.7%
Also, since 1T is near to 0.5, for 95% confidence estimation, the required number
of spot checks is given by
4. 99% confidence limits for the population proportional support for candidate
A are
1100 +
2000 - 2.58
J( 1100 x 900
2000 x 2000 x 2000
) = + =
0.55 - 0.0287 0.521 to 0.579
Thus
_ (0.05 - 0.03) - 0 _ 0.02 x 500 _
U - y[i&! x m (iW + ~)] - y(l9 x 481 x c& - 1.145
There is no evidence of a difference between the suppliers.
With the additional evidence, assuming that the underlying conditions remain
unchanged, the test may be carried out again
J 0.00987
=0.0099
[TOO
0.05 0.08
Figure 7.3 Average height excess for 100 couples
The excess of the man's height over the woman's height will be a normal
variable with mean of(1.73 -1.65) m and variance of (0.076 2 + 0.064 2 )m 2 •
The average difference (excess) of height taken over a random sample of 100
such married couples will be normally distributed (Le. from one sample of 100
to another) with mean of 1.73 -1.65 = 0.08 m and variance of
(0.076 2 + 0.064 2 )/100 and have a standard error of YO.00987 /y100 = 0.0099 m.
The observed average difference was 0.05 m corresponding to a u value of
11\- h6
'=
O.OO~.005
o 1.29? ,.. 1.29?
Time between successive customers Average of 56 time intervals
99% confidence limits for the mean time between arrivals are
J(
u = ~A -XB)-O
Qi
nA
+Q1)
nB
The denominator being the standard error of the difference of two sample
melhls based on samples of size nA and nB respectively.
168 Statistics: Problems and Solutions
Thus
_ (24.2 -- 24.5) - 0
u-J(~ +~)
~OO 80
substituting the sample variances for the population variances
-0.3 -0.3
v'0.060S = 0.246 = -1.22
Since this value is not numerically larger than 2.58, there is no evidence of a
difference in mean lifetimes between A and B.
10. (a) Assume there is no systematic difference between the analysts, i.e. the
means of an infinitely large number of analyses of the same material would be
equal for A and B.
Under such a null hypotheses we may use the test statistic
This is Significant at the 1% level (i.e. lui> 2.58) and we can conclude (with
only a small type I error) that there is a systematic difference between the
analysts, A giving a higher result than B on average. Thus at least one of them,
and possibly both, gives a biased estimate of the actual percentage composition.
99% confidence limits for the extent of this systematic difference are given
by
(28.4-28.2) ± 2.S8v'[0.12(! + n] = 0.2 ± 0.173 = 0.027 and 0.373%
(b) Figure 7.0 shows the requirements of this problem.
(7.4)
o 0.3
Figure 7.6
Estimation and Significance Testing (J) 169
Note: In writing down equations (7.3) and (7.4), the minute part of the left-
hand tail of the dotted distribution falling in the lower part of the critical
region has been ignored.
Putting n A =nB =n gives the required number of readings by each analyst as
n
=(2.58+2.33)2
0.32
x 2 x 0.12 = 4912
.
2 = 5 35
x"9 •
Thus each analyst should do six tests, the probability of detecting a systematic
difference of 0.3% between them (if it exists) being greater than the required
minimum of 99%.
In fact the required minimum power would still be achieved if one analyst
took six tests and the other five in order to reduce the total cost or effort
involved.
Testing the Hypothesis that the Mean of a Normal Population has a Specific
Value J.l.o -Population Variance Known
Here, providing the population variance is known (and therefore the sample
estimate of variance is not used), then the 'u' test is appropriate whatever the
sample size.
Thus
x-J.l.o
u=--
a
...jn
is calculated and the significance level is determined.
Example
In an intelligence test on ten pupils the following scores were obtained: 105,
120,90,85, 130, 11~ 12~ 115, 125, 10Q
Given that the average score for the class before the special tuition for the
test was 105 with standard deviation 8.0, has the special tuition improved the
performance?
Here since the standard deviation is given and if the assumption is made that
tuition method does not change this variation, then the u test is applicable.
Null hypothesis-tuition has made no improvement
Average score in test
Here a one-tailed test can be used if it is again assumed that tuition could not
have worsened the performance.
Thus
"£xl _ ("£Xj)2
S2 = ~(xi - x) _ n (8.3)
n-l n-l
The null hypothesis is set up that the sample has come from a normal population
with mean Ilo.
W. S. Gosset under the nom de plume of 'Student' examined the following
statistic
(8.4)
and showed that it is not distributed normally but in a form which depends on
the degrees of freedom (v) if the null hypothesis is true. Table 7* sets out the
various percentage points of the 't' distribution for a range of degrees of freedom.
Obviously t tends to the statistic u in the limit where v~oo, i.e. tis
approximately normally distributed for large degrees of freedom v. Reference
to the table* shows that, as most textbooks assert, where the degrees of freedom
exceed 30, the normal approximation can be used or the 't' test can be replaced
by the simpler 'u' test.
Note: For a two-tailed test note that a 5% significance level requires
a =0.025 in table 7* and for a 1% significance level, a=0.005 (a is a proportion,
not a percentage). See section 8.2.6.
Sampling Theory and Significance Testing (II) 173
Again, one-tailed tests are only used when a priori logic clearly shows that
the alternative population mean must be on one side of the hypothesis value f.lo.
See section 8.2.7.
Testing the Hypothesis that the Means of Two Normal Populations are f.lx and f.ly
Respectively- Variances Equal but Unknown
Note: The assumption must hold that the variances of the two populations
are the same (Le. a~ = a;) since we are going to pool two sample variances and
this only makes sense if they are both estimates of the same thing-a common
population variance. If a~ does not equal a;
then the statistic given below is not
distributed like t.
s;
The two sample variances s~ and are pooled to give a best estimate of the
common population variance.
(8.5)
sJ(..!. +..l)
nx ny
(8.6)
with (nx + ny - 2) degrees of freedom. The usual test hypothesis is that the
populations have equal means and under this assumption (f.lx -f.ly) = 0 and the
test statistic reduces to
(x - ji)
t=J(-1+ -1)
S
nx ny
(8.7)
Then
n
L(di-a?
S2 = ...:..i_----:-_ where
n-l
and
t=d-Jlo (8.8)
s
Vn
The test hypothesis is usually of the null type where there is assumed to be no
difference on average in the paired readings, i.e. Jlo = O. In this case the test
statistic t is given by
t = -.!:L (8.9)
s
Tn
Confidence Limits for Population Mean
Where the degrees of freedom are less than about 30, the confidence limits for
population mean Jlo are:
This is similar to the large sample case except that t is used instead of u.
Then
F _ sx
-2
- -2
Sy
where s; > s~ (8.10)
Sampling Theory and Significance Testing (II) 175
If F is greater than F 0.025 (see table 9*) for (nx - 1) degrees of freedom of
numerator and (ny - 1) degrees of freedom for the denominator, then the
difference is significant at 5% level (a = 0.05). For F to be significant at 1% level,
use F 0.005 (actually F 0.01 will have to be used giving a 2% significance level
of F).
the suffix n denoting the number of degrees of freedom. Obviously the larger n
is the larger X2 and the percentage points of the sampling distribution of X2 are
given in table 8 *.
For example, where n = 1 the numerical value of the standardised normal
deviate u exceeds 1.96 with 5% probability and 2.58 with 1% probability (Le.
with half the probabilities in each tail). Consequently X2 with one degree of
freedom has 5% and 1% points as 1.96 2 and 2.58 2 or 3.841 and 6.635.
However, for higher degrees of freedom the distribution of X2 is much more
difficult to calculate, but it is fully tabulated in table 8*.
(0. _£.)2
In I I
j = 1
E·I
Factor I Row
I 2 3 a totals
Factor 2
I 0 11 0 21 0 31 Oil Oal ~0i1
I
2 0 12 0 22
3 0 13
Column
~Ol· ~O··
. 1/ ~Oaj ~~Oi'
totals i / / / i j /
Table 8.1
'i::'i::0ij =total frequency
I /
These tables are generally used to test the hypothesis that the factors are
independent.
If this hypothesis is true then, the expected cell frequency
~O··x~O··
i I] j I]
Eij= ~~O ..
i j I]
Sampling Theory and Significance Testing (II) 177
x = 1; = 2.6
Estimated variance of population
LX 2 _ (Lx)2
n
s= 1 = 1.8
n-
t = 2.~.~ 1 = 2.67
178 Statistics: Problems and Solutions
Specialised Traditional
training method Total
A B
Above overage 0 : 32 0 : 12 44
E : 24.1 E : 19.9
Below average 2 0 :
14 0 : 22 36
E : 19.7 E : 16.3
Insufficient 0 : 6 0 :
9 15
data 3
E : 8.2 E :
6.8
Total 52 43 95
Table 8.3
Result is significant at 1% level.
There is evidence that the training methods differ in their long-term
efficiency.
Process 1 Process 2
Sample size 50 60
Mean 10.2 11.1
Standard deviation 2.7 2.1
Table 8.4
Referring to table 9*
Degrees of freedom of greater estimate VI = 49, read as 24 (safer than (0)
= 357 + 951 = 8 83
148 .
F= ~:~;= 2.00
From table 9*
Degrees of freedom of greater estimate = 149, read as 00
Degrees of freedom of lesser estimate = 59, read as 60
5% significance level = 1.48
0.2% significance level = 1.89
Difference is highly significant or there is strong evidence that process variations
are different.
4. For example (1), page 69, in chapter 3, for goals scored per soccer match,
test whether this distribution agrees with the Poisson law.
Null hypothesis: the distribution agrees with the Poisson law.
Table 8.5
In table 8.5 the last three class intervals must be grouped to give each class
interval an expected value greater than 5. Also, the first two.
Degrees of freedom = 6-1-1 = 4, since the totals are made the same and the
Poisson distribution is fitted with same mean as the actual distribution.
Referring to table 8*
X~.os = 9.488 X~.OI = 13.277
Thus, there is no evidence from the data for rejecting the hypothesis, or the
pattern of variation shows no evidence of not having arisen randomly.
5. In a mixed sixth form the marks of eight boys and eight girls in a subject
were
Boys: 25,30,42,44,59, 73, 82, 85; boys' average = 55
Girls: 32,36,40,41,46,47,54,72; girls' average = 46
Do these figures support the theory that boys are better than girls in this
subject?
Null hypothesis-that boys and girls are equally good at the subject.
From the sample of boys x! = 55 si = 540.57
From the sample of girls X2 = 46 s~ = 156.86
Applying the F test to test that population variances are not different, gives,
F = 540.57
156.86 =3, 46 ( not slgm
" f 'lcant)
Previous machine standard deviation before conversion = 2.8 mm. For the new
process, calculated from sample of 20, standard deviation = 1.7 mm.
What is the significance of this test?
It can be assumed that the process change could not have increased the
variation of the process.
Null hypothesis-that no change has occurred in process variation. Thus, a
one-tailed test can be used
2.8 2
F= 1.7 2 = 2.71 Vz = 19 (use 18)
Therefore, the result is highly significant and the change can be assumed to have
reduced the process variation.
Average Number of
Department labour force leavers/year
A 60 15
B 184 16
C 162 15
D 56 12
E 30 4
F 166 25
G 182 25
H 204 18
Table 8.6
3. Table 8.7 gives the data obtained on process times of two types of machine.
Is machine A more variable than machine B?
Sampling Theory and Significance Testing (/I) 183
Machine A Machine B
Table 8.7
6. The number of cars per hour passing an intersection, counted from 11 p.m.
to 12 p.m. over nine days was 7,10,5,1,0,6,11,4,9.
Does this represent an increase over the previous average per hour of three
cars?
8. A coin is tossed 200 times and heads appear only 83 times. Is the coin
biased?
those of the six areas before the special campaign. The data are given in table 8.8.
Has the new campaign had any effect on the sales?
Table 8.8
Difference
Typist
x
1 5 25
2 8 64
3 2 4
~i:::: 15 '1;xl:::: 93
Average Expected
Number of Con tribution
Dept. labour number of
leavers/year to X2
force/year leavers/year
(Oi) (E i )
A 60 15 7.5 7.5
B 184 16 23.0 2.1
C 162 15 20.0 1.2
~} 56 }
30 86 1~ }16 ~:~} 10.8 2.5
F 166 25 20.8 0.9
G 182 25 22.8 0.2
H 204 18 25.5 2.2
Table 8.10
Since in department E, the expected number of leavers is less than five, it has
to be grouped with another department. It is logical to group it with a similar
department or one whose effect would be expected to be similar on number of
leavers. Here, haVing no other a priori logic, since there is little difference between
the observed and expected frequency for department E, it has little effect. Here
it is combined with department D, the next smallest.
130
Average turnover rate = 1044 x 100% =12.5% per year
186 Statistics: Problems and Solutions
Average Expected
Number of Contribution
Dept. labour number of
leavers/year to X2
force/year leavers/year
Table 8.11
4. Null hypothesis- the change to the process has not affected the time.
Let Xl = time of new process
X2 = time of old process
then, mean of new process Xl = 35 s.
Variance of new process estimated from sample
si = ~(Xi;Xd2 = 16.44
Mean of old process X2 = 37 s.
Variance of old process estimated from sample
s~ = 42.67
In order to apply the 't' test, the variances of the two populations must be
the same.
Using the 'F' test to test that population variances are the same, gives
Standard deviation
S= 3.82
5.9-3
t =3.82 = 2.28
V9
with 8 degrees of freedom.
From table 7*
t o.os /2 = 2.306
Thus the result is not quite significant at the 5% level. On the present data no
real increase in mean traffic flow is shown.
7. Sample mean
'I;x·
x =-n ' = 0.136 min
Estimate of popUlation variance
0.0146 0.0146
0.136-2.11 V 18 < 110 < 0.136 + 2.11 V18
8. This problem will be solved using two alternative methods-the 'u' test and
the X2 test.
190 Statistics: Problems and Solutions
~ 83.!\ 100
Figure 8.1
Heads Tails
Observed 0 83 117
Expected E 100 100
Table 8.12
Heads Tails
o 83.5 116.5
E 100 100
9. Here, since from a priori knowledge, it can be stated that the new campaign
can only increase the sales rate. Then a one· tailed test can be used for extra
'power' in the test.
Again the paired 't' test is applicable. Null hypothesis-new campaign has
not increased the sales.
Average +200
Table 8.14
Code data
, x
x
100
Thus x' =2
~(' -')2
... X ~x = 24.4, s~ = 4.93
t = 2 - 0 = 0.99
4.93
y6
with 5 degrees of freedom
to.05 = 2.015 (for one tail test)
or result is not significant; there is no evidence of an increase in sales rate.
192 Statistics: Problems and Solutions
Red Yellow
rod population rod population
Table 8.15
The population parameters are given in table 8.15. These parameters are
chosen so that for the first part of the experiment with sample sizes n = 10,
approximately half the groups will establish a significant difference between the
populations while the other half will show no significant difference at the 5%
probability level. Since each group summarises the results of all the groups, this
experiment brings out much more clearly than any lecture could do, the
concept of significance.
In the second part of the experiment where each sample size is increased to
30, the probability is such that all groups generally establish (95% probability)
a significant difference. The experiment demonstrates that there is a connection
between the two types of error inherent in hypothesis testing by sampling and
the amount of sampling carried out. To complete this experiment, including the
full analysis, takes approximately 40 min.
Appendix 1
Object
To test whether the means of two normal populations are significantly different
and to demonstrate the effect of sample size on the result of the test.
Method
Take a random sample of size 10 from each of the two populations (red and
yellow rods) and record the lengths in table 1. Return the rods to the
appropriate population.
Also take a random sample of size 30 (a few rods at a time) from each of the
two populations (red and yellow rods) and record the lengths in table 2.
Analysis
(1) Code the data, as indicated, in tables 1 and 2.
(2) Calculate the observed value of't' for the two samples of size 10 and again
for the samples of size 30.
(3) Summarise your results with those of other groups in tables 3 and 4.
Observe whether a significant difference is obtained more often with the
samples of size 30 than with the smaller samples.
Notes
The 't' test used is only valid provided the variances of the two populations
are equal. This requirement is, in fact, satisfied in the present experiment.
Table I
Yellow population
In order to reduce the subsequent arithmetic,
Rod Coded
and to keep all numbers positive, the coded
lengths data
, values, x', are used in the calculation. The
x x"
coded data can be obtained by subtracting
b I O·~ OOCj from all readings, the smallest observed rod
length in the sample. The coded values, y', may
>. 'lI a· 0 00 be obtained in a similar way for the sample of
b·'L. 0 4- O\'=> red rods.
If a is the length of the shortest yellow rod
G,\ O~ OOq
in the sample, the mean, X, of the sample is
("2- o ./..t (? l "
x=a+~x'=58+3.7 =617
b· (J o· 2- 0·0 L{- 10 . 10 .
S- G 0 0 00 S2 = 10 = 1.97 - 1.37 _
y 9 9 - 0.0667
(,. 2- O· b O;;'b
The pooled estimate of variance, S2 , is
b'L- Ob o :'b
~X'2 _ (~x')2' + ~y'2 _ (~y')2
r; « 02. OOC,f- 2 _
s -
18
18
10
= 0.0512
5' Co 0·0 0·0
b Q o· 4- D·l b
6 60"1 <;. C\ b 0 13 I· S ~ NO
7 0· \ ?, 0· \"2:J 0 0 (\)0
The value of I t I which must be exceeded for the observed difference to be significant
at the 5% level =2'101
Table 4
Summary Table - samples of size 30
9.1 Syllabus
Assumption for use of regression theory; least squares; standard errors;
confidence limits; prediction limits; correlation coefficient and its meaning in
regression analysis; transformations to give linear regression.
a priori knowledge or assumptions that this relationship will hold for other
values ofx.
standard error of b
=
~(y.-
I
y.y
I
n-2
where Yi = estimate from regression line.
The significance of a and b can, therefore, be tested by the 't' test (see
chapter 8) or, alternatively, as shown in some textbooks, by an 'F' test, (see for
example Weatherburn, A First Course in Mathematical Statistics, C. U.P., pages
193 and 224, example 8).
t=--
b-O
Eb
given by the data can be referred to table 7* of the Statistical Tables and if it is
significantly large, judged usually on a two-sided basis, there is thus evidence of a
linear relationship between y and x.
100 (1- a)% confidence limits for the precision of estimation of the regression
line are then given bYYi ± t a /2 ,v €Yi for given xi where v = n - 2.
Note: The confidence limits are closest together at the average value x, of
the independent variable.
The observed correlation coefficient can be tested for significant departure from
zero bu t, as in the case of the regression coefficient, b, a significant value does
not necessarily imply any causal relationship between x and y.
The residual variance about the regression line defmed in section 9.2.4 as the
sum of the squared deviations of each observed value of y from its estimated value
using the fitted regression equation, this sum being divided by (n - 2) its degrees
of freedom, is related to the correlation coefficient and to the total variance
ofy.
Thus
~(y.
· I
_ y.)2
1
~(yi _ ji)2 ~(y,. _ ji)2 ~
2
I
2 I .2 1 n-
s= =(1-r) -(1 r)--:----:--
n-2 (n-2) - - (n-I) (n-2)
(n-I)
= s2(1-r2)--
y (n-2)
-I~r~+1
b=rx!l!.
sx
202 Statistics: Problems and Solutions
n-l
x x x
(0) r=+I·O (b) r=+O'5 (e) r= 0
Figure 9.1
9.2.8 Transformations
In some problems the relationship between the variables, when plotted or from
a priori knowledge, is found not to be linear. In many of these cases it is possible
to transform the variables to make use of linear regression theory.
F or example, in his book Statistical Theory with Engineering Applications
(Wiley), Hald discusses the problem of the relationship between tensile strength
of cement (y) and its curing time (x).
From a priori knowledge a relationship of the form y =Ae-B/X is to be
expected.
The simple logarithmic transformation therefore gives
B
10glO Y =10glO A - - 10glO e
x
or the logarithm of the tensile strength is a linear function of the reciprocal
value of the curing time and the theory of linear regression can then be applied.
Note: The requirement that the variance of y is constant for all x must, of
course, hold in the transformation and this must be checked. Usually a visual
check is adequate.
Student 2 3 4 5 6 7 8 9 10
Numeracy test score 200 175 385 300 350 125 440 315 275 230
(pts)
Final exam performance 55 45 71 61 62 50 74 67 65 52
(%)
Table 9.1
What is the best relationship between test score and final performance?
Before any analysis is started the scatter diagram must be plotted to test the
assumption that the relationship is linear. This diagram shows no evidence of
non-linearity (figure 9.2).
204 Statistics: Problems and Solutions
100
90
Regression line
80
70
~
E
0
x
60
Q)
0
.~
c
50 -- x
~
0
0
40
(J)
Prediction limits
30
20
10
Figure 9.2. Regression line with 95% confidence limits and prediction limits.
Total variance of x
~x~ _J~X)2 (2795?
n 868325 - 10 87 122
s;
I
= --n---1-- 9 =- 9 - = 9680.3
Linear Regression Theory 20S
Total variance of y
37050 _ (602)2
2= 10 =809.6=900
Sy 9 9·
Correlation Coefficient
r =
J( 175 960 - 2795 x 602)
10 7701
--8-=7-1-:-:2:-::2-x-8=-1::..:=0'---- = 8401 = 0.92
Regression Coefficients
0;
a = ~ = 61 = 60.2 b -- r x ~- 7701
Sx - 8401
J( 90 )
9680 = 0.088
77012)
90.0 ( 1 - 8401 2 = 14.4
However, in addition, in this example, n (=10) is not very large and the
approximation used will lead to an underestimate of the actual residual variance.
Using the exact expression gives
and this will be used in the remaining calculations since without this correction,
the bias of the estimator is -~ (I 1%) of the true value.
The residual standard deviation, s, is Y16.2 = 4.02
206 Statistics: Problems and Solutions
€a = yn Js J(16.2)
2
S = -;; = . 10 = 1.27
€b
s
=J[
=Y[L(X-X)2] S2
L(X-X)2
] J( 16.2 ) = 0.0136
= . 87122
Significance of b
Assuming E[ b] = {3 = 0, the observed value of t is
0.088 -0
t = 0.0136 = 6.47
For any given value of Xj, the confidence limits for the regression estimate
(Le. of the mean value of y for that value of x) are found as
Y j ± t Ot /2,(n-2) €Yj
For 95% limits, the appropriate value of t (table 7*) is 2.306; table 9.2 shows
the derivation of the actual limits for a range of values x.
The scatter diagram (drawn before any computations were carried out, in
order to check that the basic regression assumptions were not obviously violated),
the fitted regression line and 95% confidence limits are shown in figure 9.2.
From figure 9.2 or table 9.2, there is 95% confidence that the average final
examination percentage for all candidates who score 330 points in their initial
numeracy test will lie between 61.3% and 67.9%.
Linear Regression Theory 207
Table 9.2
These limits are calculated in table 9.3 and are also drawn in figure 9.2.
Table 9.3
From the figures in table 9.2, it can be expected, for example, that 95% of
candidates scoring 330 points in their numeracy test will achieve a final
examination mark between 55% and 74% inclusive, 5% of candidates gaining
marks outside this range.
208 Statistics: Problems and Solutions
Note: Such predictions are only likely to be at all valid if the sampled data
used to calculate the regression relation are representative of the same population
of students (and examination standards) for which the prediction is being made.
In other words, care must be taken to see that inferences really do apply to the
popUlation or conditions for which they are made.
The danger of extrapolation has been mentioned. The regression equation
indicates that students scoring zero in the test, on average, gain a fmal mark of
35.6%. This may be so but it is very likely that the relation between t4e two
examination performances is not linear over all values of x. Conclusions on the
given data should only be made for x in the range 125 to 440.
0.2 102
0.3 129
0.4 201
0.5 342
0.6 420
0.7 591
0.8 694
0.9 825
1.0 1014
1.1 1143
1.2 1219
Table 9.4
Calculate the linear relationship between strength and thickness and give the
limits of accuracy of the regression line.
60 960
220 830
180 1260
80 610
120 590
100 900
170 820
110 880
160 860
230 760
70 1020
120 1080
240 960
160 700
90 800
110 1130
220 760
110 740
160 980
80 800
Table 9.5
state the limits of error in using this relationship to predict farm income from
farm size.
which will enable the manufacturer to predict the unit cost of these lenses in
terms of the number of lenses contained in each order.
(b) Estimate the unit cost of an order for eight lenses.
5. The work of wrapping parcels of similar boxes was broken down into eight
elements. The sum of the basic seconds per parcel (Le. of these eight elements)
together with the number of boxes in each parcel is given in table 9.5.
I
Number of boxes Sum of basic INumber of boxes Sum of basic
in parcel seconds per parcell in parcel seconds per parcel
(x) (y) I (x) (y)
I 130 22 260
6
13
200
150
I 27
34
190
290
19 200
I 42 270
Table 9.5
(a) Calculate the constant basic seconds per parcel and the basic seconds for
each additional box in the parcel.
Calculate the linear regression and test its significance.
(b) What would be the best estimate of the basic seconds for wrapping a
parcel of 18 boxes?
Table 9.6
Linear Regression Theory 211
Variance of x
649 _ (7.7)2
2 =' 11 = 1.10=0110
sx 10 10'
Total Variance of y
5 692958 _ (6680)2
s~ = 10 11 = 1 6361~76.2 = 163637.6
Correlation Coefficient
6008 _ (7.7X6680)
11
r = V{1.10 x 1 636376.2) = + 0.9928
Regression Line
a = ji = 607.3
Standard Errors
The estimated residual variance about the regression line is
thus
Test of Significance of b
From the evidence of the scatter diagram and the high value of r, the observed
value of b is expected to be significant. In confirmation, the test gives
= 1210.9-0=249
t 48.7 .
a very highly significant value of t for 9 degrees of freedom.
Table 9.7
1400
"
1300 ,/ /
/' / /
1200
Regression line "X /
""
1100 " I /
/ / /
1000 ,/ ~/ / ~/
,
/
"0
.><
900 " /,'
," 11,1,"
/,
~ 800
, / 1111,/
: 700 , / X '
r.
III
," / / "
... 600 ," / / /
/
I'
o
r. 500
Co
c: 400
II
" "//'~
I'
//,'
" / / x,
/ / "
.=
III
" /x/ "
,"// , / '
Prediction
limits
300
,/~/x/~
L-
a
1!
IJ)
200 ,'1/,'
100 .,,' / x / "
;V // " Confidence limits
"
/1','
o
/ /,t'
-100
-200
Figure 9.3. Regression line with 95% confidence limits and prediction limits.
214 Statistics: Problems and Solutions
In the following solutions, since the calculations are all similar to that of
problem 1, the detailed computations are not given.
2. Here the scatter diagram (figure 9.4) shows little evidence of a relationship
but, on the other hand, it does not offer any evidence against the linearity
assumption so the computation is as follows.
n= 20 x = 139.5
~x = 2790 ji = 872.0
~y = 17440 S2 = 3194.47
x
1300
x
1200
x
1100 x
1000 x
x
x
x
900
- - - - -- _x_ x--€) - y - --
x x x x
800
.. x x x
700 x
Q)
600 x x
E
0
<J
~ 500
400
300
200
100
0
40 80 120 160 200 240
Size of farm (ha)
Figure 9.4
Linear Regression Theory 215
Regression Line
Significance of b
From inspection of the scatter diagram (figure 9.4) and the low value of r (the
significance of which can be tested using table 10*), the observed value of b is
not expected to differ significantly from zero.
Residual variance
S2 = 28 711.58(1-0.0078 2) x ~ = 30305
Standard error of b
Eb = J[ 30 305 (2790)2 1J =
(30 305) = 0.71
60695
449900- 20
t = 0'06.;~ - 0 = 0.033
which is clearly not significant. (For the slope of the fitted regression line to be
Significantly different from zero, at the 5% level, the observed value of t would
have to be numerically larger than 2.101.)
Thus, until further evidence to the contrary is obtained, farm income can be
assumed to be independent of farm size, at least for the population of farms
covered by the sample of 20 farms.
Since the data show no evidence of a relation between farm size and income,
there is little point in retaining the fitted regression equation. The best estimate
of the mean income of farms in the given popUlation is therefore $872.
Ninety-five per cent confidence limits for this mean income are given by
169.4
872 ± 2.101 x y20 = 872 ± 79.6 = $792.4 to $951.6
Ninety-five per cent prediction limits for the income of an individual farm
are given as
872 ± 2.101 x 169.4 y(1 + 1o}= $872 ± 364.7
=$507.3 to $1236.7
3. This problem is of interest since the assumption of linearity can be quite
safely rejected after drawing the scatter diagram (figure 9.5). There is therefore
no point in trying to fit a single linear relationship to the data.
2]6 Statistics: Problems and Solutions
y 45
x
40 x
x
35 x
30 x
x
25 The relationship is not
linear so analysis cannot
be continued
x
20
15
x
10
x
5
0
10 20 30 40 50 60 70 80
x
Figure 9.5
4. The scatter diagram (figure 9.6) indicates quite a strong relationship between
unit cost and order size, and a simple linear relation would probably be adequate,
at least in the range of order size considered. Such a simple model would be
inadequate for extrapolation purposes since the cost per unit would be expected
to tend towards a fixed minimum value as order size was increased indefinitely
and therefore some sort of exponential relation would be a better fit for such
purposes.
Linear Regression Theory 217
70 "-
"-
"-
60 "- Regression line
"- "-....
x
"-
50
S
-c:
::l
40
---- ---- ______ x
---- "-..
'-
-- --
.,
.... x
-
u
Co
en
0
30
"-
"-....
"- x
20 "-
"-
"-
10
"-
o 2 4
Number of units in order
Figure 9.6
LY = 212 ji = 42.4
Regression Line
a= ji = 42.4
Significance of b
Residual variance
S2 = 213.3[1-(-0.9459)2] X j = 29.94
Standard error of b
Eb
s
=v'[~(X_X)2] = J(29.94)
86.8 = 0.587
Observed value of
-2.97-0
t= 0.587 = -5.06
J{
The standard error of the regression estimate is
Table 9.8
(b) To estimate the unit cost of an order for eight lenses, substitution of
x = 8 can be made in the regression equation giving
y = 60.8 - 2.97 x 8 = £37.0
Linear Regression Theory 219
This figure is the 'best' estimate of the average over all possible orders of eight
lenses, of the cost per lens in an order of eight lenses.
The uncertainty of this figure (£37.0) is given by the interval (at 95%
confidence) £28.50 to £45.50.
If required, the cost per lens for a randomly selected order for eight lenses
is likely to be (95% probability) in the interval, £17.64 to £56.36, a very wide
range indeed.
5. The scatter diagram (figure 9.7) does not show any evidence against the
assumption of linearity and in this example, a priori logic suggests that it would
be a reasonable model of the situation.
Let x = the number of boxes in a parcel and y = the number of basic seconds
per parcel.
I
/
300 /
/ x
/
280 /
/ x
/
260 x /
/
/
Ii /
ec 240
/'
/
...
Co
./
GI /'
Co 220 ./
u
GI
III /'
U ./
III
200 x ./
C /' Regression line
m x
180
/
/ 95% Confidence limits
160 /
x/
/
140 /
/
/
120 /
/
/
1000 40
5 10 15 30 35
Number of boxes in parcel
Figure 9.7
220 Statistics: Problems and Solutions
The following totals are obtained from the data (without coding)
n=8
LX = 164 x = 20.5
Lx 2= 4700 si = 191.14
LY = 1690 Y = 211.25
Ly2 = 380100 s~ = 3298.21
LXY = 39130 r = +0.8069
Regression Line
a =y = 211.25
Significance of b
Residual variance
S2 = 3298.21 (I - 0.8069 2) x t = 1342.5
Standard error of b
€b = J( 1342.S)
1338 = 1.002
Observed value of
t = 3.35 - 0 = 3 34
1.002 .
Reference to table 7* shows that this value, having 6 degrees of freedom, falls
between the 2% and 1% levels of significance (3.143 and 3.707 respectively). The
slope of the regression line can therefore be assumed to be different from zero
with b = 3.35 as its best estimate.
€Yi
=J{1342 5
.
[l
8
+ (Xi - 20.5)2]}
1338
Linear Regression Theory 221
Table 9.9 shows values of €Yj for certainxj together with 95% confidence
limits for the regression estimate at that point. The scatter diagram (figure 9.7)
also has 95% confidence limits drawn on it.
Table 9.9
6. Here, in order to reduce the computation slightly, all the basic data have
been coded into units of $100; i.e. $1300 becomes 13 etc.
The scatter diagram (figure 9.8) illustrates the case of 'fliers' or 'outliers', i.e.
readings which do not appear to belong to the bivariate distribution. These
suspect readings are marked as A and B in figure 9.8. Whenever such observations
occur in practice, a decision has to be made as to whether or not to exclude
them. Special tests to assist in this are available but are beyond the level of this
book and all that can be said here is that the source of the readings should be
carefully examined and if any reason is found for their not being homogeneous
with the others, they should then be rejected. In many cases, a commonsense
approach will indicate what should be done.
In this example, the two points, A and B, clearly do not conform and a closer
examination of the situation would probably isolate a reason so that the points
could validly be excluded. However, to demonstrate their strong effect on the
analysis, the points A and B have been retained in fitting the regression line.
222 Statistics: Problems and Solutions
xIS)
x
/
y
3000 x
/
/
x /
'/\ Regression line
not significant
2500
/
/
...-
Q)
'0
II>
/
/
E 2000 / x
r= x
x
x
1500 x (A)
1000500
1000 1500 2000
Income level of farms(S) x
Figure 9.8
~= 117 s~ = 6.85
~X2 = 1313 Y = 23.64
~y = 260 s;' = 43.45
~y2 = 6580 r= 0.3566
~xy =2827
Linear Regression Theory 223
Regression Line
a = ji = 23.64
b = 0.3566 x J( 43.45)
6.85 = 0.898
Significance of b
The residual variance about the line,
!P = 43.45(1-0.3566 2 ) X ~ = 42.14
Observed
t = b - 0 = 0.898= 1 15
€b 0.784 .
a value which is not significantly high.
The regression line calculated above could therefore be misleading since the
observed data as a whole show no evidence of a linear relation between y and x.
However, as mentioned above, the analysis can be carried out omitting
readings A and B if a valid reason to do so is found. If this is done, the
calculations give
n=9
~x=95
~X2 = 1053
£y = 212
~y2 = 5266
~xy = 2353
leading to
r = 0.9854 and Y = 66.15 + 2.29x (in $)
The fact that just two points have obscured the relationship should be noted,
as should the assistance given by the scatter diagram towards interpretation of the
situation.
y 500
400
---
Regression line
~ 300 - - -
11)
'"
"0
11)
~
o
I- x
200
100
Figure 9.9
The scatter diagram shows no strong relationship between the variables
(sales and time), nor is there any apparent evidence of non-linearity, so the
results for the straight line regression are as shown below.
n=8 ky2 =975600 y =342.5
kx=36 kXY = 12880 s~=5307.1
kY =2740 si =6.0
Linear Regression Theory 225
Regression Line
a= y = 342.5
Significance of b
The residual variance about the line is
S2 = 5307.1 [1-(0.4403)2] x ~ = 4991.3
The standard error of b is
and
t =b - 0 = 13.09 = 1 20
€b 10.90 .
with 6 degrees of freedom.
Reference to table 7* shows that this is not significantly different from
zero, that is, there is no evidence of a relationship between sales and time. In
this case there is no point in using the regression equation above to estimate sales
for 1968 (Year 9) or beyond. The average yearly sales figure of 342 is probably
as good a figure as any to use for making a short-term forecast on the basis of the
information given.