2017dec_02402_solution_en
2017dec_02402_solution_en
There are 30 questions of the ”multiple choice” type included in this exam divided on 18
exercises. To answer the questions you need to fill in the prepared 30-question multiple choice
form (on 6 seperate pages) in CampusNet.
5 points are given for a correct answer and −1 point is given for a wrong answer. ONLY the
following 5 answer options are valid: 1, 2, 3, 4 or 5. If a question is left blank or another answer
is given, then it does not count (i.e. ”0 points”). Hence, if more than one answer option is
given to a single question, which in fact is technically possible in the online system, it will not
count (i.e. ”0 points”). The number of points corresponding to specific marks or needed to
pass the examination is ultimately determined during censoring.
The final answers should be given in the exam module in CampusNet. The
table sheet here is ONLY to be used as an ”emergency” alternative (remember
to provide your study number if you hand in the sheet).
Exercise I.1 II.1 II.2 III.1 III.2 IV.1 IV.2 V.1 V.2 V.3
Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Answer
5 2 4 5 3 3 4 5 3 3
Exercise VI.1 VI.2 VII.1 VIII.1 VIII.2 IX.1 IX.2 IX.3 X.1 XI.1
Question (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
Answer
3 2 5 2 1 2 2 2 5 4
Exercise XII.1 XIII.1 XIII.2 XIV.1 XV.1 XVI.1 XVI.2 XVII.1 XVII.2 XVIII.1
Question (21) (22) (23) (24) (25) (26) (27) (28) (29) (30)
Answer
4 1 2 3 5 4 2 1 5 5
Continues on page 2
1
Multiple choice questions: Note that not all the suggested answers are necessarily mean-
ingful. In fact, some of them are very wrong but under all circumstances there is one and only
one correct answer to each question.
Exercise I
x 38 35 47 38 42 41 48 35
y 25 21 26 23 28 27 29 18
1 ρ̂ = 0.12
2 ρ̂ = 0.22
3 ρ̂ = 0.64
4 ρ̂ = 0.73
5* ρ̂ = 0.82
See Definition 1.19. The easiest way to do this is to use R. We copy into R to read in the data
two vectors
and the we calculate the estimate of the correlation as the sample correlation
2
cor(x,y)
## [1] 0.8237548
Continues on page 4
3
Exercise II
A biologist is evaluating the effect of 3 different diets on weight change in mice. In the experi-
ment 4 different strains (genetically different types) of mice are included, as strain is expected
to have an influence on weight change. Thus, 4 different strains of mice are exposed to 3 dif-
ferent diets, i.e. a total of 12 mice are included. The weight change is measured after 5 weeks
for each diet. The weight change is denoted Yij (in grams). The weight change can be assumed
normally distributed and thus the following model has been applied
Yij = µ + αi + βj + εij .
In this model αi denote the effect of diet i (i = 1, 2, 3) and βj denotes the effect of mouse
strain j (j = 1, 2, 3, 4). µ is the overall mean and εij are the errors, assumed independent and
normally distributed with mean 0 and constant standard deviation σε .
State the critical value when you want to test whether the mean weight change is the same for
the 3 diets and the significance level is α = 0.05.
1 12.20
2* 5.14
3 1.96
4 3.81
5 4.35
The test we need to carry out is the F -test for a two-way ANOVA, in this case for the effect
of diet which is the treatment following the book (Theorem 8.22). Hence, we have to find the
two degrees of freedom and look up the 1 − α quantile. The degrees of freedom are:
• df1 = k − 1 = 2, since the number of levels for the treatment (number of diets) k = 3
## [1] 5.143253
4
Note, that the results doesn’t change if blocks and treatments are switched, i.e. if the diets are
thought of as blocks and strains as treatments.
Assume that we have estimated the model parameters using R and concluded that both diet
and type of mouse strain are statistically significant. Also assume that the model residuals,
ε̂ij , are stored in the vector resi, and that we will use R to further analyze these. Which of
the following claims is not correct?
1. TRUE statement. The residuals are sorted and, where i ∈ (1, 2, . . . , n) denotes the i’th
element in sorted order, εˆi is plotted versus the (i−0.5)/n quantile in the standard normal
distribution
2. TRUE statement
3. TRUE statement. We know that MSE is an estimate of the error variance. And this can
SSE
be found as (k−1)(l−1) .
5. TRUE statement. It is the sample mean, which is used as an estimate of the mean
Continues on page 6
5
Exercise III
In a study 605 test persons, all with a record of previous heart disease, were randomized to
one of two possible diets (A or B), in order to study the effect of diet on health. After an
observation period of 4 years the test persons were classified according to health status: (I)
dead, (II) cancer, (III) other disease, (IV) well.
Health status
I II III IV Total
Diet A 15 24 25 239 303
Diet B 7 14 8 273 302
Total 22 38 33 512 605
The null hypothesis in the study was that there is no association between diet and health.
State the distribution of the usual test statistics, when assuming that the null hypothesis is
true:
The setup of the data is a multi-sample proportion setup (chapter 7.4). We must test the
hypothesis, that the proportions in each group is equal
H0 : P1 = p2 = p3 = p4 .
and under this hypothesis the test statistic follows a χ2 -distribution with c − 1 degrees of
freedom, and there are 4 groups, so 3 degrees of freedom (Method 7.20).
We now only consider the proportion of test persons who are healthy at the end of the 4 year
period. We want to estimate at 95% confidence interval for the difference in proportions of test
6
persons who are healthy for each of the 2 diets. Which of the suggestions below is the correct
code in R to achieve this?
Here we are working with proportions in two populations as described in Chapter 7.3. We need
the observed proportion which are well for each diet. So on Diet A 239 out of 303 are well and
for Diet B 273 out of 302 are well, and these numbers are passed to prop.test, which then
prints out the estimated confidence interval (same as Example 7.19).
Continues on page 8
7
Exercise IV
In the production of a consumer product 3 subprocesses are involved, denoted A, B and C. The
time (in hours) it takes to complete each subprocess is represented with a random variable,
which we denote XA , XB and XC , respectively. It can be assumed, that XA , XB and XC
are all independent and normally distributed given by XA ∼ N (12, 22 ), XB ∼ N (25, 32 ) and
XC ∼ N (42, 42 ).
State the probability that the total production time, Y , exceeds 85 hours:
1 0.0081
2 0.1080
3* 0.1326
4 0.4180
5 0.6301
We need to find the mean and variance of Y , which we know is normal distributed, since a
linear function of normal distributed random variables is also normal distributed (Theorem
2.56).
k <- 1000000
X_a <- rnorm(k, 12, 2)
X_b <- rnorm(k, 25, 3)
X_c <- rnorm(k, 42, 4)
Y <-X_a + X_b + X_c
var(Y)
8
## [1] 29.01567
## [1] 0.1326027
An engineer is now able to perform some optimization of the process, so that the improved
process time Y ∗ , becomes
Y ∗ = 0.9 · XA + 0.8 · XB + XC ,
5 V(Y ∗ ) = 22 + 32 + 42
Continues on page 10
9
Exercise V
The yearly rainfall has been registered within a region for the last 100 years. It can be assumed
that the rainfall is independent from year to year. The cumulative distribution for the yearly
rainfall is shown in the figure below:
Cumulative distribution
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
rainfall [mm]
The following summary of the data has been conducted by the use of R, where the yearly
rainfall measurements are stored in the variable rainfall:
> var(rainfall)
[1] 412.7042
> summary(rainfall)
Min. 1st Qu. Median Mean 3rd Qu. Max.
652.8 686.6 701.9 701.3 714.9 749.1
Continues on page 11
10
Question V.1 (8)
√
412.7042
1 The estimate of the standard deviation of the mean σ̂X̄ , becomes 10
mm
1. TRUE statement. The formula for the estimate is √s (also called the standard error of
n
the mean). See Definition 3.7
4. TRUE statement. 686.6 is the first quartile (25% quantile) and 714.9 is the third quartile
(75% quantile), and certainly 50% of the observations lies between the 25% and 75%
quantile
√
s 412.7042
5. FALSE statement. The estimated coefficient of variation is V̂ = x̄
= 701.3
. See
Definition 1.12.
Provide a 95% confidence interval for the variance of the rainfall based on the 100 observations,
still assumed to be normally distributed:
2 2
1 [ 20.3151299·134.6416 ; 20.3151299·69.22989 ]
2
·99 20.31512 ·992
2 [ 20.31512
134.6416
; 69.22989 ]
3* [ 412.7042·99
128.422
; 412.7042·99
73.36108
]
4 [ 412.7042·99
123.2252
; 412.7042·99
77.04633
]
11
5 [ 20.31512·99
123.2252
; 20.31512·99
77.04633
]
We find the formula for a 1 − α confidence interval for the variance of a normal distributed
population in Method 3.19 and insert the values
" #
s2 (n − 1) s2 (n − 1)
,
χ21−α/2 χ2α/2
Continues on page 13
12
Question V.3 (10)
We continue with the exercise from the previous page. The following code in R has now been
run:
k = 10^5
Q5 = function(x){ quantile(x, 0.95) }
samples = replicate(k, sample(rainfall, replace = TRUE))
simvalues = apply(samples, 2, Q5)
interval = quantile(simvalues, c(0.025,0.975))
> interval
2.5% 97.5%
728.9515 742.0814
1 A 95% confidence interval for the mean of the yearly rainfall (parametric bootstrap)
2 A 95% confidence interval for the 5% quantile of the yearly rainfall (parametric bootstrap)
3* A 95% confidence interval for the 95% quantile of the yearly rainfall (non-parametric
bootstrap)
4 A 95% confidence interval for the 2.5% and 97.5% quantile of the yearly rainfall (non-
parametric bootstrap)
5 A 95% confidence interval for the 2.5% and 97.5% quantile of the yearly rainfall (para-
metric bootstrap)
We look at the R code and see that it is a bootstrapping is carried out by simulating the sample
100000 times, and not assuming any distribution (since the sample function is used), therefore
it is non-parametric.
The statistic calculated for each simulated sample is the 95% quantile and since the quantiles
taken for these values are the 2.5% and the 97.5%, then the results is a 95% confidence interval
for the 95% quantile.
Continues on page 14
13
Exercise VI
We consider an experiment that can result in one of two possible outcomes, here denoted A
or B. The probability of outcome A is denoted P (A). By defintion we get the probability of
outcome B as P (B) = 1 − P (A).
Assume that we observe a random variable, X, which counts the number of times that we
observe the outcome A out of n = 300 independent trials of the experiment. If we assume that
P (A) = 0.40 in a single trial, what is then the expected number E(X) and variance V(X)?
X follows a Binomial distribution with p = 0.4 and we have a formula for the mean and variance
defined in Theorem 2.21, which we use to get
Regardless of your answer to the previous question we now want to estimate the probability
P (A) based on the n = 300 trials. From the n = 300 trials we count that in 120 of these the
outcome was A and in the remaining 180 trials the outcome was B. Provide a 95% confidence
interval for the probability P (A):
1 [0.33, 0.48]
2* [0.35, 0.46]
3 [0.35, 0.42]
14
4 [0.31, 0.53]
5 [0.29, 0.54]
##
## 1-sample proportions test without continuity correction
##
## data: 120 out of 300, null probability 0.5
## X-squared = 12, df = 1, p-value = 0.000532
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3461652 0.4563634
## sample estimates:
## p
## 0.4
whereas using the formula in method 7.3 gives a slightly different result is obtained
n <- 300
x <- 120
phat <- x/n
phat + c(-1,1) * qnorm(p=0.975) * sqrt(phat*(1-phat)/n)
This is due to a numerical rounding by R and can occur sometimes. The answer is in any case
closest to the answer marked correct [0.35, 0.46].
Continues on page 16
15
Exercise VII
An engineer is examining the quality in a batch of raw materials. The quality demand is that
the purity of the raw material is at least 90%. The engineer takes a sample of 10 independent
measurements from the batch and saves the measured values (in %) of the purity in a vector x.
> x <- c(90.6, 90.3, 88.9, 87.5, 87.6, 88.1, 87.5, 88, 88, 89.6)
> n <- length(x)
> tobs <- (mean(x) - 90) / (sd(x) / sqrt(n))
> pt(tobs, df=n-1)
[1] 0.002279236
Based on the calculations listed above, and assuming that the measurements of the purity
are normally distributed and applying a significance level of α = 0.05, what can the engineer
conclude?
1 The engineer can conclude that the purity of the raw material is at least 88.6%
2 The engineer can conclude that the mean purity of the raw material is at most 88.6%
3 The engineer has with probability 99.7% shown that the mean purity of the raw material
is 90%
4 The engineer can assume that the mean purity of the raw material is 90%
5* The engineer can reject that the mean purity of the raw material is 90%
We can see from the way that tobs is calculated that the null hypothesis is that the µ = 90
(See Method 3.23) Since the p-value is 2*pt(tobs, df=n-1)=0.0046 and thus much lower than
α = 0.05. This leads to the conclusion that the null hypothesis, that the mean purity is 90%,
must be rejected.
Continues on page 17
16
Exercise VIII
2
(w−30)
We consider a random variable W with density function f (w) = √1 e− 162 .
9 2π
The density function is shown in the figure below, where the probability
P (30 < W < 33) is shown as the shaded area.
0.04
0.03
f(w)
0.02
0.01
0.00
0 10 20 30 40 50 60
1 0.09
2* 0.13
3 0.24
4 0.34
5 0.84
The answer is obtained from recognizing the the formula for the probability density function
(pdf) for the normal distribution in definition 2.37
1 (w−µ)2
f (w) = √ e− 2·σ2
σ 2π
17
and thus to find the mean µ = 30 and variance σ = 9. These are then used to obtain
P (30 < W < 33) = P (X < 33) − P (X < 30)
in R
## [1] 0.1305587
SINCE in the original exam the plot was which indeed was wrong, it was of the normal distri-
0.12
0.10
0.08
f(w)
0.06
0.04
0.02
0.00
15 20 25 30 35 40 45
## [1] 0.3413447
18
Continues on page 19
19
Question VIII.2 (15)
We consider a situation where we take 3 different samples denoted A, B, and C. All three
(w−30)2
samples are from the population characterized by the density f (w) = √1 e− 162 as in the
9 2π
previous question.
Sample A is of size nA = 10 and the estimated mean is denoted µ̂A . Sample B is of size
nB = 30 and the estimated mean is denoted µ̂B . Sample C is of size nC = 100 and the
estimated mean is denoted µ̂C .
The question is now whether the sample mean will exceed the value 33, even when the popu-
lation mean is equal to 30.
1. TRUE statement. The µ̂ is the sample mean, which we know follow the distribution
µ̂ ∼ N (µ, σ 2 /n) (Theorem 3.3), so we get the following
µ̂A ∼ N (30, 81/10)
µ̂B ∼ N (30, 81/30)
µ̂C ∼ N (30, 81/100)
and we can actually then realize, that the probability of getting a an outcome above the
same value, must be higher for XA than the two others, since its pdf has higher variance
than the others. In R we can check it by:
20
2. FALSE statement. Following same argument as above
3. FALSE statement. Since the variance is different, then they are not equal
## [1] 1.456427e-05
## [1] 0.01697229
Continues on page 22
21
Exercise IX
The yield from a chemical process, Yi , is assumed to depend linearly on the temperature, ti ,
measured in degrees. In order to achieve insight about this relation, an experiment has been
conducted where n = 50 pairwise measurements of Yi and ti has been taken. It is assumed that
the following model can give a reasonable description of the relation
Yi = β0 + β1 · ti + εi .
The residuals in this model are assumed independent and normally distributed with constant
variance, i.e. εi ∼ N (0, σε2 ). Relevant output from the analysis in R is given below:
Call:
lm(formula = y ~ t)
Residuals:
Min 1Q Median 3Q Max
-5.0816 -1.4994 -0.2493 1.5175 4.8506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.4919 2.7757 23.595 <2e-16 ***
t 0.1637 0.1103 1.485 0.144
---
Which of the following statements is correct when the significance level α = 0.05 is applied?
1 The yield increases by 16.37% when the temperature increase one degree
3 The test statistics for no effect of temperature on yield (i.e. the null hypothesis H0 : β1 =
0) is 23.595
22
----------------------------------- FACIT-BEGIN -----------------------------------
1. FALSE statement. The yield is estimated to increase 0.1637 units (we are not informed
about the units) per degree, which is not the same as 16.37% (increasing some proportion
per degree, would also lead to an exponential relation, not linear)
H0 : β1 = 0
leads to a p-value of 0.144, which is not below the significance level α = 0.05 and since
this is equivalent to testing for correlation equal to zero
H0 : ρ = 0
there is not found a significant linear relation between the yield and the temperature
4. FALSE statement. The lower limit of the CI is 0.1637 − 1.96 ∗ 0.1103 = −0.052 and the
upper is 0.1637 + 1.96 ∗ 0.1103 = 0.380
√ √
5. FALSE statement. The correlation is r2 = 0.04392 = 0.21
Continues on page 24
23
Question IX.2 (17)
We continue with the exercise from the previous page. It turns out that the pH of the process
may influence the yield, and since pH has been measured, it is decided to include it into the
model, which in its extended form becomes:
Yi = β0 + β1 · ti + β2 · pH i + εi .
Call:
lm(formula = y ~ t + pH)
Residuals:
Min 1Q Median 3Q Max
-3.7253 -1.2818 -0.2978 1.0724 4.4488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.46756 4.09799 12.071 5.25e-16 ***
t 0.24113 0.09315 2.589 0.0128 *
pH 2.37090 0.50097 4.733 2.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimates are read directly from the printed output. See Example 6.3
24
------------------------------------ FACIT-END ------------------------------------
Continues on page 25
25
Question IX.3 (18)
We continue with the exercise from the previous page and the model
Yi = β0 + β1 · ti + β2 · pHi + εi .
Provide a 95% confidence interval for the effect on yield when pH increases one unit:
See Method 6.5. The confidence interval for the effect of pH is found inserting the printed
values into
using the t-distribution with n − (p + 1) = 47 degrees of freedom to find the quantile t1−α/2 :
qt(p=0.975, df=47)
## [1] 2.011741
Exercise X
Assume there exists a dice with 10 sides and where the probability for each of the 10 outcomes,
1, 2, . . . , 10, is the same. Consider the discrete random variable X with density f (x) = 0.1 for
x ∈ (1, 2, . . . , 10).
26
1
P10
1 (10−1) i=1 xi = 6.11
1
P10
2 (10−6.11) i=1 |xi − 6.11| = 6.48
1
P10
3 (10) i=1 (xi − 6.11)2 = 8.62
P10 10−1
4 i=1 10
xi · 0.1 = 4.95
P10
5* i=1 xi · 0.1 = 5.50
See Definition 2.13. We use the formula for calculating the mean value of a discrete random
variable
n
X
xi f (xi )
i=1
sum(1:10*0.1)
## [1] 5.5
Continues on page 28
27
Exercise XI
The yield of a process is µ = 60 mg/l. Certain changes to the process are being planed and it
is desirable to be able to prove an effect on the mean yield if the change is at least 5 mg/l (i.e.
a two-sided test).
An engineer is now going to plan an experiment to evaluate the effect of the process changes.
He wants to decide how large a sample is needed. The sample size has to be large enough to
detect the relevant effect (5 mg/l) with a power of 0.8 when applying a significance level of
α = 0.05. It can be assumed that the standard deviation is σ = 10 mg/l.
Based on the information above, and by applying the function power.t.test in R, one con-
cludes that, if an equal number of measurements are taken, then the minimum number of
measurements n needed becomes:
3 n ' 64 measurements
4* n ' 34 measurements
5 n ' 27 measurements
Based on the given information the planned test is a one-sample test, since it is not stated that
a sample should be taken before the change, only that the yield before is µ = 60 mg/l. See
Example 3.67.
##
## One-sample t test power calculation
##
## n = 33.3672
## delta = 5
## sd = 10
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
28
Since, it is not completely clear, that the it should not be a two-sample setup – one could argue
that a nothing in the information given prevents it from being a two-sample test – then Answer
3 is also taken as correct, since:
##
## Two-sample t test power calculation
##
## n = 63.76576
## delta = 5
## sd = 10
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Further, since it is also not specified that n is the number of measurements is in each group
(and not the total), then Answer 2 is also taken as correct.
Continues on page 30
29
Exercise XII
In a study the aim is to investigate the possible cholesterol lowering effect of a product. 9 test
persons had their cholesterol level measured (denoted x1). After 3 months, while using the
product, the same 9 test persons had their cholesterol level measured again (denoted x2). Data
is shown in the table below:
Person 1 2 3 4 5 6 7 8 9
x1 63.5 66.7 59.2 57.4 63.9 63.2 60.7 62.6 63.3
x2 51.3 51.9 57.8 50.2 54.6 43.3 51.2 40.4 52.2
The following code is now run in R, in order to test whether the change over time can be
assumed to be zero (H0 : δ = 0):
x1 <- c(63.5, 66.7, 59.2, 57.4, 63.9, 63.2, 60.7, 62.6, 63.3)
x2 <- c(51.3, 51.9, 57.8, 50.2, 54.6, 43.3, 51.2, 40.4, 52.2)
The output from the standard statistical analysis is given below. Please note that some numbers
in the standard output have been replaced by the letters A, B and C.
t = -5.6354, df = A, p-value = B
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-16.847799 C
sample estimates:
mean of the differences
-11.95556
2 We can not show an effect since the upper limit of the confidence interval is 7.063312
3 We can not show an effect since the lower limit of the confidence interval is -7.063312
30
----------------------------------- FACIT-BEGIN -----------------------------------
The standard statistical test for this setup is a paired two-sample t-test. The R output is from
t.test(), and the easiest way to solve this is by copying and running
x1 <- c(63.5, 66.7, 59.2, 57.4, 63.9, 63.2, 60.7, 62.6, 63.3)
x2 <- c(51.3, 51.9, 57.8, 50.2, 54.6, 43.3, 51.2, 40.4, 52.2)
## The call is then either "t.test(x2, x1, paired=TRUE)" or
t.test(x2-x1)
##
## One Sample t-test
##
## data: x2 - x1
## t = -5.6354, df = 8, p-value = 0.0004897
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -16.847799 -7.063312
## sample estimates:
## mean of x
## -11.95556
and from the p-value we can find the correct answer. See section 3.1.7 for more examples.
Continues on page 32
31
Exercise XIII
An analysis of variance is performed for the above model and the output is given below. Please
note that the output is incomplete as some numbers are replaced by the symbols A, B and C.
Response: growth
Df Sum Sq Mean Sq F value Pr(>F)
treatment A 281.07 B C 0.0001409 ***
Residuals 28 268.46 9.588
Provide the usual test statistics (denoted by C) in order to test for equal mean effect of the 4
growth inhibitors
1* 9.77
2 7.23
3 2.95
4 4.57
5 16.11
32
• SS (Tr ) is the variance explained by the effect of the treatment
• SSE is the variance remaining after the model (sum of squared error)
Continues on page 34
33
Question XIII.2 (23)
We now want to calculate a post hoc 95% confidence interval for a difference in mean between
growth inhibitor V1 and V2 , here denoted I0.95 (V1 − V2 ). From the experiment it is known that
the estimated mean difference between V1 and V2 is 4.5. State the interval I0.95 (V1 − V2 ):
9.588
√
1 I0.95 (V1 − V2 ) = 4.5 ± 2.048 · 12
28 ·
√ p
2* I0.95 (V1 − V2 ) = 4.5 ± 2.048 · 9.588 · 2/8
√
9.588
3 I0.95 (V1 − V2 ) = 4.5 ± 2.306 · √
12
p
4 I0.95 (V1 − V2 ) = 4.5 ± 2.306 · 9.5882 · 1/8
9.588
5 I0.95 (V1 − V2 ) = 4.5 ± 1.960 · √
8
See method 8.9. The post hoc confidence interval for the difference is
s
SSE 1 1
ȳi − ȳj ± t1−α/2 + .
n − k ni nj
qt(p=0.975, df=28)
## [1] 2.048407
34
Exercise XIV
We consider a continuous random variable random, where the well-known cumulative distribu-
tion function F (x) is given by P (X ≤ x) = 1 − e−x/2 , where x > 0.
1
1 2
2 1
3* 2
3
4 2
5 4
It is recognized as the cdf of the exponential distribution (Definition 2.48), which is verified by
Z x
x
λeλy dy = −e−λy + c 0 = −e−λx + e0 = 1 − e−λx
0
and it can be seen that λ = 12 . Using the formula for the mean of an exponential distribution
(Theorem 2.49)
1
µ= = 2.
λ
Continues on page 36
35
Exercise XV
A biologist is examining the bio-diversity within an area and has measured the number of
different type of plants per 10 m2 in different places in the area. She has obtained a total of
30 independent measurements, yi , and these are in in the vector Yobs in R.
The biologist would like to estimate a 95% confidence interval for the coefficient of variation for
the bio-diversity (number of different type of plants per 10 m2 ) by applying the non-parametric
bootstrap. Which of the following suggestions in R is most suitable to achieve this?
1 samples = replicate(10000,rnorm(30,mean(Yobs),sd(Yobs))
results = apply(samples,2,sd)/apply(samples,2,mean)
quantile(results, c(0.025,0.975))
2 samples = replicate(10000,sample(Yobs,replace=TRUE))
results = apply(samples,2,var)/apply(samples,2,sd)
quantile(results, c(0.025,0.975))
3 samples = replicate(10000,rnorm(30,mean(Yobs),sd(Yobs))
results = apply(samples,2,var)/apply(samples,2,median)
quantile(results, c(0.025,0.975))
4 samples = replicate(10000,sample(Yobs,replace=FALSE))
results = apply(samples,2,sd)/apply(samples,2,mean)
quantile(results, c(0.025,0.975))
5* samples = replicate(10000,sample(Yobs,replace=TRUE))
results = apply(samples,2,sd)/apply(samples,2,mean)
quantile(results, c(0.025,0.975))
36
Exercise XVI
In a study 178 men and 180 women were asked to answer whom of 2 political candidates, A or
B, they preferred. Alternatively, they could answer ”none of the two”. The distribution of the
answers is shown in the figure below.
85
80
78
60
60
55
choice
Candidate A
count
40
40 40 Candidate B
None of the two
20
men women
sex
Continues on page 38
37
Question XVI.1 (26)
It is seen from the figure that we observe that 85 out of the 180 women prefer Candidate B. If
we can assume the same distribution of answers by gender, how many women out of the 180
would we expect to prefer Candidate B?
163 95
1 358
· 358
· 358
100 223
2 358
· 358
· 358
95 190
3 358
· 358
· 358
163 180
4* 358
· 358
· 358
95 180
5 358
· 358
· 358
See chapter 7.2. The total number of respondents are n = 180 + 178 = 358 and if we assume
the same distribution of answers by gender, i.e. the under the hypothesis that the proportion
of men and women prefering B is equal
H0 : pmen,B = pwomen,B = p,
then
”Total number for B” 78 + 85 163
p= = = .
”Total number” 358 358
It is then simply this fraction we expect out of the total number of women
163
· 180,
358
which is then expressed a little longer by
163 180
· · 358.
358 358
Provide the usual test statistics when you want to conduct the test of whether the distribution
of answers is the same for men and women:
1 χ2obs = 5.9915
38
2* χ2obs = 6.6581
3 χ2obs = 16.212
4 χ2obs = 8.3836
5 χ2obs = 4.5067
Maybe the easiest is to copy example 7.21 from the book of testing multiple proportions
prop <- matrix(c(60, 78, 40, 40, 85, 55), ncol = 3, byrow = TRUE)
rownames(prop) <- c("Men", "Women")
colnames(prop) <- c("A", "B", "None")
prop
## A B None
## Men 60 78 40
## Women 40 85 55
chisq.test(prop, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: prop
## X-squared = 6.6581, df = 2, p-value = 0.03583
Continues on page 40
39
Exercise XVII
Cloud seeding is a form of weather modification that can be used to increase the amount of
precipitation that falls from the clouds, by dispersing substances (small particles) e.g. alumini-
umoxid into the clouds to modify their development.
In an experiment the aim was to study the effect of cloud seeding by using a new type of
particles. The amount of precipitation (mm precipitation per day) for 35 days with cloud
seeding using the new particles is denoted Xi , (i = 1, 2, . . . , 35). This was compared to the
amount of precipitation on 30 days without cloud seeding, denoted Yj , (j = 1, 2, . . . , 30).
Measurements were only taken on days where there was sufficient humidity in the air to make
the experiment relevant. Data from the experiment is shown in the figure below.
16
●
12 ●
●
Precipitation (mm)
X (particle) Y (control)
Experiment
Continues on page 41
40
We now want to analyze the data described on the previous page using R. Data xi is stored in
the vector x and data yj is stored in the vector y, and the following code has been run:
k <- 10^4
resultX <- replicate(k, sample(x, replace = TRUE))
resultY <- replicate(k, sample(y, replace = TRUE))
result <- apply(resultX, 2, median) - apply(resultY, 2, median)
quantile(result, c(0.5, 0.025,0.975))
In the R code a 95% non-parametric bootstrap confidence interval for the difference in median
is calculated, and since 0 is not contained in the interval, then the hypothesis
H0 : q0.5,X = q0.5,Y
H1 : q0.5,X 6= q0.5,Y
and further, since X − Y was calculated and the interval is on the positive side, then it can be
concluded that q0.5,X > q0.5,Y .
Continues on page 42
41
Question XVII.2 (29)
In a different experiment using cloud seeding a different kind of particles were examined. Also
in this experiment the amount of precipitation was compared when the particles were used to a
situation with no use of particles. In this study, however, it was decided to log transform (the
natural logarithm) the data before comparing the groups. By transforming the data it can be
assumed that data in the two groups follows a normal distribution. The data is summarized in
the table below (unit is log mm precipitation).
Particles, X Control, Y
(log mm precipitation) (log mm precipitation)
Estimated mean µ̂X = 1.573 µ̂Y = 1.314
2
Estimated variance σ̂X = 0.333 σ̂Y2 = 0.171
Number of observations nX = 35 nY = 30
We now want to test whether the means of the 2 groups can be assumed equal, i.e.
H0 : µX = µY
H1 : µX 6= µY
It is given that the usual test statistics assuming the null hypothesis becomes 2.0958 with 61.19
degrees of freedom. State the p-value and conclusion when a significance level of α = 0.05 is
applied:
This is a two-sample t-test and we get the information we need from tobs = 2.0958 and degrees
of freedom is 61.19, so the p-value is calculated by
2 * (1-pt(abs(2.0958), df=61.19))
## [1] 0.04024393
42
------------------------------------ FACIT-END ------------------------------------
Continues on page 43
43
Exercise XVIII
At a Christmas marked there is a lottery. 24 balls are placed in bowl. On each of 4 balls there
is a picture of a star. On each of the remaining 20 balls there is a picture of an elf. The lottery
is now played so that 2 balls are drawn without replacement from the bowl. If both balls show
a picture of a star then you have won a prize!
You participate in the game once. Provide the probability of winning a prize:
80
1 276
56
2 276
40
3 276
16
4 276
6
5* 276
This is drawing without replacement, hence we must use the hypergeometric distribution (Chap-
ter 2.3.2). However, to get most easily to the answer in the presented form, we can use the
basic definition of probability
x
P (success) = ,
n
where x is the number of successes in a population of size n. We need possible successful
combinations, where a ball with a star is drawn. In the first draw one out of the four must be
drawn and in the second draw one out of the three remaining must be drawn, thus
x = 4 · 3 = 12.
n = 24 · 23 = 552,
since in the first draw there are 24 balls and in the second there are one less. Put together this
gives
12 6
= .
552 276
Alternatively, the x number of successful combinations could be calculated by
44
dhyper(x=2, m=4, n=20, k=2)
## [1] 0.02173913
## [1] 12
45