0% found this document useful (0 votes)
60 views21 pages

Formulas

The document appears to be a chapter containing formulas, R commands, and descriptions related to statistics and probability. It includes formulas and explanations for key concepts like mean, median, variance, standard deviation, probability density functions, cumulative distribution functions, and how to calculate these values and properties for discrete and continuous random variables. It also provides the R commands for calculating some of these common statistical measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views21 pages

Formulas

The document appears to be a chapter containing formulas, R commands, and descriptions related to statistics and probability. It includes formulas and explanations for key concepts like mean, median, variance, standard deviation, probability density functions, cumulative distribution functions, and how to calculate these values and properties for discrete and continuous random variables. It also provides the R commands for calculating some of these common statistical measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter A

Appendix A

Collection of formulas and R commands


Chapter A

Contents

A Collection of formulas and R commands


A.1 Introduction, descriptive statistics, R and data visualization . . . . . . . . . . 1
A.2 Probability and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A.2.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A.3 Statistics for one and two samples . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.4 Simulation based statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
A.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.6 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.7 Inference for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.8 Comparing means of multiple groups - ANOVA . . . . . . . . . . . . . . . . . 16

Glossaries 18

Acronyms 19
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND DATA
VISUALIZATION 1

This appendix chapter holds a collection of formulas. All the relevant equations from def-
initions, methods and theorems are included – along with associated R functions. All are
in included in the same order as in the book, except for the distributions which are listed
together.

A.1 Introduction, descriptive statistics, R and data


visualization

Description Formula R command

Sample mean 1 n
n i∑
1.4 x̄ = xi mean(x)
The mean of a sample. =1

Sample median
The value that divides a sam- 
x
( n+ 1 for odd n
1.5 ple in two halves with equal 2 ) median(x)
Q2 = x ( n ) + x ( n +2 )
2
number of observations in 
2
2
for even n
each.
Sample quantile
The value that divide a sam- (x
(np) + x(np+1)
ple such that p of the obser- for pn integer quantile(x,p,type=2),
1.7 qp = 2
vations are less that the value. x(dnpe) for pn non-integer
The 0.5 quantile is the Me-
dian.
Sample quartiles Q0 = q0 = “minimum”
The quartiles are the five Q1 = q0.25 = “lower quartile” quantile(x,
quantiles dividing the sample Q2 = q0.5 = “median” probs,type=2)
1.8
in four parts, such that each where
Q3 = q0.75 = “upper quartile”
part holds an equal number of probs=p
Q4 = q1 = “maximum”
observations
Sample variance
n
The sum of squared differ- 1
n − 1 i∑
1.10 s2 = ( xi − x̄ )2 var(x)
ences from the mean divided =1
by n − 1.
Sample standard deviation s
√ 1 n

n − 1 i∑
1.11 The square root of the sample s= s2 = ( xi − x̄ )2 sd(x)
variance. =1

Sample coefficient of vari-


ance
s
1.12 The sample standard devia- V= sd(x)/mean(x)

tion seen relative to the sam-
ple mean.
Sample Inter Quartile Range
1.15 IQR: The middle 50% range of IQR = Q3 − Q1 IQR(x, type=2)
data
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND DATA
VISUALIZATION 2

Description Formula R command

Sample covariance
1.18 Measure of linear strength of s xy = 1
n −1 ∑in=1 ( xi − x̄ ) (yi − ȳ) cov(x,y)
relation between two samples
Sample correlation
Measure of the linear strength 
xi − x̄

yi −ȳ

s xy
1.19 r= 1
n −1 ∑in=1 sx sy = s x ·sy cor(x,y)
of relation between two sam-
ples between -1 and 1.
Chapter A A.2 PROBABILITY AND SIMULATION 3

A.2 Probability and Simulation

Description Formula R command

Probability density function


(pdf) for a discrete variable
dnorm,dbinom,dhyper,
2.6 fulfills two conditions: f ( x ) ≥ f ( x ) = P( X = x )
dpois
0 and ∑all x f ( x ) = 1 and finds
the probality for one x value.
Cumulated distribution
function (cdf)
pnorm,pbinom,phyper,
2.9 gives the probability in a F ( x ) = P( X ≤ x )
ppois
range of x values where
P ( a < X ≤ b ) = F ( b ) − F ( a ).
Mean of a discrete random
2.13 variable µ = E( X ) = ∑i∞=1 xi f ( xi )

Variance of a discrete ran-


2.16 dom variable X σ2 = Var( X ) = E[( X − µ)2 ]

Pdf of a continuous random


variable
is a non-negative function for Rb
2.32 P( a < X ≤ b) = a
f ( x )dx
all possible outcomes and has
an area below the function of
one
Cdf of a continuous random
variable Rx
2.33 is non-decreasing F ( x ) = P( X ≤ x ) = −∞ f (u)du
and limx→−∞ F ( x ) =
0 and limx→∞ F ( x ) = 1
Mean and variance for a con- R∞
µ = E( X ) = −∞ x f ( x )dx
2.34 tinuous random variable X R∞
σ2 = E[( X − µ)2 ] = −∞ ( x − µ)2 f ( x )dx

Mean and variance of a linear


function
E( aX + b) = a E( X ) + b
2.54 The mean and variance of a
linear function of a random V( aX + b) = a2 V( X )
variable X.
Mean and variance of a linear E ( a 1 X1 + a 2 X2 + · · · + a n X n ) =
combination a 1 E ( X1 ) + a 2 E ( X2 ) + · · · + a n E ( X n )
2.56 The mean and variance of a V ( a 1 X1 + a 2 X2 + . . . + a n X n ) =
linear combination of random
a21 V( X1 ) + a22 V( X2 ) + · · · + a2n V( Xn )
variables.
Chapter A A.2 PROBABILITY AND SIMULATION 4

Description Formula R command

Covariance
The covariance between be
2.58 Cov( X, Y ) = E [( X − E[ X ])(Y − E[Y ])]
two random variables X and
Y.
Chapter A A.2 PROBABILITY AND SIMULATION 5

A.2.1 Distributions

Here all the included distributions are listed including some important theorems and definitions
related specifically with a distribution.

Description Formula R command

Binominal distribution
f ( x; n, p) = P( X = x ) dbinom(x, size, prob)
n is the number of indepen-  
n x pbinom(q, size, prob)
dent draws and p is the prob- = p (1 − p ) n − x
x qbinom(p, size, prob)
2.20 ability of a success in each  
n n! rbinom(n, size, prob)
draw. The Binominal pdf de- where = where
scribes the probability of x x x!(n − x )!
size=n, prob=p
succeses.
Mean and variance of a bino-
µ = np
2.21 mial distributed random vari-
able. σ2 = np(1 − p)

Hypergeometric distribution f ( x; n, a, N ) = P( X = x ) dhyper(x,m,n,k)


n is the number of draws ( xa )( Nn−−xa) phyper(q,m,n,k)
without replacement, a is = qhyper(p,m,n,k)
2.24 ( Nn )
number of succeses and N is   rhyper(nn,m,n,k)
the population size. a a!
where = where
b b!( a − b)!
m=a, n=N − a, k=n
Mean and variance of a hyper- a
µ=n
geometric distributed random N
2.25 a ( N − a) N − n
variable. σ2 = n
N2 N−1

Poisson distribution dpois(x,lambda)


λ is the rate (or intensity) i.e. ppois(q,lambda)
the average number of events λ x −λ qpois(p,lambda)
2.27 f ( x; λ) = e
per interval. The Poisson pdf x! rpois(n,lambda)
describes the probability of x where
events in an interval. lambda=λ
Mean and variance of a Pois-
µ=λ
2.28 son distributed random vari-
able. σ2 = λ

Uniform distribution 0

 for x < α
α and β defines the range of f ( x; α, β) = 1
for x ∈ [α, β] dunif(x,min,max)
β−α
possible outcomes. random 
 punif(q,min,max)
0 for x > β
variable following the uni- qunif(p,min,max)
2.35

form distribution has equal 0

 for x < α runif(n,min,max)
density at any value within a F ( x; α, β) = x −α
for x ∈ [α, β] where
defined range.  β−α
 min=α, max=β
0 for x > β
Chapter A A.2 PROBABILITY AND SIMULATION 6

Description Formula R command

Mean and variance of a uni- 1


µ= (α + β)
form distributed random vari- 2
2.36 1
able X. σ2 = ( β − α )2
12
dnorm(x,mean,sd)
pnorm(q,mean,sd)
Normal distribution ( x − µ )2
1 qnorm(p,mean,sd)
2.37 Often also called the Gaussian f ( x; µ, σ) = √ e− 2σ2
σ 2π rnorm(n,mean,sd)
distribution.
where
mean=µ, sd=σ.
Mean and variance of a nor-
µ
2.38 mal distributed random vari-
able. σ2

Transformation of a normal
distributed random variable X−µ
2.43 Z=
X into a standardized normal σ
random variable.

dlnorm(x,meanlog,sdlog)
Log-normal distribution
plnorm(q,meanlog,sdlog)
α is the mean and β2 is the 2
1 − (ln x−α) qlnorm(p,meanlog,sdlog)
2.46 variance of the normal distri- f (x) = √ e 2β2
x 2πβ rlnorm(n,meanlog,sdlog)
bution obtained when taking
where
the natural logarithm to X.
meanlog=α, sdlog=β.
Mean and variance of a log- 2 /2
µ = eα+ β
normal distributed random
2 2
2.47 variable. σ2 = e2α+ β (e β − 1)

dexp(x,rate)
( pexp(q,rate)
2.48
Exponential distribution λe−λx for x ≥ 0 qexp(p,rate)
λ is the mean rate of events. f ( x; λ) = rexp(n,rate)
0 for x < 0
where
rate=λ.
Mean and variance of a ex- 1
µ=
ponential distributed random λ
2.49 1
variable. σ2 = 2
λ
dchisq(x,df)
pchisq(q,df)
χ2 -distribution
1 x qchisq(p,df)
 x 2 −1 e − 2 ;
ν
Γ ν2 is the Γ-function and ν is

2.78 f (x) = x≥0
2 Γ 2
ν ν
2 rchisq(n,df)
the degrees of freedom.
where
df=ν.
Chapter A A.2 PROBABILITY AND SIMULATION 7

Description Formula R command

Given a sample of size n from


the normal distributed ran-
dom variables Xi with vari-
ance σ2 , then the sample vari-
ance S2 (viewed as random
( n − 1) S2
2.81 χ2 =
variable) can be transformed σ2
to follow the χ2 distribution
with the degrees of freedom
ν = n − 1.
Mean and variance of a χ2 dis- E( X ) = ν
2.83
tributed random variable. V ( X ) = 2ν
t-distribution
ν is the degrees of freedom Γ ( ν+ 1 − ν+2 1
2 )

2.86 t2
f T (t) = √ 1+
and Γ() is the Gamma func- νπ Γ( 2ν ) ν

tion.
dt(x,df)
Relation between normal pt(q,df)
random variables and χ2 - Z qt(p,df)
2.87 X= √ ∼ t(ν)
distributed random variables. Y/ν rt(n,df)
Z ∼ N (0, 1) and Y ∼ χ2 (ν). where
df=ν.
For normal distributed ran-
dom variables X1 , . . . , Xn , the
random variable follows the
t-distribution, where X is the X−µ
2.89 T= √ ∼ t ( n − 1)
sample mean, µ is the mean of S/ n
X, n is the sample size and S
is the sample standard devia-
tion.
Mean and variance of a t- µ = 0; ν>1
2.93 distributed variable X. σ2 =
ν
; ν>2
ν−2
  ν21 df(x,df1,df2)
F-distribution 1 ν1
f F (x) = pf(q,df1,df2)
ν1 an ν2 are the degrees of ν1 ν2

B 2, 2 ν2
qf(p,df1,df2)
2.95 freedom and B(·, ·) is the Beta  − ν1 +2 ν2
ν1 ν1 rf(n,df1,df2)
function. · x 2 −1 1 + x where
ν2
df1=ν1 ,df2=µ2 .
The F-distribution appears as
the ratio between two inde-
U/ν1
2.96 pendent χ2 -distributed ran- ∼ F (ν1 , ν2 )
V/ν2
dom variables with U ∼
χ2 (ν1 ) and V ∼ χ2 (ν2 ).
Chapter A A.2 PROBABILITY AND SIMULATION 8

Description Formula R command

X1 , . . . , Xn1 and Y1 , . . . , Yn2


with the mean µ1 and µ2
S12 /σ12
2.98 and the variance σ12 and σ22 ∼ F (n1 − 1, n2 − 1)
is independent and sampled S22 /σ22
from a normal distribution.
Mean and variance of a F- ν2
µ= ; ν2 > 2
distributed variable X. ν2 − 2
2.101 2ν22 (ν1 + ν2 − 2)
σ= ; ν2 > 4
ν1 (ν2 − 2)2 (ν2 − 4)
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 9

A.3 Statistics for one and two samples

Description Formula R command

1 n 2
 
The distribution of the mean σ
3.3
of normal random variables.
X̄ = ∑
n i =1
Xi ∼ N µ,
n
The distribution of the σ-
X̄ − µ
√ ∼ N 0, 12

3.5 standardized mean of normal Z=
σ/ n
random variables
The distribution of the S-
X̄ − µ
3.5 standardized mean of normal T= √ ∼ t ( n − 1)
S/ n
random variables
Standard Error of the mean s
SEx̄ = √
3.7 n
The one sample confidence in- s
3.9 x̄ ± t1−α/2 · √
terval for µ n
X̄ − µ
3.14 Central Limit Theorem (CLT) Z= √
σ/ n
" #
( n − 1 ) s 2 ( n − 1) s2
σ2 : ;
Confidence interval for the χ21−α/2 χ2α/2
3.19 variance and standard devia- "s s #
( n − 1) s2 ( n − 1) s2
tion σ: ;
χ21−α/2 χ2α/2

The p-value is the probability of obtain-


ing a test statistic that is at least as ex-
treme as the test statistic that was actu-
3.22 The p-value P(T>x)=2(1-pt(x,n-1))
ally observed. This probability is calcu-
lated under the assumption that the null
hypothesis is true.

p-value = 2 · P( T > |tobs |)


x̄ − µ0
The one-sample t-test statistic tobs = √
3.23 s/ n
and p-value
H0 : µ = µ0

Rejected: p-value < α


3.24 The hypothesis test
Accepted: otherwise
3.29 Significant effect An effect is significant if the p-value< α
The critical values: α/2- and
1 − α/2-quantiles of the t-
3.31 tα/2 and t1−α/2
distribution with n − 1 de-
grees of freedom
The one-sample hypothesis Reject: |tobs | > t1−α/2
3.32
test by the critical value accept: otherwise
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 10

Description Formula R command

x̄ ± t1−α/2 · √sn
3.33 Confidence interval for µ
acceptance region/CI: H0 : µ = µ0
Test: H0 : µ = µ0 and H1 : µ 6= µ0 by
p-value = 2 · P( T > |tobs |)
3.36 The level α one-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The one-sample confidence
z1−α/2 ·σ 2
3.63 interval (CI) sample size for- n= ME
mula
The one-sample sample size  2
z +z
3.65 n = σ 1−(µβ −µ1−)α/2
formula 0 1

naive approach: pi = ni , i = 1, . . . , n
The Normal q-q plot with
3.42 commonly aproach: pi = in−+0.5 1, i =
n > 10
1, . . . , n

δ = µ2 − µ1
The (Welch) two-sample t-test H0 : δ = δ0
3.49 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s1 /n1 +s2 /n2

( X̄ − X̄ )−δ
T = √ 21 2 2 0
S /n1 +S2 /n2
1 2
The distribution of the s s2
2
3.50 1
n +n
2
(Welch) two-sample statistic ν=
1 2
(s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1

Test: H0 : µ1 − µ2 = δ0 and H1 : µ1 −
µ2 6= δ0 by p-value = 2 · P( T > |tobs |)
3.51 The level α two-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The pooled two-sample esti- (n1 −1)s21 +(n2 −1)s22
3.52 s2p = n1 + n2 −2
mate of variance

δ = µ1 − µ2
The pooled two-sample t-test H0 : δ = δ0
3.53 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s p /n1 +s p /n2

The distribution of the pooled ( X̄ − X̄ )−δ


3.54 T = √ 21 2 2 0
two-sample t-test statistic S p /n1 +S p /n2
q
s21 s22
x̄ − ȳ ± t1−α/2 · n1 + n2
2
The two-sample confidence s2 s2

1 2
3.47 n1 + n2
interval for µ1 − µ2 ν= (s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1
Chapter A A.4 SIMULATION BASED STATISTICS 11

A.4 Simulation based statistics

Description Formula R command

The non-linear approximative  2


∂f
4.3 σ2f (X
1 ,...,Xn )
= ∑in=1 σi2
error propagation rule ∂xi

1. Simulate k outcomes
Non-linear error propagation 2. Calculate the
4.4
by simulation q standard deviation by
s f (X ,...,Xn ) = k−1 1 ∑ik=1 ( f j − f¯)2
sim
1

Confidence interval for any 1.Simulate k samples


4.7 feature θ by parametric boot- 2.Calculate the hstatistic θ̂ i
strap ∗
3.Calculate CI: q100 , q ∗
(α/2)% 100(1−α/2)%

Two-sample confidence in- 1.Simulate k sets of 2 samples


terval for any feature com- ∗ − θ̂ ∗
2.Calculate the statistic θ̂ xk
4.10 yk
parison θ1 − θ2 by parametric h i
∗ ∗
3.Calculate CI: q100(α/2)% , q100(1−α/2)%
bootstrap
Chapter A A.5 SIMPLE LINEAR REGRESSION 12

A.5 Simple linear regression

Description Formula R command

∑in=1 (Yi − Ȳ )( xi − x̄ )
β̂ 1 =
Sxx
5.4 Least square estimators β̂ 0 = Ȳ − β̂ 1 x̄
where Sxx = ∑in=1 ( xi − x̄ )2

σ2 x̄2 σ2
V[ β̂ 0 ] = +
n Sxx
σ 2
5.8 Variance of estimators V[ β̂ 1 ] =
Sxx
x̄σ2
Cov[ β̂ 0 , β̂ 1 ] = −
Sxx

β̂ 0 − β 0,0
Tβ0 =
σ̂β0
Tests statistics for H0 : β 0 = 0
5.12 β̂ 1 − β 0,1
and H0 : β 1 = 0 Tβ1 =
σ̂β1

Test H0,i : β i = β 0,i vs. H1,i : β i 6= β 0,i


with p-value = 2 · P( T > |tobs,βi |) D <- data.frame(
β̂ i − β 0,i x=c(), y=c())
5.14 Level α t-tests for parameter where tobs,βi = σ̂βi . fit <- lm(y~x, data=D)
If p-value < α then reject H0 , summary(fit)
otherwise accept H0

β̂ 0 ± t1−α/2 σ̂β0
Parameter confidence inter- confint(fit,level=0.95)
5.15 β̂ 1 ± t1−α/2 σ̂β1
vals

predict(fit,
newdata=data.frame(),
Confidence interval for the line:
q interval="confidence",
1 ( xnew − x̄ )2
Confident and prediction in- β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ n + Sxx
level=0.95)
5.18 predict(fit,
terval Interval for a new point prediction:
q newdata=data.frame(),
1 ( xnew − x̄ )2 interval="prediction",
β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ 1+ n + Sxx
level=0.95)

β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
5.23
the simple linear regression RSS
σ̂2 =
model n−2

Coefficient of determination ∑i (yi −ŷi )2


r2 = 1 − ∑i (yi −ȳ)2
R2
5.25
Chapter A A.5 SIMPLE LINEAR REGRESSION 13

Description Formula R command

> Check the normality assumption with qqnorm(fit$residuals)


a q-q plot of the residuals. qqline(fit$residuals)
Model validation of assump-
5.7 > Check the systematic behavior by
tions plot(fit$fitted.values,
plotting the residuals ei as a function of
fitted values ŷi fit$residuals)
Chapter A A.6 MULTIPLE LINEAR REGRESSION 14

A.6 Multiple linear regression

Description Formula R command

Test H0,i : β i = β 0,i vs. H1,i : β i 6= β 0,i D<-data.frame(x1=c(),


with p-value = 2 · P( T > |tobs,βi |) x2=c(),y=c())
β̂ i − β 0,i
6.2 Level α t-tests for parameter where tobs,βi = σ̂βi .
fit <- lm(y~x1+x2,
If p-value < α the reject H0 , data=D)
otherwise accept H0 summary(fit)

Parameter confidence inter-


6.5 β̂ i ± t1−α/2 σ̂βi confint(fit,level=0.95)
vals
predict(fit,
newdata=data.frame(),
Confident interval for the line interval="confidence",
β̂ 0 + β̂ 1 x1,new + · · · + β̂ p x p,new level=0.95)
Confident and prediction in-
6.9 predict(fit,
terval (in R)
Interval for a new point prediction newdata=data.frame(),
β̂ 0 + β̂ 1 x1,new + · · · + β̂ p x p,new + ε new interval="prediction",
level=0.95)

β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
6.17
the multiple linear regression RSS
σ̂2 =
model n − ( p + 1)

Backward selection: start with full


6.16 Model selection procedure model and stepwise remove insignifi-
cant terms
Chapter A A.7 INFERENCE FOR PROPORTIONS 15

A.7 Inference for proportions

Description Formula R command


x
Proportion estimate and con- p̂ = n prop.test(x=, n=,
7.3 q
p̂(1− p̂)
fidence interval p̂ ± z1−α/2 correct=FALSE)
n

Approximate proportion with X −np0


7.10 Z= √ ∼ N (0, 1)
Z np0 (1− p0 )

Test: H0 : p = p0 , vs. H1 : p 6= p0
by p-value = 2 · P( Z > |zobs |)
The level α one-sample pro- prop.test(x=, n=,
7.11 where Z ∼ N (0, 12 )
portion hypothesis test correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0
Guessed p (with prior knowledge):
z −α/2 2
Sample size formula for the CI n = p(1 − p)( 1ME )
7.13
of a proportion Unknown p:
z −α/2 2
n = 14 ( 1ME )

Difference of two proportions


q
p̂1 (1− p̂1 ) p̂2 (1− p̂2 )
σ̂p̂1 − p̂2 = n1 + n2
estimator p̂1 − p̂2 and confi-
7.15
dence interval for the differ-
( p̂1 − p̂2 ) ± z1−α/2 · σ̂p̂1 − p̂2
ence

Test: H0 : p1 = p2 , vs. H1 : p1 6= p2
by p-value = 2 · P( Z > |zobs |)
prop.test(x=, n=,
7.18 The level α one-sample t-test where Z ∼ N (0, 12 )
correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0

The multi-sample proportions Test: H0 : p1 = p2 = . . . = pc = p chisq.test(X,


7.20 (oij −eij )2
χ2 -test by χ2obs = ∑2i=1 ∑cj=1 eij
correct = FALSE)

Test: H0 : pi1 = pi2 = . . . = pic = pi


for all rows i = 1, 2, . . . , r
The r × c frequency table χ2 - ( o − e )2 chisq.test(X,
7.22 by χ2obs = ∑ri=1 ∑cj=1 ij eij ij
test correct = FALSE)
Reject if χ2obs > χ21−α (r − 1)(c − 1)


Otherwise accept
Chapter A A.8 COMPARING MEANS OF MULTIPLE GROUPS - ANOVA 16

A.8 Comparing means of multiple groups - ANOVA

Description Formula R command

k ni k ni

One-way ANOVA variation


∑ ∑ (yij − ȳ)2 = ∑ ∑ (yij − ȳi )2 +
i =1 j =1 i =1 j =1
8.2
decomposition | {z } | {z }
SST SSE
k
∑ ni (ȳi − ȳ)2
i =1
| {z }
SS(Tr)

SSE (n1 −1)s21 +···+(nk −1)s2k


MSE = n−k = n−k
One-way within group vari-
8.4
ability 1 n
s2i = n i −1 ∑i=i 1 (yij − ȳi )2

H0 : αi = 0; i = 1, 2, . . . , k,

SS( Tr )/(k −1)


One-way test for difference in F= SSE/(n−k)
8.6 anova(lm(y~treatm))
mean for k groups
F-distribution with k − 1 and n − k de-
grees of freedom
r  
SSE 1 1
ȳi − ȳ j ± t1−α/2 n−k ni + nj
Post hoc pairwise confidence
8.9 If all M = k (k − 1)/2 combinations,
intervals
then use αBonferroni = α/M

Test: H0 : µi = µ j vs. H1 : µi 6= µ j
by p-value = 2 · P( T > |tobs |)
ȳi −ȳ j
Post hoc pairwise hypothesis where tobs = s
8.10
 
1
tests MSE ni + n1
j

Test M = k (k − 1)/2 times, but each


time with αBonferroni = α/M
Least Significant Difference √
8.13 LSDα = t1−α/2 2 · MSE/m
(LSD) values
k l
∑ ∑ (yij − µ̂)2 =
i =1 j =1
| {z }
SST
Two-way ANOVA variation k l
8.20
decomposition ∑ ∑ (yij − α̂i − β̂ j − µ̂)2 +
i =1 j =1
| {z }
SSE
k l
l · ∑ α̂2i + k · ∑ β̂2j
i =1 j =1
| {z } | {z }
SS(Tr) SS(Bl)
Chapter A A.8 COMPARING MEANS OF MULTIPLE GROUPS - ANOVA 17

Description Formula R command


H0,Tr : αi = 0, i = 1, 2, . . . , k
Test for difference in means in
8.22 two-way ANOVA grouped in SS(Tr)/(k − 1) fit<-lm(y~treatm+block)
FTr = anova(fit)
treatments and in blocks SSE/((k − 1)(l − 1))
H0,Bl : β j = 0, j = 1, 2, . . . , l
SS(Bl)/(l − 1)
FBl =
SSE/((k − 1)(l − 1))

One-way ANOVA

Source of Degrees of Sums of Mean sum of Test- p-


variation freedom squares squares statistic F value
SS(Tr) MS( Tr )
Treatment k−1 SS(Tr) MS( Tr ) = k −1 Fobs = MSE P( F > Fobs )
SSE
Residual n−k SSE MSE = n−k

Total n−1 SST

Two-way ANOVA

Source of Degrees of Sums of Mean sums of Test p-


variation freedom squares squares statistic F value
SS(Tr) MS(Tr)
Treatment k−1 SS(Tr) MS(Tr) = k −1 FTr = MSE P( F > FTr )
SS(Bl) MS(Bl)
Block l−1 SS(Bl) MS(Bl) = l −1 FBl = MSE P( F > FBl )
SSE
Residual (l − 1)(k − 1) SSE MSE = (k−1)(l −1)

Total n−1 SST


Chapter A Glossaries 18

Glossaries

cumulated distribution function [Fordelingsfunktion]The cdf is the function which determines the
probability of observing an outcome of a random variable below a given value 3

Continuous random variable [Kontinuert stokastisk variabel] If an outcome of an experiment takes


a continuous value, for example: a distance, a temperature, a weight, etc., then it is represented
by a continuous random variable 3

Correlation [Korrelation] The sample correlation coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Covariance 2

Covariance [Kovarians] The sample covariance coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Correlation 2, 4

F-distribution [F-fordelingen] The F-distribution appears as the ratio between two independent χ2 -
distributed random variables 16

Inter Quartile Range [Interkvartil bredde] The Inter Quartile Range (IQR) is the middle 50% range
of data 1

Median [Median, stikprøvemedian] The median of population or sample (note, in text no distin-
guishment between population median and sample median) 1

probability density function The pdf is the function which determines the probability of every pos-
sible outcome of a random variable 3

Quantile [Fraktil, stikprøvefraktil] The quantiles of population or sample (note, in text no distin-
guishment between population quantile and sample quantile) 1

Quartile [Fraktil, stikprøvefraktil] The quartiles of population or sample (note, in text no distin-
guishment between population quartile and sample quartile) 1

Sample variance [Empirisk varians, stikprøvevarians] 1

Sample mean [Stikprøvegennemsnit] The average of a sample 1

Standard deviation [Standard afvigelse] 1


Chapter A Acronyms 19

Acronyms

ANOVA Analysis of Variance Glossary: Analysis of Variance

cdf cumulated distribution function 3, Glossary: cumulated distribution function

CI confidence interval 10–12, 15, Glossary: confidence interval

CLT Central Limit Theorem Glossary: Central Limit Theorem

IQR Inter Quartile Range 1, Glossary: Inter Quartile Range

LSD Least Significant Difference Glossary: Least Significant Difference

pdf probability density function 3, Glossary: probability density function

You might also like