Advanced Biostatistics with R
Advanced Biostatistics with R
Md. Mostakim
Session: 2022-23
Department of Statistics
University of Dhaka
Published Date:
Acknowledgements:
I am deeply grateful to Fatima Tuz Zahura Mam for her exceptional guidance in teaching Advanced
Biostatistics. My sincere thanks also go to the seniors for their insightful lectures.
N.B. You may share this pdf book as much as you like but don’t use it for any unethical purpose.
For any kind of feedback, please contact to [email protected]. Your feedback will be very inspiring for me.
Table of Contents
References ............................................................................................................................ 43
There are two types of survival models that are commonly used in practice, these are:
1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵
a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?
Solution:
Here, the random variable 𝑍 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0,1) ; so, the (Accelerated Failure Time) AFT model is called
Log Normal AFT model.
The survival probability that a lung cancer patient survives more than 𝑡 time is
𝑙𝑛𝑡 − 𝑥 ′ 𝛽 − 𝜇
𝑆(𝑡) = 1 − 𝛷 ( )
𝜎
1|Page
The 𝑝𝑡ℎ quartile time, denoted by 𝑡𝑝 , is given by-
𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )
linear.pred.TrA<-0.5*1
linear.pred.TrB<-0.5*0
Interpretation:
• For the lung cancer patients taking treatment A, 63.84% of them do not get the disease back
before 6 months, whereas, those taking treatment B, 54.15% of them do not get the disease back
before 6 months.
• For the lung cancer patients taking treatment A, 50.30% of them do not get the disease back
before 1 year (12 months), whereas, those taking treatment B, 40.42% of them do not get the
disease back before 1 year.
2|Page
• For the lung cancer patients taking treatment A, 36.73% of them do not get the disease back
before 2 years, whereas, those taking treatment B, 27.79% of them do not get the disease back
before 2 years.
• For the lung cancer patients taking treatment A, 29.40% of them do not get the disease back
before 3 years, whereas, those taking treatment B, 21.43% of them do not get the disease back
before 3 years.
• For the lung cancer patients taking treatment A, 24.65% of them do not get the disease back
before 4 years, whereas, those taking treatment B, 17.47% of them do not get the disease back
before 4 years.
• For the lung cancer patients taking treatment A, 21.27% of them do not get the disease back
before 5 years, whereas, those taking treatment B, 14.75% of them do not get the disease back
before 5 years.
• For the lung cancer patients taking treatment A, 18.72% of them do not get the disease back
before 6 years, whereas, those taking treatment B, 12.75% of them do not get the disease back
before 6 years.
3|Page
Comment:
From the above figure, it is found that the survival probabilities over the time, for patients taking
treatment A is higher than those taking treatment B. As for example, for the lung cancer patients taking
treatment A, (1 − 0.6384) ∗ 100% = 36.26% of them get the disease back after 6 months, on the
other hand, among those taking treatment B, (1 − 0.5415) ∗ 100% = 45.85% get the disease back
after 6 months using Log Normal AFT model.
c. Find the first, second, and third quartile survival times for treatment A and B.
#Qurtile times
z.first.quartile<-qnorm(0.25)
time.first.quartile.A<-exp(location+linear.pred.TrA+scale*z.first.quartile)
time.first.quartile.B<-exp(location+linear.pred.TrB+scale*z.first.quartile)
z.second.quartile<-qnorm(0.5)
time.second.quartile.A<-exp(location+linear.pred.TrA+scale*z.second.quartile)
time.second.quartile.B<-exp(location+linear.pred.TrB+scale*z.second.quartile)
z.third.quartile<-qnorm(0.75)
4|Page
time.third.quartile.A<-exp(location+linear.pred.TrA+scale*z.third.quartile)
time.third.quartile.B<-exp(location+linear.pred.TrB+scale*z.third.quartile)
quartile.A<-cbind(time.first.quartile.A,time.second.quartile.A,time.third.quartile.A)
quartile.B<-cbind(time.first.quartile.B,time.second.quartile.B,time.third.quartile.B)
cbind(quartile.A, quartile.B)
Interpretation:
• For the lung cancer patients, taking treatment A, 25% of them get back the disease at or before
3.16 months, while this is 1.92 months for those who are taking treatment B.
• For the lung cancer patients, taking treatment A, 50% of them get back the disease at or before
12.18 months, while this is 7.39 months for those who are taking treatment B.
• For the lung cancer patients, taking treatment A, 75% of them get back the disease at or before
46.95 months, while this is 28.47 months for those who are taking treatment B.
As the event of interest is time to relapse of lung cancer, from survival probability it is found that the
survival probability for the time to relapse of lung cancer for patients taking treatment A is higher than
those taking treatment B. From the quartiles it is found that, those who take treatment A takes more
time to get back disease than those who take treatment B using a Log Normal model. Thus, patients
with lung cancer should take treatment A with the hope of longer survival and better health rather than
treatment B. Therefore, treatment A is better than treatment B.
5|Page
1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵
a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?
Solution:
a.
Here, the random variable ~ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(0,1) ; so, the (Accelerated Failure Time) AFT model is called
Log Logistic AFT model.
1 1
And parameters for 𝑇 are, location, 𝛼 = 𝑒 𝜇 = 𝑒 2 ; scale, 𝛾 = = ; regression parameter, 𝛽 = 0.5 .
𝜎 2
The survival probability that a lung cancer patient survives more than 𝑡 time is
𝛾 −1
𝑡 𝑒 −𝛽𝑥′
𝑆(𝑡) = [1 + ( ) ]
𝛼
𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )
time<-c(0,0.5,1,2,3,4,5,6)
t<-12*time
#Given value
mu<-2
sigma<-2
beta<-0.5
#linear predictor
6|Page
lin.pre.A<-0.5*1
lin.pre.B<-0.5*0
#parameter values for log logistic
location<-exp(mu) #shape
gamma<-1/sigma #scale
#survival probabilities
sur.p.A<-1/(1+((t*exp(-lin.pre.A))/location)^gamma)
sur.p.B<-1/(1+((t*exp(-lin.pre.B))/location)^gamma)
cbind(t,sur.p.A,sur.p.B)
t sur.p.A sur.p.B
[1,] 0 1.0000000 1.0000000
[2,] 6 0.5876164 0.5260066
[3,] 12 0.5018867 0.4396819
[4,] 24 0.4160459 0.3568582
[5,] 36 0.3677784 0.3117910
[6,] 48 0.3350125 0.2817899
[7,] 60 0.3106307 0.2597685
[8,] 72 0.2914539 0.2426265
Rest of the calculation and interpretation same as problem 01, try yourself!
1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵
a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?
Solution:
a.
Here, the random variable ~ 𝐸𝑥𝑡𝑟𝑒𝑚𝑒 𝑣𝑎𝑙𝑢𝑒 (0,1) ; so, the (Accelerated Failure Time) AFT model
is called Weibull Regression model.
7|Page
The model is given by, 𝑌 = 𝑙𝑛𝑇 = 𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍 = 2.0 + 0.5𝑥 + 2.0 𝑍
1 1
And parameters for 𝑇 are, location, 𝜇 = −𝑙𝑛𝜆 𝑜𝑟 λ = exp(−𝜇) = exp(−2) ; scale, 𝛼 = 𝜎 = 2 ;
The survival probability that a lung cancer patient survives more than 𝑡 time is
𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )
time<-c(0,0.5,1,2,3,4,5,6)
t<-12*time
#Given value
mu<-2
sigma<-2
beta<-0.5
#linear predictor
lin.pre.A<-0.5*1
lin.pre.B<-0.5*0
#Parameters for Weibull model
lambda<-exp(-mu) #shape
alpha<-1/sigma #scale
sur.p.A<-exp(-(lambda*t*exp(-lin.pre.A))^alpha)
sur.p.B<-exp(-(lambda*t*exp(-lin.pre.B))^alpha)
cbind(t,sur.p.A,sur.p.B)
t sur.p.A sur.p.B
[1,] 0 1.00000000 1.00000000
[2,] 6 0.49569693 0.40611581
[3,] 12 0.37065568 0.27960657
[4,] 24 0.24571545 0.16493005
8|Page
[5,] 36 0.17924014 0.10999981
[6,] 48 0.13738563 0.07817983
[7,] 60 0.10868988 0.05786851
[8,] 72 0.08794235 0.04408831
Rest of the calculation and interpretation same as problem 01 and 02, try yourself!
b. Estimate the first quartile, median, and third quartile survival times when 𝑥 = 1 and also
find 95% confidence interval for these survival times.
c. Estimate the first quartile, median, and third quartile survival times when 𝑥 = 0 and also
find 95% confidence interval for these survival times.
Solution:
Given that,
̂ ̂0 ) = 0.158,
𝛽0 = 0.105, 𝑉𝑎𝑟 (𝛽
̂ ̂1 ) = 0.151,
𝛽1 = 0.937, 𝑉𝑎𝑟 (𝛽
𝑙𝑛𝑏̂ = 0.159, 𝑉𝑎𝑟 (𝑙𝑛𝑏̂ )= 0.027,
̂0 , 𝛽
𝐶𝑜𝑣 (𝛽 ̂1 ) = 𝐶𝑜𝑣 (𝛽
̂1 , 𝛽
̂0 ) = -0.092,
̂0 , 𝑙𝑛𝑏̂ ) = 𝐶𝑜𝑣 (𝑙𝑛𝑏̂, 𝛽
𝐶𝑜𝑣 (𝛽 ̂0 ) = 0.000,
̂1 , 𝑙𝑛𝑏̂ ) = 𝐶𝑜𝑣 (𝑙𝑛𝑏̂, 𝛽
𝐶𝑜𝑣 ( 𝛽 ̂1 ) = 0.000.
9|Page
Hypotheses to be tested:
𝑖) 𝐻0 : 𝛽1 = 0
vs 𝐻0 : 𝛽1 ≠ 0
𝑖𝑖) 𝐻0 : 𝑏 = 1
vs 𝐻0 : 𝑏 ≠ 1
For i)
Test statistic:
̂
𝛽1 −0
w= 𝑆𝐸 ( ̂
𝛽1 )
~ N (0,1), under 𝐻0
where, 𝑆𝐸 ( ̂ ̂1 ) = √0.151
𝛽1 ) = √𝑉𝑎𝑟 (𝛽
beta1.hat<-.937
se.beta1.hat<-sqrt(.151)
wald.beta1<-beta1.hat/se.beta1.hat
pval.beta1<-2*(1-pnorm(abs(wald.beta1)))
pval.beta1
[1] 0.0158958
Here, the p-value is 0.0158958 < 0.05. Thus, null hypothesis 𝐻0 : 𝛽1 = 0 may be rejected at 5% level of
significance.
For ii)
Test statistic
̂−1
b
w= ̂)
𝑆𝐸 (b
~ N (0,1), under 𝐻0
̂
where, b̂ = 𝑒 𝑙𝑛 b = 𝑒 0.159 ,
b.hat<-exp(.159)
var.b.hat<-(b.hat^2)*.027
se.b.hat<-sqrt(var.b.hat)
wald.b.hat<-(b.hat-1)/se.b.hat
10 | P a g e
pval.b.hat<-2*(1-pnorm(abs(wald.b.hat)))
pval.b.hat
[1] 0.3709819
Here, the p-value is 0.3709819 > 0.05. Thus, null hypothesis 𝐻0 : b = 1 may not be rejected at 5% level
of significance.
b. Estimate the first quartile, median, and third quartile survival times when 𝒙 = 𝟏 and also
find 95% confidence interval for these survival times.
Given,
𝑙𝑛𝑡𝑖 |𝑥𝑖 follows a normal distribution [or extreme value or logistic distribution] with location parameter
𝛽0 + 𝛽1 𝑥𝑖 and scale parameter 𝑏.
Thus, the AFT regression model
𝑙𝑛𝑇 = 𝛽0 + 𝛽1 𝑥 + 𝑏𝑧
11 | P a g e
In the presence of covariate x, 25% observations have survival time less than or equal to 1.285657, 50%
observations have survival time less than or equal to 2.834881 and 75% observations have survival time
less than or equal to 6.250928.
where,
2
̂0 ) + 2 𝑥 𝑐𝑜𝑣 (𝛽
√𝑉𝑎𝑟 (𝑡𝑝 ) = √( 𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽 ̂0 , ̂
𝛽1 ) + (𝑥 2 ) 𝑉𝑎𝑟 ( ̂
𝛽1 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂)) (ii)
Since, x=1,
2
̂0 ) + 2 𝑐𝑜𝑣 (𝛽
√𝑉𝑎𝑟 (𝑡𝑝 ) =√( 𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽 ̂0 , ̂
𝛽1 ) + 𝑉𝑎𝑟 ( ̂
𝛽1 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂))
#confidence intervals
se.t.25<-sqrt((t.25^2)*(.158+.151+(z.25^2)*var.b.hat-2*.092))
CI.t25<-c(t.25-qnorm(.975)*se.t.25,t.25+qnorm(.975)*se.t.25)
CI.t25
se.t.5<-sqrt((t.5^2)*(.158+.151+(z.5^2)*var.b.hat-2*.092))
CI.t5<-c(t.5-qnorm(.975)*se.t.5,t.5+qnorm(.975)*se.t.5)
CI.t5
[1] 0.8704448 4.7993174
se.t.75<-sqrt((t.75^2)*(.158+.151+(z.75^2)*var.b.hat-2*.092))
CI.t75<-c(t.75-qnorm(.975)*se.t.75,t.75+qnorm(.975)*se.t.75)
CI.t75
[1] 1.636095 10.865761
In the presence of covariate x, the intervals (0.3365032, 2.2348113), (0.8704448, 4.7993174) and
(1.636095, 10.865761) will contain the true value of 1st, 2nd and 3rd quartile survival time of the
population respectively with 0.95 probability.
c. Estimate the first quartile, median, and third quartile survival times when x=0 and also
find 95% confidence interval for these survival times.
12 | P a g e
the pth quantile survival time
𝑡𝑝 = 𝑒 (𝛽0 +𝑏𝑧𝑝 )
z.25<-qnorm(.25)
t.25<-exp(.105+b.hat*z.25)
z.5<-qnorm(.5)
t.5<-exp(.105+b.hat*z.5)
z.75<-qnorm(.75)
t.75<-exp(.105+b.hat*z.75)
quartiles<-cbind(t.25,t.5,t.75)
quartiles
t.25 t.5 t.75
[1,] 0.5037224 1.110711 2.449123
In the absence of covariate x, 25% observations have survival time less than or equal to 0.5037224,
50% observations have survival time less than or equal to 1.110711 and 75% observations have survival
time less than or equal to 2.449123.
Where,
2
̂0 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂))
√𝑉𝑎𝑟 (𝑡𝑝 ) =√(𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽
#confidence intervals
se.t.25<-sqrt((t.25^2)*(.158+(z.25^2)*var.b.hat))
CI.t25<-c(t.25-qnorm(.975)*se.t.25,t.25+qnorm(.975)*se.t.25)
CI.t25
se.t.5<-sqrt((t.5^2)*(.158+(z.5^2)*var.b.hat))
CI.t5<-c(t.5-qnorm(.975)*se.t.5,t.5+qnorm(.975)*se.t.5)
CI.t5
se.t.75<-sqrt((t.75^2)*(.158+(z.75^2)*var.b.hat))
CI.t75<-c(t.75-qnorm(.975)*se.t.75,t.75+qnorm(.975)*se.t.75)
CI.t75
13 | P a g e
[1] 0.4417362 4.4565095
In the absence of covariate x, the intervals (0.09085392, 0.91659091), (0.245389, 1.976032) and
(0.4417362, 4.4565095) will contain the true value of 1st, 2nd and 3rd quartile survival time of the
population respectively with 0.95 probability.
a. Fit a Weibull AFT regression model. Identify the potential risk factors associated with infant
mortality and interpret the results.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for the variable ‘Wealth Index’.
Solution:
a.
dim(inf.data)
names(inf.data)
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME) #Those children who have not survived up to at least 1 month, i.e.,
TIME=0 have been replaced with TIME=0.5 because log(0)=infinity
# Load the library file
library(survival)
aft.weibull<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
14 | P a g e
dist="weibull")
summary(aft.weibull)
Call:
survreg(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + POOR +
RICH, dist = "weibull")
Value Std. Error z p
(Intercept) 5.18164 1.87120 2.77 0.00562
AGE 0.33773 0.11104 3.04 0.00235
AGESQ -0.00565 0.00163 -3.46 0.00054
RELIGION -0.23318 0.48098 -0.48 0.62781
MEDIA -0.11664 0.31613 -0.37 0.71215
PLACE 0.40775 0.31807 1.28 0.19985
NGO 0.51291 0.28782 1.78 0.07474
PRIMARY 0.61883 0.32604 1.90 0.05770
SECONDAR 1.45686 0.41865 3.48 0.00050
HIGHER 3.99143 1.13256 3.52 0.00042
POOR 0.26368 0.36947 0.71 0.47542
RICH 0.12798 0.38874 0.33 0.74199
Log(scale) 0.86594 0.05593 15.48 < 2e-16
Scale= 2.38
Weibull distribution
Loglik(model)= -1938.3 Loglik(intercept only)= -1972.5
Chisq= 68.47 on 11 degrees of freedom, p= 2.4e-10
Number of Newton-Raphson Iterations: 10
n= 9845
Output:
15 | P a g e
Education
Wealth Index
Hypothesis:
Since p-value of the overall Weibull survival regression model is 2.4 × 10−10 < 𝛼 = 0.05, hence we
may reject the null hypothesis at 5% level of significance. That is, at least one covariate has effect on
the survival time.
From the results of p-value we can identify the potential factors associated with infant mortality. Age
and Education is potential risk factors associated with infant mortality at 5% level of significance where
NGO may be a potential risk factors associated with infant mortality at 10% level of significance.
Interpretation: Interpretation has been provided only for covariates which are significant.
16 | P a g e
lnT 28 Years Time
At first with increasing of Age, log of Time will increase and will get to maximum of 28 years
and then log of time will decrease with the increasing of Age keeping all other covariates at a
fixed level.
17 | P a g e
alpha<-1/aft.weibull$scale
cat("Shape parameter:", "\n")
# Overall survival probability (At mean values of covariates)
acf.wei<-exp(sum(cov*aft.weibull$coef[2:12]))
#Defining the time variable
time1<-TIME[CHSURV==1]
time2<-sort(time1)
time3<-unique(time2)
time<-c(0,time3)
sur.over<-exp(-(lambda*time/acf.wei)^alpha) #Overall survival probability
cbind(time, sur.over)
time sur.over
[1,] 0.0 1.0000000
[2,] 0.5 0.9922538
[3,] 1.0 0.9896450
[4,] 2.0 0.9861639
[5,] 3.0 0.9836120
[6,] 4.0 0.9815234
[7,] 5.0 0.9797236
[8,] 6.0 0.9781251
[9,] 7.0 0.9766769
[10,] 8.0 0.9753461
[11,] 9.0 0.9741101
[12,] 10.0 0.9729529
[13,] 11.0 0.9718622
[14,] 12.0 0.9708287
##Survival Curve
plot(time, sur.over, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.96,1.0), type='s',
cex.lab=0.8)
title("Figure 1: Overall Survival Curve for Weibull Regression Model", cex.main=0.8)
18 | P a g e
The overall survival probabilities are quite high as the probabilities lie between 1 to 0.97. Over the span
of one year the survival probabilities show a gradual declining pattern but the decline is not too steep.
c. Find the survival probabilities and curves for the variable ‘Wealth Index’.
acf.wei.poor<-exp(sum(cov.poor*aft.weibull$coef[2:12]))
acf.wei.mid<-exp(sum(cov.mid*aft.weibull$coef[2:12]))
acf.wei.rich<-exp(sum(cov.rich*aft.weibull$coef[2:12]))
sur.poor<-exp(-(lambda*time/acf.wei.poor)**alpha)
sur.mid<-exp(-(lambda*time/acf.wei.mid)**alpha)
sur.rich<-exp(-(lambda*time/acf.wei.rich)**alpha)
cbind(time, sur.poor, sur.mid, sur.rich)
19 | P a g e
[6,] 4.0 0.9823687 0.9803209 0.9813427
[7,] 5.0 0.9806504 0.9784053 0.9795255
[8,] 6.0 0.9791242 0.9767042 0.9779116
[9,] 7.0 0.9777414 0.9751630 0.9764493
[10,] 8.0 0.9764706 0.9737470 0.9751057
[11,] 9.0 0.9752903 0.9724321 0.9738579
[12,] 10.0 0.9741851 0.9712009 0.9726895
[13,] 11.0 0.9731434 0.9700407 0.9715884
[14,] 12.0 0.9721563 0.9689413 0.9705449
#Survival Curves
plot(time, sur.mid, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.poor, type='s', col="red")
lines(time, sur.rich, type='s', col="blue")
legend(0.75,.965, c("Middle", "Poor", "Rich"), lty=c(1,1,1), col=c("black", "red", "blue"), cex=0.8)
title("Figure 2: Survival Curves for Wealth Index: Weibull Regression Model", cex.main=0.8)
Comment: Children born to mothers who are poor have the highest probability of survival during
infancy and those born to mothers who are middle class have the least probability of survival during
infancy as observed from the survival curves.
20 | P a g e
Problem 6: Log-Normal AFT Model with Real Life Data
The SPSS file “infant_sur.dat” is the data set on the infant mortality of Bangladesh with some selected
variables. The variable ‘TIME’ measures the time to death in months and the variable ‘CHSURV’
indicates whether the observation is censored or not. Dataset link: Biostat Data
Using the data set,
a. Fit a Log-Normal AFT regression model. Identify the potential risk factors associated with
infant mortality and interpret the results.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for the variables ‘Place of Residence’, ‘Education’,
‘Wealth Index’.
Solution:
The relevant R codes are provided. Try yourself the outputs and interpretation!
#b.
#Overall Survival Probability
#Mean of the baseline Log-Normal distribution
mu<-aft.lognorm$coef[1]
cat("Mean:", "\n")
mu
21 | P a g e
#Scale parameter of baseline Log-Normal distribution
alpha<-aft.lognorm$scale
cat("Scale:", "\n")
alpha
#Overall survival probability (At mean values of covariates)
cov<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
acf.lognorm<-sum(cov*aft.lognorm$coef[2:12])
time1<-TIME[CHSURV==1]
time2<-sort(time1)
time3<-unique(time2)
time<-c(0,time3)
sur.over.ln<-1-pnorm((log(time)-acf.lognorm-mu)/alpha)
cbind(time,sur.over.ln)
#Survival Curve
plot(time, sur.over.ln, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.96,1.0), type='s',
cex.lab=0.8)
title("Figure 1: Overall Survival Curve for Log-Normal Regression Model", cex.main=0.8)
#c.
# Survival probability: Place of Residence
cov.u<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 1, mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
cov.r<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 0, mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
acf.ln.u<-sum(cov.u*aft.lognorm$coef[2:12])
acf.ln.r<-sum(cov.r*aft.lognorm$coef[2:12])
sur.ur<-1-pnorm((log(time)-acf.ln.u-mu)/alpha)
sur.ru<-1-pnorm((log(time)-acf.ln.r-mu)/alpha)
#Survival curves: Urban versus Rural
plot(time, sur.ur, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.ru, type='s', col="red")
legend(6,.99, c("Urban", "Rural"), lty=c(1,1), col=c("black", "red"), cex=0.8)
22 | P a g e
title("Figure 2: Survival Curves for Place of Residence: Log-Normal Regression Model", cex.main=0.8)
#Survival probability: Education
cov.n<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,0,0, mean(POOR), mean(RICH))
cov.p<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
1,0,0, mean(POOR), mean(RICH))
cov.s<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,1,0, mean(POOR), mean(RICH))
cov.h<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,0,1, mean(POOR), mean(RICH))
acf.ln.n<-sum(cov.n*aft.lognorm$coef[2:12])
acf.ln.p<-sum(cov.p*aft.lognorm$coef[2:12])
acf.ln.s<-sum(cov.s*aft.lognorm$coef[2:12])
acf.ln.h<-sum(cov.h*aft.lognorm$coef[2:12])
sur.ne<-1-pnorm((log(time)-acf.ln.n-mu)/alpha)
sur.pe<-1-pnorm((log(time)-acf.ln.p-mu)/alpha)
sur.se<-1-pnorm((log(time)-acf.ln.s-mu)/alpha)
sur.he<-1-pnorm((log(time)-acf.ln.h-mu)/alpha)
#Survival Curves
plot(time, sur.ne, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.pe, type='s', col="red")
lines(time, sur.se, type='s', col="blue")
lines(time, sur.he, type='s', col="green3")
legend(0.75,.965, c("No Education", "Primary", "Secondary", "Higher"), lty=c(1,1,1,1), col=c("black",
"red", "blue", "green3"), cex=0.8)
title("Figure 3: Survival Curves for Education: Log-Normal Regression Model", cex.main=0.8)
#Survival probability: Wealth Index
cov.poor<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 1,0)
cov.mid<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,0)
cov.rich<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,1)
23 | P a g e
acf.ln.poor<-sum(cov.poor*aft.lognorm$coef[2:12])
acf.ln.mid<-sum(cov.mid*aft.lognorm$coef[2:12])
acf.ln.rich<-sum(cov.rich*aft.lognorm$coef[2:12])
sur.poor<-1-pnorm((log(time)-acf.ln.poor-mu)/alpha)
sur.mid<-1-pnorm((log(time)-acf.ln.mid-mu)/alpha)
sur.rich<-1-pnorm((log(time)-acf.ln.rich-mu)/alpha)
# Survival Curves
plot(time, sur.mid, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.poor, type='s', col="red")
lines(time, sur.rich, type='s', col="blue")
legend(0.75,.965, c("Middle", "Poor", "Rich"), lty=c(1,1,1), col=c("black", "red", "blue"), cex=0.8)
title("Figure 4: Survival Curves for Wealth Index: Log-Normal Regression Model", cex.main=0.8)
a. Fit a Log-Logistic AFT regression model. Identify the potential risk factors associated with
infant mortality and interpret the results.
b. Estimate the odds ratio of infant mortality prior to a specified time point for NGO members and
interpret the results.
c. Find the survival probabilities and curves for the variable ‘Wealth Index’.
Solution:
a.
24 | P a g e
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME) #Those children who have not survived up to at least 1 month, i.e.,
TIME=0 have been replaced with TIME=0.5 because log(0)=infinity
# AFT Model: Log-Logistic
aft.loglogist<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="loglogistic")
summary(aft.loglogist)
Call:
survreg(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + POOR +
RICH, dist = "loglogistic")
Value Std. Error z p
(Intercept) 4.93053 1.88492 2.62 0.00890
AGE 0.34469 0.11194 3.08 0.00208
AGESQ -0.00577 0.00165 -3.50 0.00047
RELIGION -0.24180 0.48371 -0.50 0.61716
MEDIA -0.11245 0.31840 -0.35 0.72397
PLACE 0.40632 0.32005 1.27 0.20424
NGO 0.51006 0.28947 1.76 0.07805
PRIMARY 0.62718 0.32834 1.91 0.05611
SECONDAR 1.46941 0.41997 3.50 0.00047
HIGHER 3.98728 1.12373 3.55 0.00039
POOR 0.26597 0.37275 0.71 0.47551
RICH 0.12751 0.39157 0.33 0.74470
Log(scale) 0.85313 0.05556 15.36 < 2e-16
Scale= 2.35
exp(aft.loglogist$coef)
b.
For a specific covariate 𝑥𝑗 , keeping all other covariates at a fixed level, the OR becomes:
𝛾
𝑂𝑅 = [exp{−𝛽𝑗 (𝑥1𝑗 − 𝑥2𝑗 )}]
25 | P a g e
location, 𝛼 = 𝑒 𝜇 = 𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
1 1
scale, 𝛾 = 𝜎 = 𝑠𝑐𝑎𝑙𝑒
alpha<-exp(aft.loglogist$coefficient[1])
gamma<-1/aft.loglogist$scale
exp(-aft.loglogist$coef["NGO"])^gamma
NGO
0.8046661
Interpretation: The odds of infant mortality decreases by 19.5% for NGO member mothers compared
to that of non-members mothers, keeping all other covariates at fixed level.
c.
Try yourself!
Multiplicative hazard models are most population because of computational simplicity and
interpretation. We’ll do the practical based on multiplicative hazard model. Multiplicative hazard
model is also known as proportional hazard (PH) model.
In the presence of covariates x, the multiplicative hazard (/PH) model is given by:
where ℎ(𝑡)𝑎𝑛𝑑 ℎ0 (𝑡) be the hazard functions in the presence and absence of covariates respectively.
26 | P a g e
• When the baseline hazard function is defined parametrically, the PH model is known as
parametric PH model.
• When the baseline hazard function is left as an arbitrary function, the PH model is known as
semi parametric PH model.
a. Find the potential risk factors of infant mortality by fitting Weibull parametric PH model.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.
Solution:
a.
27 | P a g e
POOR 0.347 -0.111 0.895 0.155 0.4772
RICH 0.463 -0.054 0.948 0.163 0.7424
Events 312
Total time at risk 108168
Max. log. likelihood -1938.3
LR test statistic 68.47
Degrees of freedom 11
Overall p-value 2.37972e-10
Since p-value of the overall Weibull PH model is 2.37972e-10 < 0.05, hence we may reject the null
hypothesis at 5% level of significance. That is, at least one covariate has effect on the survival time.
Interpretation:
Since the covariates age has a quadratic effect, it has a slightly different interpretation.
𝑙𝑛ℎ(𝑇) = 𝜇̂ − 0.142𝐴𝑔𝑒 + 0.002𝐴𝑔𝑒 2
𝛿𝑙𝑛ℎ(𝑇)
𝐹𝑜𝑟 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑎𝑔𝑒 𝑎𝑡 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 ℎ𝑎𝑧𝑎𝑟𝑑: =0
𝛿𝐴𝑔𝑒
−0.142 + 2 × 0.002𝐴𝑔𝑒 = 0
𝑜𝑟, 𝐴𝑔𝑒 ≈ 35.5 𝑦𝑒𝑎𝑟𝑠
Since 𝐴𝑔𝑒 2 has (+) ve sign it will have a U graph:
lnh(T)
The log hazard of infant mortality is highest at both early and late maternal ages, while it is lowest at
middle maternal ages. Specifically, the log hazard reaches its minimum at approximately 35.5 years of
age, keeping all other covariates fixed.
Therefore, for mothers who have secondary level education compared to mothers having no education,
hazard rate of infant mortality decreases by (1- 0.542) * 100% = 45.8%, keeping all other covariates at
a fixed level.
28 | P a g e
• Hazard ratio for covariate Higher Education is 0.187
Therefore, for mothers who have higher level education compared to mothers having no education,
hazard rate of infant mortality decreases by (1- 0.187) * 100% = 81.3%, keeping all other covariates at
a fixed level.
time sur.over
[1,] 0.0 1.00000000
[2,] 0.5 0.54434692
[3,] 1.0 0.44305796
[4,] 2.0 0.33633738
[5,] 3.0 0.27464371
[6,] 4.0 0.23257985
[7,] 5.0 0.20148122
[8,] 6.0 0.17732651
[9,] 7.0 0.15792356
[10,] 8.0 0.14194978
[11,] 9.0 0.12854889
[12,] 10.0 0.11713692
[13,] 11.0 0.10729924
[14,] 12.0 0.09873189
29 | P a g e
##Survival Curve
plot(time, sur.over, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.096,1.0),type='s', ce
x.lab=0.8)
title("Figure 1: Overall Survival Curve for Weibull PH Regression Model", cex.main=0.8)
The overall survival probabilities are decreasing as the survival probabilities show a sharp declining
pattern during the period of infancy.
30 | P a g e
lines(time, sur.ru, type='s', col="red")
legend(6,.99, c("Urban", "Rural"), lty=c(1,1), col=c("black", "red"), cex=0.8)
title("Figure 2: Survival Curves for Place of Residence: Weibull PH Regression Model",cex.main=0.8)
Comment: Infants born to mothers who reside in urban areas have higher probability of survival
compared to infants born to mothers who reside in rural areas as observed from the survival curves.
Solution:
aft.weibull<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="weibull")
aft.rcoef<-aft.weibull$coef[2]
aft.scale<-exp(aft.weibull$coef[1])
aft.shape<-1/aft.weibull$scale
31 | P a g e
aft.res<-cbind(aft.rcoef,aft.scale,aft.shape)
ph.weibull<-phreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="weibull")
ph.rcoef<-ph.weibull$coef[1]
ph.scale<-exp(ph.weibull$coef[2])
ph.shape<-exp(ph.weibull$coef[3])
ph.res<-cbind(ph.rcoef,ph.scale,ph.shape)
cbind(aft.res,ph.res)
#Here
alpha<-0.4206546
beta_aft<-0.3377305
beta_ph<--alpha*beta_aft
beta_ph
[1] -0.1420679
a. Find the potential risk factors of infant mortality by fitting semi parametric PH model.
b. Find the baseline hazard and plot them.
c. Find overall survival probabilities and curve.
d. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.
Solution:
a.
32 | P a g e
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME)
#Start of Cox PH Model
coxph.model<-coxph(Surv(TIME,CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+
PRIMARY+SECONDAR+HIGHER+RICH+POOR)
summary(coxph.model)
Call:
coxph(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + RICH +
POOR)
Since the covariates age has a quadratic effect, it has a slightly different interpretation.
𝑙𝑛ℎ(𝑇) = 𝜇̂ − 0.1329𝐴𝑔𝑒 + 0.0023𝐴𝑔𝑒 2
𝛿𝑙𝑛ℎ(𝑇)
𝐹𝑜𝑟 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑎𝑔𝑒 𝑎𝑡 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 ℎ𝑎𝑧𝑎𝑟𝑑: =0
𝛿𝐴𝑔𝑒
−0.1329 + 2 × 0.0023𝐴𝑔𝑒 = 0
𝑜𝑟, 𝐴𝑔𝑒 ≈ 29 𝑦𝑒𝑎𝑟𝑠
33 | P a g e
Since 𝐴𝑔𝑒 2 has (+) ve sign it will have a U graph:
lnh(T)
29 Years Age
The log hazard of infant mortality is highest at both early and late maternal ages, while it is lowest at
the middle period of maternal ages. Specifically, the log hazard reaches its minimum at approximately
29 years of age, keeping all other covariates fixed. That means, mother aged around 29 years of age are
more likely to have healthy babies.
34 | P a g e
c. Find overall survival function and curve.
35 | P a g e
d. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.
data1<-data.frame(AGE=rep(mean(AGE),2),AGESQ=rep(mean(AGESQ),2),
RELIGION=rep(mean(RELIGION),2), MEDIA=rep(mean(MEDIA),2), PLACE=c(0,1),
NGO=rep(mean(NGO),2), PRIMARY=rep(mean(PRIMARY),2),
SECONDAR=rep(mean(SECONDAR),2), HIGHER=rep(mean(HIGHER),2), POOR=rep(mean(POOR),2),
RICH=rep(mean(RICH),2))
sur.place<-survfit(coxph.model, newdata=data1)
summary(sur.place)
plot(sur.place, xlab="Time", ylab="Survival Probability", ylim=c(0.96,1), lty=1:2, col=c("black", "red"))
legend(6, 0.995, c("Rural", "Urban"), lty=1:2, col=c("black", "red"))
title("Survival curves for place of residence")
data2<-data.frame(AGE=rep(mean(AGE),4), AGESQ=rep(mean(AGESQ),4),
RELIGION=rep(mean(RELIGION),4), MEDIA=rep(mean(MEDIA),4), PLACE=rep(mean(PLACE),4),
NGO=rep(mean(NGO),4), PRIMARY=c(0,1,0,0), SECONDAR=c(0,0,1,0), HIGHER=c(0,0,0,1),
POOR=rep(mean(POOR),4),RICH=rep(mean(RICH),4))
sur.educa<-survfit(coxph.model, newdata=data2)
summary(sur.educa)
plot(sur.educa, xlab="Time", ylab="Survival Probability", ylim=c(0.92,1), lty=1:4, col=c("black", "red",
"blue", "green3"))
legend(6, 0.955, c("No Edu", "Primary", "Secondary", "Higher"), lty=1:4, col=c("black", "red", "blue",
"green3"))
title("Survival curves for education")
sur.wealth<-survfit(coxph.model, newdata=data3)
summary(sur.wealth)
36 | P a g e
"red", "blue"))
legend(6, 0.97, c("Poor", "Middle", "Rich"), lty=1:3, col=c("black", "red", "blue"))
title("Survival curves for wealth index")
37 | P a g e
Multiple Modes of Failure
Check Master’s Theory and Practical Lectures of Advanced Biostatistics.
a. Fit a GLMM taking into account the random effect of clusters and interpret the results.
b. Compute 95% confidence intervals for the odds ratios. Hence identify the potential
factors of receiving postnatal care for children.
Solution:
a.
names(data_pnc)
attach(data_pnc)
library(glmmTMB)
model_rem1<-glmmTMB(pnc~factor(edu)+factor(media)+factor(w_index)+
factor(anc)+factor(ngo)+factor(resi)+factor(p_deli)+
factor(sex)+(1|cluster),zi=~0,family=binomial,data=d
ata_pnc)# (1|cluster) for Random intercept for each cluster and zi~0 for z
ero inflation
summary(model_rem1)
38 | P a g e
## Family: binomial ( logit )
## Formula:
## pnc ~ factor(edu) + factor(media) + factor(w_index) + factor(anc) +
## factor(ngo) + factor(resi) + factor(p_deli) + factor(sex) +
## (1 | cluster)
## Data: data_pnc
##
## AIC BIC logLik -2*log(L) df.resid
## 3790.1 3873.4 -1882.1 3764.1 4454
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## cluster (Intercept) 2.706 1.645
## Number of obs: 4467, groups: cluster, 593
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.30734 0.24750 -1.242 0.21432
## factor(edu)1 0.18566 0.14946 1.242 0.21415
## factor(edu)2 0.24569 0.15340 1.602 0.10924
## factor(edu)3 0.71341 0.25275 2.823 0.00476 **
## factor(media)1 0.37824 0.11955 3.164 0.00156 **
## factor(w_index)1 -0.14011 0.13796 -1.016 0.30982
## factor(w_index)2 0.11920 0.14974 0.796 0.42599
## factor(anc)1 0.58682 0.12282 4.778 1.77e-06 ***
## factor(ngo)1 -0.03300 0.11920 -0.277 0.78192
## factor(resi)1 -0.47236 0.19901 -2.374 0.01762 *
## factor(p_deli)1 3.48441 0.15670 22.236 < 2e-16 ***
## factor(sex)2 0.09783 0.09539 1.026 0.30509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef<-coef(summary(model_rem1))$cond[,1]
se<-coef(summary(model_rem1))$cond[,2]
cbind(or,se.or)
## or se.or
## factor(edu)1 1.2040181 0.1799544
## factor(edu)2 1.2785031 0.1961210
## factor(edu)3 2.0409342 0.5158523
## factor(media)1 1.4597085 0.1745142
## factor(w_index)1 0.8692644 0.1199197
## factor(w_index)2 1.1265970 0.1686947
## factor(anc)1 1.7982533 0.2208639
## factor(ngo)1 0.9675413 0.1153317
## factor(resi)1 0.6235307 0.1240889
39 | P a g e
## factor(p_deli)1 32.6031107 5.1089391
## factor(sex)2 1.1027794 0.1051972
Interpretation of Education:
Mothers with primary education level have 20% higher odds of receiving postnatal care within six
weeks of delivery compared to no education group, but it’s not statistically significant. Mothers with
higher education level have more than double the (+104%) odds of receiving postnatal care within six
weeks of delivery compared to no education group, and the p-value 0.04<0.05 suggests that it is
statistically significant at 5% level of significance.
b.
lower.ci<-or-1.96*se.or
upper.ci<-or+1.96*se.or
cbind(lower.ci,upper.ci)
## lower.ci upper.ci
## factor(edu)1 0.8513075 1.5567287
## factor(edu)2 0.8941059 1.6629002
## factor(edu)3 1.0298636 3.0520048
## factor(media)1 1.1176607 1.8017563
## factor(w_index)1 0.6342218 1.1043070
## factor(w_index)2 0.7959555 1.4572385
## factor(anc)1 1.3653601 2.2311465
## factor(ngo)1 0.7414913 1.1935914
## factor(resi)1 0.3803165 0.8667449
## factor(p_deli)1 22.5895900 42.6166313
## factor(sex)2 0.8965928 1.3089659
c.
40 | P a g e
𝜎𝑢2
𝐼𝐶𝐶 = 2
𝜎𝑢 + 𝜎𝑒2
Where, 𝜎𝑢2 is variance of the cluster which we get from our output 2.706.
𝜋2
And 𝜎𝑒2 is the residual variance of the model which is 3
(For Binary Response)
sig_clu<-2.706
sig_resi<-pi^2/3
icc<-sig_clu/(sig_clu+sig_resi)
icc
## [1] 0.4513108
About 45% of the total variation in whether children received postnatal care can be explained by
differences between clusters.
This indicates that cluster-level factors (e.g., local healthcare services, community characteristics) play
a strong role in postnatal care coverage.
You are given a data set on birth weight of newborns (“birth_weight.sav”) in a country. Description of
the variables are provided in the data set. Identify potential determinants of birth weight by fitting a
GLMM taking into account the random effect of clusters and interpret the results. Also, find the intra-
cluster correlation and interpret the result. Dataset link: Biostat Data
Solution:
attach(data)
library(glmmTMB)
model_rem2<-glmmTMB(birth_weight~factor(area)+factor(location)+factor(w_ed
u)+factor(w_media)+factor(w_index)+factor(violence)+ factor(b_order)+facto
r(anc)+Age_year+Age_year_sqr+(1|cluster), zi=~0, family=gaussian, data=dat
a)
summary(model_rem2)
41 | P a g e
## Family: gaussian ( identity )
## Formula:
## birth_weight ~ factor(area) + factor(location) + factor(w_edu) +
## factor(w_media) + factor(w_index) + factor(violence) + factor(b_ord
er) +
## factor(anc) + Age_year + Age_year_sqr + (1 | cluster)
## Data: data
##
## AIC BIC logLik -2*log(L) df.resid
## 7601.6 7715.2 -3782.8 7565.6 4054
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## cluster (Intercept) 0.007425 0.08617
## Residual 0.368021 0.60665
## Number of obs: 4072, groups: cluster, 2185
##
## Dispersion estimate for gaussian family (sigma^2): 0.368
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2755416 0.2202667 10.331 < 2e-16 ***
## factor(area)2 0.0717613 0.0237573 3.021 0.002523 **
## factor(location)1 0.1109202 0.0283502 3.912 9.13e-05 ***
## factor(location)2 0.1049401 0.0271760 3.861 0.000113 ***
## factor(location)3 -0.0483205 0.0276711 -1.746 0.080769 .
## factor(w_edu)1 0.0426592 0.0626226 0.681 0.495737
## factor(w_edu)2 0.0680711 0.0592783 1.148 0.250832
## factor(w_edu)3 0.0897342 0.0616742 1.455 0.145677
## factor(w_media)1 -0.0187051 0.0248048 -0.754 0.450793
## factor(w_index)1 0.0360164 0.0302632 1.190 0.234006
## factor(w_index)2 0.0611729 0.0282450 2.166 0.030327 *
## factor(violence)1 0.0241818 0.0249576 0.969 0.332587
## factor(b_order)1 -0.0655412 0.0263496 -2.487 0.012869 *
## factor(anc)1 -0.0249231 0.0203968 -1.222 0.221741
## Age_year 0.0372948 0.0157723 2.365 0.018051 *
## Age_year_sqr -0.0005638 0.0002746 -2.053 0.040064 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
𝜎𝑢2
𝐼𝐶𝐶 = 2
𝜎𝑢 + 𝜎𝑒2
From output, 𝜎𝑢2 is variance of the cluster 0.007425
And 𝜎𝑒2 is the residual variance of the model 0.368021
42 | P a g e
References
1. Practical and Theory Lectures of Fatima Tuz Zahura Mam
2. Survival Analysis Techniques for Censored and Truncated Data by John P. Klein and
Moeshberger.
43 | P a g e
Three Weedings
Anyone who knew Abed in his youth would have told you that he
was destined to end up with a certain someone. But that some- one
was not Haifa or Asmahan. It was a girl called Ghazl. They met in
the mid-1980s, when Anata was quiet and rural, more village than
town. Ghazl was a fourteen-year-old freshman at the Anata girls'
school. Abed was a senior at the boys' school across the street. Back
then, everyone knew each other in Anata. More than half the village
came from one of three large families all descended from the same
ancestor, a man named Alawi. Abed's family, the Salamas, was the
largest. Ghazl's, the Hamdans, was the second largest.