0% found this document useful (0 votes)
39 views47 pages

Advanced Biostatistics with R

The document is a comprehensive guide on Advanced Biostatistics with R, focusing on survival models such as the Accelerated Failure Time (AFT) Model and models based on hazard functions. It includes detailed problems, solutions, and interpretations related to the comparison of treatment effects on lung cancer relapse using different statistical models. The author expresses gratitude to mentors and encourages feedback while allowing the sharing of the document for educational purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views47 pages

Advanced Biostatistics with R

The document is a comprehensive guide on Advanced Biostatistics with R, focusing on survival models such as the Accelerated Failure Time (AFT) Model and models based on hazard functions. It includes detailed problems, solutions, and interpretations related to the comparison of treatment effects on lung cancer relapse using different statistical models. The author expresses gratitude to mentors and encourages feedback while allowing the sharing of the document for educational purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Advanced Biostatistics With R

Md. Mostakim

Session: 2022-23

Department of Statistics

University of Dhaka

Published Date:

1st Published: April, 2025

Acknowledgements:

I am deeply grateful to Fatima Tuz Zahura Mam for her exceptional guidance in teaching Advanced
Biostatistics. My sincere thanks also go to the seniors for their insightful lectures.

N.B. You may share this pdf book as much as you like but don’t use it for any unethical purpose.

For any kind of feedback, please contact to [email protected]. Your feedback will be very inspiring for me.
Table of Contents

Accelerated Failure Time (AFT) Model ................................................................................... 1

Problem 1: Log-Normal AFT Model .................................................................................... 1

Problem 2: Log-Logistic AFT Model .................................................................................... 5

Problem 3: Weibull AFT Model........................................................................................... 7

Problem 4: Hypothesis Testing ........................................................................................... 9

Problem 5: Weibull AFT Model with Real Life Data ......................................................... 14

Problem 6: Log-Normal AFT Model with Real Life Data ................................................... 21

Problem 7: Log-Logistic AFT Model with Real Life Data ................................................... 24

Model Based on Hazard Function ........................................................................................ 26

Problem 8: Weibull Parametric PH Model ....................................................................... 27

Problem 9: Inter-relationship between Weibull PH and Weibull AFT Model .................. 31

Problem 10: Semi-Parametric PH Model (Cox-PH Model) ............................................... 32

Multiple Modes of Failure .................................................................................................... 38

Generalized Linear Models................................................................................................... 38

Generalized Linear Mixed Models ....................................................................................... 38

Problem 11: Generalized Linear Mixed Model ................................................................. 38

Problem 12: Generalized Linear Mixed Model ................................................................. 41

References ............................................................................................................................ 43
There are two types of survival models that are commonly used in practice, these are:

i. Accelerated Failure Time (AFT) Model


ii. Model Based on Hazard Function

Accelerated Failure Time (AFT) Model


Problem 1: Log-Normal AFT Model
The time to relapse, in months, for patients on two treatments for lung cancer is compared using the
following accelerated failure time (AFT) regression model: 𝑌 = 𝑙𝑛𝑇 = 2.0 + 0.5𝑥 + 2.0𝑍, where
𝑍~𝑁(0,1) and

1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵

a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?

Solution:

a. Compare the survival probabilities

Here, the random variable 𝑍 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0,1) ; so, the (Accelerated Failure Time) AFT model is called
Log Normal AFT model.

The model is given by, 𝑌 = 𝑙𝑛𝑇 = 𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍 = 2.0 + 0.5𝑥 + 2.0 𝑍

Therefore, location, 𝜇 = 2 ; scale, 𝜎 = 2 ; regression parameter, 𝛽 = 0.5 .

The linear predictor is defined by, 𝜂 = 𝑥′𝛽

For, treatment A, 𝜂 = 1 × 0.5 and for treatment B, 𝜂 = 0 × 0.5.

The survival probability that a lung cancer patient survives more than 𝑡 time is

𝑙𝑛𝑡 − 𝑥 ′ 𝛽 − 𝜇
𝑆(𝑡) = 1 − 𝛷 ( )
𝜎

1|Page
The 𝑝𝑡ℎ quartile time, denoted by 𝑡𝑝 , is given by-

𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )

where 𝑍𝑝 is the 𝑝𝑡ℎ quartile of standard normal distribution.

#Defining time variable


t<-c(0,0.5,1,2,3,4,5,6)
time<-12*t #time in months

#defining parameters and linear predictor of Y=lnT


location<-2.0
scale<-2.0
beta<-0.5

linear.pred.TrA<-0.5*1
linear.pred.TrB<-0.5*0

#survival probabilities of treatment A and B


sur.pr.A<-1-pnorm((log(time)-linear.pred.TrA-location)/scale)
sur.pr.B<-1-pnorm((log(time)-linear.pred.TrB-location)/scale)
cbind(time,sur.pr.A,sur.pr.B)
time sur.pr.A sur.pr.B
[1,] 0 1.0000000 1.0000000
[2,] 6 0.6383756 0.5414630
[3,] 12 0.5030107 0.4042145
[4,] 24 0.3672947 0.2779216
[5,] 36 0.2939921 0.2142505
[6,] 48 0.2464825 0.1747395
[7,] 60 0.2126755 0.1475101
[8,] 72 0.1871808 0.1274907

Interpretation:

From the above table, it is found that,

• For the lung cancer patients taking treatment A, 63.84% of them do not get the disease back
before 6 months, whereas, those taking treatment B, 54.15% of them do not get the disease back
before 6 months.

• For the lung cancer patients taking treatment A, 50.30% of them do not get the disease back
before 1 year (12 months), whereas, those taking treatment B, 40.42% of them do not get the
disease back before 1 year.

2|Page
• For the lung cancer patients taking treatment A, 36.73% of them do not get the disease back
before 2 years, whereas, those taking treatment B, 27.79% of them do not get the disease back
before 2 years.

• For the lung cancer patients taking treatment A, 29.40% of them do not get the disease back
before 3 years, whereas, those taking treatment B, 21.43% of them do not get the disease back
before 3 years.

• For the lung cancer patients taking treatment A, 24.65% of them do not get the disease back
before 4 years, whereas, those taking treatment B, 17.47% of them do not get the disease back
before 4 years.

• For the lung cancer patients taking treatment A, 21.27% of them do not get the disease back
before 5 years, whereas, those taking treatment B, 14.75% of them do not get the disease back
before 5 years.

• For the lung cancer patients taking treatment A, 18.72% of them do not get the disease back
before 6 years, whereas, those taking treatment B, 12.75% of them do not get the disease back
before 6 years.

b. Draw survival curves for treatment A and B in the same graph.

#survival curves of treatment A and B


plot(time,sur.pr.A, type='s', col='black')
lines(time,sur.pr.B, type='s', col='red')

legend(30, 0.8, c("Treatment A", "Treatment B"), lty=c(1,1), col=c("black", "red"))


title("Survival curves for treatmemnt: Log Normal AFT")

3|Page
Comment:

From the above figure, it is found that the survival probabilities over the time, for patients taking
treatment A is higher than those taking treatment B. As for example, for the lung cancer patients taking
treatment A, (1 − 0.6384) ∗ 100% = 36.26% of them get the disease back after 6 months, on the
other hand, among those taking treatment B, (1 − 0.5415) ∗ 100% = 45.85% get the disease back
after 6 months using Log Normal AFT model.

c. Find the first, second, and third quartile survival times for treatment A and B.

#Qurtile times
z.first.quartile<-qnorm(0.25)
time.first.quartile.A<-exp(location+linear.pred.TrA+scale*z.first.quartile)
time.first.quartile.B<-exp(location+linear.pred.TrB+scale*z.first.quartile)

z.second.quartile<-qnorm(0.5)
time.second.quartile.A<-exp(location+linear.pred.TrA+scale*z.second.quartile)
time.second.quartile.B<-exp(location+linear.pred.TrB+scale*z.second.quartile)

z.third.quartile<-qnorm(0.75)

4|Page
time.third.quartile.A<-exp(location+linear.pred.TrA+scale*z.third.quartile)
time.third.quartile.B<-exp(location+linear.pred.TrB+scale*z.third.quartile)

quartile.A<-cbind(time.first.quartile.A,time.second.quartile.A,time.third.quartile.A)
quartile.B<-cbind(time.first.quartile.B,time.second.quartile.B,time.third.quartile.B)

cbind(quartile.A, quartile.B)

time.first.quartile.A time.second.quartile.A time.third.quartile.A


[1,] 3.161417 12.18249 46.94513
time.first.quartile.B time.second.quartile.B time.third.quartile.B
[1,] 1.917497 7.389056 28.47366

Interpretation:

From the above table we can say that,

• For the lung cancer patients, taking treatment A, 25% of them get back the disease at or before
3.16 months, while this is 1.92 months for those who are taking treatment B.
• For the lung cancer patients, taking treatment A, 50% of them get back the disease at or before
12.18 months, while this is 7.39 months for those who are taking treatment B.
• For the lung cancer patients, taking treatment A, 75% of them get back the disease at or before
46.95 months, while this is 28.47 months for those who are taking treatment B.

d. Which treatment is better and why?

As the event of interest is time to relapse of lung cancer, from survival probability it is found that the
survival probability for the time to relapse of lung cancer for patients taking treatment A is higher than
those taking treatment B. From the quartiles it is found that, those who take treatment A takes more
time to get back disease than those who take treatment B using a Log Normal model. Thus, patients
with lung cancer should take treatment A with the hope of longer survival and better health rather than
treatment B. Therefore, treatment A is better than treatment B.

Problem 2: Log-Logistic AFT Model


The time to relapse, in months, for patients on two treatments for lung cancer is compared using the
following accelerated failure time (AFT) regression model: 𝑌 = 𝑙𝑛𝑇 = 2.0 + 0.5𝑥 + 2.0𝑍, where
𝑍~𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(0,1) and

5|Page
1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵

a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?

Solution:

a.

Here, the random variable ~ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(0,1) ; so, the (Accelerated Failure Time) AFT model is called
Log Logistic AFT model.

The model is given by, 𝑌 = 𝑙𝑛𝑇 = 𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍 = 2.0 + 0.5𝑥 + 2.0 𝑍

So, parameters for 𝑌 are, location, 𝜇 = 2 ; scale, 𝜎 = 2 ; regression parameter, 𝛽 = 0.5

1 1
And parameters for 𝑇 are, location, 𝛼 = 𝑒 𝜇 = 𝑒 2 ; scale, 𝛾 = = ; regression parameter, 𝛽 = 0.5 .
𝜎 2

The linear predictor is defined by, 𝜂 = 𝑥′𝛽

For, treatment A, 𝜂 = 1 × 0.5 and for treatment B, 𝜂 = 0 × 0.5.

The survival probability that a lung cancer patient survives more than 𝑡 time is

𝛾 −1
𝑡 𝑒 −𝛽𝑥′
𝑆(𝑡) = [1 + ( ) ]
𝛼

The 𝑝𝑡ℎ quartile time, denoted by 𝑡𝑝 , is given by-

𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )

where 𝑍𝑝 is the 𝑝𝑡ℎ quartile of standard logistic distribution.

time<-c(0,0.5,1,2,3,4,5,6)
t<-12*time
#Given value
mu<-2
sigma<-2
beta<-0.5
#linear predictor

6|Page
lin.pre.A<-0.5*1
lin.pre.B<-0.5*0
#parameter values for log logistic
location<-exp(mu) #shape
gamma<-1/sigma #scale
#survival probabilities
sur.p.A<-1/(1+((t*exp(-lin.pre.A))/location)^gamma)
sur.p.B<-1/(1+((t*exp(-lin.pre.B))/location)^gamma)
cbind(t,sur.p.A,sur.p.B)

t sur.p.A sur.p.B
[1,] 0 1.0000000 1.0000000
[2,] 6 0.5876164 0.5260066
[3,] 12 0.5018867 0.4396819
[4,] 24 0.4160459 0.3568582
[5,] 36 0.3677784 0.3117910
[6,] 48 0.3350125 0.2817899
[7,] 60 0.3106307 0.2597685
[8,] 72 0.2914539 0.2426265

Rest of the calculation and interpretation same as problem 01, try yourself!

Problem 3: Weibull AFT Model


The time to relapse, in months, for patients on two treatments for lung cancer is compared using the
following accelerated failure time (AFT) regression model: 𝑌 = 𝑙𝑛𝑇 = 2.0 + 0.5𝑥 + 2.0𝑍, where
𝑍~𝐸𝑥𝑡𝑟𝑒𝑚𝑒 𝑣𝑎𝑙𝑢𝑒 (0,1) and

1, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐴
𝑥={ .
0, 𝐼𝑓 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑖𝑠 𝐵

a. Compare the survival probabilities of the two treatments at 0, 0.5, 1, 2, 3, 4, 5, and 6 years.
b. Draw survival curves for treatment A and B in the same graph.
c. Find the first, second, and third quartile survival times for treatment A and B.
d. Which treatment is better and why?

Solution:

a.

Here, the random variable ~ 𝐸𝑥𝑡𝑟𝑒𝑚𝑒 𝑣𝑎𝑙𝑢𝑒 (0,1) ; so, the (Accelerated Failure Time) AFT model
is called Weibull Regression model.

7|Page
The model is given by, 𝑌 = 𝑙𝑛𝑇 = 𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍 = 2.0 + 0.5𝑥 + 2.0 𝑍

So, parameters for 𝑌 are, location, 𝜇 = 2 ; scale, 𝜎 = 2 ; regression parameter, 𝛽 = 0.5

1 1
And parameters for 𝑇 are, location, 𝜇 = −𝑙𝑛𝜆 𝑜𝑟 λ = exp(−𝜇) = exp(−2) ; scale, 𝛼 = 𝜎 = 2 ;

regression parameter, 𝛽 = 0.5 .

The linear predictor is defined by, 𝜂 = 𝑥′𝛽

For, treatment A, 𝜂 = 1 × 0.5 and for treatment B, 𝜂 = 0 × 0.5.

The survival probability that a lung cancer patient survives more than 𝑡 time is

𝑆(𝑡) = exp[−{𝜆𝑡𝑒𝑥𝑝(−𝑥 ′ 𝛽)}𝛼 ]

The 𝑝𝑡ℎ quartile time, denoted by 𝑡𝑝 , is given by-

𝑡𝑝 = exp (𝜇 + 𝑥 ′ 𝛽 + 𝜎 𝑍𝑝 )

where 𝑍𝑝 is the 𝑝𝑡ℎ quartile of standard logistic distribution.

time<-c(0,0.5,1,2,3,4,5,6)
t<-12*time
#Given value
mu<-2
sigma<-2
beta<-0.5
#linear predictor

lin.pre.A<-0.5*1
lin.pre.B<-0.5*0
#Parameters for Weibull model
lambda<-exp(-mu) #shape
alpha<-1/sigma #scale
sur.p.A<-exp(-(lambda*t*exp(-lin.pre.A))^alpha)
sur.p.B<-exp(-(lambda*t*exp(-lin.pre.B))^alpha)
cbind(t,sur.p.A,sur.p.B)

t sur.p.A sur.p.B
[1,] 0 1.00000000 1.00000000
[2,] 6 0.49569693 0.40611581
[3,] 12 0.37065568 0.27960657
[4,] 24 0.24571545 0.16493005

8|Page
[5,] 36 0.17924014 0.10999981
[6,] 48 0.13738563 0.07817983
[7,] 60 0.10868988 0.05786851
[8,] 72 0.08794235 0.04408831

Rest of the calculation and interpretation same as problem 01 and 02, try yourself!

Problem 4: Hypothesis Testing


Let 𝑥 be binary covariate taking value either 0 or 1. Assume that 𝑙𝑛𝑡𝑖 |𝑥𝑖 follows a normal distribution
[or extreme value or logistic distribution] with location parameter 𝛽0 + 𝛽1 𝑥𝑖 and scale parameter 𝑏. The
maximum likelihood estimator of 𝜃 = (𝛽0 , 𝛽1 , 𝑙𝑛𝑏)/ and the corresponding variance-covariance
matrix are given below:

0.105 0.158 −0.092 0.000


𝜃̂ = (0.937) 𝑉̂ (𝜃̂) = [−0.092 0.151 0.000]
0.159 0.000 0.000 0.027

Using the above estimates, answer the following

a. Test the null hypotheses 𝐻0 : 𝛽1 = 0 and 𝐻0 : 𝑏 = 1.

b. Estimate the first quartile, median, and third quartile survival times when 𝑥 = 1 and also
find 95% confidence interval for these survival times.

c. Estimate the first quartile, median, and third quartile survival times when 𝑥 = 0 and also
find 95% confidence interval for these survival times.

Solution:

a. Test the null hypotheses 𝑯𝟎 : 𝜷𝟏 = 𝟎 and 𝑯𝟎 : 𝒃 = 𝟏.

Given that,

̂ ̂0 ) = 0.158,
𝛽0 = 0.105, 𝑉𝑎𝑟 (𝛽
̂ ̂1 ) = 0.151,
𝛽1 = 0.937, 𝑉𝑎𝑟 (𝛽
𝑙𝑛𝑏̂ = 0.159, 𝑉𝑎𝑟 (𝑙𝑛𝑏̂ )= 0.027,
̂0 , 𝛽
𝐶𝑜𝑣 (𝛽 ̂1 ) = 𝐶𝑜𝑣 (𝛽
̂1 , 𝛽
̂0 ) = -0.092,
̂0 , 𝑙𝑛𝑏̂ ) = 𝐶𝑜𝑣 (𝑙𝑛𝑏̂, 𝛽
𝐶𝑜𝑣 (𝛽 ̂0 ) = 0.000,
̂1 , 𝑙𝑛𝑏̂ ) = 𝐶𝑜𝑣 (𝑙𝑛𝑏̂, 𝛽
𝐶𝑜𝑣 ( 𝛽 ̂1 ) = 0.000.

9|Page
Hypotheses to be tested:

𝑖) 𝐻0 : 𝛽1 = 0
vs 𝐻0 : 𝛽1 ≠ 0
𝑖𝑖) 𝐻0 : 𝑏 = 1
vs 𝐻0 : 𝑏 ≠ 1

For i)
Test statistic:
̂
𝛽1 −0
w= 𝑆𝐸 ( ̂
𝛽1 )
~ N (0,1), under 𝐻0

where, 𝑆𝐸 ( ̂ ̂1 ) = √0.151
𝛽1 ) = √𝑉𝑎𝑟 (𝛽

beta1.hat<-.937
se.beta1.hat<-sqrt(.151)
wald.beta1<-beta1.hat/se.beta1.hat
pval.beta1<-2*(1-pnorm(abs(wald.beta1)))
pval.beta1
[1] 0.0158958

Here, the p-value is 0.0158958 < 0.05. Thus, null hypothesis 𝐻0 : 𝛽1 = 0 may be rejected at 5% level of
significance.

For ii)
Test statistic
̂−1
b
w= ̂)
𝑆𝐸 (b
~ N (0,1), under 𝐻0
̂
where, b̂ = 𝑒 𝑙𝑛 b = 𝑒 0.159 ,

𝑆𝐸 (b̂) = √𝑉𝑎𝑟 (𝑏̂)

Using Delta method,


2
Var (b) = Var (𝑒 𝑙𝑛𝑏 ) = (𝑒 𝑙𝑛𝑏 ) Var (𝑙𝑛𝑏) = (𝑏)2 Var (𝑙𝑛𝑏)

b.hat<-exp(.159)
var.b.hat<-(b.hat^2)*.027
se.b.hat<-sqrt(var.b.hat)
wald.b.hat<-(b.hat-1)/se.b.hat

10 | P a g e
pval.b.hat<-2*(1-pnorm(abs(wald.b.hat)))
pval.b.hat

[1] 0.3709819

Here, the p-value is 0.3709819 > 0.05. Thus, null hypothesis 𝐻0 : b = 1 may not be rejected at 5% level
of significance.

b. Estimate the first quartile, median, and third quartile survival times when 𝒙 = 𝟏 and also
find 95% confidence interval for these survival times.

Given,
𝑙𝑛𝑡𝑖 |𝑥𝑖 follows a normal distribution [or extreme value or logistic distribution] with location parameter
𝛽0 + 𝛽1 𝑥𝑖 and scale parameter 𝑏.
Thus, the AFT regression model
𝑙𝑛𝑇 = 𝛽0 + 𝛽1 𝑥 + 𝑏𝑧

Then, the p-th quantile survival time


𝑡𝑝 =𝑒 (𝛽0 + 𝛽1 𝑥𝑝+𝑏𝑧𝑝 )
Since, x=1,
𝑡𝑝 = 𝑒 (𝛽0 + 𝛽1 +𝑏𝑧𝑝 )
Here, for first quartile, p = 0.25,
for median, p = 0.50,
for third quartile, p = 0.75.
#Quartiles
z.25<-qnorm(.25)
t.25<-exp(.105+.937+b.hat*z.25)
z.5<-qnorm(.5)
t.5<-exp(.105+.937+b.hat*z.5)
z.75<-qnorm(.75)
t.75<-exp(.105+.937+b.hat*z.75)
quartiles<-cbind(t.25,t.5,t.75)
quartiles
t.25 t.5 t.75
[1,] 1.285657 2.834881 6.250928

11 | P a g e
In the presence of covariate x, 25% observations have survival time less than or equal to 1.285657, 50%
observations have survival time less than or equal to 2.834881 and 75% observations have survival time
less than or equal to 6.250928.

95% confidence interval for the above survival times:

𝑡𝑝 ± 𝑧(1−𝛼) √𝑉𝑎𝑟 (𝑡𝑝 ) ; 𝛼 = 0.05 (i)


2

where,

2
̂0 ) + 2 𝑥 𝑐𝑜𝑣 (𝛽
√𝑉𝑎𝑟 (𝑡𝑝 ) = √( 𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽 ̂0 , ̂
𝛽1 ) + (𝑥 2 ) 𝑉𝑎𝑟 ( ̂
𝛽1 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂)) (ii)

Since, x=1,
2
̂0 ) + 2 𝑐𝑜𝑣 (𝛽
√𝑉𝑎𝑟 (𝑡𝑝 ) =√( 𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽 ̂0 , ̂
𝛽1 ) + 𝑉𝑎𝑟 ( ̂
𝛽1 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂))

#confidence intervals
se.t.25<-sqrt((t.25^2)*(.158+.151+(z.25^2)*var.b.hat-2*.092))
CI.t25<-c(t.25-qnorm(.975)*se.t.25,t.25+qnorm(.975)*se.t.25)
CI.t25

[1] 0.3365032 2.2348113

se.t.5<-sqrt((t.5^2)*(.158+.151+(z.5^2)*var.b.hat-2*.092))
CI.t5<-c(t.5-qnorm(.975)*se.t.5,t.5+qnorm(.975)*se.t.5)
CI.t5
[1] 0.8704448 4.7993174

se.t.75<-sqrt((t.75^2)*(.158+.151+(z.75^2)*var.b.hat-2*.092))
CI.t75<-c(t.75-qnorm(.975)*se.t.75,t.75+qnorm(.975)*se.t.75)
CI.t75
[1] 1.636095 10.865761

In the presence of covariate x, the intervals (0.3365032, 2.2348113), (0.8704448, 4.7993174) and
(1.636095, 10.865761) will contain the true value of 1st, 2nd and 3rd quartile survival time of the
population respectively with 0.95 probability.

c. Estimate the first quartile, median, and third quartile survival times when x=0 and also
find 95% confidence interval for these survival times.

When, x= 0, from equation (i),

12 | P a g e
the pth quantile survival time
𝑡𝑝 = 𝑒 (𝛽0 +𝑏𝑧𝑝 )
z.25<-qnorm(.25)
t.25<-exp(.105+b.hat*z.25)
z.5<-qnorm(.5)
t.5<-exp(.105+b.hat*z.5)
z.75<-qnorm(.75)
t.75<-exp(.105+b.hat*z.75)
quartiles<-cbind(t.25,t.5,t.75)
quartiles
t.25 t.5 t.75
[1,] 0.5037224 1.110711 2.449123

In the absence of covariate x, 25% observations have survival time less than or equal to 0.5037224,
50% observations have survival time less than or equal to 1.110711 and 75% observations have survival
time less than or equal to 2.449123.

Now, when x=0, from equation (2),


95% confidence interval for the above survival times

𝑡𝑝 ± 𝑧(1−𝛼) √𝑉𝑎𝑟 (𝑡𝑝 ) ; 𝛼 = 0.05


2

Where,
2
̂0 ) + 𝑧𝑝 2 𝑉𝑎𝑟 (𝑏̂))
√𝑉𝑎𝑟 (𝑡𝑝 ) =√(𝑡𝑝 ) (𝑉𝑎𝑟 (𝛽

#confidence intervals
se.t.25<-sqrt((t.25^2)*(.158+(z.25^2)*var.b.hat))
CI.t25<-c(t.25-qnorm(.975)*se.t.25,t.25+qnorm(.975)*se.t.25)
CI.t25

[1] 0.09085392 0.91659091

se.t.5<-sqrt((t.5^2)*(.158+(z.5^2)*var.b.hat))
CI.t5<-c(t.5-qnorm(.975)*se.t.5,t.5+qnorm(.975)*se.t.5)
CI.t5

[1] 0.245389 1.976032

se.t.75<-sqrt((t.75^2)*(.158+(z.75^2)*var.b.hat))
CI.t75<-c(t.75-qnorm(.975)*se.t.75,t.75+qnorm(.975)*se.t.75)
CI.t75

13 | P a g e
[1] 0.4417362 4.4565095

In the absence of covariate x, the intervals (0.09085392, 0.91659091), (0.245389, 1.976032) and
(0.4417362, 4.4565095) will contain the true value of 1st, 2nd and 3rd quartile survival time of the
population respectively with 0.95 probability.

Problem 5: Weibull AFT Model with Real Life Data


The SPSS file “infant_sur.dat” is the data set on the infant mortality of Bangladesh with some selected
variables. The variable ‘TIME’ measures the time to death in months and the variable ‘CHSURV’
indicates whether the observation is censored or not. Dataset link: Biostat Data
Using the data set,

a. Fit a Weibull AFT regression model. Identify the potential risk factors associated with infant
mortality and interpret the results.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for the variable ‘Wealth Index’.

Solution:

a.

# Read the data


inf.data<-read.table("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biostat and
Econometrics/Practical_AFT/infant_sur.dat", header=T)

dim(inf.data)
names(inf.data)
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME) #Those children who have not survived up to at least 1 month, i.e.,
TIME=0 have been replaced with TIME=0.5 because log(0)=infinity
# Load the library file
library(survival)
aft.weibull<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,

14 | P a g e
dist="weibull")
summary(aft.weibull)

Call:
survreg(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + POOR +
RICH, dist = "weibull")
Value Std. Error z p
(Intercept) 5.18164 1.87120 2.77 0.00562
AGE 0.33773 0.11104 3.04 0.00235
AGESQ -0.00565 0.00163 -3.46 0.00054
RELIGION -0.23318 0.48098 -0.48 0.62781
MEDIA -0.11664 0.31613 -0.37 0.71215
PLACE 0.40775 0.31807 1.28 0.19985
NGO 0.51291 0.28782 1.78 0.07474
PRIMARY 0.61883 0.32604 1.90 0.05770
SECONDAR 1.45686 0.41865 3.48 0.00050
HIGHER 3.99143 1.13256 3.52 0.00042
POOR 0.26368 0.36947 0.71 0.47542
RICH 0.12798 0.38874 0.33 0.74199
Log(scale) 0.86594 0.05593 15.48 < 2e-16
Scale= 2.38

Weibull distribution
Loglik(model)= -1938.3 Loglik(intercept only)= -1972.5
Chisq= 68.47 on 11 degrees of freedom, p= 2.4e-10
Number of Newton-Raphson Iterations: 10
n= 9845

exp(aft.weibull$coef) # Odds ratio

(Intercept) AGE AGESQ RELIGION MEDIA PLACE


177.9745550 1.4017627 0.9943621 0.7920080 0.8899057 1.5034380

NGO PRIMARY SECONDAR HIGHER POOR RICH


1.8567507 4.2924418 54.1322478 1.3017139 1.1365313 1.6701395

Output:

Covariate Regression Standard p-value 𝒆𝜷 (Ratio of mean/quantile


Coefficient Error survival time)

Intercept 5.1816 1.8712 0.0056 177.9746

Age 0.3377 0.1114 0.0024 1.4018

Age Square -0.0056 0.0016 0.0005 0.9944

Religion -0.2332 0.4810 0.6278 0.7920

Media -0.1166 0.3161 0.7121 0.8899

Place 0.4078 0.3181 0.1999 1.5034

NGO 0.5129 0.2878 0.0747 1.6701

15 | P a g e
Education

Primary 0.6188 0.3260 0.0577 1.8568

Secondary 1.4569 0.4187 0.0005 4.2924

Higher 3.9914 1.1327 0.0004 54.1322

Wealth Index

Poor 0.2639 0.3695 0.4754 1.3017

Rich 0.1280 0.3887 0.7420 1.1365

Hypothesis:

𝐻0 = All covariates have no effect 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑢𝑟𝑣𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒

𝑣𝑠 𝐻1 = At least one covariate has effect on the survival time

Since p-value of the overall Weibull survival regression model is 2.4 × 10−10 < 𝛼 = 0.05, hence we
may reject the null hypothesis at 5% level of significance. That is, at least one covariate has effect on
the survival time.

From the results of p-value we can identify the potential factors associated with infant mortality. Age
and Education is potential risk factors associated with infant mortality at 5% level of significance where
NGO may be a potential risk factors associated with infant mortality at 10% level of significance.

Interpretation: Interpretation has been provided only for covariates which are significant.

• Regression Coefficient for Age


Since the covariates age has a quadratic effect, it has a slightly different interpretation.
𝑙𝑛𝑇 = 𝜇̂ + 0.337𝐴𝑔𝑒 − 0.006𝐴𝑔𝑒 2
𝛿𝑙𝑛𝑇
𝐹𝑜𝑟 𝑡ℎ𝑒 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑎𝑔𝑒 𝑎𝑡 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑚𝑒𝑎𝑛 𝑠𝑢𝑟𝑣𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒: =0
𝛿𝐴𝑔𝑒
0.337 − 2 × 0.006𝐴𝑔𝑒 = 0
𝑜𝑟, 𝐴𝑔𝑒 ≈ 28 𝑦𝑒𝑎𝑟𝑠
2
Since 𝐴𝑔𝑒 has (-) ve sign it will have a flipped U graph:

16 | P a g e
lnT 28 Years Time

At first with increasing of Age, log of Time will increase and will get to maximum of 28 years
and then log of time will decrease with the increasing of Age keeping all other covariates at a
fixed level.

• Regression Coefficient for NGO


Exponentiated regression coefficient for NGO= 1.67, therefore for mothers who are members
of NGO compared to mothers with no NGO membership, the mean survival time of child
increases by (1.67-1) * 100% = 67 %, keeping all other covariates at a fixed level.
• Exponentiated Regression Coefficient for Primary=1.8568
Therefore, for mothers who have primary level education compared to mothers having no
education, the mean survival time of child increases by (1.8568-1) * 100% = 85.68%, keeping
all other covariates at a fixed level.
• Exponentiated Regression Coefficient for Secondary=4.2924
Therefore, for mothers who have secondary level education compared to mothers having no
education, the mean survival time of child increases by (4.2924-1) * 100% = 329.24%, keeping
all other covariates at a fixed level.
• Exponentiated Regression Coefficient for Higher=54.1322
Therefore, for mothers who have primary level education compared to mothers having no
education, the mean survival time of child increases by (54.1322-1) * 100% = 5313.22%,
keeping all other covariates at a fixed level.

b. Find the overall survival probabilities and curve.

# Scale parameter of the baseline Weibull distribution


lambda<-exp(-aft.weibull$coef[1])
cat("Scale:", "\n")
#Shape parameter of baseline Weibull distribution

17 | P a g e
alpha<-1/aft.weibull$scale
cat("Shape parameter:", "\n")
# Overall survival probability (At mean values of covariates)

cov<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),


mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))

acf.wei<-exp(sum(cov*aft.weibull$coef[2:12]))
#Defining the time variable
time1<-TIME[CHSURV==1]
time2<-sort(time1)
time3<-unique(time2)
time<-c(0,time3)
sur.over<-exp(-(lambda*time/acf.wei)^alpha) #Overall survival probability
cbind(time, sur.over)

time sur.over
[1,] 0.0 1.0000000
[2,] 0.5 0.9922538
[3,] 1.0 0.9896450
[4,] 2.0 0.9861639
[5,] 3.0 0.9836120
[6,] 4.0 0.9815234
[7,] 5.0 0.9797236
[8,] 6.0 0.9781251
[9,] 7.0 0.9766769
[10,] 8.0 0.9753461
[11,] 9.0 0.9741101
[12,] 10.0 0.9729529
[13,] 11.0 0.9718622
[14,] 12.0 0.9708287

##Survival Curve
plot(time, sur.over, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.96,1.0), type='s',
cex.lab=0.8)
title("Figure 1: Overall Survival Curve for Weibull Regression Model", cex.main=0.8)

18 | P a g e
The overall survival probabilities are quite high as the probabilities lie between 1 to 0.97. Over the span
of one year the survival probabilities show a gradual declining pattern but the decline is not too steep.

c. Find the survival probabilities and curves for the variable ‘Wealth Index’.

#Survival probability: Wealth Index


cov.poor<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 1,0)
cov.mid<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,0)
cov.rich<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,1)

acf.wei.poor<-exp(sum(cov.poor*aft.weibull$coef[2:12]))
acf.wei.mid<-exp(sum(cov.mid*aft.weibull$coef[2:12]))
acf.wei.rich<-exp(sum(cov.rich*aft.weibull$coef[2:12]))

sur.poor<-exp(-(lambda*time/acf.wei.poor)**alpha)
sur.mid<-exp(-(lambda*time/acf.wei.mid)**alpha)
sur.rich<-exp(-(lambda*time/acf.wei.rich)**alpha)
cbind(time, sur.poor, sur.mid, sur.rich)

time sur.poor sur.mid sur.rich


[1,] 0.0 1.0000000 1.0000000 1.0000000
[2,] 0.5 0.9926100 0.9917467 0.9921776
[3,] 1.0 0.9901207 0.9889682 0.9895433
[4,] 2.0 0.9867983 0.9852612 0.9860282
[5,] 3.0 0.9843625 0.9825442 0.9834515

19 | P a g e
[6,] 4.0 0.9823687 0.9803209 0.9813427
[7,] 5.0 0.9806504 0.9784053 0.9795255
[8,] 6.0 0.9791242 0.9767042 0.9779116
[9,] 7.0 0.9777414 0.9751630 0.9764493
[10,] 8.0 0.9764706 0.9737470 0.9751057
[11,] 9.0 0.9752903 0.9724321 0.9738579
[12,] 10.0 0.9741851 0.9712009 0.9726895
[13,] 11.0 0.9731434 0.9700407 0.9715884
[14,] 12.0 0.9721563 0.9689413 0.9705449

#Survival Curves
plot(time, sur.mid, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.poor, type='s', col="red")
lines(time, sur.rich, type='s', col="blue")
legend(0.75,.965, c("Middle", "Poor", "Rich"), lty=c(1,1,1), col=c("black", "red", "blue"), cex=0.8)
title("Figure 2: Survival Curves for Wealth Index: Weibull Regression Model", cex.main=0.8)

Comment: Children born to mothers who are poor have the highest probability of survival during
infancy and those born to mothers who are middle class have the least probability of survival during
infancy as observed from the survival curves.

20 | P a g e
Problem 6: Log-Normal AFT Model with Real Life Data
The SPSS file “infant_sur.dat” is the data set on the infant mortality of Bangladesh with some selected
variables. The variable ‘TIME’ measures the time to death in months and the variable ‘CHSURV’
indicates whether the observation is censored or not. Dataset link: Biostat Data
Using the data set,

a. Fit a Log-Normal AFT regression model. Identify the potential risk factors associated with
infant mortality and interpret the results.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for the variables ‘Place of Residence’, ‘Education’,
‘Wealth Index’.

Solution:

The relevant R codes are provided. Try yourself the outputs and interpretation!

# AFT Model: Log-Normal


#a.

# Read the data


inf.data<-read.table("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biostat and
Econometrics/Practical_AFT/infant_sur.dat", header=T)
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME) #Those children who have not survived up to at least 1 month, i.e.,
TIME=0 have been replaced with TIME=0.5 because log(0)=infinity
aft.lognorm<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="lognormal")
summary(aft.lognorm)
exp(aft.lognorm$coef)

#b.
#Overall Survival Probability
#Mean of the baseline Log-Normal distribution
mu<-aft.lognorm$coef[1]
cat("Mean:", "\n")
mu

21 | P a g e
#Scale parameter of baseline Log-Normal distribution
alpha<-aft.lognorm$scale
cat("Scale:", "\n")
alpha
#Overall survival probability (At mean values of covariates)
cov<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
acf.lognorm<-sum(cov*aft.lognorm$coef[2:12])
time1<-TIME[CHSURV==1]
time2<-sort(time1)
time3<-unique(time2)
time<-c(0,time3)
sur.over.ln<-1-pnorm((log(time)-acf.lognorm-mu)/alpha)
cbind(time,sur.over.ln)
#Survival Curve
plot(time, sur.over.ln, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.96,1.0), type='s',
cex.lab=0.8)
title("Figure 1: Overall Survival Curve for Log-Normal Regression Model", cex.main=0.8)

#c.
# Survival probability: Place of Residence
cov.u<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 1, mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
cov.r<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 0, mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), mean(POOR), mean(RICH))
acf.ln.u<-sum(cov.u*aft.lognorm$coef[2:12])
acf.ln.r<-sum(cov.r*aft.lognorm$coef[2:12])
sur.ur<-1-pnorm((log(time)-acf.ln.u-mu)/alpha)
sur.ru<-1-pnorm((log(time)-acf.ln.r-mu)/alpha)
#Survival curves: Urban versus Rural
plot(time, sur.ur, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.ru, type='s', col="red")
legend(6,.99, c("Urban", "Rural"), lty=c(1,1), col=c("black", "red"), cex=0.8)

22 | P a g e
title("Figure 2: Survival Curves for Place of Residence: Log-Normal Regression Model", cex.main=0.8)
#Survival probability: Education
cov.n<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,0,0, mean(POOR), mean(RICH))
cov.p<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
1,0,0, mean(POOR), mean(RICH))
cov.s<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,1,0, mean(POOR), mean(RICH))
cov.h<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
0,0,1, mean(POOR), mean(RICH))
acf.ln.n<-sum(cov.n*aft.lognorm$coef[2:12])
acf.ln.p<-sum(cov.p*aft.lognorm$coef[2:12])
acf.ln.s<-sum(cov.s*aft.lognorm$coef[2:12])
acf.ln.h<-sum(cov.h*aft.lognorm$coef[2:12])
sur.ne<-1-pnorm((log(time)-acf.ln.n-mu)/alpha)
sur.pe<-1-pnorm((log(time)-acf.ln.p-mu)/alpha)
sur.se<-1-pnorm((log(time)-acf.ln.s-mu)/alpha)
sur.he<-1-pnorm((log(time)-acf.ln.h-mu)/alpha)
#Survival Curves
plot(time, sur.ne, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.pe, type='s', col="red")
lines(time, sur.se, type='s', col="blue")
lines(time, sur.he, type='s', col="green3")
legend(0.75,.965, c("No Education", "Primary", "Secondary", "Higher"), lty=c(1,1,1,1), col=c("black",
"red", "blue", "green3"), cex=0.8)
title("Figure 3: Survival Curves for Education: Log-Normal Regression Model", cex.main=0.8)
#Survival probability: Wealth Index
cov.poor<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 1,0)
cov.mid<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,0)
cov.rich<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE), mean(NGO),
mean(PRIMARY), mean(SECONDAR), mean(HIGHER), 0,1)

23 | P a g e
acf.ln.poor<-sum(cov.poor*aft.lognorm$coef[2:12])
acf.ln.mid<-sum(cov.mid*aft.lognorm$coef[2:12])
acf.ln.rich<-sum(cov.rich*aft.lognorm$coef[2:12])
sur.poor<-1-pnorm((log(time)-acf.ln.poor-mu)/alpha)
sur.mid<-1-pnorm((log(time)-acf.ln.mid-mu)/alpha)
sur.rich<-1-pnorm((log(time)-acf.ln.rich-mu)/alpha)
# Survival Curves
plot(time, sur.mid, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.95 ,1.0),
type='s', cex.lab=0.8)
lines(time, sur.poor, type='s', col="red")
lines(time, sur.rich, type='s', col="blue")
legend(0.75,.965, c("Middle", "Poor", "Rich"), lty=c(1,1,1), col=c("black", "red", "blue"), cex=0.8)
title("Figure 4: Survival Curves for Wealth Index: Log-Normal Regression Model", cex.main=0.8)

Problem 7: Log-Logistic AFT Model with Real Life Data


The SPSS file “infant_sur.dat” is the data set on the infant mortality of Bangladesh with some selected
variables. The variable ‘TIME’ measures the time to death in months and the variable ‘CHSURV’
indicates whether the observation is censored or not. Dataset link: Biostat Data
Using the data set,

a. Fit a Log-Logistic AFT regression model. Identify the potential risk factors associated with
infant mortality and interpret the results.
b. Estimate the odds ratio of infant mortality prior to a specified time point for NGO members and
interpret the results.
c. Find the survival probabilities and curves for the variable ‘Wealth Index’.

Solution:

a.

# Load the library file


library(survival)
# Read the data
inf.data<-read.table("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biostat and
Econometrics/Practical_AFT/infant_sur.dat", header=T)

24 | P a g e
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME) #Those children who have not survived up to at least 1 month, i.e.,
TIME=0 have been replaced with TIME=0.5 because log(0)=infinity
# AFT Model: Log-Logistic
aft.loglogist<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="loglogistic")
summary(aft.loglogist)

Call:
survreg(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + POOR +
RICH, dist = "loglogistic")
Value Std. Error z p
(Intercept) 4.93053 1.88492 2.62 0.00890
AGE 0.34469 0.11194 3.08 0.00208
AGESQ -0.00577 0.00165 -3.50 0.00047
RELIGION -0.24180 0.48371 -0.50 0.61716
MEDIA -0.11245 0.31840 -0.35 0.72397
PLACE 0.40632 0.32005 1.27 0.20424
NGO 0.51006 0.28947 1.76 0.07805
PRIMARY 0.62718 0.32834 1.91 0.05611
SECONDAR 1.46941 0.41997 3.50 0.00047
HIGHER 3.98728 1.12373 3.55 0.00039
POOR 0.26597 0.37275 0.71 0.47551
RICH 0.12751 0.39157 0.33 0.74470
Log(scale) 0.85313 0.05556 15.36 < 2e-16

Scale= 2.35

Log logistic distribution


Loglik(model)= -1937.2 Loglik(intercept only)= -1971.6
Chisq= 68.8 on 11 degrees of freedom, p= 2.1e-10
Number of Newton-Raphson Iterations: 8
n= 9845

exp(aft.loglogist$coef)

(Intercept) AGE AGESQ RELIGION MEDIA PLACE


138.4533473 1.4115567 0.9942504 0.7852166 0.8936461 1.5012870
NGO PRIMARY SECONDAR HIGHER POOR RICH
1.6653974 1.8723168 4.3466686 53.9082094 1.3047007 1.1359944

Interpretation and output try yourself!

b.

For a specific covariate 𝑥𝑗 , keeping all other covariates at a fixed level, the OR becomes:
𝛾
𝑂𝑅 = [exp{−𝛽𝑗 (𝑥1𝑗 − 𝑥2𝑗 )}]

For log-logistic the parameters for are,

25 | P a g e
location, 𝛼 = 𝑒 𝜇 = 𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡

1 1
scale, 𝛾 = 𝜎 = 𝑠𝑐𝑎𝑙𝑒

alpha<-exp(aft.loglogist$coefficient[1])
gamma<-1/aft.loglogist$scale
exp(-aft.loglogist$coef["NGO"])^gamma

NGO
0.8046661

Interpretation: The odds of infant mortality decreases by 19.5% for NGO member mothers compared
to that of non-members mothers, keeping all other covariates at fixed level.

c.

Try yourself!

Model Based on Hazard Function


One can assess the influence of a set of covariates on survival time by constructing regression model
using hazard function. Two types of such survival regression models are available in literature. These
are:

i. Multiplicative hazard model


ii. Additive hazard model

Multiplicative hazard models are most population because of computational simplicity and
interpretation. We’ll do the practical based on multiplicative hazard model. Multiplicative hazard
model is also known as proportional hazard (PH) model.

In the presence of covariates x, the multiplicative hazard (/PH) model is given by:

h(t) = ℎ0 (t) ∗ exp(𝑥 ′ 𝛽)

where ℎ(𝑡)𝑎𝑛𝑑 ℎ0 (𝑡) be the hazard functions in the presence and absence of covariates respectively.

26 | P a g e
• When the baseline hazard function is defined parametrically, the PH model is known as
parametric PH model.
• When the baseline hazard function is left as an arbitrary function, the PH model is known as
semi parametric PH model.

Problem 8: Weibull Parametric PH Model


Using the data set “infant_sur.dat”. Dataset link: Biostat Data

a. Find the potential risk factors of infant mortality by fitting Weibull parametric PH model.
b. Find the overall survival probabilities and curve.
c. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.

Solution:

a.

# Read the data


inf.data<-read.table("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biostat and
Econometrics/Practical_AFT/infant_sur.dat", header=T)
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME)
# Load the library file
library(survival)
library(eha)
#PH Model: Weibull
ph.weibull<-phreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="weibull")
summary(ph.weibull)

Covariate Mean Coef Rel.Risk S.E. LR p


AGE 32.077 -0.142 0.868 0.046 0.0026
AGESQ 1107.987 0.002 1.002 0.001 0.0006
RELIGION 0.904 0.098 1.103 0.202 0.6231
MEDIA 0.635 0.049 1.050 0.133 0.7120
PLACE 0.378 -0.172 0.842 0.133 0.1956
NGO 0.392 -0.216 0.806 0.121 0.0708
PRIMARY 0.305 -0.260 0.771 0.136 0.0546
SECONDAR 0.281 -0.613 0.542 0.173 0.0003
HIGHER 0.071 -1.679 0.187 0.467 0.0000

27 | P a g e
POOR 0.347 -0.111 0.895 0.155 0.4772
RICH 0.463 -0.054 0.948 0.163 0.7424

Events 312
Total time at risk 108168
Max. log. likelihood -1938.3
LR test statistic 68.47
Degrees of freedom 11
Overall p-value 2.37972e-10

Since p-value of the overall Weibull PH model is 2.37972e-10 < 0.05, hence we may reject the null
hypothesis at 5% level of significance. That is, at least one covariate has effect on the survival time.

Interpretation:

• Coefficient for covariate Age is -0.142 and 𝐴𝑔𝑒 2 is 0.002.

Since the covariates age has a quadratic effect, it has a slightly different interpretation.
𝑙𝑛ℎ(𝑇) = 𝜇̂ − 0.142𝐴𝑔𝑒 + 0.002𝐴𝑔𝑒 2
𝛿𝑙𝑛ℎ(𝑇)
𝐹𝑜𝑟 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑎𝑔𝑒 𝑎𝑡 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 ℎ𝑎𝑧𝑎𝑟𝑑: =0
𝛿𝐴𝑔𝑒
−0.142 + 2 × 0.002𝐴𝑔𝑒 = 0
𝑜𝑟, 𝐴𝑔𝑒 ≈ 35.5 𝑦𝑒𝑎𝑟𝑠
Since 𝐴𝑔𝑒 2 has (+) ve sign it will have a U graph:
lnh(T)

35.5 Years Time

The log hazard of infant mortality is highest at both early and late maternal ages, while it is lowest at
middle maternal ages. Specifically, the log hazard reaches its minimum at approximately 35.5 years of
age, keeping all other covariates fixed.

• Hazard ratio for covariate Secondary Education is 0.542

Therefore, for mothers who have secondary level education compared to mothers having no education,
hazard rate of infant mortality decreases by (1- 0.542) * 100% = 45.8%, keeping all other covariates at
a fixed level.

28 | P a g e
• Hazard ratio for covariate Higher Education is 0.187

Therefore, for mothers who have higher level education compared to mothers having no education,
hazard rate of infant mortality decreases by (1- 0.187) * 100% = 81.3%, keeping all other covariates at
a fixed level.

b. Find overall survival probabilities and curve.

# Overall survival probability (At mean values of covariates)


# Scale parameter of the baseline Weibull distribution
lambda<-exp(ph.weibull$coef[12])
cat("Scale:", "\n")

# Shape parameter of baseline Weibull distribution


alpha<-exp(ph.weibull$coef[13])
cat("Shape parameter:", "\n")

cov<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), mean(PLACE),mean(NGO),


mean(PRIMARY), mean(SECONDAR),mean(HIGHER), mean(POOR), mean(RICH))
mf.wei<-exp(sum(cov*ph.weibull$coef[1:11]))
time1<-TIME[CHSURV==1]
time2<-sort(time1)
time3<-unique(time2)
time<-c(0,time3)
sur.over<-(exp(-(lambda*time)**alpha))**mf.wei
cbind(time,sur.over)

time sur.over
[1,] 0.0 1.00000000
[2,] 0.5 0.54434692
[3,] 1.0 0.44305796
[4,] 2.0 0.33633738
[5,] 3.0 0.27464371
[6,] 4.0 0.23257985
[7,] 5.0 0.20148122
[8,] 6.0 0.17732651
[9,] 7.0 0.15792356
[10,] 8.0 0.14194978
[11,] 9.0 0.12854889
[12,] 10.0 0.11713692
[13,] 11.0 0.10729924
[14,] 12.0 0.09873189

29 | P a g e
##Survival Curve
plot(time, sur.over, xlab="Survival Time", ylab="Survival Probabilities", ylim=c(0.096,1.0),type='s', ce
x.lab=0.8)
title("Figure 1: Overall Survival Curve for Weibull PH Regression Model", cex.main=0.8)

The overall survival probabilities are decreasing as the survival probabilities show a sharp declining
pattern during the period of infancy.

c. Find Survival probabilities and curves for place of residence.

#Survival probability: Place of residence


cov.u<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 1,
mean(NGO),mean(PRIMARY), mean(SECONDAR),mean(HIGHER), mean(POOR), mean(RICH))
cov.r<-c(mean(AGE),mean(AGESQ), mean(RELIGION), mean(MEDIA), 0,
mean(NGO),mean(PRIMARY), mean(SECONDAR),mean(HIGHER), mean(POOR), mean(RICH))
mf.wei.u<-exp(sum(cov.u*ph.weibull$coef[1:11]))
mf.wei.r<-exp(sum(cov.r*ph.weibull$coef[1:11]))
sur.ur<-(exp(-(lambda*time)**alpha))**mf.wei.u
sur.ru<-(exp(-(lambda*time)**alpha))**mf.wei.r

##Survival curves: Urban versus rural


plot(time, sur.ur, xlab="Survival Time", ylab="Survival Probabilities", col="black", ylim=c(0.08,1.0),
type='s', cex.lab=0.8)

30 | P a g e
lines(time, sur.ru, type='s', col="red")
legend(6,.99, c("Urban", "Rural"), lty=c(1,1), col=c("black", "red"), cex=0.8)
title("Figure 2: Survival Curves for Place of Residence: Weibull PH Regression Model",cex.main=0.8)

Comment: Infants born to mothers who reside in urban areas have higher probability of survival
compared to infants born to mothers who reside in rural areas as observed from the survival curves.

Similarly try for Education and Wealth Index!

Problem 9: Inter-relationship between Weibull PH and Weibull AFT Model


̂
Check the relationship of (𝛽 ̂
𝑃𝐻 ) = −𝛼 ∗ (𝛽𝐴𝐹𝑇 ) between Weibull PH and Weibull AFT model by

comparing the result of Problem 5 and Problem 8.

Solution:

aft.weibull<-survreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="weibull")
aft.rcoef<-aft.weibull$coef[2]
aft.scale<-exp(aft.weibull$coef[1])
aft.shape<-1/aft.weibull$scale

31 | P a g e
aft.res<-cbind(aft.rcoef,aft.scale,aft.shape)

ph.weibull<-phreg(Surv(TIME,
CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+PRIMARY+SECONDAR+HIGHER+POOR+RICH,
dist="weibull")
ph.rcoef<-ph.weibull$coef[1]
ph.scale<-exp(ph.weibull$coef[2])
ph.shape<-exp(ph.weibull$coef[3])
ph.res<-cbind(ph.rcoef,ph.scale,ph.shape)

cbind(aft.res,ph.res)

aft.rcoef aft.scale aft.shape ph.rcoef ph.scale ph.shape


AGE 0.3377305 177.9746 0.4206546 -0.1420679 1.002381 1.103062

#Here
alpha<-0.4206546
beta_aft<-0.3377305
beta_ph<--alpha*beta_aft
beta_ph

[1] -0.1420679

Here, our calculated beta_ph = 𝛽̂


𝑃𝐻 = −0.142, therefore the inter-relationship is justified.

Problem 10: Semi-Parametric PH Model (Cox-PH Model)


Using the data set “infant_sur.dat”. Dataset link: Biostat Data

a. Find the potential risk factors of infant mortality by fitting semi parametric PH model.
b. Find the baseline hazard and plot them.
c. Find overall survival probabilities and curve.
d. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.

Solution:

a.

# Read the data


inf.data<-read.table("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biostat and
Econometrics/Practical_AFT/infant_sur.dat", header=T)

32 | P a g e
attach(inf.data)
TIME<-ifelse(TIME==0, 0.5, TIME)
#Start of Cox PH Model
coxph.model<-coxph(Surv(TIME,CHSURV)~AGE+AGESQ+RELIGION+MEDIA+PLACE+NGO+
PRIMARY+SECONDAR+HIGHER+RICH+POOR)
summary(coxph.model)

Call:
coxph(formula = Surv(TIME, CHSURV) ~ AGE + AGESQ + RELIGION +
MEDIA + PLACE + NGO + PRIMARY + SECONDAR + HIGHER + RICH +
POOR)

n= 9845, number of events= 312

coef exp(coef) se(coef) z Pr(>|z|)


AGE -0.132870 0.875579 0.046061 -2.885 0.003918 **
AGESQ 0.002267 1.002270 0.000675 3.359 0.000782 ***
RELIGION 0.103947 1.109542 0.202245 0.514 0.607274
MEDIA 0.052445 1.053845 0.132946 0.394 0.693223
PLACE -0.170816 0.842977 0.133452 -1.280 0.200553
NGO -0.209118 0.811300 0.120553 -1.735 0.082801 .
PRIMARY -0.259219 0.771654 0.136393 -1.901 0.057364 .
SECONDAR -0.612225 0.542143 0.172873 -3.541 0.000398 ***
HIGHER -1.684126 0.185607 0.467115 -3.605 0.000312 ***
RICH -0.053956 0.947473 0.163549 -0.330 0.741468
POOR -0.112970 0.893177 0.155277 -0.728 0.466894
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

exp(coef) exp(-coef) lower .95 upper .95


AGE 0.8756 1.1421 0.8000 0.9583
AGESQ 1.0023 0.9977 1.0009 1.0036
RELIGION 1.1095 0.9013 0.7464 1.6493
MEDIA 1.0538 0.9489 0.8121 1.3675
PLACE 0.8430 1.1863 0.6490 1.0950
NGO 0.8113 1.2326 0.6406 1.0275
PRIMARY 0.7717 1.2959 0.5906 1.0081
SECONDAR 0.5421 1.8445 0.3863 0.7608
HIGHER 0.1856 5.3877 0.0743 0.4637
RICH 0.9475 1.0554 0.6876 1.3055
POOR 0.8932 1.1196 0.6588 1.2109

Concordance= 0.637 (se = 0.015 )


Likelihood ratio test= 69.63 on 11 df, p=1e-10
Wald test = 64.11 on 11 df, p=2e-09
Score (logrank) test = 69.46 on 11 df, p=2e-10

• Coefficient for covariate Age is -0.142 and 𝐴𝑔𝑒 2 is 0.002.

Since the covariates age has a quadratic effect, it has a slightly different interpretation.
𝑙𝑛ℎ(𝑇) = 𝜇̂ − 0.1329𝐴𝑔𝑒 + 0.0023𝐴𝑔𝑒 2
𝛿𝑙𝑛ℎ(𝑇)
𝐹𝑜𝑟 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑎𝑔𝑒 𝑎𝑡 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 ℎ𝑎𝑧𝑎𝑟𝑑: =0
𝛿𝐴𝑔𝑒
−0.1329 + 2 × 0.0023𝐴𝑔𝑒 = 0
𝑜𝑟, 𝐴𝑔𝑒 ≈ 29 𝑦𝑒𝑎𝑟𝑠

33 | P a g e
Since 𝐴𝑔𝑒 2 has (+) ve sign it will have a U graph:

lnh(T)
29 Years Age

The log hazard of infant mortality is highest at both early and late maternal ages, while it is lowest at
the middle period of maternal ages. Specifically, the log hazard reaches its minimum at approximately
29 years of age, keeping all other covariates fixed. That means, mother aged around 29 years of age are
more likely to have healthy babies.

Other interpretations are same as problem 08. Try yourself!

b. Find the baseline hazard function and plot them.

# Estimated baseline hazard


base.haz<-basehaz(coxph.model)
base.haz
plot(base.haz$time, base.haz$hazard, xlab="Time", ylab="Hazard Rate", type="h")
title("Baseline Hazard Function for COX PH Model")

34 | P a g e
c. Find overall survival function and curve.

# Overall survival function (At mean values of covariates)


cox.sur<-survfit(coxph.model)
summary(cox.sur)
plot(cox.sur, xlab="Time", ylab="Survival Probability", ylim=c(0.92,1))
title("Overall Survival Curve for COX PH Model")

35 | P a g e
d. Find the survival probabilities and curves for variables ‘Type of place of residence’,
‘Education’, and ‘Wealth Index’.

## Survival curves for the type of place of residence


## Two types 0: Rural and 1: Urban

data1<-data.frame(AGE=rep(mean(AGE),2),AGESQ=rep(mean(AGESQ),2),
RELIGION=rep(mean(RELIGION),2), MEDIA=rep(mean(MEDIA),2), PLACE=c(0,1),
NGO=rep(mean(NGO),2), PRIMARY=rep(mean(PRIMARY),2),
SECONDAR=rep(mean(SECONDAR),2), HIGHER=rep(mean(HIGHER),2), POOR=rep(mean(POOR),2),
RICH=rep(mean(RICH),2))

sur.place<-survfit(coxph.model, newdata=data1)
summary(sur.place)
plot(sur.place, xlab="Time", ylab="Survival Probability", ylim=c(0.96,1), lty=1:2, col=c("black", "red"))
legend(6, 0.995, c("Rural", "Urban"), lty=1:2, col=c("black", "red"))
title("Survival curves for place of residence")

## Survival curves for Education


## No education, primary, secondary, and higher

data2<-data.frame(AGE=rep(mean(AGE),4), AGESQ=rep(mean(AGESQ),4),
RELIGION=rep(mean(RELIGION),4), MEDIA=rep(mean(MEDIA),4), PLACE=rep(mean(PLACE),4),
NGO=rep(mean(NGO),4), PRIMARY=c(0,1,0,0), SECONDAR=c(0,0,1,0), HIGHER=c(0,0,0,1),
POOR=rep(mean(POOR),4),RICH=rep(mean(RICH),4))

sur.educa<-survfit(coxph.model, newdata=data2)
summary(sur.educa)
plot(sur.educa, xlab="Time", ylab="Survival Probability", ylim=c(0.92,1), lty=1:4, col=c("black", "red",
"blue", "green3"))
legend(6, 0.955, c("No Edu", "Primary", "Secondary", "Higher"), lty=1:4, col=c("black", "red", "blue",
"green3"))
title("Survival curves for education")

##Survival curves for wealth index


## Poor, Middle, and Rich
data3<-data.frame(AGE=rep(mean(AGE),3),AGESQ=rep(mean(AGESQ),3),
RELIGION=rep(mean(RELIGION),3), MEDIA=rep(mean(MEDIA),3), PLACE=rep(mean(PLACE),3),
NGO=rep(mean(NGO),3), PRIMARY=rep(mean(PRIMARY),3),SECONDAR=rep(mean(SECONDAR),3),
HIGHER=rep(mean(HIGHER),3), POOR=c(1,0,0), RICH=c(0,0,1))

sur.wealth<-survfit(coxph.model, newdata=data3)
summary(sur.wealth)

plot(sur.wealth, xlab="Time", ylab="Survival Probability", ylim=c(0.95,1), lty=1:3, col=c("black",

36 | P a g e
"red", "blue"))
legend(6, 0.97, c("Poor", "Middle", "Rich"), lty=1:3, col=c("black", "red", "blue"))
title("Survival curves for wealth index")

37 | P a g e
Multiple Modes of Failure
Check Master’s Theory and Practical Lectures of Advanced Biostatistics.

Generalized Linear Models


Check 4th Year Theory and Practical Lectures of Generalized Linear Models.

Generalized Linear Mixed Models


Problem 11: Generalized Linear Mixed Model
You are given a data set on postnatal care of children (“postnatal.sav”) in Bangladesh where the variable
“pnc” is defined as a binary variable (1: received postnatal care within six weeks of delivery; 0:
otherwise). Description of control variables is provided in the data set. Dataset link: Biostat Data

a. Fit a GLMM taking into account the random effect of clusters and interpret the results.

b. Compute 95% confidence intervals for the odds ratios. Hence identify the potential
factors of receiving postnatal care for children.

c. Find the intra-cluster correlation and interpret the result.

Solution:

a.

data_pnc<-read.csv("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 B


iostat and Econometrics/Practical_GLMM/postnatal.csv",header=T)

names(data_pnc)

attach(data_pnc)

library(glmmTMB)

model_rem1<-glmmTMB(pnc~factor(edu)+factor(media)+factor(w_index)+

factor(anc)+factor(ngo)+factor(resi)+factor(p_deli)+

factor(sex)+(1|cluster),zi=~0,family=binomial,data=d
ata_pnc)# (1|cluster) for Random intercept for each cluster and zi~0 for z
ero inflation

summary(model_rem1)

38 | P a g e
## Family: binomial ( logit )
## Formula:
## pnc ~ factor(edu) + factor(media) + factor(w_index) + factor(anc) +
## factor(ngo) + factor(resi) + factor(p_deli) + factor(sex) +
## (1 | cluster)
## Data: data_pnc
##
## AIC BIC logLik -2*log(L) df.resid
## 3790.1 3873.4 -1882.1 3764.1 4454
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## cluster (Intercept) 2.706 1.645
## Number of obs: 4467, groups: cluster, 593
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.30734 0.24750 -1.242 0.21432
## factor(edu)1 0.18566 0.14946 1.242 0.21415
## factor(edu)2 0.24569 0.15340 1.602 0.10924
## factor(edu)3 0.71341 0.25275 2.823 0.00476 **
## factor(media)1 0.37824 0.11955 3.164 0.00156 **
## factor(w_index)1 -0.14011 0.13796 -1.016 0.30982
## factor(w_index)2 0.11920 0.14974 0.796 0.42599
## factor(anc)1 0.58682 0.12282 4.778 1.77e-06 ***
## factor(ngo)1 -0.03300 0.11920 -0.277 0.78192
## factor(resi)1 -0.47236 0.19901 -2.374 0.01762 *
## factor(p_deli)1 3.48441 0.15670 22.236 < 2e-16 ***
## factor(sex)2 0.09783 0.09539 1.026 0.30509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coef<-coef(summary(model_rem1))$cond[,1]
se<-coef(summary(model_rem1))$cond[,2]

or<-exp(coef[2:12]) #Odds ratio


se.or<-or*se[2:12] #Standard error of odds ratio

cbind(or,se.or)

## or se.or
## factor(edu)1 1.2040181 0.1799544
## factor(edu)2 1.2785031 0.1961210
## factor(edu)3 2.0409342 0.5158523
## factor(media)1 1.4597085 0.1745142
## factor(w_index)1 0.8692644 0.1199197
## factor(w_index)2 1.1265970 0.1686947
## factor(anc)1 1.7982533 0.2208639
## factor(ngo)1 0.9675413 0.1153317
## factor(resi)1 0.6235307 0.1240889

39 | P a g e
## factor(p_deli)1 32.6031107 5.1089391
## factor(sex)2 1.1027794 0.1051972

Interpretation of Education:

Mothers with primary education level have 20% higher odds of receiving postnatal care within six
weeks of delivery compared to no education group, but it’s not statistically significant. Mothers with
higher education level have more than double the (+104%) odds of receiving postnatal care within six
weeks of delivery compared to no education group, and the p-value 0.04<0.05 suggests that it is
statistically significant at 5% level of significance.

In a similar manner, attempt to interpret the remaining variables yourself!

b.

lower.ci<-or-1.96*se.or
upper.ci<-or+1.96*se.or
cbind(lower.ci,upper.ci)

## lower.ci upper.ci
## factor(edu)1 0.8513075 1.5567287
## factor(edu)2 0.8941059 1.6629002
## factor(edu)3 1.0298636 3.0520048
## factor(media)1 1.1176607 1.8017563
## factor(w_index)1 0.6342218 1.1043070
## factor(w_index)2 0.7959555 1.4572385
## factor(anc)1 1.3653601 2.2311465
## factor(ngo)1 0.7414913 1.1935914
## factor(resi)1 0.3803165 0.8667449
## factor(p_deli)1 22.5895900 42.6166313
## factor(sex)2 0.8965928 1.3089659

The relevant hypothesis:


𝐻𝑜 : 𝑂𝑅 = 1
If the confidence interval of odds ratio includes the value 1, then it is not statistically significant.
Therefore, from the above table, we can say that Education (especially higher education), media
exposure, antenatal care, residence (urban/rural), and place of delivery are important factors for
postnatal care.

c.

Intra cluster correlation is defined as:

40 | P a g e
𝜎𝑢2
𝐼𝐶𝐶 = 2
𝜎𝑢 + 𝜎𝑒2
Where, 𝜎𝑢2 is variance of the cluster which we get from our output 2.706.
𝜋2
And 𝜎𝑒2 is the residual variance of the model which is 3
(For Binary Response)

sig_clu<-2.706
sig_resi<-pi^2/3
icc<-sig_clu/(sig_clu+sig_resi)
icc

## [1] 0.4513108

About 45% of the total variation in whether children received postnatal care can be explained by
differences between clusters.

This indicates that cluster-level factors (e.g., local healthcare services, community characteristics) play
a strong role in postnatal care coverage.

Problem 12: Generalized Linear Mixed Model

You are given a data set on birth weight of newborns (“birth_weight.sav”) in a country. Description of
the variables are provided in the data set. Identify potential determinants of birth weight by fitting a
GLMM taking into account the random effect of clusters and interpret the results. Also, find the intra-
cluster correlation and interpret the result. Dataset link: Biostat Data

Solution:

data<-read.csv("F:/Mostakim/5th Masters/Stat MS-508; Data Analysis 3 Biost


at and Econometrics/Practical_GLMM/birth_weight.csv",header=T)
names(data)

attach(data)

## The following objects are masked from data_pnc:


##
## anc, cluster, sex, w_index

library(glmmTMB)
model_rem2<-glmmTMB(birth_weight~factor(area)+factor(location)+factor(w_ed
u)+factor(w_media)+factor(w_index)+factor(violence)+ factor(b_order)+facto
r(anc)+Age_year+Age_year_sqr+(1|cluster), zi=~0, family=gaussian, data=dat
a)

summary(model_rem2)

41 | P a g e
## Family: gaussian ( identity )
## Formula:
## birth_weight ~ factor(area) + factor(location) + factor(w_edu) +
## factor(w_media) + factor(w_index) + factor(violence) + factor(b_ord
er) +
## factor(anc) + Age_year + Age_year_sqr + (1 | cluster)
## Data: data
##
## AIC BIC logLik -2*log(L) df.resid
## 7601.6 7715.2 -3782.8 7565.6 4054
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## cluster (Intercept) 0.007425 0.08617
## Residual 0.368021 0.60665
## Number of obs: 4072, groups: cluster, 2185
##
## Dispersion estimate for gaussian family (sigma^2): 0.368
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2755416 0.2202667 10.331 < 2e-16 ***
## factor(area)2 0.0717613 0.0237573 3.021 0.002523 **
## factor(location)1 0.1109202 0.0283502 3.912 9.13e-05 ***
## factor(location)2 0.1049401 0.0271760 3.861 0.000113 ***
## factor(location)3 -0.0483205 0.0276711 -1.746 0.080769 .
## factor(w_edu)1 0.0426592 0.0626226 0.681 0.495737
## factor(w_edu)2 0.0680711 0.0592783 1.148 0.250832
## factor(w_edu)3 0.0897342 0.0616742 1.455 0.145677
## factor(w_media)1 -0.0187051 0.0248048 -0.754 0.450793
## factor(w_index)1 0.0360164 0.0302632 1.190 0.234006
## factor(w_index)2 0.0611729 0.0282450 2.166 0.030327 *
## factor(violence)1 0.0241818 0.0249576 0.969 0.332587
## factor(b_order)1 -0.0655412 0.0263496 -2.487 0.012869 *
## factor(anc)1 -0.0249231 0.0203968 -1.222 0.221741
## Age_year 0.0372948 0.0157723 2.365 0.018051 *
## Age_year_sqr -0.0005638 0.0002746 -2.053 0.040064 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Intra cluster correlation is defined as:

𝜎𝑢2
𝐼𝐶𝐶 = 2
𝜎𝑢 + 𝜎𝑒2
From output, 𝜎𝑢2 is variance of the cluster 0.007425
And 𝜎𝑒2 is the residual variance of the model 0.368021

Interpretation and calculation try yourself!

42 | P a g e
References
1. Practical and Theory Lectures of Fatima Tuz Zahura Mam

2. Survival Analysis Techniques for Censored and Truncated Data by John P. Klein and
Moeshberger.

3. Statistical Models and Methods for Lifetime Data by Jeral F. Lawless.

43 | P a g e
Three Weedings

Anyone who knew Abed in his youth would have told you that he
was destined to end up with a certain someone. But that some- one
was not Haifa or Asmahan. It was a girl called Ghazl. They met in
the mid-1980s, when Anata was quiet and rural, more village than
town. Ghazl was a fourteen-year-old freshman at the Anata girls'
school. Abed was a senior at the boys' school across the street. Back
then, everyone knew each other in Anata. More than half the village
came from one of three large families all descended from the same
ancestor, a man named Alawi. Abed's family, the Salamas, was the
largest. Ghazl's, the Hamdans, was the second largest.

A Day in the Life of Abed Salama: Anatomy of a Jerusalem


44 | P a g e
Tragedy by Nathan Thrall (Winner of Pulitzer Prize Award)

You might also like