0% found this document useful (0 votes)
93 views197 pages

Statistical Lifetime Models

This document provides an overview of a 16-lecture course on statistical lifetime models taught by David Steinsaltz at the University of Oxford. The course covers survival analysis techniques for analyzing event times and counts, including life tables, parametric and nonparametric survival models, model diagnostics, and repeated events models. Key topics include hazard rates, survival curves, censoring and truncation, the Kaplan-Meier estimator, Cox proportional hazards models, and assessing model fit. Students will learn to apply these methods in R and interpret the results. The course aims to help students understand which models are appropriate for different situations and apply survival analysis to real data problems.

Uploaded by

gmert58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views197 pages

Statistical Lifetime Models

This document provides an overview of a 16-lecture course on statistical lifetime models taught by David Steinsaltz at the University of Oxford. The course covers survival analysis techniques for analyzing event times and counts, including life tables, parametric and nonparametric survival models, model diagnostics, and repeated events models. Key topics include hazard rates, survival curves, censoring and truncation, the Kaplan-Meier estimator, Cox proportional hazards models, and assessing model fit. Students will learn to apply these methods in R and interpret the results. The course aims to help students understand which models are appropriate for different situations and apply survival analysis to real data problems.

Uploaded by

gmert58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

SB3.

2
Statistical Lifetime-Models
David Steinsaltz, Department of Statistics, University of Oxford

P'(3,t)

3
Age

P'(2,t)
2

P'(1,t)

1 2 3 4

Time

HT 2019
SB3.2
Statistical Lifetime-Models
David Steinsaltz – 16 lectures HT 2019
[email protected]

Prerequisites
Part A Probability and Part A Statistics are prerequisites.

Website: http://www.steinsaltz.me.uk/SB3b/SB3b.html
Aims
Event times and event counts appear in many social and medical data contexts, and require
a specialised suite of techniques to handle properly, broadly known as survival analysis. This
course covers the basic definitions and techniques for creating life tables, estimating event-time
distributions, comparing and testing distributions of different populations, and evaluating the
goodness of fit of various models. A focus is on understanding when and why particular models
ought to be chosen, and on using the standard software tools in R to carry out data analysis.
Key Themes
• Life tables

• Kinds of event-time data: Censoring and truncation

• Non-parametric and semi-parametric survival models

• Model diagnostics for survival models

Learning Objectives
At the end of this course students will

• Understand the standard notation of life tables, and be able to make inferences from life
tables;

• Interpret censored and truncated data;

• Derive estimators and confidence intervals for standard parametric survival models;

• Fit survival curves with standard nonparametric techniques using R, and interpret the
results;

• Select appropriate tests for equality of survival distributions, and carry out the tests in R;

• Fit semi-parametric survival models in R and interpret the results;

• Estimate the goodness of fit of survival models using graphical and residual techniques;

• Make individual predictions based on survival data.


iv

Computing
The lectures include material about performing basic survival analysis in R. This is an important
element of the course, since almost any application you will make of the methods you learn here
will be done on the computer. If you have not done any R programming yet, you are encouraged
to review the introductory material provided for the Part A Statistics lectures, and try the
computing exercises from that course. There will be computing exercises in each problem sheet.
These will not be checked, but as with all of the problems on the sheets, you can ask about them
in the classes.
Synopsis
1. Introduction to survival data: hazard rates, survival curves, life tables.

2. Censoring and truncation, introduction through the census approximation.

3. Multiple-decrements model.*

4. Parametric survival models

5. Nonparametric estimation of survival curves

6. Nonparametric model tests (log-rank test and relatives)

7. Semiparametric models:

(a) Proportional hazards;


(b) Additive hazards;
(c) Accelerated failure.

8. Model-fit diagnostics.

9. Dynamic prediction and model information quality.*

10. Repeated events.*

Note: Three topics (marked with stars) are listed in the lecture notes as “optional”. Some of
these may not be covered, depending on the available time.
v

Reading
The main reading for this course will be a set of lecture notes, that will be available on the course
website. Unless specifically stated, all material in the lecture notes is examinable.
There are lots of good books on survival analysis. Look for one that suits you. Some pointers
will be given in the lecture notes to readings that are connected, but look in the index to find
more explanation of topics that confuse you and/or interest you.

1. Life table concepts


Life tables: Basic notation, life expectancy and remaining life expectancy, curtate lifetimes.
Survival models: general lifetime distributions, force of mortality (hazard rate), survival
function. Periods and cohorts. Lexis diagrams. Census and vital statistics. Multiple
decrements model.
Primary Readings:

• Kenneth W. Wachter. Essential Demographic Methods. Harvard University Press,


2014.

Secondary Readings

• Farhat Yusuf, David Swanson, Jo Martins. Methods of Demographic Analysis. Springer,


2013.
• Subject CT4 Models Core Reading, Faculty & Institute of Actuaries.

2. Survival analysis basics


Censoring and truncation. Maximum likelihood estimation for parametric models. Kaplan-
Meier and Nelson-Aalen estimator with variance estimation (including Greenwood’s for-
mula). Applications in epidemiology. Parametric models generalised linear regression.
Nonparametric comparison of survival distributions, including log-rank test and serial-
correlations test. Using the survival package in R. (This will be continued through the
course.)
Primary Readings:

• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapters
3, 4, 7.

Secondary Readings

• Odd O. Aalen et al., Survival and Event History Analysis, Springer, 2008.
• D. F. Moore, Applied Survival Analysis Using R, Springer, 2016.
vi

3. Semiparametric models
Relative risk (proportional hazards) including the Cox model, additive hazards model,
accelerated failure models. Partial likelihood. Efron’s estimator for survival distributions.
Primary Readings:

• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapters
8 and 10.

Secondary Readings:

• H. C. van Houwelingen and T. Stijnen, “Cox Regression Model”, in J. P. Klein et al.


(ed.) Handbook of Survival Analysis, pp. 5–26, CRC Press, 2014.
• T. Martinussen and L. Peng, “Alternatives to the Cox Model”, in J. P. Klein et al.
(ed.) Handbook of Survival Analysis, pp. 49–76, CRC Press, 2014.

4. Survival model diagnostics


Residual tests, including Cox–Snell residuals, martingale residuals, Schoenfeld residuals.
Dynamic prediction and predictive power of models: Cross validation, discriminative and
predictive value, ROC, mean-square prediction error.
Primary Readings:

• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapter
11.

Secondary Readings

• H. C. van Houwelingen and H. Putten, Dynamic Prediction in Clinical Survival


Analysis. CRC Press, 2011.
• Lawless, J. F. and Yuan, Y. (2010). “Estimation of prediction error for survival
models”. Statistics in Medicine, 29(2), 262-274.

5. Repeated events
Anderson–Gill model, Poisson regression, negative binomial model.
Contents

Glossary xii

1 Introduction: Survival Models 1


1.1 Early life tables (not examinable) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic statistical methods for lifetime distributions . . . . . . . . . . . . . . . . . . 3
1.2.1 Plot the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Fit a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Significance test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Goals of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Survival function and hazard rate (force of mortality) . . . . . . . . . . . . . . . . 8
1.5 Residual lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Force of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Defining mortality laws from hazards . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Curtate lifespan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Lifetime distributions and life tables 12


2.1 Mortality laws: Simple or Complex? Parametric or Nonparametric? . . . . . . . . 12
2.2 Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Notation for life tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Continuous and discrete models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Crude estimation of life tables – discrete method . . . . . . . . . . . . . . . . . . 17
2.6 Crude life table estimation – direct method for continuous data . . . . . . . . . . 18
2.7 Are life tables continuous or discrete? . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Interpolation for non-integer ages . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 The “continuous” method for life-table estimation . . . . . . . . . . . . . . . . . . 21
2.10 Comparing continuous and discrete methods . . . . . . . . . . . . . . . . . . . . . 22
2.11 Central exposed to risk and the census approximation . . . . . . . . . . . . . . . 22
2.11.1 The Principle of Correspondence . . . . . . . . . . . . . . . . . . . . . . . 22
2.11.2 Census approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.12 Lexis diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.13 Life Expectancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13.1 What is life expectancy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.13.3 Life expectancy and mortality . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.14 An example of life-table computations . . . . . . . . . . . . . . . . . . . . . . . . 31
2.15 Cohorts and period life tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.16 Comparing Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.16.1 Defining the Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii
CONTENTS viii

2.16.2 Testing hypotheses for qx and µx . . . . . . . . . . . . . . . . . . . . . . . 36


2.16.3 The tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.16.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.17 Graduation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.17.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.17.2 Reference to a standard table . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.17.3 Nonparametric smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.17.4 Methods of fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.17.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Multiple-decrements model (Optional Topic) 44


3.1 Introduction to the multiple-decrements model . . . . . . . . . . . . . . . . . . . 44
3.1.1 An introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 The mathematical model (non-examinable) . . . . . . . . . . . . . . . . . 46
3.1.3 Multiple decrements – time-homogeneous rates . . . . . . . . . . . . . . . 46
3.2 Estimation for general multiple decrements . . . . . . . . . . . . . . . . . . . . . 46
3.3 Example: Workforce model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 The distribution of the endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Cohabitation–dissolution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Survival analysis 53
4.1 Censoring and truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Likelihood and Right Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Random censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Non-informative censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Time on test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Non-parametric survival estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Review of basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Kaplan–Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.3 Nelson–Aalen estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.4 Invented data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.5 Example: The AML study . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Left truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Variance estimation: Greenwood’s formula . . . . . . . . . . . . . . . . . . . . . . 62
4.6.1 The cumulative hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6.2 The survival function and Greenwood’s formula . . . . . . . . . . . . . . . 63
4.6.3 Reminder of the δ method . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 AML study, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Survival to ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8.1 Example: Time to next birth . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.9 Computing survival estimators in R (non-examinable) . . . . . . . . . . . . . . . . 70
4.9.1 Survival objects with only right-censoring . . . . . . . . . . . . . . . . . . 70
4.9.2 Other survival objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
CONTENTS ix

5 Regression models 74
5.1 Introduction to regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 How survival functions vary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Graphical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Generalised linear survival models . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 The relative-risk regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 Partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Significance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8 Estimating baseline hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8.1 Breslow’s estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8.2 Individual risk ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9 Dealing with ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.10 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 The Cox model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.12 The additive hazards model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.12.1 Describing the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.12.2 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Non-parametric two-sample hypothesis tests 97


6.1 Tests in the regression setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Non-parametric testing of survival between groups . . . . . . . . . . . . . . . . . 97
6.2.1 General principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.2 Standard tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Kidney dialysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Nonparametric tests in R (non-examinable) . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS x

7 Testing the fit of survival models 104


7.1 Graphical tests of the proportional hazards assumption . . . . . . . . . . . . . . . 104
7.1.1 Log cumulative hazard plot . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.2 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.3 Arjas plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.4 Leukaemia example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 General principles of model selection . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 The idea of model diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 A simulated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4 Bone marrow transplantation example . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5 Testing the proportional hazards assumption: Schoenfeld residuals . . . . . . . . 111
7.6 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.1 Definition of martingale residuals . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.2 Application of martingale residuals for estimating covariate transforms . . 113
7.7 Outliers and influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.7.1 Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.7.2 Delta–beta residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.8 Residuals in R (non-examinable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.8.1 Dutch Cancer Institute (NKI) breast cancer data . . . . . . . . . . . . . . 115
7.8.2 Complementary log-log plot . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.8.3 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.8.4 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.8.5 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.8.6 Schoenfeld residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8 Dynamic prediction (Optional Topic) 123

9 Correlated events and repeated events 124


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.2 Time-to-first-event analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.3 Clustered data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.3.1 Stratified baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.3.2 The sandwich estimator for variance . . . . . . . . . . . . . . . . . . . . . 126
9.4 Multiple events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4.1 The Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4.2 The Poisson regression model . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4.3 The Andersen–Gill model . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5 Shared frailty model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.5.1 Negative-binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.5.2 Frailty in proportional hazards models . . . . . . . . . . . . . . . . . . . . 132
9.6 Example: The diabetic retinopathy data set . . . . . . . . . . . . . . . . . . . . . 132
9.7 Example: Bladder cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
CONTENTS xi

A Problem sheets I
A.1 Revision, lifetime distributions, Lexis diagrams and the census approximation . . II
A.2 Life expectancy, graduation, and survival analysis . . . . . . . . . . . . . . . . . . V
A.3 Survival regression models and two-sample testing . . . . . . . . . . . . . . . . . IX
A.4 Model diagnostics, repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . XII

B Solutions XIV
B.1 Revision, lifetime distributions, Lexis diagrams and the census approximation . . XV
B.2 Life expectancy, graduation, and survival analysis . . . . . . . . . . . . . . . . . . XX
B.3 Survival regression models and two-sample testing . . . . . . . . . . . . . . . . . XXVII
B.4 Model diagnostics, repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . XXXIX
Glossary

cdf Cumulative distribution function. 8

census approximation Method of estimating Central Exposed To Risk based on observations


of curtate age at death. 23

Central Exposed To Risk Total time that individuals are at risk. Under some circumstances,
this is about the number of individuals at risk at the midpoint the estimation period. 21

cohort A group of individuals of equivalent age (in whatever sense relevant to the study),
observed over a period of time. 8, 16

cohort life table Life table showing mortality of individuals born in the same year (or approx-
imately same year). 32

curtate lifetime The integer part of a real-valued life time. 17

force of mortality Same as mortality rate, but also used in a discrete context. 8

graduation Smoothing for life tables. 39

hazard rate Density divided by survival. Thus, the instantaneous probability of the event
occurring, conditioned on survival to time t. 8

Initial Exposed To Risk Number of individuals at risk at the start of the estimation period.
17, 21

Maximum Likelihood Estimator Estimator for a parameter, chosen to maximise the likeli-
hood function. 17

mortality rate Same as hazard rate, in a mortality context. 8

period life table Life table showing mortality of individuals of a given age living in the same
year (or approximately same year). 32

Radix The initial number of individuals in the nominal cohort described by a life table. 16

xii
Chapter 1

Introduction: Survival Models

1.1 Early life tables (not examinable)


In one of the earliest treatises on probability George Leclerc Buffon considered the problem of
finding the fundamental unit of risk, the smallest discernible probability. He wrote that “all fear
or hope, whose probability equals that which produces the fear of death, in the moral realm may
be taken as unity against which all other fears are to be measured.” [4, p. 56] In other words,
because no healthy man in the prime of life (he argued) attends to the risk that he may die in
the next twenty-four hours, Buffon considered that events with this probability could be treated
as negligible; after all, “since the intensity of the fear of death is a good deal greater than the
intensity of any other fear or hope,” any other risk of equivalent probability of a less troubling
event — such as winning a lottery — would leave a person equally indifferent. He decided that
the appropriate age to consider for a man to be in the prime of health was 56 years. But what is
that probability, that a 56 year old man dies in the next day?
To answer this, Buffon turned to mortality tables. A colleague (one M. Dupré of Saint-Maur)
assembled the registers of 12 rural parishes and 3 parishes of Paris, in which 23,994 deaths were
recorded. The ages at death were all recorded, so that he knew that 174 of the deaths were at
age 56; that is, between the 56th and 57th birthdays.1 Our naïve estimator for the probability of
an event is
number of occurrences
probability of occurrence = .
number of opportunities
The number of occurrences of the event (death of an individual aged 56) is observed to be
174. But what about the denominator? The number of “opportunities” for this event is just
the number of individuals in the population at the appropriate age. The most direct way to
determine this number would be a time-consuming census. Buffon’s approach (and that of other
17th and 18th creators of such life tables) depended upon the following implicit logic: Suppose
the population is stable, so that the same number of people in each age group die each year.
Since every person dies at some time (it is believed), the total number of people in the population
who live to their 56th birthday will be exactly the same as the number of people observed to
1
Actually, Buffon’s statistical procedure was a bit more complicated than this. The recorded numbers of deaths
at ages 55,56,57,58,59,60 were 280,130,129,182,90,534 respectively. Buffon observed that the priests (“particularly
the country priests”) were likely to record round numbers for the age at death, rather than the exact age — which
they may not know anyway. He thus decided that it would make more sense to smooth (as statisticians would call
the procedure today) or graduate (as actuaries call it) the data. We will learn about graduation in section 2.17.

1
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 2

have died after their 56th birthday in the particular year under observation, which happens to
be 5031. The probability of dying in one day may then be estimated as
1 174 1
× ≈ ,
365 5031 10000
and Buffon proceeds to reason with this estimate.
From this elementary exercise we see that:

• Mortality probabilities can be estimated as the ratio of the number of deaths to the number
of individuals “at risk”.

• The numerator (the number of deaths) is usually straightforward to determine.

• The denominator (the number at risk) can be challenging.

• Mortality can serve as a model for thinking about risks (and opportunities) more generally,
for events happening at random times.

• You don’t get very far in thinking about mortality and other risks without some sort of
theoretical model.

The last claim may require a bit more elucidation. What would a naïve, empirical approach
to life tables look like? Given a census of the population by age, and a list of the ages at death
in the following year, we could compute the proportion of individuals aged x who died in the
following year. This is merely a free-floating fact, which could be compared with other facts, such
as the measured proportion of individuals aged x who died in a different year (or at a different
age, or a different place, etc.) If you want to talk about a probability of dying in that year
(for which the proportion would serve as an estimate), this is a theoretical construct, which can
be modelled (as we will see) in different ways. Once you have a probability model, this allows
you to pose (and perhaps answer) questions about the probability of dying in a given day, make
predictions about past and future trends, and isolate the effect of certain medications or life-style
changes on mortality.
There are many different kinds of problems for which the same survival analysis statistics
may be applied. Some examples which we will consider at various points in this course are:

• Time to failure of a machine with multiple internal components.

• Time from infection until a subject shows signs of illness.

• Time from starting to try to conceive a baby until a woman is pregnant.

• Time until a person diagnosed with (and perhaps treated for) a disease has a recurrence.

• Time until an unmarried couple marries or separates.

Often, though, we will use the term “lifetime” to represent any waiting time, along with its
attendant vocabulary: survival probability, mortality rate, cause of death, etc.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 3

1.2 Basic statistical methods for lifetime distributions


In Table 1.1 we see the estimated ages at death for 103 tyrannosaurs, from four different species,
as reported in [9]. Let us treat them here as a single population.

A. sarcophagus 2,4,6,8,9,11,12,13,14,14,15,15,16,17,17,18,19,19,20,21,23,28
2,6,8,9,11,14,15,16,17,18,18,18,18,18,19,21,21,21,
T. rex
22,22,22,22,22,22,23,23,24,24,28
2,5,5,5,7,9,10,10,10,11,12,12,12,13,13,14,14,14,14,14,
G. libratus
15,16,16,17,17,17,18,18,18,19,19,19,20,20,21,21,21,21,22
Daspletosaurus 3,9,10,17,18,21,21,22,23,24,26,26,26

Table 1.1: 103 estimated ages of death (in years) for four different tyrannosaur species.

In Part A Statistics you learned to do the following:


1.2.1 Plot the data
The most basic thing you can do with any data is to sort the observations into bins of some
width ∆, and plot the histogram, as in Figure 1.1). This does not presuppose any model.

Histogram of tyrannosaur deaths Histogram of tyrannosaur deaths


10

30
25
8

20
6
Frequency

Frequency

15
4

10
2

5
0

0 5 10 15 20 25 30 0 5 10 15 20 25 30

age (yrs) age (yrs)

(a) Narrow bins (b) Wide bins

Figure 1.1: Histogram of tyrannosaur mortality data from Table 1.1.


CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 4

1.2.2 Fit a model


Suppose we believe the list of lifetimes to be i.i.d. samples from a fixed (unknown) distribution.
We can then use the data to determine which distribution it was that generated the samples.
In Part A statistics you learned parametric maximum likelihood estimation. Suppose the
unknown distribution is believed to be one of a family of distributions that is indexed by a
possibly multivariate (k-dimensional) parameter λ ∈ Λ ⊂ Rk . That is — taking just the case
of data from a continuous distribution — the distribution of the independent observations has
density f (T ; λ) at the point T , if the true value of the parameter is λ. Suppose we have observed
n independent lifetimes T1 , . . . , Tn . We define the log-likelihood function to be the (natural) log
of the density of the observations, considered as a function of the parameter. By the assumption
of independence, this is
n
X
`T1 ,...,Tn (λ) = `T := log f (Ti ; λ). (1.1)
i=1
(We use T to represent the vector (T1 , . . . , Tn ).) The maximum likelihood estimator (MLE) is
simply the value of λ that makes this as large as possible:
n
Y
λ̂ = λ̂(T) = λ̂(T1 , . . . , Tn ) := arg max f (Ti ; λ). (1.2)
λ∈Λ i=1

Notice the nomenclature: maxλ∈Λ f (λ) picks the maximal value in the range of f , arg maxλ∈Λ f (λ)
picks the λ-value in the domain of f for which this maximum is attained.
The most basic model for lifetimes is the exponential. This is the “memoryless” waiting-
time distribution, meaning that the remaining waiting time always has the same distribution,
conditioned on the event not having occurred up to any time t. This distribution has a single
parameter (k = 1) µ, and density
f (µ; T ) = µe−µT .
The parameter µ is chosen from the domain Λ = (0, ∞). If we observe independent lifetimes
T1 , . . . , Tn from the exponential distribution with parameter µ, and let T̄ := n−1 ni=1 Ti be the
P
average, the log likelihood is
n
X  
log µe−µTi = n log µ − T̄ µ ,

`T (µ) =
i=1

which has maximum at µ̂ = 1/T̄ = n/ Ti . This is an example of what we will see to be a


P
general principle:
# events
Estimated rate = . (1.3)
total time at risk
In some cases we will be thinking of the time as random, in other cases the number of events,
but the formula (1.3) remains. The challenge will be to estimate the number of events and the
total time in a way that they correspond to the same time period and the same population, since
they are often estimated from different data sources and timed in different ways.
For large n, the estimator λ̂(T1 , . . . , Tn ) is approximately normally distributed, under some
regularity conditions, and it has some other optimality properties (finite-sample and asymptotic).
This allows us to construct approximate confidence intervals/regions to indicate the precision of
maximum likelihood estimates. Specifically, for
 
n
" #
∂ 2 2 ` (λ)

−1
 X ∂ T
λ̂ ∼ N λ, (I(λ)) , where Ij1 j2 (λ) = −E  log(f (Ti ; λ)) = −E
∂λj1 ∂λj2 ∂λj1 ∂λj2
i=1
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 5

are the entries of the Fisher Information matrix. Of course, we generally don’t know what λ is —
otherwise, we probably would not be bothering to estimate it! — so we may approximate the
Information matrix by computing Ij1 j2 (λ̂) instead. Furthermore, we may not be able to compute
the expectation in any straightforward way; in that case, we use the principle of Monte Carlo
estimation: We approximate the expectation of a random variable by the average of a sample of
observations. We already have the sample T1 , . . . , Tn from the correct distribution, so we define
the observed information matrix
n
1 X ∂ 2 log f (Ti ; λ)
Jj1 j2 (λ, T1 , . . . , Tn ) = − .
n ∂λj1 ∂λj2
i=1

Again, we may substitute Jj1 j2 (λ̂, T1 , . . . , Tn ), since the true value of λ is unknown. Thus, in the
case of a one-dimensional parameter (where the covariance matrix is just the variance and the
matrix inverse (I(λ̂))−1 is just the multiplicative inverse in R), we obtain
 s s 
λ̂ − 1.96 1 1 
, λ̂ + 1.96
I(λ̂) I(λ̂)

as an approximate 95% confidence interval for the unknown parameter λ.


In the case of the exponential model, we have
n
`00T (µ) = − ,
µ2
√ √
so that the standard error for µ̂ is µ/ n, which we estimate by µ̂/ n. For the tyrannosaur data
of Table 1.1, we have

T̄ = 16.03,
µ̂ = 0.062,
SEµ̂ = 0.0061,
95% confidence interval for µ̂ = (0.050, 0.074).

Aside: In the special case of exponential lifetimes, we can construct exact confidence intervals,
since we know the distribution of n/µ̂ ∼ Γ(n, µ), so that 2nµ/µ̂ ∼ χ22n allows us to use χ2 -tables.
Is the fit any good? We have various standard methods of testing goodness of fit — we
discuss an example in section 1.2.3 — but it’s pretty easy to see by eye that the histograms in
Figure 1.1 aren’t going to fit an exponential distribution, which is a declining density, very well.
In Figure 1.2 we show the empirical (observed) cumulative distribution of tyrannosaur deaths,
together with the cdf of the best exponential fit, which is obviously not a very good fit at all.
We also show (in green) the fit to a class of distribution which is an example of a larger
class that we will meet later, called the Weibull distributions. Instead of the exponential cdf
2
F (t) = 1 − e−µt , suppose we take F (t) = 1 − e−αt . Note that if we define Yi = Ti2 , we have

P(Yi ≤ y) = P(Ti ≤ y) = 1 − e−αy ,

so Yi is actually exponentially distributed with parameter α. Thus, the MLE for α is


n
α̂ = P 2 .
Ti
We see in Figure 1.2 that this fits much better than the exponential distribution.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 6

1.0
0.8
0.6
CDF

0.4
0.2
0.0

0 5 10 15 20 25 30
Age

Figure 1.2: Empirical cumulative distribution of tyrannosaur deaths (circles), together with cdf
of exponential fit (red) and Weibull fit (green).

1.2.3 Significance test


The maximum-likelihood approach is optimal in many respects for picking the correct parameter
based on the observed data, under the assumption that the observed data did actually
come from a distribution in the appointed parametric family. But did they? We already
looked at a plot, Figure 1.2, comparing the fit cdf to the observed cdf. The Weibull fit was clearly
better. But how much better?
One way to answer this question is to apply a significance test. We start with a set of
distributions H1 , such that we know that it includes the true distribution (for instance, the set
of all distributions on (0, ∞), and a null hypothesis H0 ⊂ H1 , and we wish to test how plausible
the observations are as a sample from H0 , rather than from the alternative hypothesis H1 \ H0 .
The standard parametric procedure is to use a χ2 goodness of fit test, based on the statistic
m
X (Oj − Ej )2
2
X = ∼ χ2m−k−1 approximately, under H0 , (1.4)
Ej
j=1

where m is the number of bins (e.g. from your histogram), but merged to satisfy size restrictions,
and k the number of parameters estimated. Oj is the random variable modelling the number
observed in bin j, Ej the number expected under maximum likelihood parameters. To justify
the approximate distribution for the test statistic, we require that at most 20% of bins have
Ej ≤ 5, none Ej ≤ 1 (‘size restriction’).
We obtain then X 2 = 17.9 for the Weibull model, and X 2 = 92.2 for the exponential
distribution. The latter produces a p-value on the order of 10−18 , but the former has a p-
value around 0.0013. Thus, while the data could not possibly have come from an exponential
distribution, or anything like it, the Weibull distribution, while unlikely to have produced exactly
these data, is a plausible candidate.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 7

Expected Expected
Age Observed
Exponential Weibull
0–4 8 22.7 5.4
5–9 13 21.5 19.3
10–14 22 15.7 25.3
15–19 39 11.5 22.7
20–24 25 8.4 15.7
25+ 15 23.1 14.6

Table 1.2: χ2 computation for fitting tyrannosaur data.

1.3 Goals of the course


Why do we need special statistical methods for lifetime data? Some reasons are:

• Large samples We have looked at some very simple single-parameter models. Other
models, with more complicated time-varying mortality rates, may be closer to the truth.
But no matter how elaborate the multivariate parametric models that we propose, they
are unlikely to be precisely true. A parametric family will eventually be rejected once the
sample size is large enough — and since we may be concerned with statistical surveys of,
for example, the entire population of the UK, the sample sizes will be very large indeed.
Nonparametric or semiparametric methods will be better able to let the data speak for
themselves.

• Small samples While nonparametric models allow the data to speak for themselves,
sometimes we would prefer that they be somewhat muffled. When the number of observed
deaths is small — which can be the case, even in a very large data set, when considering
advanced ages, above 90, and certainly above 100, because of the small number of individuals
who survive to be at risk, but also in children, because of the very low mortality rate — the
estimates are less reliable, being subject to substantial random noise. Also, the mortality
pattern changes over time, and we are often interested in future mortality, but only have
historical data. A non-parametric estimate that precisely reflects the data at hand may
reflect less well the underlying processes, and be ill-suited to projection into the future.
Graduation (smoothing) and extrapolation methods have been developed to address these
issues.

• Incomplete observations Some observations will be incomplete. We may not know the
exact time of a death, but only that it occurred before a given time, or after a given time,
or between two known times, a phenomenon called “censoring”. (When we are informed
only of the year of a death, but not the day or time, this is a kind of censoring. Or we may
have observed only a sample of the population, with the sample being not entirely random,
but chosen according to being alive at a certain date, or having died before a certain date,
a phenomenon known as “truncation”. We need special techniques to make use of these
partial observations.) Since we are observing times, subjects who break off a study midway
through provide partial information in a clearly structured way.

• Successive events A key fact about time is its sequence. A patient is infected, develops
symptoms, has a diagnosis, a treatment, is cured or relapses, at some point dies. Some
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 8

or all of these events may be considered as a progression, and we may want to model the
sequence of random times. Some care is needed to carry out joint maximum likelihood
estimation of all transition rates in the model, from one or several individuals observed.
This can be combined with time-varying transition rates.
• Comparing lifetime distributions We may wish to compare the lifetime distributions
of different groups (e.g., smokers and nonsmokers; those receiving a traditional cholesterol
medication and those receiving the new drug) or the effect of a continuous parameter (e.g.,
weight) on the lifetime distribution.
• Changing rates Mortality rates are not static in time, creating disjunction between period
measures — looking at a cross-section of the population by age as it exists at a given
time — and cohort measures — looking at a group of individuals born at a given time,
and following them through life.

1.4 Survival function and hazard rate (force of mortality)


In general, we may describe a lifetime distribution — which is simply the distribution of a
nonnegative random variable — in several different ways:

cdf F (t) = P L ≤ t ;


survival function S(t) = F̄ (t) = 1 − F (t) = P L > t ;




density function f (t) = dF/dt;


hazard rate λ(t) = f (t)/F̄ (t)

The hazard rate is also called mortality rate in survival contexts. The traditional name in
demography is force of mortality. This may be thought of as the instantaneous rate of dying per
unit time, conditioned on having already survived.s The exponential distribution with parameter
λ ∈ (0, ∞) is given by

cdf F (t) = 1 − e−λt ;


survival function F̄ (t) = e−λt ;
density function f (t) = λe−λt ;
hazard rate λ(t) = λ.

Thus, the exponential is the distribution with constant force of mortality, which is a formal
statement of the “memoryless” property.

1.5 Residual lifetimes


Assume that there is an overall lifetime distribution, and every individual born has a random
lifetime according to this distribution. Then, if we observe sombody now aged x, and we denote
his residual lifetime T − x by Tx , then we have
F̄T (x + t) fT (x + t)
F̄Tx (t) = F̄T −x|T >x (t) = , fTx (t) = fT −x|T >x (t) = , t ≥ 0. (1.5)
F̄T (x) F̄T (x)
So, any distribution of a full lifetime T is naturally associated with a family of conditional
distributions of T given T > x.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 9

1.6 Force of mortality


We now look more closely at the hazard rate, which may be defined as
1
1 P(t < T ≤ t + ε) fT (t)
hT (t) = µt = lim P(T ≤ t + ε|T > t) = lim ε = . (1.6)
ε↓0 ε ε↓0 P(T > t) F̄T (t)

The density fT (t) is the (unconditional) infinitesimal probability to die at age t. The hazard
rate hT (t) is the (conditional) infinitesimal probability to die at age t of an individual known to
be alive at age t. It may seem that the hazard rate is a more complicated quantity than the
density, but it is very well suited to modelling mortality. Whereas the density has to integrate to
one and the distribution function (survival function) has boundary values 0 and 1, the force of
mortality has no constraints, other than being nonnegative — though if “death” is certain the
force of mortality has to integrate to infinity. Also, we can read its definition as a differential
equation and solve
( Z )
t
F̄T0 (t) = −µt F̄T (t), F̄ (0) = 1 ⇒ F̄T (t) = exp − µs ds , t ≥ 0. (1.7)
0

We can now express the distribution of Tx as


( Z ) ( Z )
x+t t
F̄T (x + t)
F̄Tx (t) = = exp − µs ds = exp − µx+r dr , t ≥ 0. (1.8)
F̄T (x) x 0

Note that this implies that hTx (t) = hT (x + t), so it is really associated with age x + t only, not
with initial age x nor with time t after initial age. Also note that, given a measurable function
R ∞→ R, F̄Tx (0) = 1 always holds, F̄Tx decreasing if and only if µ ≥ 0. F̄Tx (∞) = 0 if and
µ : [0, ∞)
only if 0 µt dt = ∞. This leaves a lot of modelling freedom via the force of mortality.
Densities can now be obtained from the definition of the force of mortality (and consistency)
as fTx (t) = µt+x F̄Tx (t).

1.7 Defining mortality laws from hazards


We are now in the position to model mortality laws via their force of mortality. Clearly, the
Exp(λ) distribution has a constant hazard rate µt ≡ λ, and the uniform distribution on [0, ω]
has a hazard rate
1
hT (t) = , 0 ≤ t < ω. (1.9)
ω−t

Note that here 0 hT (t)dt = ∞ squares with F̄T (ω) = 0 and forces the maximal age ω. This is a
general phenomenon: distributions with compact support have a divergent force of mortality at
the supremum of their support, and the singularity is not integrable.
The Gompertz distribution is given by µt = Beθt . More generally, Makeham’s law is given by
  
µt = A + Be ,θt
F̄Tx (t) = exp −At − m e θ(x+t)
−eθx
, x ≥ 0, t ≥ 0, (1.10)

for parameters A > 0, B > 0, θ > 0; m = B/θ. Note that mortality grows exponentially. If θ is
big enough, the effect is very close to introducing a maximal age ω, as the survival probabilities
decrease very quickly. There are other parametrisations for this family of distributions. The
Gompertz distribution is named for British actuary Benjamin Gompertz, who in 1825 first
published his discovery [12] that human mortality rates over the middle part of life seemed
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 10

Canadian Lifetime data Canadian Lifetime data

1.00
5e-01

male
female

0.50
Survival function
5e-02
Mortality rate

0.20
0.10
5e-03

male
female

0.05
5e-04

0.02
1e-04

0.01
0 20 40 60 80 100 0 20 40 60 80 100

age age

(a) Mortality rates (b) Survival function

Figure 1.3: Canadian mortality data, 1995–7.

to double at constant age intervals. It is unusual, among empirical discoveries, for having
been confirmed rather than refuted as data have improved and conditions changed, and it (or
Makeham’s modification) serves as a standard model for mortality rates not only in humans, but
in a wide variety of organisms. As an example, see Figure 1.3, which shows Canadian mortality
rates from life tables produced by Statistics Canada (available at http://www.statcan.ca:
80/english/freepub/84-537-XIE/tables.htm). Notice how close to a perfect line the mid-life
mortality rates for both males and females is, when plotted on a logarithmic scale, showing that
the Gompertz model is a very good fit.
Figure 1.3(b) shows the corresponding survival curves. It is worth recognising how much more
informative the mortality rates are. in Figure 1.3(a) we see that male mortality is regularly higher
than female mortality at all ages (and by a fairly constant ratio), we see several phases of mortality
— early decline, jump in adolescence, then steady increase through midlife, and deceleration in
extreme old age — whereas Figure 1.3(b) shows us only that mortality is accelerating overall,
and that males have accumulated higher mortality by late life.
The Weibull distribution suggests a polynomial rather than exponential growth of mortality
n o
µt = αρα tα−1 , F̄Tx (t) = exp −ρα (x + t)α − xα , x ≥ 0, t ≥ 0, (1.11)

for rate parameter ρ > 0 and exponent α > 0. The Weibull model is commonly used in engineering
contexts to represent the failure-time distribution for machines. The Weibull distribution arises
naturally as the lifespan of a machine with n redundant components, each of which has constant
failure rate, such that the machine fails only when all components have failed. Later in the
course we will discuss how to fit Weibull and Gompertz models to data.
Another class of distributions is obtained by replacing the parameter λ in the exponential dis-
tribution by a (discrete or continuous) random variable M . Then the specification of exponential
conditional densities
fT |M =λ (t) = λe−λt (1.12)
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 11

determines the unconditional density of T as


Z ∞ Z ∞ X
fT (t) = fT,M (t, λ)dλ = λe−λt fM (λ)dλ or fT (t) = λe−λt P(M = λ). (1.13)
0 0 λ>0

Various special cases of exponential mixtures and other extensions of the exponential distribution
have been suggested in a life insurance context.

1.8 Curtate lifespan


We have implicitly assumed that the lifetime distribution is continuous. However, we can always
pass from a continuous random variable T on [0, ∞) to a discrete random variable K = bT c, its
integer part, on N. If T models a lifetime, then K is called the associated curtate lifetspan.
Chapter 2

Lifetime distributions and life tables

2.1 Mortality laws: Simple or Complex? Parametric or Nonpara-


metric?
Consider the data for Albertosaurus sarcophagus in Table 1.1. We see here the estimated ages
at death for 22 members of this species. Let us assume, for the sake of discussion, that these
estimates are correct, and that our skeleton collection represents a simple random sample of all
Albertosaurs that ever lived. If we assume that there was a large population of these dinosaurs,
and that they died independently (and not, say, in a Cretaceous suicide pact), then these are
22 independent samples T1 , . . . , T22 of a random variable T whose distribution we would like to
know. Consider the probabilities

qx := P x ≤ T < x + 1 .

Then the number of individuals observed to have curtate lifespan x has binomial distri-
bution Bin(22, qx ). The MLE for a binomial probability is just the naïve estimate q̂x =
# successes/# trials (where a “success”, in this case, is a death in the age interval under con-
sideration). To compute q̂2 , then, we observe that there were 22 Albertosaurs from our sample
still alive on their 22 birthdays, of which one unfortunate met its maker in the following year:
q̂2 = 1/22 ≈ 0.046. As for q̂3 , on the other hand, there were 21 Albertosaurs observed alive on
their third birthdays, and all of them arrived safely at their fourth, making q̂3 = 0/21. This
leads us to the peculiar conclusion that our best estimate for the probability of an albertosaur
dying in its third year is 0.046, but that the probability drops to 0 in its fourth year, then
becomes nonzero again in the fifth year, and so on. This violates our intuition that mortality
rates should be fairly smooth as a function of age. This problem becomes even more extreme
when we consider continuous lifetime models. With no constraints, the optimal estimator for the
mortality distribution would put all the mass on just those moments when deaths were observed
in the sample, and no mass elsewhere — in other words, infinite hazard rate at a finite set of
points at which deaths have been observed, and 0 everywhere else.
As we see from Figure 1.1, the mortality distribution for the tyrannosaurs becomes much
smoother and less erratic when we use larger bins for the histogram. This is no surprise, since
we are then sampling from a larger baseline, leading to less random fluctuation. The simplest
way to impose our intuition of regularity upon the estimators is to increase the time-step and
reduce the number of parameters to estimate. An extreme version of this, of course, is to impose
a parametric model with a small number of parameters. This is part of the standard tradeoff in

12
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 13

statistics: a free, nonparametric model is sensitive to random fluctuations, but constraining the
model imposes preconceived notions onto the data.
Notation: When the hazard rate µx is being assumed constant over each year of life, the
continuous mortality rate has been reduced to a discrete set of parameters. What do we call
these parameters? In actuarial contexts the constant hazard rate for T on the interval [k, k + 1)
is called µT (k + 12 ), or sometimes µk+ 1 when T is clear from the context. This has the advantage
2
of including in the notation the assumption of constancy over integer intervals. It is, however, a
very inflexible notation, that causes all kinds of confusion as soon as the problem gets slightly
more general — for instance, when we assume rates constant over intervals of length 5 or 10.
We will instead write µT (k) or µk , and be sure to make clear in context what intervals we are
assuming µ to be constant over.

2.2 Life Tables


Life tables represent a discretised form of the hazard function for a population, often together
with raw mortality data. Apart from an aggregate table subsuming the whole population (of
the UK, say), such tables exist for various groups of people characterized by their sex, smoking
habits, job type, insurance level etc. This immediately raises interesting questions concerning
the interdependence of such tables, but we focus here on some fundamental issues, which are
already present for the single aggregate table.
We begin with a naïve, empirical approach. In Table 2.2 we see a life table for men in the
UK, in the years 1990–2, as provided by the Office of National Statistics. In the column labelled
Exc we see the number of years “exposed to risk” in age-class x. Since everyone alive is at risk of
dying, this should be exactly the sum of the number of individuals alive in the age class in years
1990, 1991, and 1992. The 1991 number is obtained from the census of that year, and the other
two years are estimated. The column dx shows the number of men of the given age known to
have died during this three-year period. The final column is mx := dx /Exc .
Again, this is an empirical fact, but we find ourselves in a quandary when we try to interpret
it. What is mx ? If the number of deaths is reasonably stable from year to year, then mx should
be close to the fraction of men aged x who died each year. How close? The number of men at
risk changes constantly, with each birthday, each death, each immigration or emigration. We
sense intuitively that the effect of these changes would be small, but how small? And what would
we do to compensate for this in a smaller population, where the effects are not negligible? How
do we make projections about future states of the population?
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 14

AGE x Ex dx mx × 105 AGE x Ex dx mx × 105


0 1066867 8779 823 52 827414 4781 578
1 1059343 661 62 53 822603 5324 647
2 1054256 403 38 54 810731 5723 706
3 1047298 319 30 55 794930 6411 806
4 1037973 251 24 56 775350 6925 893
5 1022032 229 22 57 759747 7592 999
6 1003486 201 20 58 755475 8477 1122
7 989008 186 19 59 761913 9484 1245
8 976049 180 18 60 764497 10735 1404
9 981422 180 18 61 753706 11880 1576
10 988020 179 18 62 736868 12871 1747
11 984778 179 18 63 725679 14463 1993
12 950853 185 19 64 721743 16094 2230
13 909437 212 23 65 713576 17704 2481
14 891556 259 29 66 700666 19097 2726
15 913423 366 40 67 681977 20930 3069
16 954339 496 52 68 676972 22507 3325
17 1002077 758 76 69 678157 25127 3705
18 1057508 922 87 70 684764 27159 3966
19 1124668 930 83 71 600343 26508 4415
20 1163581 979 84 72 504808 24443 4842
21 1195366 1030 86 73 422817 22792 5391
22 1210521 1073 89 74 422480 24921 5899
23 1238979 1105 89 75 431321 27286 6326
24 1263313 1083 86 76 422822 29712 7027
25 1296300 1068 82 77 399257 30856 7728
26 1313794 1145 87 78 365168 30744 8419
27 1311662 1090 83 79 328386 30334 9237
28 1291017 1110 86 80 293014 29788 10166
29 1259644 1129 90 81 260517 28483 10933
30 1219278 1101 90 82 229149 27399 11957
31 1176120 1144 97 83 197322 25697 13023
32 1135091 1128 99 84 165896 23717 14296
33 1103162 1095 99 85 136103 20930 15378
34 1071474 1142 107 86 110565 18689 16903
35 1035587 1218 118 87 87989 16370 18605
36 1017422 1291 127 88 68443 13571 19828
37 1010544 1399 138 89 52151 11284 21637
38 1006929 1536 153 90 40257 9061 22508
39 1006500 1660 165 91 29000 7032 24248
40 1016727 1662 163 92 20124 5405 26858
41 1046632 1967 188 93 13406 4057 30263
42 1092927 2240 205 94 9392 3069 32677
43 1167798 2543 218 95 6446 2219 34424
44 1134652 2656 234 96 4384 1578 35995
45 1071729 2836 265 97 2795 1091 39034
46 974301 2930 301 98 1761 701 39807
47 955329 3251 340 99 1059 489 46176
48 914107 3354 367 100 624 292 46795
49 848419 3486 411 101 359 178 49582
50 815653 3836 470 102 216 118 54630
51 811134 4251 524 103 107 63 58879

Table 2.1: Male mortality data for England and Wales, 1990–2. From [11] (available online).
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 15

AGE x `x qx ex AGE x `x qx ex
0 100000 0.0082 73.4 52 92997 0.0058 24.4
1 99180 0.0006 73.0 53 92461 0.0065 23.6
2 99119 0.0004 72.1 54 91865 0.0070 22.7
3 99081 0.0003 71.1 55 91219 0.0080 21.9
4 99052 0.0002 70.1 56 90486 0.0089 21.0
5 99028 0.0002 69.1 57 89682 0.0099 20.2
6 99006 0.0002 68.2 58 88791 0.0112 19.4
7 98986 0.0002 67.2 59 87800 0.0124 18.6
8 98967 0.0002 66.2 60 86714 0.0139 17.9
9 98950 0.0002 65.2 61 85505 0.0156 17.1
10 98932 0.0002 64.2 62 84168 0.0173 16.4
11 98914 0.0002 63.2 63 82710 0.0197 15.6
12 98896 0.0002 62.2 64 81078 0.0221 14.9
13 98877 0.0002 61.2 65 79290 0.0245 14.3
14 98855 0.0003 60.2 66 77347 0.0269 13.6
15 98826 0.0004 59.3 67 75267 0.0302 13.0
16 98786 0.0005 58.3 68 72992 0.0327 12.4
17 98735 0.0008 57.3 69 70605 0.0364 11.8
18 98660 0.0009 56.4 70 68037 0.0389 11.2
19 98574 0.0008 55.4 71 65391 0.0432 10.6
20 98492 0.0008 54.5 72 62567 0.0473 10.1
21 98410 0.0009 53.5 73 59610 0.0525 9.6
22 98325 0.0009 52.5 74 56481 0.0573 9.1
23 98238 0.0009 51.6 75 53246 0.0613 8.6
24 98150 0.0009 50.6 76 49982 0.0679 8.1
25 98066 0.0008 49.7 77 46590 0.0744 7.7
26 97986 0.0009 48.7 78 43125 0.0807 7.2
27 97900 0.0008 47.8 79 39643 0.0882 6.8
28 97819 0.0009 46.8 80 36145 0.0967 6.5
29 97735 0.0009 45.8 81 32651 0.1036 6.1
30 97647 0.0009 44.9 82 29270 0.1127 5.7
31 97559 0.0010 43.9 83 25971 0.1221 5.4
32 97465 0.0010 43.0 84 22800 0.1332 5.1
33 97368 0.0010 42.0 85 19763 0.1425 4.8
34 97272 0.0011 41.0 86 16946 0.1555 4.5
35 97168 0.0012 40.1 87 14310 0.1698 4.2
36 97053 0.0013 39.1 88 11881 0.1799 4.0
37 96930 0.0014 38.2 89 9744 0.1946 3.8
38 96796 0.0015 37.2 90 7848 0.2016 3.6
39 96648 0.0016 36.3 91 6266 0.2153 3.3
40 96489 0.0016 35.4 92 4917 0.2355 3.1
41 96332 0.0019 34.4 93 3759 0.2611 2.9
42 96151 0.0021 33.5 94 2777 0.2787 2.7
43 95954 0.0022 32.5 95 2003 0.2912 2.6
44 95745 0.0023 31.6 96 1420 0.3023 2.5
45 95521 0.0027 30.7 97 991 0.3232 2.3
46 95269 0.0030 29.8 98 670 0.3284 2.2
47 94982 0.0034 28.9 99 450 0.3698 2.1
48 94660 0.0037 28.0 100 284 0.3737 2.0
49 94313 0.0041 27.1 101 178 0.3909 1.9
50 93926 0.0047 26.2 102 108 0.4209 1.8
51 93486 0.0052 25.3 103 63 0.4450 1.7

Table 2.2: Life table for English men, computed from data in Table 2.1
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 16

2.3 Notation for life tables

qx Probability that individual aged x dies before reaching age x + 1


px Probability that individual aged x survives to age x + 1
t qx Probability that individual aged x dies before reaching age x + t
t px Probability that individual aged x survives to age x + t
lx Number of people who survive to age x. Note: This is based
on starting with a fixed number l0 of lives, called the Radix;
most commonly, for human populations the radix is 100,000
dx Number of individuals who die aged x (from the standard population)
m
t x Mortality rate between exact age x and exact age x + t
ex Remaining life expectancy at age x

Note the following relationships:


dx = lx − lx+1 ;
lx+1 = lx px = lx (1 − qx );
t−1
Y
t px = px+i
i=0

The quantities qx may be thought of as the discrete analogue of the mortality rate — we will
call it the discrete mortality rate or discrete hazard function — since it describes the probability
of dying in the next unit of time, given survival up to age x. In Table 2.1 we show the life table
computed from the raw data of Table 2.2. (It differs slightly from the official table, because the
official table added some slight corrections. The differences are on the order of 1% in qx , and
much smaller in lx .) The life table represents the effect of mortality on a nominal population
starting with size l0 called the Radix, and commonly fixed at 100,000 for large-population life
tables. Imagine 100,000 identical individuals — a cohort — born on 1 January, 1900. In the
column qx we give the estimates for the probability of an individual who is alive on his x birthday
dying in the next year, before his x + 1 birthday. (We discuss these estimates later in the chapter.)
Thus, we estimate that 820 of the 100,000 will die before their first birthday. The surviving
l1 = 99, 180 on 1 January, 1901, face a mortality probability of 0.00062 in their next year, so
that we expect 61 of them to die before their second birthday. Thus l2 = 99119. And so it goes.
The final column of this table, labelled ex , gives remaining life expectancy; we will discuss this
in section 2.13.

2.4 Continuous and discrete models


The first decision that needs to be made in setting up a lifetime model is whether to model
lifetimes as continuous or discrete random variables. On first consideration, the discrete approach
may seem to recommend itself: after all, we are commonly concerned with mortality data given in
whole years or, if not years, then whole numbers of months, weeks, or days. Real measurements
are inevitably discrete multiples of some minimal unit of precision. In fact, though, discrete
models for measured quantities are problematic because
• They tie the analysis to one unit of measurement. If you start by measuring lifespans in
years, and restrict the model accordingly you have no way of even posing a question about,
for instance, the effect of shifting the reporting date within the year.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 17

• Discrete methods are comfortable only when the numbers are small, whereas moving down
to the smallest measurable unit turns the measurements into large whole numbers. Once
you start measuring an average human lifespan as 30000 days (more or less), real numbers
become easier to work with, as integrals are easier than sums.

• It is relatively straightforward to embed discrete measures within a continuous-time model,


by considering the integer part of the continuous random lifetime, called the curtate lifetime
in actuarial terminology.

(Compare this to the suggestion once made by the physicist Enrico Fermi, that lecturers might
take their listeners’ investment of time more seriously if they thought of the 50-minute span
of a lecture as a “microcentury”.) The discrete model, it is pointed out by A. S. Macdonald
in [21] (and rewritten in [7, Unit 9]), “is not so easily generalised to settings with more than
one decrement. Even the simplest case of two decrements gives rise to difficult problems,” and
involves the unnecessary complication of estimating an Initial Exposed To Risk. We will generally
treat the continuous model as the fundamental object, and treat the discrete data as coarse
representations of an underlying continuous lifetime. However, looking beyond the actuarial
setting, there are models which really do not have an underlying continuous time parameter. For
instance, in studies of human fertility, time is measured in menstrual cycles, and there simply
are no intermediate chances to have the event occur.

2.5 Crude estimation of life tables – discrete method


Since their invention in the 17th century, the basic methodology for life table has been to collect
(from the church registry or whoever kept records of births and deaths) lifetimes, truncate to
integer lifetimes, count the numbers dx of deaths between ages x and x + 1, relate this to the
(0)
numbers `x alive at age x — called the Initial Exposed to Risk — and use q̂x = dx /`x , or similar
quantities as an estimate for the one-year death probability qx .
In our model, the deaths are Bernoulli events with probability qx , so we know that the Maxi-
(0)
mum Likelihood Estimator for qx is q̂x = # successes/# trials = dx /`x for n = `0 independently
observed curtate lifetimes k1 , . . . , kn , observed from random variables with common probability
mass function (m(x))x∈N parameterized by (qx )x∈N . If we write m(x) = (1 − q0 ) . . . (1 − qx−1 )qx ,
the likelihood is
Y n Y Y
m(k (i) ) = (m(x))dx = (1 − qx )`x −dx qxdx , (2.1)
i=1 x∈N x∈N

where only max{k1 , . . . , kn } + 1 factors in the infinite product differ from 1, and

dx = dx (k1 , . . . , kn ) = # {1 ≤ i ≤ n : ki = x} ,
`x = `x (k1 , . . . , kn ) = # {1 ≤ i ≤ n : ki ≥ x} .

This product is maximized when its factors are maximal (the xth factor only depending on
parameter qx ). An elementary differentiation shows that q 7→ (1 − q)`−d q d is maximal for q̂ = d/`,
so that
dx (k1 , . . . , kn )
q̂x(0) = q̂x(0) (k1 , . . . , kn ) = , 0 ≤ x ≤ max{k1 , . . . , kn }.
`x (k1 , . . . , kn )
(The superscript (0) denotes the estimate based on the discrete method.) Note that for x =
(0)
max{k1 , . . . , kn }, we have q̂x = 1, so no survival beyond the highest age observed is possible
under the maximum likelihood parameters, so that (q̂ (0) )0≤x≤max{k1 ,...,kn } specifies a unique
distribution. (Varying the unspecified parameters qx , x > max{k1 , . . . , kn }, has no effect.)
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 18

2.6 Crude life table estimation – direct method for continuous


data
The estimator described in section 2.5 treats lifetimes as discrete numbers. We will discuss
in this section how we translate between discrete data and a continuous model. If we have
continuous lifetime data, though, then we can estimate a continuous model directly. Assume
that you observe n = `0 independent lives t1 , . . . , tn . Then the likelihood function is
n n
( Z )
Y Y ti
fT (ti ) = µti exp − µs ds (2.2)
i=1 i=1 0

Now assume that the force of mortality µs is constant on [x, x + 1), x ∈ N and denote these
values by  ( Z )
x+1
µx = − log(px ) remember px = exp − µs ds  . (2.3)
x

Then, the likelihood takes the form


Y n o
µdxx exp −µx `˜x (2.4)
x∈N

where only max{t1 , . . . , tn } + 1 factors in the infinite product differ from 1, and

dx = dx (t1 , . . . , tn ) = # 1 ≤ i ≤ n : [ti ] = x ,
n Z x+1
X
`˜x = `˜x (t1 , . . . , tn ) = 1{ti >s} ds.
i=1 x

`˜x is the total time exposed to risk. In section 2.9 we define the Central Exposed to Risk, which
is a generalisation of this notion.
The quantities µx , x ∈ N, are the parameters, and we can maximise the product by maximising
each of the factors. An elementary differentiation shows that µ 7→ µd e−µ` has a unique maximum
at µ̂ = d/`, so that

dx (t1 , . . . , tn )
µ̂x = µ̂x (t1 , . . . , tn ) = , 0 ≤ x ≤ max{t1 , . . . , tn }.
`˜x (t1 , . . . , tn )

Since maximum likelihood estimators are invariant under reparameterisation (the range of the
likelihood function remains the same, and the unique parameter where the maximum is obtained
can be traced through the reparameterisation), we obtain
( )
dx (t1 , . . . , tn )
q̂x = q̂x (t1 , . . . , tn ) = 1 − p̂x = 1 − exp {−µ̂x } = 1 − exp − . (2.5)
`˜x (t1 , . . . , tn )

For small dx /`˜x , this is close to dx /`˜x , and therefore also close to dx /`x .
Note that under q̂x , x ∈ N, there is a positive survival probability beyond the highest observed
age, and the maximum likelihood method does not fully specify a lifetime distribution, leaving
free choice beyond the highest observed age.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 19

2.7 Are life tables continuous or discrete?


The standard approach to life tables mixes the continuous and discrete, in sometimes confusing
ways. The data upon which life tables are based are measured in discrete units, but in most
applications we assume that the risk is actually continuous. If we were to observe a fixed number
of individuals for exactly one year, and count the number of deaths at the end of the year, and
if the number of deaths during the year were a small fraction of the total number at risk, it
would hardly matter whether we chose a discrete or continuous model. As we discuss in section
2.11.2, the distinction becomes significant to the extent that the number of individuals at risk
changes substantially over a single time unit; then we need to distinguish among three traditional
approaches:
• Discrete (or direct) method;
• Continuous method;
• Census method.
By direct method we mean that we treat the given lifetimes as being exact observations. We
will discuss this approach in the context of lifetimes rounded to the nearest integer, but it is the
same if the recording is to several decimal places; the crucial thing is simply that our model does
not admit intermediate values. If lifetimes are reported equal, then they are treated as being
exactly equal.
The connection between discrete and continuous laws is fairly straightforward, at least in
one direction. Suppose T is a lifetime with hazard rate µx at age x, and qx is the probability of
dying on or after birthday x, and before the x + 1 birthday. Then
R x+t
t qx = e− x µs ds
.
Another way of putting this is to say that the discrete model may be embedded in the
continuous model, by considering the associated curtate lifespan K = bT c. The remainder
(fractional part) S = T − K = {T } can often be treated separately in a simplified way (see
below). Clearly, the probability mass function of K on N is given by
Z n+1
P(K = n) = P(n ≤ T < n + 1) = fT (t)dt = F̄T (n) − F̄T (n + 1)
n
 ( Z )
 Z n  n+1
= exp − µT (t)dt 1 − exp − µT (t)dt 
0 n

and if we denote the one-year death probabilities (discrete hazard function) by


( Z )
k+1
P(K = k)
qk = P(K = k|K ≥ k) = = 1 − exp − µT (t)dt
P(K ≥ k) k

and pk = 1 − qk , k ∈ N, we obtain the probability of success after n independent Bernoulli trials


with varying success probabilities qk :
P(K = n) = p0 . . . pn−1 qn .
Note that qk only depends on the hazard rate between ages k and k + 1. As a consequence, for
Kx = bTx c
P(Kx = n) = px . . . px+n−1 qx+n
are also easily represented in terms of (qk )k∈N .
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 20

2.8 Interpolation for non-integer ages


Suppose now that we have modeled the curtate lifetime K. The fractional part S of the lifetime
is a random variable on the interval [0, 1], commonly modeled by assuming Constant force of
mortality: µ(x) is constant on the interval [k, k + 1). Then

1 qk = 1 − e−µk ; µk = − log pk .

S has the distribution of an exponential random variable conditioned on S < 1, so it has density

e−µk s
fS (s) = µk .
1 − e−µk
This assumption thus implies decreasing density of the lifetime through the interval. We also
have, for 0 ≤ s ≤ 1, and k an integer,
( Z )
k+s
s pk = P(T > k + s|T > k) = exp − µt dt = exp {−sµk } = (1 − qk )s .
k

Note that K and S are not independent, under this assumption. We also may write
 
 Xk−1 
F̄T (k + s) = exp − µi − sµk
 
i=0

fT (k + s) = F̄T (k + s) · µk .

for s ∈ [0, 1).


Other traditional models for the mortality rate between integer times are Uniform (where
the fractional part S is uniform on [0, 1], independent of K), and Balducci (which has force
of mortality decreasing over the time unit, following a convenient pattern). We will not be
considering these in this course.
Once we have made one of these assumptions, we can reconstruct the full distribution of a
lifetime T from the entries (qx )x∈N of a life table. When the force of mortality is small, these
different assumptions are all equivalent to µt = qk for t ∈ [k, k + 1). Notice again that the
choice of a measurement unit for discretisation implies a certain level of smoothing, in continuous
nonparametric life table computations. Taking the evidence at face value, we would have to
say that we have observed zero mortality rate, except at the instants at which deaths were
observed, where the mortality rate jumps to ∞. Of course, we average over a period of time,
either by imposing the constraint that mortality rates be step functions, constant over a single
measurement unit (or multiple units, if we wish to impose additional smoothing, usually because
the number of observations is small).
Moving in the other direction is not so straightforward. The continuous model cannot be
embedded in the discrete model, for obvious reasons: within the framework of the discrete model,
there is no such thing as a death midway through a time period. Traditionally, when the discrete
nature of life-table data has been in the foreground, a model of the fractional part, such as one
of those listed above, has been adjoined to the model. This approach quickly collapses under the
weight of unnecessary complications, which is why we will always treat the continuous lifetime as
the fundamental object, except when the lifetime truly is measured only in discrete units.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 21

2.9 The “continuous” method for life-table estimation


Suppose we observe n identically distributed, independent lives aged x for exactly 1 year, and
record the number dx who die. Using the notation set up for the discrete model, a life dies with
probability qx within the year.
Hence Dx , the random variable representing the numbers dying in the year conditional on n
alive at the beginning of the year, has distribution
Dx ∼ B(n, qx )
giving a maximum likelihood estimator
Dx qx (1 − qx )
qbx = , with var(b
qx ) =
n n
where using previous notation we have set lx = n.
While attractively simple, this approach has significant problems. Ordinarily failures, deaths,
and other events of interest, happen continuously, even if we happen to observe or tabulate them
at discrete intervals. While we get a perfectly valid estimate of qx , the probability of an event
happening in this time interval, we have no way of generalising to a question about how many
individuals died in half a year, for example. And real data may be interval censored: That is,
the life is not under observation during the entire year, but only during the interval of ages
(x + a, x + b), where 0 ≤ a < b ≤ 1. If we write Dxi for the indicator of the event that individual
i is observed to die at (curtate) age x, we have
P (Dxi = 1) = bi −ai qx+ai
Hence  
Xn n
X
EDx = E  Dxi  = bi −ai qx+ai
i=1 i=1

There is no way to analyse (or even describe) this intra-interval refinement within the framework
of the binomial model.
Nonetheless, the simplicity and tradition of the binomial model have led actuaries to develop
a kind of continuous prosthetic for the binomial model, in the form of a supplemental (and
hidden) model for the unobserved continuous part of the lifetime. These have been discussed in
section 2.8. In the end, these are applied through the terms Initial Exposed To Risk (Ex0 ) and
Central Exposed To Risk (Exc ). These are defined more by their function than as a particular
quantity: the Initial Exposed to Risk plays the role of n in a binomial model, and Central
Exposed to Risk plays the role of total time at risk in an exponential model. They are linked by
the actuarial estimator
1
Ex0 ≈ Exc + dx .
2
This may be justified from any of our fractional-lifetime models if the number of deaths is small
relative to the number at risk. Thus, the actuarial estimator for qx is
dx
qex = .
Exc + 21 dx

The denominator, Exc + 12 dx , comprises the observed time at risk (also called central exposed to
risk) within the interval (x, x + 1), added to 1/2 the number of deaths (assumes deaths evenly
spread over the interval). This is an estimator for Ex0 , the denominator for the binomial model
estimator.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 22

2.10 Comparing continuous and discrete methods


There appears to be a contradiction between the discrete life-table estimation of section 2.5 and
the continuous life-table estimation of section 2.6. While the models are different, there are
questions to which both offer an answer, and the answers are different. In the discrete model, we
estimate
 dk
P T < k + 1 T ≥ k = qk ≈ q̂k = 0 .
Ek
The continuous model suggests that we estimate the same quantity by
c dk
P T < k + 1 T ≥ k = 1 − e−µk ≈ 1 − e−µ̂k = 1 − e−dk /Ek ≤ c . (2.6)

Ek

The direct discrete method treats the curtate lifetimes as the true lifetimes; then Ek0 is the
same as Ekc , so the continuous model gives a strictly smaller answer, unless dk = 0. Why is that?
The difference here is that the continuous model presumes that individuals are dying all through
the year, making Ekc somewhat smaller than Ek0 . In fact, the actuarial estimator Ekc ≈ Ek − dk /2
(essentially presuming that those who died lived on average half a year), and substituting the
Taylor series expansion into (2.6) shows that in the continuous model
!
 dk dk  dk 3
P T <k+1 T ≥k = 0 − +O
Ek − dk /2 2(Ek0 − dk /2)2 Ek0 − dk /2
!
dk  dk 3
= 0 +O .
Ek Ek0 − dk /2

That is, when the mortality fraction dk /Ek0 is small, the estimates agree up to second order in
dk /Ek0 .
A more As we already mentioned, one advantage of the continuous model is that it does not
tie the estimates to any fixed timespan.

2.11 Central exposed to risk and the census approximation


2.11.1 The Principle of Correspondence
What would be the complete information that we would need to estimate mortality rates for
a population? Regardless of whether we are considering cohort rates — the rates experienced
by a given population born at about the same time as they progress through different ages —
or period rates — the rates experienced over a period of time by the individuals alive during
that time — we would need exact birth and death dates for everyone in the relevant population.
From these data we would calculate dx (the number of individuals who died aged x) and Exc (the
total number of years lived aged x), from which we may compute

dx
µ̂x = ,
Exc

which is the maximum likelihood estimator under the assumption of a constant force of mortality
on [x, x + 1).
In any given data set we are likely to have only partial information about the population
because:
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 23

• Date of birth and date of death may be only approximate (for instance, only calendar year
given, or only curtate age at death);
• Individuals may be “observable” for only part of the time (for instance, because of migration);
• Possible uncertainty about cause of being unobserved, or about identity of individuals;
• We may have access to only a portion of the population (for instance, an insurance company
working from the life records of its customers, or field biologists working with the population
of birds that they have managed to capture and mark).
Because of the limitations of the data, we reinterpret dx and Exc as the number of deaths
observed and number of years of life observed, for which we need to create estimates that
obey the Principle of Correspondence:
An individual alive at time t should be included in the exposure at age x at time t if
and only if, were that individual to die immediately, he or she would be counted in
the death data dx at age x.
The key point is that we can tolerate a substantial amount of uncertainty in the numerator
and the denominator (number of events and total time at risk), but failing to satisfy the Principle
of Correspondence can lead to serious error. For example, [22] analyses the “Hispanic Paradox,”
the observation that Latin American immigrants in the USA seem to have substantially lower
mortality rates than the native population, despite being generally poorer (which is usually
associated with shorter lifespans). This difference is particularly pronounced at more advanced
ages. Part of the explanation seems to be return migration: Some old hispanics return to their
home countries when they become chronically ill or disabled. Thus, there are some members of
this group who count as part of the US hispanic population for most of their lives, but whose
deaths are counted in their home-country statistics.
2.11.2 Census approximation
The task is to approximate Exc (and often also dx ) given census data. There are various forms of
census data. The most common one is
Px,k = Number of individuals in the population aged [x, x + 1) at time k = 0, . . . , n.
The problem is that we do not know when the individuals were actually available to be observed.
If this is a national census, it won’t count people who emigrated before the census, or those who
immigrated the day after the census. If we are doing a census of wildlife, we don’t count the
individuals who were not caught, and again, there will be migration in and out of the observation
area.
The basic assumption of the census approximation is that the number of individuals changes
linearly between any two consecutive census dates. By definition,
Z n
Exc = Px,t dt (2.7)
0
We only know the integrand at integer times, and linear approximation yields
n
X 1
Exc ≈ (Px,k−1 + Px,k ). (2.8)
2
k=1

This allows us to estimate µx if we also know dx , the number of deaths aged x.


Now assume that, instead of dx , you only observe the calendar years of birth and death,
yielding the count
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 24

d0x = Number of deaths aged x on the birthday in the calendar year of death.

Then, some of the deaths counted in d0x will be deaths aged x − 1, not x, in fact we should view
d0x as containing deaths aged in the interval (x − 1, x + 1), but not all of them. If we assume that
birthdays are uniformly spread over the year, we can also specify that the proportion of deaths
counted under d0x changes linearly from 0 to 1 and back to 0 as x − 1 increases to x and x + 1.
In order to estimate a force of mortality, we need to identify the corresponding (approximation
to) Central exposed to risk. The Principle of Correspondence requires
Z n
0
c0
Ex = Px,t dt, (2.9)
0

where
0 = Number of individuals in the population at t with xth birthday in calendar
Px,t
year btc.
Again, suppose we know the integrand at integer times. Here the linear approximation requires
some care, since the policy holders do not change age group continuously, but only at census
dates. Therefore, all continuing policy holders counted in Px,k−1
0 will be counted in Px,t
0 for all

k − 1 ≤ t < k, but then in Px+1,k


0 at the next census date. Therefore
n
0
X 1 0 0
Exc ≈ (Px,k−1 + Px+1,k ). (2.10)
2
k=1

0
The ratio d0x /Exc gives a slightly smoothed (because of the wider age interval) estimate of µx over
the time interval (k − 1, k + 1). Note however that it is not clear if this estimate is a maximum
likelihood estimate for µx under any suitable model assumptions such as constancy of the force
of mortality between half-integer ages.

2.12 Lexis diagrams


A graphical tool that helps in making sense of estimates like the census approximation is the
Lexis diagram.1 These reduce the three dimensions of demographic data — date, age, and
moment of birth — to two, by an ingenious application of the diagonal.
Consider the diagram in Figure 2.1. The horizontal axis represents calendar time (which we
will take to be in years), while the vertical axis represents age. Lines representing the lifetimes of
individuals start at their birthdate on the horizontal axis, then ascend at a 45◦ angle, reflecting
the fact that individuals age at the rate of one year (of age) per year (of calendar time). Events
during an individual’s life may be represented along the lifeline — for instance, the line might
change colour when the individual buys an insurance policy, or has a child — and the line ends
at death. (Here we have marked the end with a red dot.) The line may begin after birth, or
end before death, if only part of the lifetime is observed, an issue that we will discuss in detail
starting in Chapter 4.
The collection of lifelines in a diagonal strip — individuals born at the same time (more or
less broadly defined) — comprise what demographers call a “cohort”. They start out together
and march out along the diagonal through life, exposed to similar (or at least simultaneous)
1
These diagrams are named for Wilhelm Lexis, a 19th century statistician and demographer of many accomplish-
ments, none of which was the invention of these diagrams, in keeping with Stigler’s law of eponymy, which states that
“No scientific discovery is named after its original discoverer.” (cf. Christophe Vanderschrick, “The Lexis diagram, a
misnomer”, Demographic Research 4:3, pp. 97–124, http://www.demographic-research.org/Volumes/Vol4/3/.)
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 25

experiences. (A “cohort” was originally a unit of a Roman legion.) Note that cohorts need not
be birth cohorts, as the horizontal axis of the Lexis diagram need not represent literal birthdates.
For instance, a study of marriage would start “lifelines” at the date of marriage, and would refer
to the “marriage cohort of 2008”, for instance, while a study of student employment prospects
would refer to the “student cohort of 2008”, the collection of all students who completed (or
started) their studies in that year.

3
Age

0
2000 2001 2002 2003 2004 2005 2006 2007 2008
Year

Figure 2.1: A Lexis diagram. The red region represents the experience of all individuals (at
whatever time) aged between 3 years and 4 years, 8 months. The green region represents the
experience of all individuals during the period from 15 March 2004 and 31 December 2005. The
blue region represents the experience of the cohort born in 2002, from birth to age 5.

The census approximation involves making estimates for mortality rates in regions of the Lexis
diagram. Vertical lines represent the state of the population, so a census may be represented by
counting (and describing) the lifelines that cross a given vertical line. The goal is to estimate the
hazard rate for a region (in age-time space) by
# events
total time at risk
The total time at risk is the total length
√ of lifelines intersecting the region (or, to be geometric
about it, the total length divided by 2), while the number of events is a count of the number
of dots. The problem is that we do not know the exact total time at risk. Our censuses do tell
us, though, the number of individuals at risk
The count dx described in section 2.11.2 tells us the number of deaths of individuals aged
between x and x + 1 (for integer x), so it is counting events in horizontal strips, such as we have
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 26

3
Age

0
0 1 2 3 4 5 6 7 8
Year

Figure 2.2: Census at time 3 represented by open circles. The population consists of 7 individuals.
3 are between ages 2 and 3, 2 are between ages 1 and 2, and 2 are between 0 and 1.

RT
shown in Figure 2.3. We are trying to estimate the central exposed to risk Exc := 0 Px,t dt,
where Px,t is the number of individuals alive at time t whose curtate age is x. We can represent
this as Z T
Exc = Px,t dt (2.11)
0
If we assume that Px,t is approximately linear over such an interval, we may approximate the
average over [k, k + 1] by 12 (Px,k + Px,k+1 ). Then we get the approximation

T −1
1 X 1
Exc ≈ Px,0 + Px,k + Px,T .
2 2
k=1

Note that this is just the trapezoid rule for approximating the integral (2.11).
Is this assumption of linearity reasonable? What does it imply? Consider first the individuals
whose lifelines cross a box with lower corner (k, x). (Note that, unfortunately, the order of the
age and time coordinates is reversed in the notation when we go to the geometric picture. This
has no significance except sloppiness which needs to be cleaned up.) They may enter either the
left or the lower border. In the former case (corresponding to individuals born in year x − k)
they will be counted in Px,k ; in the latter (born in x − k + 1) case in Px,k+1 . If the births in year
x − k + 1 differ from those in year x − k by a constant (that is, the difference between January 1
births in the two years is the same as the difference between February 12 births, and so on, then
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 27

on average the births in the two years on a given date will contribute 1/2 year to the central
years at risk, and will be counted once in the sum Px,k + Px,k+1 . Important to note:

• This does not actually require that births be evenly distributed through the year.

• When we say births, we mean births that survive to age k. If those born in, say, December
of one year had substantially lowered survival probability relative to a “normal” December,
this would throw the calculation off.

• These assumptions are not about births and deaths in general, but rather about births and
deaths of the population of interest: those who buy insurance, those who join the clinical
trial, etc.

If mortality levels are low, this will suffice, since nearly all lifelines will be counted among
those that cross the box. If mortality rates are high, though, we need to consider the contribution
of years at risk due to those lifelines which end in the box. In this case, we do need to assume that
births and deaths are evenly spread through the year. This assumption implies that conditioned
on a death occurring in a box, it is uniformly distributed through the box. On the one hand, that
implies that it contributes (on average) 1/4 year to the years at risk in the box. On the other
hand, it implies that the probability of it having been counted in our average 12 (Px,k + Px,k+1 )
is 12 , since it is counted only if it is in the upper left triangle of box. On average, then, these
should balance.

3
Age

0
0 1 2 3 4 5 6 7 8
Year

Figure 2.3: Census approximation when events are counted by actual curtate age. The vertical
dashed segments represent census counts, carried out on 1 January of years 2, 3, 4, 5, and 6.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 28

What happens when we count births and deaths only by calendar year? Note that Px,k 0 =P
x,k
for integers k and x. One difference is that the regions in question, which are parallelograms,
follow the same lifelines from the beginning of the year to the end. This makes the analysis more
straightforward. Lifelines that pass through the region are counted on both ends. The other
difference is that the region that begins with the census value Px,k ends not with Px,k+1 , but
with Px+1,k+1 . Thus all the lifelines passing through the region will be counted in Px,k and in
Px+1,k+1 , hence also in their average. This requires no further assumptions. For the lifelines that
end in the region to be counted appropriately, on the other hand, requires that the deaths be
evenly distributed throughout the year. (Other, slightly less restrictive assumptions, are also
possible.) In this case, each death will contribute exactly 1/2 to the estimate 12 (Px,k + Px+1,k+1 )
(since it is counted only in Px,k ), and it contributes on average 1/2 year of time at risk.

3
Age

0
0 1 2 3 4 5 6 7 8
Year

Figure 2.4: Census approximation when events are counted by calendar year of birth and death.
Vertical segments bounding the coloured regions represent census counts. P3,t , for instance, is
the number of red lifelines crossing in the yellow region, at Year t.

2.13 Life Expectancy


2.13.1 What is life expectancy?
One of the most interesting (and most discussed) features of life tables is the life expectancy. It
has an intuitive meaning — the average length of life — and is commonly used as a summary of
the life table, to compare mortality between countries, regions, and subpopulations. For instance,
Table 2.3 shows the estimated life expectancy in some rich and poor countries, ranging from 37.2
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 29

years for a man in Angola, to 85.6 years for a woman in Japan. The UK is in between (though,
of course, much closer to Japan), with 76.5 years for men and 81.6 years for women.

Table 2.3: 2009 Life expectancy at birth (LE) in years and infant mortality rate per thousand
live births (IMR) in selected countries, by sex. Data from US Census Bureau. International
Database available at http://www.census.gov/ipc/www/idb/idbprint.html

Country IMR IMR male IMR female LE LE male LE female


Angola 180 192 168 38.2 37.2 39.2
France 3.33 3.66 2.99 81.0 77.8 84.3
India 30.1 34.6 25.2 69.9 67.5 72.6
Japan 2.79 2.99 2.58 82.1 78.8 85.6
Russia 10.6 12.1 8.9 66.0 59.3 73.1
South Africa 44.4 48.7 40.1 49.0 49.8 48.1
United Kingdom 4.85 5.40 4.28 79.0 76.5 81.6
United States 6.26 6.94 5.55 78.1 75.7 80.7

Life expectancies can vary significantly, even within the same country. For example, the UK
Office of National Statistics has published estimates of life expectancy for 432 local areas in
the UK. We see there that, for the period 2005–7, men in Kensington and Chelsea had a life
expectancy of 83.7 years, and women 87.8 years; whereas in Glasgow (the worst-performing area)
the corresponding figures were 70.8 and 77.1 years. Overall, English men live 2.7 years longer on
average than Scottish men, and English women 2.0 years longer.
When we think of lifetimes as random variables, the life expectancy is simply the mathematical
expectation E[T ]. By definition, Z ∞
E[T ] = tfT (t)dt.
0

Integration by parts, using the fact that fT = −F̄T0 , turns this into a much more useful form,
Z ∞ Z ∞ Z ∞ R
∞ t
E[T ] = −tF̄T (t) + F̄T (t) dt = F̄T (t) dt = e− 0 µs ds dt. (2.12)
0 0 0 0

That is, the life expectancy may be computed simply by integrating the survival function. The
discrete form of this is

X ∞
X
(2.13)
 
E[K] = kP K = k = P K>k .
k=0 k=0

Applying this to life tables, we see that the expected curtate lifetime is
∞ ∞ ∞
X  X lk X
E[K] = P K>k = = p0 · · · pk−1 .
l0
k=0 k=1 k=1

Note that expected future lifetimes can be expressed as


∞ ∞
( Z )
Z ∞ t
◦ X X lk+x
ex := E[Tx ] = exp − µs ds dt and ex := E[Kx ] = px . . . px+k−1 = .
x x lx
k=1 k=1
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 30

(Recall that Tx is the remaining lifetime at age x. It is defined on the event {T ≥ x}, and has
the value T − x. Its distribution is understood to be the conditional distribution of T − x on
◦ ◦
{T ≥ x}.) We see that ex ≤ ex < ex + 1. For sufficiently smooth lifetime distributions, ex ≈ ex
will be a good approximation.
2.13.2 Example
Table 2.4 shows a life table based on the mortality data for tyrannosaurs from Table 1.1. Notice
that the life expectancy at birth e0 = 16.0 years is exactly what we obtain by averaging all the
ages at death in Table 2.4.

age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
dx 0 0 3 1 1 3 2 1 2 4 4 3 4 3 8
lx 103 103 103 100 99 98 95 93 92 90 86 82 79 75 72
qx 0.00 0.00 0.03 0.01 0.01 0.03 0.02 0.01 0.02 0.04 0.05 0.04 0.05 0.04 0.11
ex 16.0 15.0 14.0 13.5 12.6 11.7 11.1 10.3 9.4 8.7 8.1 7.5 6.7 6.1 5.4
age 15 16 17 18 19 20 21 22 23 24 25 26 27 28
dx 4 4 7 10 6 3 10 8 4 3 0 3 0 2
lx 64 60 56 49 39 33 30 20 12 8 5 5 2 2
qx 0.06 0.07 0.12 0.20 0.15 0.09 0.33 0.40 0.33 0.38 0.00 0.60 0.00 1.00
ex 5.0 4.4 3.7 3.2 3.0 2.6 1.8 1.7 1.8 1.8 1.8 0.8 1.0 0.00

Table 2.4: Life table for tyrannosaurs, based on data from Table 1.1.

2.13.3 Life expectancy and mortality


The connection between life expectancy and mortality is somewhat subtle. It is well known that
life expectancy at birth — e0 — has been rising for well over a century. For males it is 73.4
years on the 1990-2 UK life table, but was only 44.1 years on the life table a century before.
However, it would be a mistake to suppose this means that a typical man was dying at an age
that we now consider active middle-age. This becomes clearer when we look at the remaining
life expectancy at age 44. In 1990 it was 31.6 years; in 1890 it was 22.1 years. Less, to be sure,
but still a substantial number of years remaining. The low average length of life in 1890 was
determined in large part by the number of zeroes being included in the average.
Imagine a population in which everyone dies at exactly age 75. The expectation of life
remaining at age x would then be exactly 75 − x. While that is not, of course, our true situation,
mortality in much of the developed world today is quite close to this extreme: There is almost no
randomness, as witnessed by the fact that the remaining life expectancy column of the lifetable
marches monotonously down by one year per year lived. The only exception is at the beginning
— the newborn has only lost about 0.4 remaining years for the year it has lived. This is because
the mortality in the first year is fairly high, so that overcoming that hurdle gives a significant
boost to ones remaining life expectancy. We can compute

e0 = p0 (1 + e1 ).

This follows either from (2.13), or directly from observing that if someone survives the first year
(which happens with probability p0 ) he will have lived one year, and have (on average) e1 years
remaining. Thus,
e0 1 + e1 − e0 .6
q0 = 1 − = = = 0.008,
1 + e1 1 + e1 74.4
which is approximately right. On the 1890 life table we see that the life expectancy of a newborn
was 44.1 years, but this rose to 52.3 years for a boy on his first birthday. This can only mean
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 31

that a substantial portion of the children died in infancy. We compute the first-year mortality as
q0 = (1 + 52.3 − 44.1)/53.3 = 0.17, so about one in six.
How much would life expectancy have been increased simply by eliminating infant mortality
— that is, mortality in the first year of life? In that case, all newborns would have reached their
first birthday, at which point they would have had 52.2 years remaining on average — thus, 53.2
years in total. Today, with infant mortality almost eliminated, there is only a potential 0.6 years
remaining to be achieved from further reductions.

2.14 An example of life-table computations


Suppose we are studying a population of creatures that live a maximum of 4 years. For simplicity,
we will assume that births all occur on 1 January. (The complications of births going on
throughout the year will be addressed briefly in section 2.11.2.) The entire population is under
observation, and all deaths are recorded. We make the following observations:

Year 1: 300 born, 100 die.

Year 2: 350 born, 150 die. 20 1-year-olds die.

Year 3: 400 born, 100 die. 40 1-year-olds die. 90 2-year-olds die.

Year 4: 300 born, 50 die. 75 1-year-olds die. 100 2-year-olds die. 90 3-year-olds die.

(a) Cohort 1 life table (b) Cohort 2 life table

x dx `x qx ex x dx `x qx ex
0 100 300 0.333 1.57 0 150 350 0.43 1.2
1 20 200 0.10 1.35 1 40 200 0.20 1.1
2 90 180 0.50 0.50 2 100 160 0.625 0.375
3 90 90 1.0 0 3 60 60 1.0 0

(c) Period life table for year 4

x qx `x dx ex
0 0.167 1000 167 1.69
1 0.25 833 208 1.03
2 0.625 625 391 0.375
3 1.0 235 1.0 0

Table 2.5: Alternative life tables from the same data.

In Table 2.5 we compute different life tables from these data. The two cohort life tables
(Tables 2.4(a) and 2.4(b)) are fairly straightforward: We start by writing down `0 (the number
of births in that cohort) and then in the dx column the number of deaths in each year from that
cohort. Subtracting those successively from `0 yields the number of survivors in each age class
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 32

`x , and qx = dx /`x . Finally, we compute the remaining life expectancies:

`1 `2 `3
e0 = + +
`0 `0 `0
`2 `3
e1 = +
`1 `1
`3
e2 = .
`2
The period life table is computed quite differently. We start with the qx numbers, which
come from different cohorts:

q0 comes from cohort 4 newborn deaths;


q1 comes from cohort 3 age 1 deaths;
q2 comes from cohort 2 age 2 deaths;
q3 comes from cohort 1 age 3 deaths.

We then write in the radix `0 = 1000. Of 1000 individuals born, with q0 = 0.167, we expect 167
to die, giving us our d0 . Subtracting that from `0 tells us that `1 = 833 of the 1000 newborns
live to their first birthday. And so it continues. The life expectancies are computed by the same
formula as before, but now the interpretation is somewhat different. The cohort remaining life
expectancies were the same as the actual average number of (whole) years remaining for the
population of individuals from that cohort who reached the given age. The period remaining life
expectancies are fictional, telling us how many individuals would have remained alive if we had a
cohort of 1000 that experienced in each age the same mortality rates that were in effect for the
population in year 4.

2.15 Cohorts and period life tables


You may have noticed a logical fallacy in the arguments of sections 2.5 and 2.6. The life
expectancy at birth should be the average length of life of individuals born in that year. Of
course, we would have to go back to about 1890 to find a birth year whose cohort — the
individuals born in that year — have completed their lives, so that the average lifespan can be
computed as an average.
Consider, for instance, the discrete-time non-homogeneous model. “Time” in the model is
individual age: An individual starts out at age 0, then progresses to age 1 if she survives, and so
on. We estimate the probability of dying aged x by dividing the number of deaths observed age
x by the number of individuals observed to have been at that age.
In our life-tables, called period life tables, these numbers came from a census of the individuals
alive at one particular time, and the count of those who died in the same year, or period of a few
years. No individual experiences those mortality rates. Those born in 2009 will experience the
mortality rates for age 10 in 2019, and the mortality rates for age 80 in 2089. Putting together
those mortality rates would give us a cohort life table. (Actually, this is not precisely true. You
might think about why not. The answer is given in a footnote.2 ) If, as has been the case for the
2
The main difference between a cohort life table and the life table constructed from the corresponding age
classes of successive period life tables is immigration: The cohort life table for 1890 should include, in the row for
(let us say) ages 60–4 the mortality rates of those born in 1890 in the relevant region — England and Wales in
this case — who are still alive at age 60. But these are not identical to the 60 year old men living in England and
Wales in 1950. Some of the original cohort have moved away, and some residing in the country were not born
there.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 33

past 150 years, mortality rates decline in the interval, that means that the survival rates will be
higher than we see in the period table.
We show in Figure 2.5 a picture of how a cohort life table for the 1890 cohort would be
related to the sequence of period life tables from the 1890s through the 2000s. The mortality
rates for ages 0 through 9 (thus 1 q0 , 4 q1 , 5 q5 )3 are on the 1890s period life table, while their
mortality rates for ages 10 through 19 are on the 1900–1909 period life table, and so on. Note
that the mortality rates for the 1890s period life table yield a life expectancy at birth e0 = 44.2
years. That is the average length of life that babies born in those years would have had, if their
mortality in each year of their lives had corresponded to the mortality rates which were realised
in for the whole population in the year of their birth. Instead, though, those that survived their
early years entered the period of late-life high mortality in the mid- to late 20th century, when
mortality rates were much lower. It may seem surprising, then, that the life expectancy for the
cohort life table only goes up to 44.7 years. Is it true that this cohort only gained 6 months of
life on average, from all the medical and economic progress that took place during their lives?
Yes and no. If we look more carefully at the period and cohort life tables in Table 2.15 we
see an interesting story. First of all, a substantial fraction of potential lifespan is lost in the first
year, due to the 17% infant mortality, which is obviously the same for the cohort and period life
tables. 25% died before age 5. If mortality to age 5 had been reduced to modern levels — close
to zero — the period and cohort mortality would both be increased by about 14 years. Second,
notice that the difference in life expectancies jumps to over 5 years at age 30. Why is that?
For the 1890 cohort, age 30 was 1920 — after World War I, and after the flu pandemic. The
male mortality rate in this age class was around 0.005 in 1900–9, and less than 0.004 in 1920–9.
Averaged over the intervening decade, though, male mortality was close to 0.02. (Most of the
effect is due to the war, as we see from the fact that it almost exclusively is seen in the male
mortality; female mortality in the same period shows a slight tick upward, but it is on the order
of 0.001.) One way of measuring the horrible cost of that war is to see that for the generation
of men born in the 1890s, that was most directly affected, the advances of the 20th century
procured them on average about 4 years of additional life, relative to what might have been
expected from the mortality rates in the year of their birth. Of these 4 years, 3 12 were lost in the
war. Another way of putting this is to see that the approximately 4.5 million boys born in the
UK between 1885 and 1895 lost cumulatively about 16 million years of potential life in the war.
There are, in a sense, three basic kinds of life tables:

1. Cohort life table describing a real population. These make most sense in a biological
context, where there is a small and short-lived population. The `x numbers are actual
counts of individuals alive at each time, and the rest of the table is simply calculated from
these, giving an alternative descriptions of survival and mortality.

2. Period life tables, which describe a notional cohort (usually starting with radix `0 being
a nice round number) that passes through its lifetime with mortality rates given by the
qx . These qx are estimated from data such as those of Table 2.2, giving the number of
individuals alive in the age class during the period (or number of years lived in the age
class) and the number of deaths.

3. Synthetic cohort life tables. These take the qx numbers from a real cohort, but express
them in terms of survival `x starting from a rounded radix.

3
Actually, we have given µx for the intervals [0, 1), [1, 5), and [5, 10). We compute 1 q0 = 1−e−µ0 , 4 q1 = 1−e−4µ1 ,
5 q5 = 1 − e
−5µ5
.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 34

1890-1899 1900-1909 1910-1919 1920-1929 1930-1939 1940-1949

1950-1959 1960-1969 1970-1979 1980-1989 1990-1999 2000-

Figure 2.5: Decade period life tables, with the pieces joined that would make up a cohort life
table for individuals born in 1890.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 35

(a) Period life table for men in England and (b) Cohort life table for the 1890 cohort of
Wales 1890–9 men in England and Wales
x µx `x dx ex x µx `x dx ex
0 0.187 100000 17022 44.2 0 0.187 100000 17022 44.7
1 0.025 82978 7923 51.7 1 0.025 82978 7923 52.3
5 0.004 75055 1655 52.8 5 0.004 75055 1655 53.5
10 0.002 73400 908 48.9 10 0.002 73400 774 49.6
15 0.004 72492 1379 44.5 15 0.003 72626 1167 45.1
20 0.005 71113 1766 40.3 20 0.020 71459 6749 40.8
25 0.006 69347 2087 36.2 25 0.017 64710 5219 39.7
30 0.008 67260 2550 32.2 30 0.004 59491 1257 37.9
35 0.010 64710 3229 28.4 35 0.006 58234 1608 33.6
40 0.013 61481 3970 24.7 40 0.006 56626 1671 29.5
45 0.017 57511 4703 21.2 45 0.009 54955 2384 25.3
50 0.022 52808 5515 17.8 50 0.012 52571 3027 21.3
55 0.030 47293 6508 14.6 55 0.019 49544 4388 17.5
60 0.042 40785 7710 11.6 60 0.028 45156 5956 14.0
65 0.061 33075 8636 8.9 65 0.044 39200 7760 10.8
70 0.086 24439 8511 6.5 70 0.067 31440 8985 8.0
75 0.122 15928 7281 4.5 75 0.102 22455 8940 5.7
80 0.193 8647 5346 2.7 80 0.146 13515 6997 3.8
85 0.262 3301 2410 1.7 85 0.215 6518 4294 2.3
90 0.358 891 742 0.9 90 0.288 2224 1697 1.4
95 0.477 149 135 0.5 95 0.395 527 454 0.8
100 0.590 14 13 0.3 100 0.516 73 67 0.4
105 0.695 1 1 0.2 105 0.645 6 6 0.2
110 0.772 0 0 0.0 110 0.733 0 0 0.0

Table 2.6: Period and cohort tables for England and Wales. The period table is taken directly
from the Human Mortality Database. The cohort table is taken from the period tables of the
HMD, not copied from their cohort tables.

2.16 Comparing Life Tables


2.16.1 Defining the Poisson model
Under the assumption of a constant hazard rate (force of mortality) µx over the year (x, x + 1],
we may view the estimation problem as a chain of separate hazard rate estimation problems, one
for each year of life. Each individual lives some portion of a year in the age interval (x, x + 1],
the portion being 0 (if he dies before birthday x), 1 (if he dies after birthday x + 1), or between
0 and 1 if he dies between the two birthdays. Suppose now we lay these intervals end to end,
with a mark at the end of an interval where an individual died. It is not hard to see that what
results is a Poisson process on the interval [0, Exc ], where Exc is the total observed years at risk.
Suppose we treat Exc as though it were a constant. Then if Dx represents the numbers dying
in the year the model uses
c
 (µx Exc )k e−µx Ex
P Dx = k = , k = 0, 1, 2, · · · .
k!
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 36

The estimator for the constant force of mortality over the year is
Dx dx
µ
ex = , with observed value c .
Exc Ex

Under the Poisson model we therefore have that


µx Exc µx
vare
µx =
c 2 = Ec .
(Ex ) x

So the estimate will be


dx
µx ≈
vare .
(Exc )2
Intuitive justification for the Poisson model
Are we justified in treating Exc as though it were fixed? Certainly it’s not exactly the same: The
numerator and denominator are both random, and they are not even independent. One way
of looking at this is to ask, how different would our estimate have been in a given realisation,
had we fixed the total time under observation in advance. If we observe m lives from the start
of year x, we see that Dx is approximately normal with mean mqx and variance mqx (1 − qx ),
while Exc is normal with mean m − mqx (1 − e∗ ), where e∗ is the expected remaining length of
a life starting from age x, conditioned on its being less than 1; and variance mσ 2 , where σ 2 is
the variance in time under observation of a single life. (If µx is not very large, e∗ is close to 21 .)
Looking at the first-order Taylor series expansion, we see that the ratio Dx /Exc varies only by a
normal error term times m−1/2 , plus a bias of order m−1 . For large m, then, the estimate on the
basis of fixed m (number of individuals) is almost the same as the estimate we would have made
from observing the Poisson model for the fixed total time at risk m − mqx (1 − e∗ ).
2.16.2 Testing hypotheses for qx and µx
We note the following normal approximations:
Binomial model:

Dx ∼ B(Ex , qx ) =⇒ Dx ∼ N Ex qx , Ex qx (1 − qx )


and  
Dx qx (1 − qx )
qbx = ∼N qx , .
Ex Ex

Poisson model
Dx ∼ N (Exc µx , Exc µx )
and  
µx
bx ∼ N
µ µx , c .
Ex
Tests are often done using comparisons with a published standard life table. These can
be from national tables for England and Wales published every 10 years, or insurance company
data collected by the Continuous Mortality Investigation Bureau, or from other sources.
A superscript "s" denotes "from a standard table", such as qxs and µsx .
Test statistics are generally obtained from the following:
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 37

Binomial:
dx − Ex qxs
 
O−E
zx = p ≈ √ .
Ex qxs (1 − qxs ) V

Poisson:
dx − E c µs
 
O−E
zx = p x x ≈ √ .
Exc µsx V
Both of these are denoted as zx since under a null hypothesis that the standard table is correct,
Zx has approximately standard normal distribution.
2.16.3 The tests
χ2 test
We take X
X= zx2
all ages x

This gives the sum of squares of standard normal random variables under the null hypothesis
and so is a sum of χ2 (1). Therefore

X ∼ χ2 (m) , if m = # years of study.

H0 : there is no difference between the standard table and the data,


HA : they are not the same.
It is common to use 5% significance and so the test fails if X > χ2 (m)0.95 .
It tests large deviations in either direction from the standard table.
Disadvantages:

1. There may be a few large deviations offset by substantial agreement over part of the table.
The test will not pick this up.

2. There might be bias, that is, although not necessarily large, all the deviations may be of
the same sign.

3. There could be significant groups of consecutive deviations of the same sign, even if not
overall.

Signs test
Test statistic X is given by
X = #{zx > 0}
Under the null hypothesis X ∼ Binom(m, 21 ), since the probability of a positive sign should be
1/2. This should be administered as a two-tailed test. It is under-powered since it ignores the
size of the deviations but it will pick up small deviations of consistent sign, positive or negative,
and so it addresses point 2 above.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 38

Cumulative deviations test


This again addresses point 2 and essentially looks very similar to the logrank test between two
survival curves, which we will consider later in the course. If instead of squaring dx − Ex qxs or
dx − Exc µsx , we simply sum then

(d − Ex qxs )
P
pP x ∼ N (0, 1), approximately
Ex qxs (1 − qxs )

and
(dx − Exc µsx )
P
pP ∼ N (0, 1) approximately.
Exc µsx
H0 : there is no bias
HA : there is a bias.
This test addresses point 2 again, which is that the chi-square test does not test for consistent
bias.
Other tests
There are tests to deal with consecutive bias/runs of same sign. These are called the groups of
signs test and the serial correlations test. Again a very large number of years, m, are required to
render these tests useful.
2.16.4 An example
Table 2.7 presents imaginary data for men aged 90 to 95. The column `x lists the initial at
risk, the number of men in the population on the census date, and dx is the number of deaths
from this initial population over the course of the year. Exc is the central at risk, estimated as
`x − dx /2. Standard male British mortality for these ages is listed in column µsx . (The column
µ̊x is a graduated estimate, which will be discussed in section 2.17.

age `x dx Exc µ̂x+ 1 µsx zx µ̊x


2

90 40 10 35 0.29 0.202 1.1 0.25


91 35 8 31 0.258 0.215 0.52 0.28
92 22 4 18 0.20 0.236 −0.33 0.335
93 14 6 11 0.545 0.261 1.85 0.40
94 11 4 9 0.444 0.279 0.94 0.45
95 7 3 5.5 0.545 0.291 1.11 0.48

Table 2.7: Table of mortality rates for an imaginary old-people’s home, with standard British
male mortality given as µsx , and graduated estimate µ̊x .

We note substantial differences between the estimates µ̂x and the standard mortality µsx , but
none of them is extremely large relative to the standard error: The largest zx is 1.85. We test the
two-sided alternative hypothesis, that the mortality rates in the old-people’s home are different
from the standard mortality rates, with a χ2 test, adding up the zx2 . The observed X 2 is 7.1,
corresponding to an observed significance level p = 0.31. (Remember that we have 6 degrees of
freedom, not 5, because these zx are independent. This is not an incidence table.)
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 39

2.17 Graduation
Graduation is a term common in lifetime analysis — particularly in actuarial contexts — for
what is generally called smoothing in statistics. Suppose that a company has collected its own
data, producing estimates for either qx or µx . The estimates may be rather irregular from year to
year and this could be an artefact of the population the company happens to have in a particular
scheme. The underlying model should probably (but not necessarily) be smoother than the raw
estimates. If it is to be considered for future predictions, then smoothing should be considered.
This is called graduation.
There is always a tradeoff in smoothing procedures. Without smoothing, real patterns get
lost in the random noise. Too much smoothing, though, can swamp the data in the model, so
that the final estimate reflects more our choice of model than any truth gleaned from the data.
2.17.1 Parametric models
We may fit a formula to the data. Possible examples are

µx = µ (Exponential);
µx = Be θx
(Gompertz);
µx = A + Beθx (Makeham)

The Gompertz can be a good model for a population of middle older age groups. The Makeham
model has an extra additive constant which is sometimes used to model “intrinsic mortality”,
which is supposed to be independent of age. We could use more complicated formulae putting in
polynomials in x.
2.17.2 Reference to a standard table
Here qx0 , µ0x represent the graduated estimates. We could have a linear dependence

qx0 = a + bqxs , µ0x = a + bµsx

or possibly a translation of years

qx0 = qx+k
s
, µ0x = µsx+k

In general there will be some assigned functional dependence of the graduated estimate on
the standard table value. These are connected with the notions of accelerated lifetimes and
proportional hazards, which will be central topics in the second part of the course.
2.17.3 Nonparametric smoothing
We effectively smooth our data when we impose the assumption that mortality rates are constant
over a year. We may tune the strength of smoothing by requiring rates to be constant over longer
intervals. This is a form of local averaging, and there are more and less sophisticated versions of
this. In Matlab or R the methods available include kernel smoothing, orthogonal polynomials,
cubic splines, and LOESS. These are beyond the scope of this course.
In Figure 2.6 we show a very simple example. The mortality rates are estimated by individual
years or by lumping the data in five year intervals. The green line shows a moving average of the
one-year estimates, in a window of width five years.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 40

Estimates of A. sarcophagus mortality (based on Erickson et al.)


1.0

● yearly estimate
0.8

continuous (5−year groupings)


5−year moving average
Estimated mortality probability
0.6


0.4

● ●



0.2


● ●

● ● ● ● ● ● ● ●
0.0

● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25
Age (years)

Figure 2.6: Different smooothings for A. sarcophagus mortality from Table 1.1.

Tyrannosaur mortality rates


0.50 1.00 2.00

Estimates from data


Constant hazard fit
Gompertz hazard fit
Mortality rate

0.05 0.10 0.20


0.01 0.02

0 5 10 15 20 25

Age (yrs)

Figure 2.7: Estimated tyrannosaurus mortality rates from Table 2.4, together with exponential
and Gompertz fits.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 41

2.17.4 Methods of fitting


1. Apply the binomial model, and set qx = a + bqxs in the likelihood; find maximum likelihood
estimators for the unknown parameters a, b. Similarly, do the same for µx in the Poisson
model. Other parametric functions of the standard life table than linear may also be used.

2. Use weighted least squares and minimise


X  2
wx qbx − qx0 or
all ages x
X  2
wx µbx − µ0x
all ages x

as appropriate. For the weights suitable choices are either Ex or Exc respectively. Alterna-
tively we can use 1/var, where the variance is estimated for qbx or µ
bx , respectively.

The hypothesis tests we have already covered above can be used to test the graduation fit
to the data, replacing qxs , µsx by the graduated estimates. Note that in the χ2 test we must
reduce the degrees of freedom of the χ2 distribution by the number of parameters
estimated in the model for the graduation. For example if qx0 = a + bqxs , then we reduce
the degrees of freedom by 2 as the parameters a, b are estimated.
2.17.5 Examples
Standard life table
We graduate the estimates in Table 2.7, based on the standard mortality rates listed in the
column µsx , using the parametric model µ̊x = a + bµsx . The log likelihood is
X
`= dx log µ̊x+ 1 − µ̊x+ 1 Exc .
2 2

We maximise by solving the equations


 
∂` X  dx
0= = − Exc 

∂a â + b̂µs

x+ 12
 
s
∂` X  dx µx+ 12
0= = − µsx+ 1 Exc  .
∂b â + b̂µsx 2

We can solve these equations numerically, to obtain â = −0.279 and b̂ = 2.6. This yields the
graduated estimates µ̊ tabulated in the final column of Table 2.7. Note that these estimates have
the virtue of being, on the one hand, closer to the observed data than the standard mortality
rates; on the other hand smoothly and monotonically increasing.
If we had used ordinary least squares to fit the mortality rates, we would have obtained
very different estimates: ã = −0.472 and b̃ = 3.44, because we would be trying to minimise the
errors in all classes equally, regardless of the nimber of observations. Weighted least squares,
with weights proportional to Exc (inverse variance) solves this problem, more or less, and gives us
estimates â∗ = −0.313 and b̂∗ = 2.75 very close to the MLE.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 42

In Figure 2.7 we plot the mortality rate estimates for the complete population of tyrannosaurs
described in Table 1.1, on a logarithmic scale, together with two parametric model fits: the
exponential model, with one parameter µ estimated by
1 n n
µ̂ = = ≈ = 0.058,
t̄ t1 + · · · + tn k1 + · · · + kn + n/2
where t1 , . . . , tn are the n lifetimes observed, and ki = bti c the curtate lifetimes; and the Gompertz
model µs = Beθs , estimated by

Q0 (θ̂) 1
θ̂ solves − = t̄,
Q(θ̂) − 1 θ̂
θ̂
B̂ := ,
Q(θ̂) − 1
1 X θti
where Q(θ) := e .
n
This yields θ̂ = 0.17 and B̂ = 0.0070. It seems apparent to the eye that the exponential fit is
quite poor, while the Gompertz fit might be pretty good. It is hard to judge the fit by eye,
though, since the quality of the fit depends in part on the number of individuals at risk that go
into the individual mortality-rate estimates, something which does not appear in the plot.
To test the hypothesis, we compute the predicted number of deaths in each age class
(exp) (exp) (exp)
dx = lx · qx if there is a constant µx = µ̂ = 0.058, meaning that qx = 0.057, and
(Gom) (Gom)
dx = lx · qx if
( )
B̂  
qx = qx(Gom) := 1 − exp − eθ̂x eθ̂ − 1 ,
θ̂

which is obtained by integrating the Gompertz hazard.


(exp)
It matters little how we choose to interpret the deviations in the column zx — with values
going up as high as 8.13, it is clear that these could not have come from a normal distribution,
and we must reject the null hypothesis that these lifetimes came from an exponential P distribution.
As for the Gompertz model, the deviations are all quite moderate. We compute zx2 = 28.7.
There are 29 categories, but we have estimated 2 parameters, so this needs to be compared to
the χ2 distribution with 27 degrees of freedom. (You have learned this rule of thumb of reducing
the degrees of freedom by 1 for every parameter estimated in the multinomial setting. What we
are doing here is slightly different, but the principle is the same.) The cutoff for a test at the
0.05 level is 40.1, so we do not reject the null hypothesis.
There are several problems with this χ2 test. As you have already learned, the χ2 approxi-
mation doesn’t work very well when the expected numbers in some categories are too low. This
is certainly the case in this case, with dGom
x as low as 0.78. (That is, we are using a normal
approximation with mean 0.78 and SD 0.88 for a quantity that takes on nonnegative integer
values. That obviously cannot be right.) A better version would lump categories together. If
we replace the first 10 years by a single category, it will have an expected number of deaths
equal to 0.168 · 103 = 17.3, as compared with exactly 17 observed deaths, producing a z value of
0.08. Similarly, we cut off the final three years (since collectively
P 2 they correspond to the certain
event that all remaining individuals die), leaving us with zx = 11 on 15 degrees of freedom.
Again, this is a perfectly ordinary value of the χ variable, and we do not reject the hypothesis
2

of Gompertz mortality.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 43

(exp) (exp) (exp) (Gom) (Gom) (Gom)


age lx dx qx dx zx qx dx zx
0 103 0 0.007 5.87 -2.50 0.008 0.78 -0.89
1 103 0 0.008 5.87 -2.50 0.009 0.93 -0.97
2 103 3 0.010 5.87 -1.23 0.011 1.10 1.84
3 100 1 0.012 5.70 -2.03 0.013 1.26 -0.24
4 99 1 0.014 5.64 -2.02 0.015 1.48 -0.40
5 98 3 0.016 5.59 -1.14 0.018 1.73 0.98
6 95 2 0.019 5.42 -1.52 0.021 1.99 0.01
7 93 1 0.023 5.30 -1.93 0.025 2.30 -0.87
8 92 2 0.027 5.24 -1.47 0.029 2.69 -0.43
9 90 4 0.032 5.13 -0.52 0.035 3.12 0.52
10 86 4 0.038 4.90 -0.42 0.041 3.52 0.27
11 82 3 0.044 4.67 -0.80 0.048 3.96 -0.50
12 79 4 0.052 4.50 -0.25 0.057 4.50 -0.25
13 75 3 0.062 4.28 -0.64 0.067 5.04 -0.95
14 72 8 0.073 4.10 2.04 0.079 5.70 1.03
15 64 4 0.086 3.65 0.19 0.093 5.96 -0.86
16 60 4 0.101 3.42 0.33 0.109 6.56 -1.08
17 56 7 0.118 3.19 2.27 0.128 7.18 -0.08
18 49 10 0.139 2.79 4.69 0.150 7.36 1.11
19 39 6 0.162 2.22 2.72 0.175 6.84 -0.37
20 33 3 0.189 1.88 0.86 0.204 6.74 -1.65
21 30 10 0.220 1.71 7.15 0.237 7.12 1.35
22 20 8 0.255 1.14 7.40 0.275 5.49 1.40
23 12 4 0.295 0.68 4.52 0.317 3.80 0.14
24 8 3 0.339 0.46 4.30 0.363 2.91 0.08
25 5 0 0.388 0.29 -0.55 0.414 2.07 -1.88
26 5 3 0.441 0.29 6.26 0.470 2.35 0.70
27 2 0 0.498 0.11 -0.35 0.528 1.06 -1.50
28 2 2 0.558 0.11 8.13 0.590 1.18 1.67

Table 2.8: Life table for tyrannosaurs, with fit to exponential and Gompertz models, and based
on data from Table 1.1.
Chapter 3

Multiple-decrements model (Optional


Topic)

3.1 Introduction to the multiple-decrements model


The simplest (and most immediately fruitful) way to generalise the single-decrements model is to
allow transitions to multiple absorbing states. Of course, as demographer Kenneth Wachter has
put it, it may seem peculiar to introduce multiple “dead” states into our models since there is
only one way of being dead; but (as he continues), there are many ways of getting there. Further,
there are many other settings which can be modelled by a single nonabsorbing state transitioning
into one of several possible absorbing states. Some examples are

• A working population insured for disability might transition into multiple different possible
causes of disability, which may be associated with different costs.

• Workers may leave a company through retirement, resignation, or death.

• A model of unmarried cohabitations, which may end either by separation or marriage.

• Unemployed individuals may leave that state either by finding a job, or by giving up looking
for work and so becoming “long-term unemployed”.

An important common element is that calling the states “absorbing” does not have to mean that
it is a deathlike state, from which nothing more happens. Rather, it simply means that our
model does not follow any further developments.
3.1.1 An introductory example
This example is taken from section 8.2 of [29].
According to United Nations statistics, the probability of dying for men in Zimbabwe in
2000 was 5 q30 = 0.1134, with AIDS accounting for approximately 4/5 of the deaths in this age
group. Suppose we wish to answer the question: what would be the effect on mortality rates of a
complete cure for AIDS?
One might immediately be inclined to think that the mortality rate would be reduced to 1/5
of its current rate, so that what the probability of dying of some other cause in the absence of
AIDS, which we might write as 5 q30 OT HER∗ , would be 0.02268. On further reflection, though, it

seems that this is too low: This is the proportion of people aged 30 who currently die of causes
other than AIDS. If AIDS were eliminated, surely some of the people who now die of AIDS
would instead die of something else.

44
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 45

Of course, this is not yet a well-defined mathematical problem. To make it such, we need to
impose extra conditions. In particular, we impose the competing risks assumption: Individual
causes of death are assumed to act independently. You might imagine an individual drawing lots
from multiple urns, labelled “AIDS”, “Stroke”, “Plane crash”, to determine whether he will die of
this cause in the next year. The fraction of black lots among the white is precisely qx , when the
individual has age x. If he gets no black lot, he survives the year. If he draws two or more, we
only get to see the one drawn first, since he can only die once. The probability of surviving is
then the product of the survival probabilities:
CAU SE1
1 − t qxCAU SE2 · · · (3.1)
 
t qx = 1 − 1 − t qx

What is the fraction of deaths due to a given cause? Assuming constant mortality rate over the
time interval due to each cause, we have
CAU SE1
1 − t qxCAU SE1 = e−tλx .

Given a death, the probability of it being due to a given cause, is proportional to the associated
hazard rate. Consequently,

λxCAU SE1 = fraction of deaths due to CAUSE 1 × λx ,

which implies that


CAU SE1
t qx = 1 − (1 − t qx )fraction of deaths due to CAUSE 1 .

(Note that this is the same formula that we use for changing lengths of time intervals: t qx =
1 − (1 − 1 qx )t .) This tells us the probability of dying from cause 1 in the absence of any other
cause. The probability of dying of any cause at all is then given by (3.1).
Applying this to our Zimbabwe AIDS example, treating the causes as being either AIDS or
OTHER, we see that the probability of dying of AIDS in the absence of any other cause is
AIDS∗
5 q30 = 1 − (1 − 5 q30 )4/5 = 1 − 0.88664/5 = 0.0918,

while the probability of dying of any other cause, in the absence of AIDS, is
OT HER∗
5 q30 = 1 − (1 − 5 q30 )4/5 = 1 − 0.88661/5 = 0.0238.

Appropriately, we have the total cause of death 0.1138 = 1 − (1 − 0.0918)(1 − 0.0238).


Is the competing risks assumption reasonable? Another way of putting this is to ask, what
circumstances would cause the assumption to be violated? The answer is: Competing risks
is violated when a subpopulation is at higher than average risk for multiple causes of death
simultaneously; or conversely, when those at higher than average risk for one cause of death are
protected from another cause of death. For example, smokers have more than 10 times the risk of
dying from lung cancer that nonsmokers have; but they also have substantially higher mortality
from other cancers, heart disease, stroke, and so on. If a perfect cure for lung cancer were to
be found, it would not save nearly as many lives as one might suppose, from a competing-risks
calculation like the one above, because the lives that would be saved would be almost all those
of smokers, and they would be more likely to die of something else than an equivalent number of
saved lives from the general population.
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 46

3.1.2 The mathematical model (non-examinable)


For those familiar with the theory of continuous-time Markov processes, we are considering here
a (1 + m)–state Markov model, with state space S = {0, . . . , m}, that only have one transition
from 0 to j, for some 1 ≤ j ≤ m, with absorption in j. We can write down an – in general
time-dependent – Q-matrix
 
−λ+ (t) λ1 (t) · · · λm (t)
0 0 ··· 0 
 
where λ+ (t) = λ1 (t) + . . . + λm (t). (3.2)

Q(t) =  .. .. .. .. ,

 . . . .


0 0 ··· 0

Such models occur naturally where insurance policies provide different benefits for different
causes of death, or distinguish death and disability, possibly in various different strengths or
forms. This is also clearly a building block (one transition only) for general Markov models,
where states j = 1, . . . , m may not all be absorbing.
Such a model depends upon the assumption that different causes of death act independently
— that is, the probability of dying is the product of what might be understood as the probability
of dying from each individual cause acting alone.
3.1.3 Multiple decrements – time-homogeneous rates
In the time-homogenous case, we can think of the multiple decrement model as m exponential
clocks Cj with parameters λj , 1 ≤ j ≤ m, and when the first clock goes off, say, clock j, the
only transition takes place, and leads to state j. Alternatively, we can describe the model as
consisting of one L ∼ Exp(λ+ ) holding time in state 0, after which the new state j is chosen
independently with probability λj /λ+ , 1 ≤ j ≤ m. The likelihood for a sample of size 1 consists
of two ingredients, the density λ+ e−tλ+ of the exponential time, and the probability λj /λ+ of
the transition observed. This gives λj e−tλ+ , or, for a sample of size n of lifetimes ti and states
ji , 1 ≤ i ≤ n,
n m
n
Y Y
λji e−ti λ+ = λj j e−λj (t1 +...+tn ) , (3.3)
i=1 j=1

where nj is the number of transitions to j. Again, this can be solved factor by factor to give
nj
λ̂j = , 1 ≤ j ≤ m. (3.4)
t1 + . . . + tn

In particular, we find again λ̂+ = n/(t1 + . . . + tn ), since n1 + . . . + nm = n.


In the competing-clocks description, we can interpret the likelihood as consisting of m
ingredients, namely the density λj e−λj t of clock j to go off at time t, and probabilities e−λk t of
clocks Ck , k 6= j, to go off after time t.

3.2 Estimation for general multiple decrements


We can deduce from either description in the previous section that the likelihood for a sample of
n independent lifetimes t1 , . . . , tn and respective new states j1 , . . . , jn , each (ti , ji ) sampled from
(L, J), is given by
n
( Z )
Y ti
λji (ti ) exp − λ+ (t)dt . (3.5)
i=1 0
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 47

Let us assume that the forces of decrement λj (t) = λj (x) are constant on x ≤ t < x + 1, for all
x ∈ N and 1 ≤ j ≤ m. Then the likelihood can be given as
m 
YY dj,x n o
λj (x + 12 ) exp −`˜x λj (x + 12 ) , (3.6)
x∈N j=1

where dj,x is the number of decrements to state j between ages x and x + 1, and `˜x is the total
time spent alive between ages x and x + 1.
Now the parameters are λj (x), x ∈ N, 1 ≤ j ≤ m, and they are again well separated to
deduce
dj,x
λ̂j (x + 21 ) = , 1 ≤ j ≤ m, 0 ≤ x ≤ max{L1 , . . . , Ln }. (3.7)
`˜x
Similarly, we can try to adapt the method to get maximum likelihood estimators from the curtate
lifetimes. We can write down the likelihood as
n m
d
Y Y Y
`x −dx
p(J,K) (ji , [ti ]) = (1 − qx ) j,x
qj,x , (3.8)
i=1 x∈N j=1

but 1 − qx = 1 − q1,x − . . . − qm,x does not factorise, so we have to maximise simultaneously for
all 1 ≤ j ≤ m expressions of the form
m
d
Y
`−d1 −...−dm
(1 − q1 − . . . − qm ) qj j . (3.9)
j=1

(We suppress the indices x.) A zero derivative with respect to qj amounts to

(` − d1 − . . . − dm )qj = dj (1 − q1 − . . . − qm ), 1 ≤ j ≤ m, (3.10)

and summing over j gives


d
(` − d)q = d(1 − q) ⇒ q= . (3.11)
`
and then
dj (1 − q) dj
(` − d)qj = dj (1 − q) ⇒ qj = = (3.12)
`−d `
so that, if we display the suppressed indices x again,
(0) (0) dj,x
q̂j,x = q̂j,x (t1 , j1 , . . . , tn , jn ) = . (3.13)
`x
Now we’ve done essentially all maximum likelihood calculations. This one was the only one that
was not totally trivial. At repeated occurrences of the same factors, we have been and will be
less explicit about these calculations. We’ll derive likelihood functions, note that they factorise
and identify the factors as being of one of the three forms

(1 − q)`−d q d ⇒ q̂ = d/`
d −µ`
µ e ⇒ µ̂ = d/`
m
d
(1 − q1 − . . . − qm )`−d1 −...−dm
Y
qj j ⇒ q̂j = dj /`, j = 1, . . . , m.
j=1

and deduce the estimates.


CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 48

3.3 Example: Workforce model


A company is modelling its workforce using the model
 
−λ(t) − σ(t) − µ(t) λ(t) σ(t) µ(t)
0 0 0 0 
 
Q(t) =  (3.14)

0 0 0 0 


0 0 0 0

with four states S = {W, V, I, ∆}, where W =‘working’, V =’left the company voluntarily’,
I =’left the company involuntarily’ and ∆ =’left the company through death’.
If we observe nx people aged x, then
dx,V dx,I dx,∆
λ̂x = , σ̂x = , µ̂x = (3.15)
`˜x `˜x `˜x

where `˜x is the total amount of time spent working aged x, dx,V is the total number of workers
who left the company voluntarily aged x, dx,I is the total number of workers who left the company
involuntarily aged x, dx,∆ is the total number of workers dying aged x.

3.4 The distribution of the endpoint


The time-homogeneous multiple-decrement model makes a transition at the minimum of m
exponential clocks as opposed to one clock in the single decrement model. In the same way, we
can construct the time-inhomogeneous multiple-decrement model from m independent clocks Cj
with hazard function λj (t), 1 ≤ j ≤ m. Then the likelihood for a transition at time t to state j
is the product of fCj (t) and F̄Ck (t).
By Exercise A.1.2, the hazard function of L = min{C1 , . . . , Cm } is given by hL (t) = hC1 (t) +
. . . + hCm (t) = λ1 (t) + . . . + λm (t) = λ+ (t), and we can also calculate

 P t ≤ Cj < t + ε, min{Ck : k 6= j} ≥ Cj
P L = Cj L = t = lim 
ε↓0 P t≤L<t+ε

1


ε P t ≤ Cj < t + ε, min{Ck : k 6= j} ≥ t + ε
 ≥ lim


1
 
εP t ≤ L < t + ε
ε↓0
1

ε P t ≤ Cj < t+ ε, min{Ck k 6= j} ≥ t
:



 lim 1
εP t ≤ L < t + ε

 ε↓0
Q
fCj (t) k6=j F̄Ck (t) hCj (t) λj (t)
= = = ,
fL (t) hL (t) λ+ (t)
and we obtain
Z ∞ Z ∞
P(L = Cj ) = P(L = Cj |L = t)fL (t)dt = λj (t)F̄L (t)dt = E(Λj (L)), (3.16)
0 0
Rt
where Λj (t) = 0 λj (s)ds is the integrated hazard function. (For the last step we used that
R∞
E(g(L)) = 0 g 0 (t)F̄L (t)dt for all increasing differentiable g : [0, ∞) → [0, ∞) with g(0) = 0.)
The discrete (curtate) lifetime model: We can also split the curtate lifetime K = [L]
according to the type of decrement J (J = j if L = Tj ) and define

qj,x = P(L < x + 1, J = j|L > x), 1 ≤ j ≤ m, x ∈ N, (3.17)


CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 49

then clearly for x ∈ N


q1,x + . . . + qm,x = qx (3.18)
and, for 1 ≤ j ≤ m,

p(J,K) (j, x) = P(J = j, K = x) = P(L ≤ x + 1, J = j|L > x)P(L > x) = p0 . . . px−1 qj,x . (3.19)

Note that this bivariate probability mass function is simple, whereas the joint distribution of
(L, J) is conceptually more demanding since L is continuous and J is discrete. We chose to
express the marginal probability density function of L and the conditional probability mass
function of J given L = t. In the assignment questions, you will see an alternative description
in terms of sub-probability densities gj (t) = dtd
P(L ≤ t, J = j), which you can normalise –
gj (t)/P(J = j) is the conditional density of L given J = j.

3.5 Cohabitation–dissolution model


There has been considerable interest in the influence of nonmarital birth on the likelihood of a
child growing up without one of its parents. In the paper [17] relevant data are given for nine
different western European countries. We give a summary of some of the UK data in Table 3.1.
We represent the data in terms of a multiple decrement model in which the one nonabsorbing
state is cohabitation, and this leads to the two absorbing states, which are marriage or separation.
(Of course, there is a third absorbing state, corresponding to the death of one of the partners,
but this did not appear in the data. And of course, the marriage state is not actually absorbing,
except in fairy tales. A more complete analysis would splice this model onto a model of the fate of
marriages. There are data in the article on rates of separation after marriage, for those interested
in filling out the model.) Time, in this model, begins with the birth of the first child. Because
of the way the data are given, we treat the hazard rates as constant in the time intervals [0, 1],
[1, 3], and [3, 5]. There are no data about what happens after 5 years. We write dM x and dx for
S

the number of individuals marrying and separating, respectively, and similar for the estimation
of hazard rates. (For simplicity, we have divided the separation data, which were actually only
given for the periods [0, 3] and [3, 5], as though there were a count for separations in [0, 1].)

Table 3.1: Data from [17] on rates of conversion of cohabitations into marriage or separation, by
years since birth of first child

(a) % cohabiting couples remaining together (b) % of cohabiting couples who marry
(from among those who did not marry) within stated time.
n after 3 years after 5 years n 1 year 3 years 5 years
106 61 48 150 18 30 39

Translating the data in Table 3.1 into a multiple-decrement life table requires some interpretive
work.

1. There are only 106 individuals given for the data on separation; this is both because the
individuals who eventually married were excluded from this tabulation, and because the
two tabulations were based on slightly different samples.

2. The data are given in percentages.


CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 50

3. There is no count of separations in the first year.

4. Note that separations are given by survival percentages, while marriages are given by loss
percentages.

We now construct a combined life table from the data in Table 3.1. The purpose of this
model is to integrate information from the two data sets. This requires some assumptions, to
wit, that the transition rates to the two different absorbing states are the same for everyone, and
that they are constant over the periods 0–1, 1–3, 3–5 (and constant over 0–3 for separation).
The procedure is essentially the same as the construction of the single-decrement life table,
except that the survival is decremented by both loss counts dM x and dx ; and the estimation of
S
˜
years at risk `x now depends on both decrements, so is
0
S x −x
`˜x = `x0 (x0 − x) + (dM
x + dx ) ,
2

where x0 is the next age on the life table. Thus, for example, `˜1 , which is the number of years at
risk from age 1 to 3, is 64 · 2 + 41 · 1 = 169.
One of the challenges is that we observe transitions to Separation conditional on never being
Married, but the transitions to Married are unconditioned. The data for marriage are more
straightforward: These are absolute decrements, rather than conditional ones. If we set up a life
table an a radix of 1000, we know that the decrements due to marriage should be exactly the
percentages given in Table 3.1(b); that is, 180, 120, and 90. We begin by putting these into our
multiple-decrements life table, Table 3.2.
Of these nominal 1000 individuals, there are 610 who did not marry. Multiplying by the
percentages in Table 3.1(a) we estimate 238 separations in the first 3 years, and 79 in the next 2
years.

Table 3.2: Multiple decrement life table for survival of cohabiting relationships, from time of
birth of first child, computed from data in Table 3.1.

x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–1 1000 180 ? ? ? ?
1–3 ? 120 ? ? ? ?
3–5 462 90 79 755 ? ?

At this point, the only thing preventing us from completing the life table is that we don’t
know how to allocate the 238 separations between the first two rows, which makes it impossible
to compute the total years at risk for each of these intervals. The easiest way is to begin by
making a first approximation by assuming the marriage rate is the unchanged over the first three
years, and then coming back to correct it. Our first step, then, is the multiple-decrement table
in Table 3.3. According to our usual approximation, we estimate the years at risk for the first
period as 1000 · 3 − 538 · 1.5 = 2193; and in the second period as 462 · 2 − 169 · 1 = 755. The two
decrement rates are then estimated by dividing the number of events by the number of years at
risk.
What correction do we need? We need to estimate the number of separations in the first year.
Our model says that we have two independent times S and M , the former with constant hazard
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 51

Table 3.3: First approximation multiple decrement table.

x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–3 1000 300 238 2193 0.137 0.109
3–5 462 90 79 755 0.119 0.105

rate µS1.5 on [0, 3], the other with rate µM


1 on [0, 1] and µ2 on [1, 3]. Of the 238 separations in
M
2
[0, 3], we expect the fraction in [0, 1] to be about
 P{min{S, M } < 1 & S < M }
P S < 1 S < min{3, M } = .
P{min{S, M } < 1 & S < M } + P{1 ≤ min{S, M } < 3 & S < M }

Since min{S, M } is exponential with rate µM


1 + µ1.5 on [0, 1] and rate µ2 + µ1.5 on [1, 3], and
S M S
2
independent of {S < M }, we have

µS
 
−µM S
1 −µ1.5
P{min{S, M } < 1 & S < M } = M 1.5 S 1−e 2 ;
µ 1 + µ1.5
2

µS1.5 −µM S 
1 −µ1.5 −2µM S
2 −2µ1.5

P{1 < min{S, M } < 3 & S < M } = e 2 1 − e .
µM
2 + µ1.5
S

In Table 3.3 we have taken both values µM to be equal, the estimate being 0.137; were we to do
a second round of correction we could take the estimates to be distinct. We thus obtain
0.109  
P{min{S, M } < 1 & S < M } ≈ 1 − e−0.137−0.109 = 0.0966;
0.137 + 0.109
0.109  
P{1 < min{S, M } < 3 & S < M } ≈ e−0.137−0.109 1 − e−2·0.137−2·0.109 = 0.135.
0.137 + 0.109
Thus, we allocate 0.418 ∗ 238 = 99.4 of the separations to the first period, and 138.6 to the
second. This way we can complete the life table by computing the total years at risk for each
period, and hence the hazard rate estimates.
In principle, we could do a second round of approximation, based on the updated hazard rate
estimates. In fact, there is no real need to do this. The approximations are not very sensitive to
the exact value of the hazard rate. If we do substitute the new values in, the estimated fraction
of separations occurring in the first year will shift only from 0.418 to 0.419.

Table 3.4: Second approximation Multiple decrement life table.

x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–1 1000 180 99.4 860.2 0.209 0.116
1–3 720.6 120 138.6 1183 0.101 0.116
3–5 462 90 79 755 0.119 0.105

We are now in a position to use the model to draw some potentially interesting conclusions.
For instance, we may be interested to know the probability that a cohabitation with children will
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 52

end in separation. We need to decide what to do with the lack of observations after 5 years. For
simplicity, let us assume that rates remain constant after that point, so that all cohabitations
would eventually end in one of these fates. Applying the formula (3.16), we see that
Z ∞
µSx F̄ (x)dx.

P separate =
0

We have then
Z 1 Z 3 Z ∞
−0.325x −0.325−0.217(x−1)
e−0.759−0.224(x−3) dx

P separate = 0.116 e dx + 0.116 e dx + 0.105
0 1 3
0.116 h i 0.116 h i 0.105 h i
= 1 − e−0.325 + e−0.325 − e−0.759 + e−0.759
0.325 0.217 0.224
= 0.099 + 0.136 + 0.219
= 0.454.
Chapter 4

Survival analysis

4.1 Incomplete observations: Censoring and truncation


Why do we need a special statistical theory of event times? If we observe a collection of random
times T1 , . . . , Tn , we may study their distribution just as we study any other random variables.
The problem is, event times are often incompletely observed, and the incompleteness of the data
is connected with the one-dimensional structure of time.
The most common incomplete observation for event times is right censoring. Right censoring
occurs when a subject leaves the study before an event occurs, or the study ends before the
event has occurred. For example, we consider patients in a clinical trial to study the effect of
treatments on stroke occurrence. The study ends after 5 years. Those patients who have had no
strokes by the end of the year are censored. If the patient leaves the study at time te , then the
event occurs in (te , ∞) .
Left censoring is when the event of interest has already occurred before enrolment. This is
very rarely encountered.
Truncation is deliberate and due to study design.
Right truncation occurs when the entire study population has already experienced the
event of interest (for example: a historical survey of patients on a cancer registry).
Left truncation occurs when the subjects have been at risk before entering the study (for
example: life insurance policy holders where the study starts on a fixed date, event of interest is
age at death).
Generally we deal with right censoring and left truncation.
Three types of independent right censoring:
Type I: All subjects start and end the study at the same fixed time. All are censored at the
end of the study. If individuals have different but fixed censoring times, this is called progressive
type I censoring.
Type II: study ends when a fixed number of events amongst the subjects has occurred.
Type III or random censoring: Individuals drop out or are lost to followup at times
that are random, rather than predetermined. We generally assume that the censoring times are
independent of the event times, in which case there is no need to distinguish this from Type I.
Skeptical question: Why do we need special techniques to cope with incomplete observa-
tions? Aren’t all observations incomplete? After all, we never see all possible samples from the
distribution. If we did, we wouldn’t need any sophisticated statistical analysis.
The point is that most of the basic techniques that you have learned assume that the observed
values are interchangeable with the unobserved values. The fact that a value has been observed
does not tell us anything about what the value is. In the case of censoring or truncation, there is

53
CHAPTER 4. SURVIVAL ANALYSIS 54

dependence between the event of observation and the value that is observed. In right-censoring,
for instance, the fact of observing a time implies that it occurred before the censoring time. The
distribution of a time conditioned on its being observed is thus different from the distribution of
the times that were censored.
There are different levels of independence, of course. In the case of random (type III)
censoring, the censoring time itself is independent of the (potentially) observed time. In Type II
censoring, the censoring time depends in a complicated way on all the observation times.

4.2 Likelihood and Right Censoring


4.2.1 Random censoring
Suppose that Te is the time to event and that C is the time to the censoring event. Assume that
all subjects may have an event or be censored. That is, we get to observe for each individual i
only ti := min{tei , ci }, and the censoring indicator
(
1 if tei ≤ ci ,
δi :=
0 if tei > ci .

It was shown by [27] that it is impossible to reconstruct the joint distribution of (Te, C) from
these data. It is possible to reconstruct the marginal distributions under the assumption that
event times Te and censoring times C are independent.
Define

f (t) and fC (t) to be the densities,


S(t) and SC (t) to be the survival functions,
h(t) and hC (t) to be the hazard functions

of the event time and the censoring time respectively. Then the likelihood is
Y Y
L= f (tei )SC (tei ) S(ci )fC (ci )
δi =1 δi =0
Y Y
= h(ti )δi S(ti ) × hC (ti )1−δi SC (ti ).
i i

Since this factors into an expression involving the distribution of Te and one involving the
distribution of C, we may perform likelihood inference on the event-time distribution without
reference to the censoring time distribution.
4.2.2 Non-informative censoring
As stated above, the random censoring assumption is both too strong and too weak. It is too
strong because it makes highly restrictive assumptions about the nature of the censoring process
Note that this factorisation depends upon the assumption of independent censoring. In fact, the
same conclusion follows from a weaker assumption, called non-informative censoring.
We generally want to retain the assumption that the event times Tei are jointly independent.
The censoring process is called non-informative if it satisfies the following conditions:
1. For each fixed t, and each i, the distribution of (Ti − t)1{Ti >t} is independent of {Ci > t}.
That is, knowing that an individual was or was not censored by time t gives no information
about the lifetime remaining after time t, if they were still alive at time t.
CHAPTER 4. SURVIVAL ANALYSIS 55

2. The event-time distribution and the censoring distribution do not depend on the same
parameters.

The latter condition cannot be formalised in a frequentist framework, since we would be taking
the distributions as fixed (but unknown), allowing no formal way to define “dependence”. In a
parametric setting this is simply the assertion that we maximise the likelihood in Te without
regard to the distribution of C. (Note that in the absence of an independence assumption, such
as the first condition above, it would not be possible to define a joint distribution in which the
distribution of Te varies in a way that the distribution of C ignores.) In a Bayesian context we
are asserting that prior distribution on the distribution of (Te, C) makes the distribution of Te
independent of the distribution of C.
Note that this is a convenient modelling assumption, not an assumption about the
underlying process. We are free to impose this assumption. If the assumption has been imposed
incorrectly, the main effect will be only to reduce the efficiency of estimation.
The first assumption, on the other hand — or some substitute assumption — is absolutely
necessary for the procedures we describe — or, indeed, any such procedure — to yield unbiased
estimates. In a regression setting where we consider the covariates xi to be random we weaken
the independence assumptions to be merely independence conditional on the observed covariates.
As an example, Type II censoring is clearly not independent, but it is non-informative.
Warning: The definitions of random and non-informative censoring are not entirely standardised.

4.3 Time on test


Our models include a “time” parameter, whose interpretation can vary. First of all, in population-
level models (for instance, a birth-death model of population growth, where the state represents
the number of individuals) the time is true calendar time, while in individual-level models (such
as our multiple-decrement model of death due to competing risks, or the healthy-sick-dead
process, where there is a single model run for each individual) the time parameter is more likely
to represent individual age. Within the individual category, the time to event can literally be
the age, for instance in a life insurance policy. In a clinical trial it will more typically be time
from admission to the trial, also called “time on test”.
For example, consider the following data from a Sydney hospital pilot study, concerning the
treatment of bladder cancer:

Time to cancer Time to recurrence Time between Recurrence status


0.00 4.97 4.97 1
21.02 22.99 1.97 1
45.03 61.09 16.05 0
52.17 55.03 2.86 1
48.06 65.03 16.97 0

All times are in months. Each patient has their own zero time, the time at which the patient
entered the study (accrual time). For each patient we record time to event of interest or censoring
time, whichever is the smaller, and the status, δ = 1 if the event occurs and δ = 0 if the patient
is censored. If it is the recurrence that is of interest, so in fact the relevant time, the “time
between”, is measured relative to the zero time that is the onset of cancer.
CHAPTER 4. SURVIVAL ANALYSIS 56

4.4 Non-parametric survival estimation


4.4.1 Review of basic concepts
Consider random variables X1 , . . . , Xn which represent independent observations from a distribu-
tion with cdf F . Given a class F of possibilities for F , an estimator is a choice of the “best”, on
the basis of the data. That is, it is a function from Rn+ to F that maps a collection of observations
(x1 , . . . , xn ) to F.
Estimators for distribution functions may be either parametric or non-parametric, de-
pending on the nature of the class F. The distinction is not always clear-cut. A parametric
estimator is one for which the class F depends on some collection of parameters. For example, it
might be the two-dimensional family of all gamma distributions. A non-parametric estimator is
one that does not impose any such parametric assumptions, but allows the data to “speak for
themselves”. There are intermediate non-parametric approaches as well, where an element of F
is not defined by any small number of parameters, but is still subject to some constraint. For
example, F might be the class of distributions with smooth hazard rate, or it might be the class
of log-concave distribution functions (equivalent to having increasing hazard rate). We will also
be concerned with semi-parametric estimators, where an underlying infinite-dimensional class
of distributions is modified by one or two parameters of special interest.
The disadvantage of parametrisation is always that it distorts the observations; the advantage
is that it allows the data from different observations to be combined into a single parameter
estimate. (Of course, if the data are known to come from some distribution in the parametric
family, the “distortion” is also an advantage, because the real distortion was in the data, due to
random sampling.)
We start by considering nonparametric estimators of the cdf. These have the advantage of
limiting the assumptions imposed upon the data, but the disadvantage of being too strictly
limited by the data. That is, taken literally, the estimator we obtain from a sample of observed
times will imply that only exactly those times actually observed are possible.
If there are observations x1 , . . . , xn from a random sample then we define the empirical
distribution function
1
Fb(x) = # {xi : xi ≤ x}
n
This is an appropriate non-parametric estimator for the cdf — it is consistent, for example,
and F̂ (x) is an unbiased estimator for F (x) for each fixed x — in the absence of censoring. It
needs to be modified when the data are censored.
Suppose that the observations are (Ti , δi ) for i = 1, 2 . . . , n. We consider the component of
the likelihood that relates to the distribution of the survival times:
Y
L= f (Ti )δi S(Ti )1−δi
i
Y 1−δi
= f (Ti )δi 1 − F (ti )
i

4.4.2 Kaplan–Meier estimator


The first nonparametric estimator for the survival function in the presence of non-informative
right censoring was described by Kaplan and Meier in their 1958 paper [16]. The work has
been cited more than 50 thousand times, which probably underestimates its influence, since the
Kaplan–Meier estimator has become so ubiquitous in survival analysis that most applications do
not bother to cite a source.
CHAPTER 4. SURVIVAL ANALYSIS 57

The basic idea is the following: There is no way to estimate a hazard rate from data without
some kind of smoothing, so the most direct representation of the data comes from estimating the
survival function directly. On any interval [a, b) on which no events have been observed, it is
natural to estimate Ŝ(b) − Ŝ(a) = 0. That is, the estimated probability of an event occurring in
this interval is the empirical probability 0. This leads us to estimate the survival function by a
step function whose jumps are at the points tj where events have been observed.
We conventionally list the event times (not the censoring times) in order as (t0 = 0 <)t1 <
. . . < tj < . . . . If there are ties — several individuals i with Ti = tj and δi = 1 — we represent
this by a count dj , being the number of individuals whose event was at tj . (Thus all dj are at
least 1.)
At the point ti there is a drop in Ŝ. (Like cdfs, survival functions are taken to be right-
continuous: S(t) = P{Te > t}.) The size of the drop at an event time tj is

S(tj −) − S(tj ) = P{Te = tj }

since S(tj −) = P{Te ≥ tj }. We may rewrite this as


!
P{Te = tj }
S(tj ) = S(tj −) 1 −
S(tj −)
  
= S(tj −) 1 − P Te = tj Te ≥ tj .

Estimating the change in S is then a matter of estimating this conditional probability, which is
also known as the discrete hazard hj . We have the natural estimator (which is also the MLE) for
this conditional probability, that is the empirical fraction of those observed to meet the condition
— that is, they survived at least to time tj — who died at time tj . This will be dj /nj , where nj is
the number of individuals at risk at time tj — that is, individuals i such that Ti ≥ tj , meaning
that they have neither been censored, nor have they had their event, before time tj .
The Kaplan-Meier estimator is the result of this process:
!
Y  Y dj
S(t)
b = 1 − ĥj = 1−
nj
tj ≤t tj ≤t

where

nj = #{in risk set at tj },


dj = #{events at tj }.

Note
nj+1 + cj + dj = nj
where cj = #{censored in [tj , tj+1 )}. If there are no censored observations before the first failure
time then n0 = n1 = #{in study}. Generally we assume t0 = 0.
4.4.3 Nelson–Aalen estimator
The Nelson–Aalen estimator for the cumulative hazard function is
 
X dj X
H(t)
b = = hbj 
nj
tj ≤t tj ≤t
CHAPTER 4. SURVIVAL ANALYSIS 58

This is natural for a discrete estimator, as we have simply summed the estimates of the hazards
at each time, instead of integrating, to get the cummulative hazard. This correspondingly gives
an estimator of S of the form
 
S(t)
e = exp −H(t)b
 
X di
= exp − 
ni
ti ≤t

It is not difficult to show by comparing the functions 1 − x, exp(−x) on the interval 0 ≤ x ≤ 1,


that S(t)
e ≥ S(t).b
4.4.4 Invented data set
Suppose that we have 10 observations in the data set with failure times as follows:

2, 5, 5, 6+, 7, 7+, 12, 14+, 14+, 14+ (4.1)

Here + indicates a censored observation. Then we can calculate both estimators for S(t) at all
time points. It is considered unsafe to extrapolate much beyond the last time point, 14, even
with a large data set.

Table 4.1: Computations of survival estimates for invented data set (4.1)

tj dj nj ĥj Ŝ(tj ) S(t


e j)
2 1 10 0.10 0.90 0.90
5 2 9 0.22 0.70 0.72
6 1 6 0.17 0.58 0.63
12 1 4 0.25 0.44 0.54

4.4.5 Example: The AML study


In the 1970s it was known that individuals who had gone into remission after chemotherapy
for acute lymphatic leukemia would benefit — by longer remission times — from a course of
continuing “maintenance” chemotherapy. A study [8] pointed out that “Despite a lack of conclusive
evidence, it has been assumed that maintenance chemotherapy is useful in the management of
acute myelogenous leukemia (AML).” The study set out to test this assumption, comparing the
duration of remission between an experimental group that received the additional chemotherapy,
and a control group that did not. (This analysis is based on the discussion in [23].)
The data are from a preliminary analysis of the data, before completion of the study. The
duration of complete remission in weeks was given for each patient (11 maintained, 12 non-
maintained controls); those who were still in remission at the time of the analysis are censored
observations. The data are given in Table 4.2. They are included in the survival package of R,
under the name aml.
The first thing we do is to estimate the survival curves. The summary data and computations
are given in Table 4.3. The Kaplan-Meier survival curves are shown in Figure 4.1.
In section 4.6.2 we describe how to compute confidence intervals for the survival probabilities.
CHAPTER 4. SURVIVAL ANALYSIS 59

Table 4.2: Times of complete remission for preliminary analysis of AML data, in weeks. Censored
observations denoted by +.

maintained 9 13 13+ 18 23 28+ 31 34 45+ 48 161+


non-maintained 5 5 8 8 12 16+ 23 27 30 33 43 45

Table 4.3: Computations for the Kaplan–Meier and Nelson–Aalen survival curve estimates of the
AML data.

Maintenance Non-Maintenance (control)


ti ni di ĥi Ŝ(ti ) Ĥi S(t
e i) ni di ĥi Ŝ(ti ) Ĥi S(t
e i)
5 11 0 0.00 1.00 0.00 1.00 12 2 0.17 0.83 0.17 0.85
8 11 0 0.00 1.00 0.00 1.00 10 2 0.20 0.67 0.37 0.69
9 11 1 0.09 0.91 0.09 0.91 8 0 0.00 0.67 0.37 0.69
12 10 0 0.00 0.91 0.09 0.91 8 1 0.12 0.58 0.49 0.61
13 10 1 0.10 0.82 0.19 0.83 7 0 0.00 0.58 0.49 0.61
18 8 1 0.12 0.72 0.32 0.73 6 0 0.00 0.58 0.49 0.61
23 7 1 0.14 0.61 0.46 0.63 6 1 0.17 0.49 0.66 0.52
27 6 0 0.00 0.61 0.46 0.63 5 1 0.20 0.39 0.86 0.42
30 5 0 0.00 0.61 0.46 0.63 4 1 0.25 0.29 1.11 0.33
31 5 1 0.20 0.49 0.66 0.52 3 0 0.00 0.29 1.11 0.33
33 4 0 0.00 0.49 0.66 0.52 3 1 0.33 0.19 1.44 0.24
34 4 1 0.25 0.37 0.91 0.40 2 0 0.00 0.19 1.44 0.24
43 3 0 0.00 0.37 0.91 0.40 2 1 0.50 0.10 1.94 0.14
45 3 0 0.00 0.37 0.91 0.40 1 1 1.00 0.00 2.94 0.05
48 2 1 0.50 0.18 1.41 0.24 0 0
CHAPTER 4. SURVIVAL ANALYSIS 60

1.0
0.8
0.6
0.4
0.2
0.0

0 10 20 30 40 50

Figure 4.1: Kaplan-Meier estimates of survival in maintenance (black) and non-maintenance


groups in the AML study.
CHAPTER 4. SURVIVAL ANALYSIS 61

4.5 Left truncation


Left truncation is easily dealt with in the context of nonparametric survival estimation. Suppose
the invented data set comes from the following hidden process: There is an event time, and an
independent censoring time, and, in addition, a truncation time, which is the time when that
individual becomes available to be studied. For example, suppose this were a nursing home
population, and the time being studied is the number of years after age 80 when the patient first
shows signs of dementia. The censoring time might be the time when the person dies or moves
away, or when the study ends. The study population consists of those who have entered the
nursing home free of dementia. The truncation time would be the age at which the individual
moves into the nursing home.

Table 4.4: Invented data illustrating left truncation. Event times after the censoring time may
be purely nominal, since they may not have occurred at all; these are marked with *. The row
Observation shows what has actually been observed. When the event time comes before the
truncation time the individual is not included in the study; this is marked by a ◦.

Patient ID 5 2 9 0 1 3 7 6 4 8
Event time 2 5 5 * 7 * 12 * * *
Censoring time 10 8 7 8 11 7 14 14 14 14
Truncation time −2 3 6 0 1 0 6 6 −5 1
Observation 2 5 ◦ 8+ 7 7+ 12 14+ 14+ 14+

Table 4.5: Computations of survival estimates for invented data set of Table 4.4.

tj dj nj ĥj Ŝ(tj ) S(t


e j)
2 1 6 0.17 0.83 0.85
5 1 6 0.17 0.69 0.72
7 1 7 0.14 0.58 0.62
12 1 4 0.25 0.45 0.48

We give a version of these data in Table 4.4. Note that patient number 9 was truncated at
time 6 (i.e., entered the nursing home at age 86) but her event was at time 5 (i.e., she had already
suffered from dementia since age 85), hence was not included in the study. In table 4.5 we give
the computations for the Kaplan–Meier estimate of the survival function. The computations are
exactly the same as those of sectionP 4.4.4, except
P for one important change: The number at risk
nj is not simply the number n − tj <t dj − tj <t kj of individuals who have not yet had their
event or censoring time. Rather, an individual is at risk at time t if her event time and censoring
time are both ≥ t, and if the truncation time is ≤ t. (As usual, we assume that individuals who
have their event or are censored in a given year, were at risk during that year. We are similarly
assuming that those who entered the study at age x are at risk during that year.) At the start of
our invented study there are only 6 individuals at risk, so the estimated hazard for the event at
age 2 becomes 1/6.
CHAPTER 4. SURVIVAL ANALYSIS 62

In the most common cases of truncation we need do nothing at all, other than be careful
in interpreting the results. For instance, suppose we were simply studying the age after 80 at
which individuals develop dementia by a longitudinal design, where 100 healthy individuals 80
years old are recruited and followed for a period of time. Those who are already impaired at age
80 are truncated. All this means is that we have to understand (as we surely would) that the
results are conditional on the individual not suffering from dementia until age 80.

4.6 Variance estimation: Greenwood’s formula


We need to find confidence intervals (pointwise) for the estimators of S(t) at each time point.
The estimators are both based on the discrete hazard hj at event time tj , which we estimate by
ĥj = dj /nj :
Y  n X o
Kaplan–Meier: Ŝ(t) = 1 − ĥj , Nelson–Aalen: S(t)
e = exp − ĥj .
tj ≤t tj ≤t

4.6.1 The cumulative hazard


Suppose the survival function were genuinely discrete, with conditional probability hj (the
“discrete hazard”) of an event occurring at tj for any individual i with Ti ≥ tj . Then ĥj = dj /nj
is an unbiased estimator for hj . The dj are binomially distributed with parameters (nj , hj ),
hence variance nj hj (1 − hj ). Furthermore, while the numbers of events dj are not independent
— the underlying parameter nj , the number of individuals at risk, depends on the outcomes of
all the survival events preceding tj — the expectated value of dj /nj is always hj . If the nj are
large, then, we may write s
hj (1 − hj )
ĥj = hj + Zj ,
nj
where the Zj are approximately independent standard normal, and approximately independent
of the nj . If we take this to be real independence, then conditioned on nj the variance of the
sum is the sum of the variances, so
 1/2
X X hj (1 − hj )
b − H(t) =
H(t) (ĥj − hj ) =   Z,
nj
tj ≤t tj ≤t

where Z is approximately standard normal. Substituting, as usual, ĥj for the unknown hj we
get the approximation
 1/2
X dj (nj − dj )
b − H(t) ≈ 
H(t) 3
 Z. (4.2)
t ≤t
n j j

This yields
X dj (nj − dj )
σ̃ 2 (t) := (4.3)
t ≤t
n3j
j

as an estimator for the variance 1 of H(t).


b So an approximate 100(1 − α)% confidence interval
for H(t) is
b ± z1−α/2 σ̃(t).
H(t) (4.4)
1
Note that the thing σ̃ 2 is estimating is actually a conditional variance, conditioned on nj . The true variance
of Ĥ(t) is the expected value of this quantity. This is much like the distinction between observed and expected
information. And as in that situation you are free to take one of two perspectives: You could think of the
CHAPTER 4. SURVIVAL ANALYSIS 63

In fact, we don’t want to model the cumulative hazard as a step function with fixed jump
times. We could imagine interpolating extra “jump times”, when no events happen to have
been observed — hence with discrete hazard estimate 0 — and note that the estimate of Ĥ
and its variance is unchanged, allowing us to approximate a continuous cumulative hazard to
any accuracy we wish. To see a more mathematically satisfying derivation for σ̃(t), that uses
martingale methods to avoid discretisation, see [2].
4.6.2 The survival function and Greenwood’s formula
We can apply (4.4) directly to obtain a confidence interval for S(t) = e−H(t)
 n o n o
exp −H(t)b − z1−α/2 σ̃(t) , exp −H(t) b + z1−α/2 σ̃(t)
 n o n o (4.5)
= S(t) exp −z1−α/2 σ̃(t) , S(t) exp z1−α/2 σ̃(t)
e e .

While the confidence interval (4.5) is preferable in some respects, we also need a confidence
interval centred on the traditional Kaplan–Meier estimator. To derive an estimator for the
variance we start with the formula
X
log Ŝ(t) = log(1 − ĥj ).
tj ≤t

The delta method gives us the variance of the increments to be approximately


hj dj
(1 − hj )−2 Var(ĥj ) = ≈ .
nj (1 − hj ) nj (nj − dj )

By the same reasoning as in section 4.6.1 we get an estimator


X dj
σ̂(t) :=
nj (nj − dj )
tj ≤t

for the variance of log Ŝ(t). We may use this exactly as in (4.5) to produce confidence intervals
for log S(t), and hence for S(t). Traditionally we apply the delta method again to produce the
estimator
2
X dj
σG (t) := Ŝ(t)2
nj (nj − dj )
tj ≤t

for the variance of Ŝ(t). This is called Greenwood’s formula.


Note that Greenwood’s formula works perfectly well with any combination of left truncation
and right censoring. The one problem that arises is that individuals may enter into the study
slowly, yielding a small number at risk, and hence very wide error bounds, which of course will
carry through to the end.
“observed variance” as a good-enough approximation to the expected variance, given that the expected variance
is hard to compute — and actually includes a non-zero probability of a 0/0. But more appropriate is to refer
to (4.2), to see that the observed variance is the thing you really want, since the error of this observation is
approximately normal with SD σ̃(t).
CHAPTER 4. SURVIVAL ANALYSIS 64

4.6.3 Reminder of the δ method


If the random variation of Y around µ is small (for example if µ is the mean of Y and Var(Y )
has order n1 ), we use:

g(Y ) ≈ g(µ) + (Y − µ)g 0 (µ) (Y − µ)2 g 00 (µ) + . . .


Taking expectations
 
1
E(g(Y )) = g(µ) + O
n
 
0 1 2
var(g(Y)) = g (µ) varY + o
n
More generally, if Y1 , . . . , Yk are highly concentrated around their means, and approximately
multivariate normal with covariances Σij (where Σii = Var Yi ), we may write
X
Yi ≈ µi + Aij Zj
j

where A is the symmetric square-root of Σ (that is, A is symmetric and A2 = Σ) and Z is a


vector of independent standard normal random variables. If X = g(Y1 , . . . , Yk ) is any function
that is well-behaved near µ, we may use the first-order Taylor expansion to write
X ∂g
g(Y ) ≈ g(µ) + Aij (µ)Zj .
∂yi
i,j

To this order,
 
E g(Y1 , . . . , Yk ) ≈ g(µ1 , . . . , µk ),
 X ∂g ∂g (4.6)
Var g(Y1 , . . . , Yk ) ≈ (µ) (µ)Σij .
∂yi ∂yj
i,j

In the special case k = 1 this yields the familiar form


 
E g(Y ) ≈ g(µ),
(4.7)
Var g(Y ) ≈ g 0 (µ)2 σ 2


When k = 2 it becomes
 
E g(Y1 , Y2 ) ≈ g(µ1 , µ2 ),
 2  2   
 ∂g 2 ∂g 2 ∂g ∂g
Var g(Y1 , Y2 ) ≈ (µ) σ1 + (µ) σ2 + 2 (µ) (µ) Cov (Y1 , Y2 ) .
∂y1 ∂y2 ∂y1 ∂y2
(4.8)

4.7 AML study, continued


In Table 4.6 we show the computations for confidence intervals just for the Kaplan-Meier curve
of the maintenance group. The confidence intervals are based on the logarithm of survival, using
(??) directly. That is, the bounds on the confidence interval are
 
 v 
dj
 uX 
exp log Ŝ(t) ± z t ,
u
 nj (nj − dj ) 
 tj ≤t 
CHAPTER 4. SURVIVAL ANALYSIS 65

where z is the appropriate quantile of the normal distribution. Note that the approximation
cannot be assumed to be very good in this case, since the number of individuals at risk is too
small for the asymptotics to be reliable. We show the confidence intervals in Figure 4.2.
1.0
0.8
0.6
Survival
0.4
0.2
0.0

0 10 20 30 40 50

Time (weeks)

Figure 4.2: Greenwood’s estimate of 95% confidence intervals for survival in maintenance group
of the AML study.

Important: The estimate of the variance is more generally reliable than


the assumption of normality, particularly for small numbers of events.
Thus, the first line in Table 4.6 indicates that the estimate of log Ŝ(9)
is associated with a variance of 0.009. The error in this estimate is on
the order of n−3 , so it’s potentially about 10%. On the other hand, the
number of events observed has binomial distribution, with parameters
around (11, 0.909), so it’s very far from a normal distribution. We could
improve our confidence interval by using the Poisson confidence intervals
worked out in Problem Sheet 3, question 2, or binomial confidence
interval. We will not go into the details in this course.
CHAPTER 4. SURVIVAL ANALYSIS 66

Table 4.6: Computations for Greenwood’s estimate of the standard error of the Kaplan-Meier
survival curve from the maintenance population in the AML data. “lower” and “upper” are
bounds for 95% confidence intervals, based on the log-normal distribution.

dj
tj nj dj nj (nj −dj ) Var(log Ŝ(tj )) lower upper
9 11 1 0.009 0.009 0.754 1.000
13 10 1 0.011 0.020 0.619 1.000
18 8 1 0.018 0.038 0.488 1.000
23 7 1 0.024 0.062 0.377 0.999
31 5 1 0.050 0.112 0.255 0.946
34 4 1 0.083 0.195 0.155 0.875
48 2 1 0.500 0.695 0.036 0.944

4.8 Survival to ∞
Let T be a survival time, and define the conditional survival function

S0 (t) := P T > t T < ∞ ;

that is, the probability of surviving past time t given that the event eventually does occur. We
have
P{∞ > T > t}
S0 (t) = . (4.9)
P{∞ > T }
How can we estimate S0 ? Nelson–Aalen estimators will never reach ∞ (which would mean 0
survival); Kaplan–Meier estimators will reach 0 if and only if the last individual at risk actually
has an observed event. In either case, there is no mathematical principle for distinguishing
between the actual survival to ∞ — that is, the probability that the event never occurs — and
simply running out of data. Nonetheless, in many cases there can be good reasons for thinking
that there is a time t∂ such that the event will never happen if it hasn’t happened by that time.
In that case we may use the fact that {T < ∞} = {T < t∂ } to estimate

Ŝ(t) − Ŝ(t∂ )
Ŝ0 (t) = . (4.10)
1 − Ŝ(t∂ )

In this case, assuming that S(t) is constant after t = t∂ , we need to estimate the variance of
Ŝ0 (t). To compute a confidence interval for S0 (t) we apply the delta method again, in a slightly
more complicated form. Suppose we set Y1 = Ŝ(t) and Y2 = Ŝ(t∂ )/Ŝ(t). The variance of Y1 is
approximated just by Greenwood’s formula
  X dj
σ12 := Var S(t)
b ≈ σ̂ 2 (t) = S(t)
b 2 ,
tj ≤t
nj nj − dj

Y2 is just the estimated survival from t to t∂ , so Greenwood’s formula yields


! !2
S(t
b ∂) S(t
b ∂) X dj
σ22 := Var ≈ ,
S(t)
b S(t)
b
t<tj ≤t∂
nj nj − dj
CHAPTER 4. SURVIVAL ANALYSIS 67

Since Y1 and Y2 depend on distinct survival events they are uncorrelated. We may apply the
delta method (4.8) to the two-variable function g(y1 , y2 ) = y1 (1 − y2 )/(1 − y1 y2 ), obtaining

σ02 (t) := Var Sb0 (t) = Var g(Y1 , Y2 )



 2  2
∂g ∂g
≈ (Y1 , Y2 ) σ12 + (Y1 , Y2 ) σ22
∂y1 ∂y2
 2
b − S(t
S(t) b ∂ ) X 
dj  2 X dj
≈  4  + 1 − S(t)
b  .
1 − S(t∂ )
b tj ≤t
n j n j − dj t<tj ≤t∂
n j n j − dj

(4.11)

If there is no a priori reason to choose a value of t∂ , we may estimate it from the data as
max{ti }, as long as there is a significant length of time during which there is a significant number
of individuals under observation, when an event could have been observed.
4.8.1 Example: Time to next birth
This is an example discussed repeatedly in [2]. It has the advantage of being a large data set,
where the asymptotic assumptions may be assumed to hold; it has the corresponding disadvantage
that we cannot write down the data or perform calculations by hand.
The data set at http://folk.uio.no/borgan/abg-2008/data/second_births.txt lists, for
53,558 women listed in Norway’s birth registry, the time (in days) from first to second birth.
(Obviously, many women do not have a second birth, and the observations for these women will
be treated as censored.)
In Figure 4.3(a) we show the Kaplan–Meier estimator computed and automatically plotted
by the survfit command. Figure 4.3(b) shows a crude estimate for the distribution of time-
to-second-birth for those women who actually had a second birth. We see that the last birth
time recorded in the registry was 3677, after which time none of the remaining 131 women had a
recorded second birth. Thus, the second curve is simply the same as the first curve, rescaled to
go between 1 and 0, rather than between 1 and 0.293 as the original curve does.
The code used to generate the plots is in Code 1.
CHAPTER 4. SURVIVAL ANALYSIS 68

Code 1 Code for Second Birth Figure


1 library ( ' survival ' )
2 sb=r e a d . t a b l e ( ' s e c o n d_b i r t h s . dat ' , h e a d e r=TRUE)
3 a t t a c h ( sb )
4 sb . s u r v=Surv ( time , s t a t u s )
5
6 sb . f i t 1=s u r v f i t ( sb . s u r v∼r e p ( 1 , 5 3 5 5 8 ) )
7
8 p l o t ( sb . f i t 1 , mark . time=FALSE, x l a b= ' Time ( days ) ' ,
9 main= ' Norwegian b i r t h r e g i s t r y time t o s e c o n d b i r t h ' )
10
11 # C o n d i t i o n on l a s t e v e n t . D i r e c t l y c h a n g i n g s u r v i v a l o b j e c t .
12 c l e=f u n c t i o n ( SF , a l p h a =.05) {
13 z=qnorm(1− a l p h a / 2 )
14 minsurv=min ( SF$ s u r v )
15 maxSE=max( SF$ s t d . e r r ) ## SE a t f i n a l e v e n t time
16 newSE=((SF$ surv−minsurv ) ^2/(1− minsurv ) ^4) ∗ ( SF$ s t d . e r r ∗ ( 2 ∗SF$ surv−SF$ s u r v ^2)
+(1−SF$ s u r v ) ^2∗ minsurv )
17 SF$ s t d . e r r=newSE
18 SF$ s u r v =(SF$ surv−minsurv ) /(1− minsurv )
19 SF$ upper=SF$ s u r v+z ∗newSE
20 SF$ l o w e r=SF$ surv−z ∗newSE
21 SF
22 }
23
24 sb . f i t 2=c l e ( sb . f i t 1 )
25
26 p l o t ( sb . f i t 2 , mark . time=FALSE, x l a b= ' Time ( days ) ' ,
27 main= ' Time t o s e c o n d b i r t h c o n d i t i o n e d on o c c u r r e n c e ' )
CHAPTER 4. SURVIVAL ANALYSIS 69

Norwegian birth registry time to second birth


1.0
0.8
0.6
0.4
0.2
0.0

0 1000 2000 3000 4000 5000

Time (days)

(a) Original Kaplan–Meier curve

Time to second birth conditioned on occurrence


1.0
0.8
0.6
0.4
0.2
0.0

0 1000 2000 3000 4000 5000

Time (days)

(b) Kaplan–Meier curve conditioned on second birth occurring

Figure 4.3: Time (in days) between first and second birth from Norwegian registry data.
CHAPTER 4. SURVIVAL ANALYSIS 70

4.9 Computing survival estimators in R (non-examinable)


The main package for doing survival analysis in R is survival. Once the package is installed
on your computer, you include library(survival) at the start of your Rcode This works with
“survival objects”, which are created by the Surv command with the following syntax:
Surv(time, event) or Surv(time, time2, event,type)

4.9.1 Survival objects with only right-censoring


We begin by discussing the first version, which may be applied to right-censored (or uncensored)
survival data. The individual times (whether censoring or events) are entered as vectors time.
The vector event (of the same length) has values 0 or 1, depending on whether the time is a
censoring time or an event time respectively. These alternatives may also be labelled 1 and 2, or
FALSE and TRUE.
For an example, we turn to our tiny simulated data set

21, 47, 47, 58+, 71, 71+, 125, 143+, 143+, 143+ (4.12)

into a survival object with

sim.surv = Surv(c(21, 47, 47, 58, 71, 71, 125, 143, 143, 143), c(1, 1, 1, 0, 1, 0, 1, 0, 0, 0)).

Fitting models is done with the survfit command. This is designed for comparing distribu-
tions, so we need to put in a some sort of covariate. Then we can write
sim.fit=survfit(sim.surv∼1,conf.int=.99)
and then plot(sim.fit), or
plot(sim.fit,main=‘Kaplan-Meier for simulated data set’,
xlab=‘Time’,ylab=‘Survival’)
to plot the Kaplan–Meier estimator of the survival function, as in Figure 4.4. The dashed lines
are the Greenwood estimator of a 99% confidence interval. (The default for conf.int is 0.95.)
The Nelson–Aalen estimatorcan also be computed with survfit. The associated survival
estimator Se = e−Â is called the Fleming–Harrington estimator, and it may be estimated with
fit=survfit(formula, type=’fleming-harrington’). The cumulative hazard — the log of
the survival estimator — may be printed with plot(fit, fun=’cumhaz’).
If you want to compute it more directly, you can extract the information in the survfit
object. If you want to see what’s inside an R object, you can use the str command. The output
is shown in Code 2.
We can then compute the Nelson–Aalen estimator with a function such as the one in Figure
3. This is plotted together with the Kaplan–Meier estimator in Figure 4.5. As you can see, the
two estimators are similar, and the Nelson–Aalen survival is always higher than the KM.
4.9.2 Other survival objects
Left-censored data are represented with
Surv(time, event,type=’left’).
Here event can be 0/1 or 1/2 or TRUE/FALSE for alive/dead, i.e., censored/not censored.

Left-truncation is represented with


Surv(time,time2, event).
event is as before. The type is ’right’ by default.
CHAPTER 4. SURVIVAL ANALYSIS 71

Code 2 Example of structure of a survfit object.


> str(sim.fit)
List of 13
$ n : int 10
$ time : num [1:6] 21 47 58 71 125 143
$ n.risk : num [1:6] 10 9 7 6 4 3
$ n.event : num [1:6] 1 2 0 1 1 0
$ n.censor : num [1:6] 0 0 1 1 0 3
$ surv : num [1:6] 0.9 0.7 0.7 0.583 0.438...
$ type : chr “right”
$ std.err : num [1:6] 0.105 0.207 0.207 0.276 0.399 ...
$ upper : num [1:6] 1 1 1 1 0.957 ...
$ lower : num [1:6] 0.732 0.467 0.467 0.34 0.2 ...
$ conf.type: chr “log”
$ conf.int : num 0.95
$ call : language survfit(formula = sim.surv ∼ type, conf.int = 0.95) -
attr(*, “class”)= chr “survfit”

Code 3 Function to compute Nelson–Aalen estimator.


1 # SF i s output o f s u r v f i t
2 NAest=f u n c t i o n ( SF ) {
3 t i m e s=SF$ time [ SF$n . event >0]
4 e v e n t s=SF$n . e v e n t [ SF$n . event >0]
5 n r i s k=SF$n . r i s k [ SF$n . event >0]
6 i n c r e m e n t=s a p p l y ( s e q ( l e n g t h ( n r i s k ) ) , f u n c t i o n ( i )
7 sum ( 1 / s e q ( n r i s k [ i ] , n r i s k [ i ]− e v e n t s [ i ]+1) ) )
8 v a r i a n c e i n c r e m e n t=s a p p l y ( s e q ( l e n g t h ( n r i s k ) ) , f u n c t i o n ( i )
9 sum ( 1 / s e q ( n r i s k [ i ] , n r i s k [ i ]− e v e n t s [ i ]+1) ^2) )
10 hazard=cumsum ( i n c r e m e n t )
11 v a r i a n c e=cumsum ( v a r i a n c e i n c r e m e n t )
12 l i s t ( time=times , Hazard=hazard , Var=v a r i a n c e )
CHAPTER 4. SURVIVAL ANALYSIS 72

Kaplan-Meier for simulated data set

1.0
0.8
0.6
Survival

0.4
0.2
0.0

0 20 40 60 80 100 120 140

Time

Figure 4.4: Plot of Kaplan–Meier estimates from data in (4.12). Dashed lines are 95% confidence
interval from Greenwood’s estimate.

Interval censoring also takes time and time2, with type=’interval’. In this case, the event
can be 0 (right-censored), 1 (event at time), 2 (left-censored), or 3 (interval-censored).
CHAPTER 4. SURVIVAL ANALYSIS 73

Estimators for simulated data set


1.0
0.8

+
+
0.6
Survival

0.4

Kaplan-Meier
Nelson-Aalen
0.2
0.0

0 20 40 60 80 100 120 140

Time

Figure 4.5: Plot of Kaplan–Meier (black) and Nelson–Aalen (red) estimates from data in (4.12).
Dashed lines are pointwise 95% confidence intervals.
Chapter 5

Regression in a survival context

5.1 Introduction to regression models


We learned in section 2.16.2 how to compare observed mortality to a standard life table. In many
settings, though, we are interested to compare observed mortality (or more general event times)
between groups, or between individuals with different values of a quantitative covariate, and in
the presence of censoring. For example, in a clinical trial we wish to compare survival times
between patients who receive two different treatments. A nutrition study might compare the
rate of stroke occurrence among different levels of fat consumption. A demographer may wish to
compare the length of marriages among different levels of education or socioeconomic class.
When considering the effect of a single binary covariate we may produce estimated survival
curves for the two subpopulations and compare them directly. In chapter 6 we describe methods
for testing two populations for equality of survival distribution. When considering the effect
of a quantitative, often continuous, variable on survival this approach is foreclosed, since the
information provided by any single individual — with a single value of the covariate — is so low,
often just a single individual. We need a way to combine the effect of multiple individuals into a
single estimate, and that means we need to make fairly strong modelling assumptions about how
the covariates affect the hazard. The approach of linking covariates to an outcome measure (in
this case, hazards) through strong functional assumptions about the influence is called regression.
Even when we do have a categorical model, it may be useful to apply a regression model that
summarises the relationship between covariate and hazard in a more meaningful and compact
way.
There are three general approaches to regression modelling:

Parametric Any parametric model may be turned into a regression model by imposing a functional
assumption that links the covariates to one or more model parameters. As we will discuss
in section 5.4, there are traditional choices of parameters to modify in this way that provide
natural interpretations in a survival context.

Semiparametric In a fully parametric model the parameters defining the survival probabilities are not
separated from the parameters that describe the effect of the covariates. Assumptions
that we impose on the hazard rates may bias the estimate of the parameters measuring
covariate effects. Semiparametric approaches split the estimation of a baseline hazard rate
— which is done nonparametrically — from the estimation of a small number of parameters
that describe how individuals with particular parameter values differ from that baseline.
By far the most popular semiparametric model is the Cox proportional hazards regression
model, described in section 5.5.

74
CHAPTER 5. REGRESSION MODELS 75

Nonparametric A useful alternative approach is the Aalen additive-hazards regression model, described in
section 5.12, which estimates both cumulative hazards and cumulative effects of covariates
in a unified nonparametric way.

5.2 How survival functions vary


It is natural, in a survival context, to define the dependence of survival distributions on covariates
in terms of the influence on cumulative hazards. There are three popular general approaches:

• Additive hazards: It seems natural when comparing two groups to measure the difference
in survival in terms of the cumulative difference in hazard — which is to say, in expected
number of events — and to suppose that each increment in the covariate might produce a
fixed increment in cumulative hazard. In an epidemiological context this means that the
output of the model is the expected number of events caused or prevented by a change
in treatment or risk factors. One disadvantage to this approach — described in detail in
section 5.12 is that there is the potential for estimated hazards to become negative in some
parameter regimes, which is nonsensical.

• Proportional hazards: Another natural approach is to suppose that the effect of a


treatment or a change in risk factor is proportional to the risk. If we write h0 (t) for the
baseline hazard at time t, then we are supposing the hazard for individual i at time t

hi (t) = ρi h0 (t)

where h0 (t) is the baseline hazard, and ρi is a function of covariates, which may themselves
be changing over time. Equivalently, we have Hi (t) = ρi H0 (t) for the cumulative hazard,
and
Si (t) = S0 (t)ρi
This sort of model is also called relative-risk regression.

• Acccelerated lifetimes In this approach we say that there is a standard survival function
S0 (t), which applies to everyone, but different individuals run through the function at
different rates. So individual i with acceleration parameter ρi will have survival function

Si (t) = S0 (ρi t).

Equivalently, we have Hi (t) = H0 (ρi t) for the cumulative hazard, or hi (t) = ρi h0 (ρi t)
for the hazard. AL models will not be considered in this course except in the context of
parametric models.

5.2.1 Graphical tests


Suppose we have a categorical covariate — in other words, distinct subpopulations to compare —
and we want to decide whether some kind of accelerated lifetime or proportional hazards model is
worth considering. Starting from the estimated survival functions or cumulative hazard functions
for the subpopulations there are plots that we can make to investigate this.
Suppose the distinct subpopulations differ by an acceleration parameter. If we could plot
Sg against log t, for groups g = 1, 2, . . . , k, then we would see the distinct curves differing by
horizontal shifts, as
Sg (t) = S0 (elog ρg +log t ) .
CHAPTER 5. REGRESSION MODELS 76

Similarly for Hg (t). Thus, the plot of Ŝg (t) or Ĥg (t) may be used as a diagnostic for AL models,
where we accept the AL assumption when we see an approximate agreement between the curves
when shifted horizontally.
To interrogate the PH assumption we plot the log cumulative hazard estimate (or plot the
cumulative hazard on a log scale). If distinct groups differ by a proportionality constant

log Hg (t) = log ρg + log H0 (t),

So if we plot the log Ĥg (t) against either t or log t (where g is, again, a group of individuals) we
expect to see a vertical shift between groups. Note that log Ĥ = log(− log Ŝ) (or log(− log S),
e as
a consequence of which this plot is known as the complementary log log plot.
Taking both models together it is clear that we could plot
 
log − log Sbg (t) against log t

as then we can check for AL and PH in one plot. Generally Sbg will be calculated as the
Kaplan–Meier estimator for group g.

• If the accelerated life model is plausible we expect to see a horizontal shift between groups.

• If the proportional hazards model is plausible we expect to see a vertical shift between
groups.

Of course, if the data came from a Weibull distribution, with differences in the ρ parameter,
it is simultaneously AL and PH. We see that

log − log Sg (t) = log ρg + α log t.

Thus, survival curve estimates for different groups should appear approximately as parallel lines,
which of course may be viewed as vertical or as horizontal shifts of one another.
In section 5.4 we illustrate this with two simulations of populations with Gompertz mortality.

5.3 Generalised linear survival models


Any parametric model may be turned into an AL model by replacing t by ρi t. And it may be
turned into a PH model by replacing the hazard h(t) by ρi h(t).
There remains then the question, how to link ρi to the explanatory factors such as age,
smoking status, blood pressure and so on that have been observed for the individual. We need
to incorporate these into a model using some sort of generalised regression, making ρ a function
of the explanatory variables. The generalized linear model approach says that we define a certain
function (the link function) of ρi is determined as a linear combination ρi = ρ(β · xi ), where β · x
is called the linear predictor or the linear risk score. The unknown vector β of parameters is the
main goal to estimate. The most common link function is the logarithm, producing

log ρi = β · xi , equivalently ρi = eβ·xi .

The shape parameter α is assumed to be the same for each observation in the study.
As an example, the row of data for an individual will include

• response

– event time ti ;
CHAPTER 5. REGRESSION MODELS 77

– status δi (=1 if failure, =0 if censored);


– possibly a left-truncation time.
• covariates, often a mix of quantitative and categorical variables, typically represented as
a vector xi (or xi (t) if the covariates are changing over time such as
– age at entry (when age is not the time variable in the analysis);
– sex;
– systolic blood pressure;
– treatment group.
Suppose we think the Weibull distribution is generally a good fit. Then
α
Si (t) = e−(ρt) and ρ = eβ·x
β · x = β0 + β1 agei +β2 sexi +β3 sbpi +β4 trti ,
where β0 is the intercept and all regression coefficients βj are to be estimated, as well as estimating
α. Note this model assumes that α is the same for each subject. We have not shown interaction
terms such as xage ∗ xtrt , but these could be added as well. This would allow a different effect of
age according to treatment group.
Suppose subject i has covariate vector xi , and so scale parameter
i
ρi = eβ·x .
This gives a likelihood
Y δ i α
L(α, β) = αραi tα−1
j e−(ρi ti )
i
Y i
δ i β·xi t α
= αeαβ·x tjα−1 e−(e ) .
i

We can now compute MLEs for α and all components of the vector β,q using numerical optimisation,

giving estimators α
b, βb together with their standard errors ( Var α
b, Var βbj ), estimated from the
observed information matrix. Of course, the same could have been done for another parametric
model instead of the Weibull.
In general, if we have a parametric model with cumulative hazard Hα (t) (where α represents
the parameters of the model that do not vary between individuals), and hα (t) = Hα0 (t) is the
hazard function, we have the likelihood
Y   δ i β·xi
β·xi β·xi
L(α, β) = e hα e ti e−H(e ti ) .
i

If observations are left-truncated, say at a time si , the survival term includes only the cumulative
hazard from si to ti :
β·xi β·xi
Y i
 i

L(α, β) = eβ·x hα eβ·x ti e−H(e ti )+H(e si ) .
i

For example, if we are fitting a Weibull model, the shape parameter will be α. Recall that
when fitting a Weibull we can test for α = 1 — the null hypothesis that the data actually might
have come from the simpler exponential distribution — using
b exp ∼ χ2 (1), asymptotically.
b weib − 2 log L
2 log L
CHAPTER 5. REGRESSION MODELS 78

5.4 Simulation example


The Gompertz hazard fits naturally into either the AL or the PH framework. If individual i has
hazard rate Bi eθx at age x, where θ is always the same, this is a PH model. If the hazard rate is
Beθi x , this is an AL model. We can use this to illustrate the graphical approach to identifying
AL or PH variation between different groups.
In Figure 5.1 we show the results of two different simulations where we have simulated 1000
survival times for each of two groups, one of which has Gompertz mortality rates approximating
those of UK women, and the other somewhat higher rates, which we think of as “men”. All are
conditioned on survival to age 20 (since human survival ages before maturity are very different
from Gompertz). In each case we plot the log estimated cumulative hazard of each group against
age on a log scale, and draw the horizontal and vertical differences at various levels.
In Figure 5.1(a) the (B, θ) parameters are (2 × 10−5 , 0.11) and (4 × 10−5 , 0.11). We see,
as expected, very nearly constant vertical differences, but large variation in the horizontal gap
between the curves. Figure 5.1(b), on the other hand, shows the AL version, with parameters
(1 × 10−5 , 0.11) and (1 × 10−5 , 0.14). The horizontal gaps are now similar, while the vertical
differences change.
CHAPTER 5. REGRESSION MODELS 79

Female 0.06
0

Male
Log cumulative mortality

0.61
−2

0.64

0.23 0.64
−4
−6

20 40 60 80 100

Age

(a) PH Gompertz difference


2

Female 0.22
0

Male
Log cumulative mortality

2.17
−2

1.45
0.18
−4

0.96
−6

20 40 60 80 100

Age

(b) AL Gompertz difference

Figure 5.1: Simulations to show effect of PH and AL differences in Gompertz populations.


CHAPTER 5. REGRESSION MODELS 80

5.5 The relative-risk regression model


By far the most popular survival model — is the proportional hazards semiparametric model
with log link function, generally known as Cox regression, after David Cox, who introduced
the model in 1972. (Cox’s paper [6] was listed in a recent survey by Nature as the 24th most
frequently cited paper of all time, over all sciences.)
The most general relative-risk regression model represents the hazard rate for individual i as

(5.1)
 
hi (t) = h t xi = h0 (t)r β, xi (t); t

Here xi is a vector of (possibly time-varying) covariates belonging to individual i, and β is a


vector of parameters. Since this model has a nonparametric piece α0 and a parametric piece β,
it is called semiparametric. In this lecture we will generally be assuming that r is a function
only of β and x (no direct dependence on t), and in that case we will drop t from the notation,
writing r(β, x) or r(β, x(t)).
This model is called relative-risk or proportional hazards because there is an unchanging ratio
of the hazard rate (or risk of the event) between individuals with parameter values xi and xj .
(Of course, if the covariates themselves are time-varying, this looks more complicated when we
look at individual hazard rates over time.)
Different choices for the function relative risk function r are possible. We will focus mainly
on the choice
T
P
r(β, x) = eβ x = e βj xj , (5.2)
which assigns a constant proportional change to the hazard rate to each unit change in the
covariate. This regression model with this risk function, which is by far the most commonly used
survival model in medical applications, is called the Cox proportional hazards regression model.
One might naively suppose that the choice of model should be guided by a belief about
what is “really true” about the survival process. But, as George Box famously said, “All models
are wrong, but some are useful.” A wrong model will still summarise the data in a potentially
non-misleading way. (On the other hand, the model can’t be too wrong. We will discuss model
diagnostics later on.)
For example, in medical statistics the excess relative risk of a category is defined as
observed rate − expected rate
.
expected rate
If we have a single parameter that is supposed to summarise the excess relative risk created per
unit of covariate, we would take the risk function

r(β, x) = 1 + βx.

In the multidimensional-covariate setting we can generalise this to the excess relative risk model
(taking p to be the dimension of the covariate)
p
Y
(5.3)

r(β, x) = 1 + β j xj .
j=1

This allows each covariate to contribute its own excess relative risk, independent of the others.
Alternatively, we can define the linear relative risk function
p
X
r(β, x) = 1 + βj xj . (5.4)
j=1
CHAPTER 5. REGRESSION MODELS 81

A note about coding of covariates: In order for the baseline survival to make sense,
the point where all the covariates are 0 must correspond to a plausible vector of co-
variates for an individual who could be in the sample. Thus, quantitative covariates
should be centred, either at the mean, the median, or some approximately central
value. Note that this normalisation cancels out between numerator and denomina-
tor of the partial likelihood in Cox regression, but it becomes especially important
when we add interaction effects, typically introduced as products of covariates.

5.6 Partial likelihood


A precise mathematical treatment of this material may be found in section 4.1.1 of [2].
We represent the data in the following form:

1. A list of event times t1 < t2 < · · · . (We are assuming no ties, for the moment.)

2. The identity ij of the individual whose event is at time tj .

3. The values of all individuals’ covariates (at times tj , if they are varying).

4. The risk sets Rj = {i : Ti ≥ tj }, the set of individuals who are at risk at time tj .

The most common way to fit a relative-risk model is to split the likelihood into two pieces:
The likelihood of the event times, and the conditional likelihood of the choice of subjects given
the event times. The first piece is assumed to contain relatively little information about the
parameters, and its dependence on the parameters is quite complicated.
We use the second piece to estimate β. Conditioned on some event happening at time t, the
probability that it is individual i is

 hi (t) r(β, xi (t); t)1{i∈R(t)}


π i t := =P ,
h(t) j∈R(t) r(β, xj (t); t)

where R(t) is the risk set, the set of individuals at risk at time t. We have the partial likelihood
Y  Y r(β, xij (tj ); tj )
LP (β) = π ij tj = P . (5.5)
tj t l∈Rj r(β, xl (tj ); tj )
j

The partial likelihood is useful because it involves only the parameters β, isolating them from
the nonparametric (and often less interesting) α0 . The maximiser of the partial likelihood has
the same essential properties as the MLE.

Theorem 5.1. Let β̂ maximise LP , as given in (5.5). Then β̂ is a consistent estimator of



the true parameter β0 , and n(β̂ − β0 ) converges to a multivariate normal distribution with
mean 0 and covariance matrix consistently approximated by J(β̂)−1 , where J(β̂) is the observed
information matrix, with (i, j) component given by

∂2
− log LP (β).
∂βi ∂βj
CHAPTER 5. REGRESSION MODELS 82

5.7 Significance testing


As with the MLE, when β is p-dimensional there are three (asymptotically equivalent) conventional
test statistics used to test the null hypothesis β = β0 :

• Wald statistic: ξW
2 := (β̂ − β )T J(β )(β̂ − β );
0 0 0

• Score statistic: ξSC


2 = U (β )T J(β )U (β );
0 0 0

• Likelihood ratio statistic: ξLR


2 = 2 ` (β̂) − ` (β ) , where ` := log L .
 
P P 0 P P

Under the null hypothesis these are all asymptotically chi-squared distributed with p degrees of
freedom. Here J(β) is the observed Fisher partial information matrix. There is a computable
estimate for this, which is fairly straightforward, but notationally slightly tricky in the general
(multivariate) case, so we do not include it here. (See equations (4.46) and (4.48) of [2] if you are
interested.) As usual, we can approximate the expected information by the observed information.
Here U (β) is the vector of score functions ∂`P /∂βj .

5.8 Estimating baseline hazard


In relative-risk regression we are usually primarily interested in estimating the coefficients β —
the difference between groups. Thus, if we are performing a clinical trial, we are interested to
know whether the treatment group had better survival than the control. How long the groups
actually survived in an absolute sense is irrelevant to the experimental outcome.
But that doesn’t mean that the hazard rates are merely a nuisance. And even when survival
rates are the primary concern, if hazard is affected by a known covariate there is no way to
estimate the appropriate survival function for individuals without using a regression model to
pool the data over different values of the covariate. Even when the covariate has only a small
number of categories, so that pooling within categories is possible, it may still be worth using a
regression model — as long as it is plausibly accurate — which allows us effectively to pool more
individuals together to estimate a single survival curve. Thus variance is reduced, at the cost of
potentially introducing a bias from modelling error (because the regression assumptions won’t
hold exactly).
5.8.1 Breslow’s estimator
Suppose the baseline survival is given by

c0 (t) = e−H
S
c0 (t)
,

where the discrete hazard estimation h


c0 is given by

1
h
c0 (tj ) = P
r(β, xi (tj ))
i∈Rj

Breslow’s estimator for the cumulative baseline hazard is very similar to the Nelson–Aalen
estimator. We start from the constraint that our MLE for the cumulative hazard must confine
the points of increase (which are the times when events might be observed) to the times when
CHAPTER 5. REGRESSION MODELS 83

events actually were observed. Conditioning on that constraint1 we want the baseline hazard
estimator to have the form X
H
c0 (t) = h
c0 (tj ),
tj ≤t

where the problem is to determine the discrete hazard estimates h


c0 (tj ).

1
h
c0 = P (5.6)
r(β, xi (tj ))
i∈Rj

For the special case of Cox regression

c0 = P 1
h (5.7)
eβ·xi (tj )
i∈Rj

In some sense the discrete estimates for h


c0 (tj ) can be thought of as the maximum likelihood
estimators from the full likelihood, provided we assume that the hazard distribution is discrete
(which of course it generally is not). When β̂ = 0 or when the covariates are all 0, this
reduces simply to the Nelson–Aalen estimator. Otherwise, we see that this is equivalent to a
modified Nelson–Aalen estimator, where the size of the risk set is weighted by the relative risks
of the individuals. In other words, the estimate of h c0 is equivalent to the standard estimate
# events/time at risk, but now time at risk is weighted by the relative risk.
The estimator may be loosely derived as follows:

log(1 − e−hij (tj ) ) −


X X
`(h) = hk
tj i∈Rj
k6=ij

log(1 − e−r̂ij h0 (tj ) ) −


X X
= r̂i h0 (tj ).
tj i∈Rj
j6=[j]

We estimate h0 (tj ) by

r̂ij e−r̂ij ĥ0 (tj ) X


0= − r̂k
1 − e−r̂ij ĥ0 (tj ) k∈Rj
k6=ij

r̂ij (1 − r̂ij ĥ0 (tj )) X


≈ − r̂k
r̂ij ĥ0 (tj ) k∈Rj
k6=ij

(1 − r̂ij ĥ0 (tj )) X


= − r̂k .
ĥ0 (tj ) k∈Rj
k6=ij

1
A likelihood constrained to have certain features of the “parameters” (which, in this case, are an entire hazard
function) already optimised, so that the remaining parameters can be computed, is called a profile likelihood.
CHAPTER 5. REGRESSION MODELS 84

(In the second line we have assumed h0 (tj ) to be small.) Thus


 
X
1 ≈ ĥ0 (tj )  r̂i  ,
i∈Rj

which is the same as (5.7).


5.8.2 Individual risk ratios
A benefit of relative risk models is that they give a straightforward interpretation to the question of
individual risk. In the case where covariates are constant in time, an individual with covariates x
has a relative risk of r(β, x) compared with individuals with baseline covariates (with r(β, x) = 1;
typically corresponding to x = 0). Thus, their cumulative hazard to age t is approximated by

Ĥ(t x) = r(β̂, x)Ĥ0 (t). (5.8)

In case of time-varying covariates we have an individual cumulative hazard


Z t
H(t x) = r(β, x(u))h0 (u)du,
0

which we approximate by
X r(β̂, x(tj ))
Ĥ(t x) = P . (5.9)
tj ≤t i∈Rj r(β̂, xi (tj ))

In the special case of Cox regression we have

X eβ̂·x(tj )
Ĥ(t x) = . (5.10)
eβ̂·x(tj )
P
tj ≤t i∈Rj

5.9 Dealing with ties


We discuss here the methods of dealing with tied event times for the Cox model.
Until now in this section we have been assuming that the times of events are all distinct.
In situations where event times are equal, we can carry out the same computations for Cox
regression, only using a modified version of the partial likelihood. Suppose Rj is the set of
individuals at risk at time tj , and Dj the set of individuals who have their event at that time.
We assume that the ties are not real ties, but only the result of discreteness in the observation.
Then the probability of having precisely those individuals at time tj will depend on the order in
which they actually occurred. For example, suppose there are 5 individuals at risk at the start,
and two of them have their events at time t1 . If the relative risks were {r1 , . . . , r5 }, then the
first term in the partial likelihood would be
r1 r2 r2 r1
· + · .
r1 + r2 + r3 + r4 + r5 r2 + r3 + r4 + r5 r1 + r2 + r3 + r4 + r5 r1 + r3 + r4 + r5
The number of terms is dj !, so it is easy to see that this computation quickly becomes intractable.
A very good alternative — accurate and easy to compute — was proposed by B. Efron.
Observe that the terms differ in the denominator merely by a small change due to the individuals
lost from the risk set. If the deaths at time tj are not a large proportion of the risk set, then we
CHAPTER 5. REGRESSION MODELS 85

can approximate this by deducting the average of the risks that depart. In other words, in the
above example, the first contribution to the partial likelihood becomes
r1 r2
.
(r1 + r2 + r3 + r4 + r5 )( 12 (r1 + r2 ) + r3 + r4 + r5 )

More generally, the log partial likelihood becomes


 
dj −1
XX X X k X
`P (β) = log r(β, xi (tj )) − log  r(β, xi (tj ) − r(β, xi ) .
tj i∈Dj
di
k=0 i∈Rj i∈Dj

We take the same approach to estimating the baseline cumulative hazard:


 −1
j −1
X dX X  k X 
Ĥ0 (t) =  r β, xi (tj ) − r β, xi (tj )  .
di
tj ≤t k=0 i∈Rj i∈Dj

An alternative approach, due to Breslow, makes no correction for the progressive loss of risk
in the denominator:
XX X
`Breslow

P (β) = log r(β, xi (tj )) − dj log r β, xi (tj ) .
tj i∈Dj i∈Rj

This approximation is always too small, and tends to shift the estimates of β toward 0. It is
widely used as a default in software packages (SAS, not R!) for purely historical reasons.

5.10 The AML example


We continue looking at the leukemia study that we started to consider in section 4.4.5. First, in
Figure 5.2 we plot the iterated logarithm of survival against time, to test the proportional hazards
assumption. The PH assumption corresponds to the two curves differing by a vertical shift. The
result makes this assumption at least credible. (This method, and others, for examining the PH
assumption are discussed in detail in chapter 7.)
We code the data with covariate x = 0 for the maintained group, and x = 1 for the non-
maintained group. Thus, the baseline hazard will correspond to the maintained group, and eβ
will be the relative risk of the non-maintained group. From Table 4.3 we see that the partial
likelihood is given by
! !
e2β e2β
LP (β) =
(12eβ + 11)(11eβ + 11) (10eβ + 11)(9eβ + 11)
!

    
1 1 1
×
8eβ + 11 8eβ + 10 7eβ + 10 6eβ + 8
! ! ! (5.11)
eβ · 1 eβ eβ
×
(6eβ + 7)(5.5eβ + 6.5) 5eβ + 6 4eβ + 5
! ! ! 
eβ eβ eβ
   
1 1 1
× β β β β β
3e + 5 3e + 4 2e + 4 2e + 3 e +3 2

A plot of LP (β) is shown in Figure 5.3.


CHAPTER 5. REGRESSION MODELS 86

0.5
0.0
-0.5
log(-log(Survival))
-1.0
-1.5
-2.0
-2.5
-3.0

0 10 20 30 40 50
Age

Figure 5.2: Iterated log plot of survival of two populations in AML study, to test proportional
hazards assumption.
2.8e-18
2.4e-18
LP
2.0e-18
1.6e-18

0.6 0.8 1.0 1.2 1.4


β

Figure 5.3: A plot of the partial likelihood from (5.11). Dashed line is at β = 0.9155.
CHAPTER 5. REGRESSION MODELS 87

In the one-dimensional setting it is straightforward to estimate β̂ by direct computation. We


see the maximum at β̂ = 0.9155 in the plot of Figure 5.3. In more complicated settings, there
are good maximisation algorithms built in to the coxph function in the survival package of R.
Applying this to the current problem, we obtain:

Table 5.1: Output of the coxph function run on the aml data set.

coxph(formula = Surv(time, status) ∼ x, data = aml)


coef exp(coef) se(coef) z p
×Nonmaintained 0.916 2.5 0.512 1.79 0.074
Likelihood ratio test=3.38 on 1 df p=0.0658 n= 23

The z is simply the Z-statistic for testing the hypothesis that β = 0, so z = β̂/SE(β̂). We
see that z = 1.79 corresponds to a p-value of 0.074, so we would not reject the null hypothesis at
level 0.05.
We show the estimated baseline hazard in Figure 5.4; the relevant numbers are given in Table
5.2. For example, the first hazard, corresponding to t1 = 5, is given by
1 1
ĥ0 (5) = + = 0.050,
12eβ̂ + 11 11eβ̂ + 11

substituting in β̂ = 0.9155.
1.0
0.8
0.6
0.4
0.2
0.0

0 10 20 30 40 50 60

Figure 5.4: Estimated baseline hazard under the PH assumption. The purple circles show the
baseline hazard; blue crosses show the baseline hazard shifted up proportionally by a multiple
of eβ̂ = 2.5. The dashed green line shows the estimated survival rate for the mixed population
(mixing the two estimates by their proportions in the initial population).
CHAPTER 5. REGRESSION MODELS 88

Table 5.2: Computations for the baseline hazard LME for the AML data, in the proportional
hazards model, with maintained group as baseline, and relative risk eβ̂ = 2.498. We write Y M
and Y N for the number at risk in the maintenance and non-maintenance groups respectively.

Maintenance Non-Maintenance Baseline


(control)
tj Y M (tj ) dM
j Y N (tj ) dN
j ĥ0 (tj ) Â0 (tj ) Se0 (tj )
5 11 0 12 2 0.050 0.050 0.951
8 11 0 10 2 0.058 0.108 0.898
9 11 1 8 0 0.032 0.140 0.869
12 10 0 8 1 0.033 0.174 0.841
13 10 1 7 0 0.036 0.210 0.811
18 8 1 6 0 0.043 0.254 0.776
23 7 1 6 1 0.095 0.348 0.706
27 6 0 5 1 0.054 0.403 0.669
30 5 0 4 1 0.067 0.469 0.625
31 5 1 3 0 0.080 0.549 0.577
33 4 0 3 1 0.087 0.636 0.529
34 4 1 2 0 0.111 0.747 0.474
43 3 0 2 1 0.125 0.872 0.418
45 3 0 1 1 0.182 1.054 0.348
48 2 1 0 0 0.500 1.554 0.211
1.0
0.8
0.6
0.4
0.2
0.0

0 10 20 30 40 50 60

Figure 5.5: Comparing the estimated population survival under the PH assumption (green
dashed line) with the estimated survival for the combined population (blue dashed line), found
by applying the Nelson–Aalen estimator to the population, ignoring the covariate.
CHAPTER 5. REGRESSION MODELS 89

5.11 The Cox model in R


The survival package includes a function coxph that computes Cox regression models. To
illustrate this, we simulate 100 individuals with hazard rate texi , where xi are normal with mean
0 and variance 0.25. We also right-censor observations at constant rate 0.5. The simulations
may be carried out with the following commands:
require(’survival’)

n=100
censrate=0.5
covmean=0
covsd=0.5
beta=1

x=rnorm(n,covmean,covsd)
T=sqrt(rexp(n)*2*exp(-beta*x))

# Censoring times
C=rexp(2*n,censrate)
t=pmin(C,T)
delta=1*(T<C)
Then the Cox model may be fit with the command
cfit=coxph(Surv(t,delta)~x)
> summary(cfit)
Call:
coxph(formula = Surv(t, delta) ~ x)

n= 100, number of events= 50

coef exp(coef) se(coef) z Pr(>|z|)


x 0.8998 2.4590 0.3172 2.836 0.00457 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

exp(coef) exp(-coef) lower .95 upper .95


x 2.459 0.4067 1.32 4.579

Concordance= 0.601 (se = 0.049 )


Rsquare= 0.079 (max possible= 0.966 )
Likelihood ratio test= 8.19 on 1 df, p=0.004203
Wald test = 8.04 on 1 df, p=0.004565
Score (logrank) test = 8.18 on 1 df, p=0.004225
If we give a coxph object to survfit it will automatically compute the hazard estimate,
using the Efron method as default. We need to give it a data frame of “newdata”, with one
column for each covariate. It will output one survival curve for each row of the data frame. In
particular, inputting the new data x = 0 we get the baseline hazard. If we plot this object it
comes by default with a 95% confidence interval. We show the plot in Figure 5.6.
CHAPTER 5. REGRESSION MODELS 90

plot(survfit(cfit,data.frame(x=0)),main=’Cox example’)
tt=(0:300)/100
lines(tt,exp(-tt^2/2),col=2)
legend(.1,.2,c(’baseline survival estimate’,’true baseline’),col=1:2,lwd=2)

1.0
0.8
0.6
0.4
0.2 Cox example

baseline survival estimate


true baseline
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 5.6: Survival estimated from 100 individuals simulated from the Cox proportional hazards
2
model. True baseline survival e−t /2 is plotted in red.

Suppose now we have a categorical variable — for example, three different treatment groups,
labelled 0,1,2 — with relative risk 1,2,3, let us say. If we were to use the command
cfit=coxph(Surv(t,delta)~x)
we would get the wrong result:
coef exp(coef) se(coef) z p
x 0.335 1.4 0.0865 3.88 0.00011

Likelihood ratio test=15 on 1 df, p=0.000107 n= 300, number of events= 191


The problem is that this assumes the effects are monotonic: Group 1 has relative risk eβ and
Group 2 has relative risk e2β , which is simply wrong.
The estimates of survival for the three different groups may be estimated correctly by defining
new covariates x1=(x==1) and x2=(x==2) — that is, they are indicator variables for x to be 1
or 2 respectively, and we would then use the command
cfit2=coxph(Surv(t,delta)~x1+x2).
Even easier is to use the factor command, which tells Rto treat the vector as non-numeric. If
we give the command
cfit2=coxph(Surv(t,delta)~factor(x))
CHAPTER 5. REGRESSION MODELS 91

it produces the output

coef exp(coef) se(coef) z p


factor(x)2 0.801 2.23 0.187 4.29 1.8e-05
factor(x)3 0.707 2.03 0.185 3.82 1.3e-04

Likelihood ratio test=23 on 2 df, p=1.01e-05 n= 300, number of events= 191

This comes close to estimating correctly the relative risks 2 and 2.5, which in the first version
were estimated as 1.4 and 1.42 = 1.96.
If you want to include time-varying covariates, this may be done crudely by having multiple
time intervals for each individual, with all having a start and a stop time, and all but (perhaps)
the last being right-censored. (Of course, if individuals have multiple events, then there may be
multiple intervals that end with an event.) This allows the covariate to be changed stepwise.

5.12 The additive hazards model


In this chapter we will consider the Aalen additive-hazards model, which is essentially a non-
parametric regression model. Relative risk regression is mathematically elegant, and traditionally
is very popular in medical statistics, but it cannot be said to be a universal model. The
assumptions are very strong, and the model may give misleading results when applied to data
that don’t support it. (We describe in section 7.1 methods for evaluating the appropriateness of
the proportional hazards assumption.) It is useful to have alternative models available.
In addition to the general issue of scientific appropriateness of a given model, the additive
hazards model has numerous advantages, detailed at length in section 4.2 of [2]. In particular

• the statistical methods for fitting additive hazards regression models make it relatively
easy to allow for effects that change with time;

• results of the additive model lend themselves to a natural interpretation as “excess mortality”.

5.12.1 Describing the model


The additive-hazards regression model assigns to each individual i a time-varying covariate vector
 
xi (t) = xi1 (t) · · · xip (t))

The model parameters (unknown) are the baseline hazard β0 (t), and p functions β1 (t), . . . , βp (t).
The hazard for individual i at time t is then

β0 (t) + β1 (t)xi1 (t) + · · · + βp (t)xip (t). (5.12)

Each βj is a trajectory of excess risk attributable to changes in the covariate xij .


CHAPTER 5. REGRESSION MODELS 92

5.12.2 Fitting the model


As with other nonparametric estimation procedures, we naturally estimate the cumulative excess
risk rather than the risk itself. We define
Z t
Bk (t) := βk (s) ds.
0

We want to estimate this cumulative excess risk, as we always do for nonparametric models, by a
function that is piecewise constant, taking jumps only at the times tj when there is an event.
We may represent this2 as X
Bk (t) = dBk (tj ),
tj ≤t

where dBk (tj ) is simply a notation for the discrete hazard increment at time tj .
Let Yi (t) be the risk indicator for individual i — that is, it is 1 if individual i is at risk at
time t, and 0 otherwise. Consider the vector whose i-th component is
p
X
Yi (tj ) dB̂0 (tj ) + Yi (tj )xik (tj ) dB̂k (tj )
k=1

In matrix form this is X(tj ) dB̂(tj ), where Here X(t) is the n × (p + 1) matrix whose (i, k)
component is Yi (t)xik (t), with xi0 (t) ≡ 1. This is the hazard increment at time tj . If we observed
one event for individual i at time tj , we want the hazard increment to be about 1; if we observed
no event (as for most of the at-risk population) we want the increment to be about 0.
Of course, there is no way to choose the p + 1 components of dB̂(tj ) to make all n of these
relations come out exactly right. So, as in the case of linear regression, we look for a least-squares
approximation. That is, we solve exactly as in the case of multivariate linear regression. Taking
account of the possibility that the “design matrix” X(t) will eventually have rank below p + 1 —
after enough subjects have dropped out — at which point the estimation process has to stop, we
follow our usual procedure by defining
( −1
− X(t)T X(t) X(t)T if X(t) has full rank,
X (t) := (5.13)
0 otherwise.

In other words, it is the generalised inverse of X whenever this exists. Our usual least-squares
solution for this equation is then

dB̂(t) = X− (t) dN(t),

yielding the estimator


X X
B̂(t) = X− (tj ) dN(tj ) = X− (tj )·ij , (5.14)
tj ≤t tj ≤t

where dN(tj ) is a vector of all 0’s, except for a 1 in the ij component, where ij is the individual
having an event at time tj . (Here we are assuming no ties, as is conventional. If there are ties,
we could still obtain an unbiased estimator by letting dN(tj ) have a 1 in all the coordinates
corresponding to individuals with events at time tj .)
2
Formally, this may be understood as a stochastic integral with respect to a jump process, but this formalism
is beyond the scope of this course. Those who are interested may find the details in section 4.2 of [2].
CHAPTER 5. REGRESSION MODELS 93

A consequence of the approach we have taken is that B̂k (t) is an unbiased estimator3 for
Bk (t) for each t.
Assume now that there is a single individual ij having event at time tj . (If there are ties,
this would require, as with relative-risk regression, some decision about the order of the events
in order to compute the variance. The R function aareg breaks ties randomly.) Then for large
sample sizes the estimators B̂k (t) are approximately multivariate normally distributed with
means Bk (t) and variance-covariance matrix estimated by
    X
Σ̂k` (t) = B̂ − B∗
, B̂ − B ∗
(t) = X− −
kij (tj )X`ij (tj ). (5.15)
k `
tj ≤t

Suppose we wish to test the hypothesis that βk (·) is identically 0, for some particular k. This
is equivalent to saying that any test statistic
Z ∞
Zk = W (s)βk (s) ds
0

is 0. We will discuss nonparametric testing in more detail in chapter 6, in particular the criteria
for choosing a good weight function. The crucial thing is that W (s) has to be computable on
the basis of past information (that is, events and censoring before time s.)
Under the null hypothesis Zk is approximately normal with mean 0 and variance
X 2
Vk := W (tj )X−
kij (t j ) . (5.16)
tj

5.12.3 Examples
Single covariate
Suppose each individual has a single covariate xi (t) at time t, and the hazard for individual i at
time t is β0 (t) + β1 (t)xi (t). If we let R(t) be the set of individuals at risk at time t, and define
1 X
µk (t) = xi (t)k ,
#R(t)
i∈R(t)

we have !
T 1 µ1 (t)
X(t) X(t) = #R(t) ,
µ1 (t) µ2 (t)

and the inverse is !


1 µ2 (t) −µ1 (t)
#R(t)(µ2 (t) − µ1 (t)2 ) −µ1 (t) 1

If we assume that t is such that there are still multiple individuals whose event is after t, and
that the xi (t) are all distinct, then the Cauchy–Schwarz inequality tells us that the denominator
is always > 0.
Rt
3
To be precise, it is an unbiased estimator for 0 βk (s)1{X− (s) has full rank} ds, since our estimation breaks down
when the population is reduced to the point where X− no longer has full rank. Thus, there is a bias of the order
of the probability of this breakdown occurring. It will be small under most circumstances, where the study is
stopped long before most individuals have had their events.
CHAPTER 5. REGRESSION MODELS 94

We also observe that X(t)T dN(t) is a 2 × 1 vector which is 0 except at times tj , when it is
!
T dj
X(tj ) dN(tj ) = P ,
i∈Dj xi (tj )

where Dj is the set of individuals who have events at time Dj and dj = #Dj . The estimator
(5.14) then becomes
! !
B̂0 (t) X dj µ2 (tj ) − µ1 (tj )x̄j
= , (5.17)
B̂1 (t) #R(tj )(µ2 (tj ) − µ1 (tj )2 ) −µ1 (tj ) + x̄j
tj ≤t

where x̄j is the mean value of xi (tj ) over i ∈ Dj .


Intuitively, this result makes sense: It says that B̂1 (t) increases insofar as the average covariate
value of the individuals having an event at time tj is greater than the average covariate value of
all the individuals at risk at that time. On the other hand, the increments to B̂0 are like dj /#Rj
— the increment in the Nelson–Aalen estimator — modified in proportion as the estimate of β̂1 is
negative or positive.
We estimate the covariance matrix of B̂(t) as
X 1
#R(tj )2 (µ 2 2
2 (tj ) − µ1 (tj ) )
tj ≤t
!
X (µ2 (tj ) − µ1 (tj )xi )2 (µ2 (tj ) − µ1 (tj )xi )(−µ1 (tj ) + xi )
× ,
(µ2 (tj ) − µ1 (tj )xi )(−µ1 (tj ) + xi ) (−µ1 (tj ) + xi )2
i∈Dj

Simulated data
We consider the single-covariate additive model from section 5.12.3. We consider a population of
n individuals, where the hazard rate for individual i is
xi
λi (t) = 1 + ,
1+t
with xi being i.i.d. covariates with N(1, 0.25) distribution. So the effect of the covariate decreases
with time. We assume independent right censoring at constant rate 0.5. We consider two cases:
n = 100 and n = 1000. First of all, we need to simulate the times. We use the result of problem
3 from Problem Sheet A.1. The cumulative hazard for individual i is

Λi (t) = t + xi log(1 + t).

We can’t find Λ−1


i in closed form, but it is easy to write this as a function xtime that computes
it numerically:

censrate=0.5
covmean=0
covsd=0.5
xtime=function(T,x){
u=uniroot(function(t) t+x*log(1+t)-T,c(0,max(T,2*(T-x))))
u$root
}
n=1000
CHAPTER 5. REGRESSION MODELS 95

# Censoring times
C=rexp(n,censrate)
xi=rnorm(n,covmean,covsd)

T=rep(0,n)

for (i in 1:n){
T[i]=xtime(rexp(1),xi[i])
}

t=pmin(T,C)
delta=(T<C)

We may compute B̂ by using the function aareg:

afit=aareg(Surv(t,delta)~xi)

plot(afit,xlim=c(0,1),ylim=c(-.5,1.5))
s=(0:120)/100
lines(s,log(1+s),col=2)

The results are in Figure 5.7. Note that the estimates for n = 100 are barely useful even just
for distinguishing the effect of the covariates from 0; on the other hand, bear in mind that these
are pointwise confidence intervals, so interpretations in terms of the entire time-course of B are
more complicated (and beyond the scope of this course). The estimates with n = 1000 are much
more useful.
Applying print() to an aareg object gives useful summary information about the model fit.
Applying it to the n = 100 simulation we get

> print(afit,maxtime=1)
Call:
aareg(formula = Surv(T, delta) ~ xi)

n= 100
70 out of 80 unique event times used

slope coef se(coef) z p


Intercept 1.76 0.00733 0.00433 1.69 0.0909
xi 1.24 0.00847 0.00428 1.98 0.0480

Chisq=3.91 on 1 df, p=0.048; test weights=aalen

The slope is a crude estimate of the rate of increase of B· (t) with t (based on fitting a weighted
least-squares line to the estimates). We use the option maxtime=1 since about 80% of the events
are in [0, 1], so that the estimates become extremely erratic after t = 1. If we leave out this
option, the slope will not make much sense (though we could extend the range significantly
further when n = 1000). In this case, we would get a slope estimate of 2.
Note that the p-value for the covariate coefficient (row “xi”) is based on the SE for the
cumulative weighted test statistic for that particular parameter, and has nothing to do with
the slope estimate. The chi-squared statistic is a joint is based on a weighted cumulative test
CHAPTER 5. REGRESSION MODELS 96

statistic for all effects to be 0, and it has chi-squared distribution with p degrees of freedom. In
the case p = 1 it is just the square of the single-variable test statistic.
1.5

1.5
1.0

1.0
0.5

0.5
xi

xi
0.0

0.0
-0.5

-0.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Time Time

(a) Sample size 100 (b) Sample size 1000

Figure 5.7: Estimated cumulative hazard increment per unit of covariate (B̂1 (t)) for two different
sample sizes, together with pointwise 95% confidence intervals. The true value is B1 (t) = log(1+t),
which is plotted in red.
Chapter 6

Non-parametric two-sample hypothesis


tests

A common question that we may have is, whether two (or more) samples of survival times may
be considered to have been drawn from the same distribution: That is, whether the populations
under observation are subject to the same hazard rate.

6.1 Tests in the regression setting


1) A package will produce a test of whether or not a regression coefficient is 0. It uses properties
of mle’s. Let the coefficient of interest be b say. Then the null hypothesis is HO : b = 0 and
the alternative is HA : b =
6 0. At the 5% significance level, HO will be accepted if the p-value
p > 0.05, and rejected otherwise.
2) In an AL parametric model if α is the shape parameter then we can test HO : log α = 0
against the alternative HA : log α 6= 0. Again mle properties are used and a p-value is produced
as above. In the case of the Weibull if we accept log α = 0 then we have the simpler exponential
distribution (with α = 1).
3) We have already mentioned that, to test Weibull v. exponential with null hypothesis HO :
exponential is an acceptable fit, we can use
b exp ∼ χ2 (1), asymptotically.
b weib − 2 log L
2 log L

6.2 Non-parametric testing of survival between groups


6.2.1 General principles
We will consider only the case where the data splits into two groups. There is a relatively easy
extension to k > 2 groups.
We define the following notation

Event times are 0 < t1 < t2 < · · · < tm .


For j = 1, 2, . . . , m, and i = 1, 2, dij = # events at tj in group i,
nij = # in risk set at tj from group i,
dj = # events at tj ,
nj = # in risk set at tj .

97
Testing hypotheses 98

Thus, when the number of groups k = 2, we have dj = d1j + d2j and nj = n1j + n2j .
Generally we are interested in testing the null hypothesis H0 , that there is no difference
between the hazard rates of the two groups, against the two-sided alternative that there is
a difference in the hazard rates. The guiding principle is quite elementary, quite similar to
our approach to the proportional hazards model: We treat each event time ti as a new and
independent experiment. Under the null hypothesis, the next event is simply a random sample
from the risk set. Thus, the probability of the death at time tj being from group 1 is n1j /nj ,
and the probability of it being from group 2 is n2j /nj .
This describes only the setting where the events all occur at distinct times: That is, dj are all
exactly 1. More generally, the null hypothesis predicts that the group identities of the individuals
whose events are at time tj are like a sample of size dj without replacement from a collection of
n1j ‘1’s and n2j ‘2’s. The distribution of d1j under such sampling is called the hypergeometric
distribution. It has
n1j
expectation = dj , and
nj
n1j n2j (nj − dj )dj
variance =: σj2 = .
n2j (nj − 1)
n n
Note that if dj is negligible with respect to nj , this variance formula reduces to dj ( n1jj )( n2jj ),
which is just the variance of a binomial distribution.
Conditioned on all the events up to time tj (hence on nj , n1j , n2j ) and on dj , the random
d
variable d1j − n1j njj has expectation 0 and variance σj2 . If we multiply it by an arbitrary weight
d
W (tj ), determined by the data up to time tj , we still have W (tj )(d1j − n1j njj ) being a random
variable with (conditional) expectation 0, but now (conditional) variance W (tj )2 σj2 . This means
that if we define for k = 1, . . . , m
 m
k
X dj 

Mk :=  W (tj ) dj1 − nj1 ,
nj
j=1
k=1

these will be random variables with expectation 0 and variance ki=1 W (tj )2 σj2 . While the
P
increments are not independent, we may still apply a version of the Central Limit Theorem
to show that Mk is approximately normal when the sample size is large enough. (In technical
terms, the sequence of random variables Mk is a martingale, and the appropriate theorem is the
Martingale Central Limit Theorem. See [13] for more details.) We then base our tests on the
statistic Pm dj 
j=1 W (tj ) dj1 − nj1 nj
Z := r
2 nj1 nj2 (nj −dj )dj
Pm
j=1 W (tj ) n2 (n −1)
j j

which should have a standard normal distribution under the null hypothesis.
Note that, as in the Cox regression setting, right censoring and left truncation are automatically
taken care of, by appropriate choice of the risk sets.
Testing hypotheses 99

6.2.2 Standard tests


Any choice of weights W (tj ) defines a valid test. Why do we need weights? Since any choice
of weights produces a correct test, there is no canonical choice. Changing the weights changes
the power with respect to different alternatives. Which alternative you choose — hence, which
weights you choose — should depend on what deviations from equality you are most interested
in detecting. As always, the test should be chosen beforehand. Multiple testing makes the
interpretation of test results problematic.
Some common choices are:

1. W (tj ) = 1, ∀i. This is the log rank test, and is the test in most common use. The log
rank test is aimed at detecting a consistent difference between hazards in the two groups
and is best placed to consider this alternative when the proportional hazard assumption
applies. It is maximally asymptotically efficient in the proportional hazards context; in
fact, it is equivalent to the score test for the Cox regression parameter being 0, hence is
asymptotically equivalent to the likelihood ratio test. A criticism is that it can give too
much weight to the later event times when numbers in the risk sets may be relatively small.

2. R. Peto and J. Peto [24] proposed a test that emphasises deviations that occur early on,
when there are more individuals under observation. Petos’ test uses a weight dependent
on a modified estimated survival function, estimated for the whole study. The modified
estimator is
Y nj + 1 − dj
S(t)
e =
nj + 1
tj ≤t

and the suggested weight is then

e i−1 ) nj
W (tj ) = S(t
nj + 1

This has the advantage of giving more weight to the early events and less to the later ones
where the population remaining is smaller. Much like the cumulative-deviations test, this
avoids giving extra weight to excess deaths that come later.

3. W (tj ) = nj has also been suggested (Gehan, Breslow). This again downgrades the effect of
the later times.

4. D. Harrington and T. Fleming [14] proposed a class of tests that include Petos’ test and
the logrank test as special cases. The Fleming–Harrington tests use
 p  q
W (tj ) = S(t
b j−1 ) 1 − S(t
b j−1 )

where Sb is the Kaplan-Meier survival function, estimated for all the data. Then p = q = 0
gives the logrank test and p = 1, q = 0 gives a test very close to Peto’s test. If we were to
set p = 0, q > 0 this would emphasise the later event times if needed for some reason. This
might be the case if we were testing the effect of a medication that we expected to need
some time after initiation of treatment to become fully effective, or if the patients were

All of these tests may be written in the form


P
(O1j − E1j )Wj
qP ,
2 W2
σ1j j
Testing hypotheses 100

where Oj and Ej are observed and expected numbers of events. Consequently, positive and
negative fluctuations can cancel each other out. This could conceal a substantial difference
between hazard rates which is not of the proportional hazards form, but where the hazard rates
(for instance) cross over, with group 1 having (say) the higher hazard early, and the lower hazard
later. One way to detect such an effect is with a test statistic to which fluctuations contribute
only their absolute values. For instance, we could use the standard χ2 statistic
m X
k
X (Oij − Eij )2
X := .
Eij
i=1 j=1

Asymptotically, this should have the χ2 distribution with (k − 1)m degrees of freedom. Of course,
if the number of groups k = 2, this is the same as
m
X (O1j − E1j )2
X := n n .
d 1j (1 − n1jj )
i=1 j nj

6.3 Examples
6.3.1 The AML example
We can use these tests to compare the survival of the two groups in the AML experiment discussed
in section 4.4.5. The relevant quantities are tabulated in Table 6.1.

tj nj1 nj2 dj1 dj2 σj2 Peto weight


5 11 12 0 2 0.476 0.958
8 11 10 0 2 0.474 0.875
9 11 8 1 0 0.244 0.792
12 10 8 0 1 0.247 0.750
13 10 7 1 0 0.242 0.708
18 8 6 1 0 0.245 0.661
23 7 6 1 1 0.456 0.614
27 6 5 0 1 0.248 0.519
30 5 4 0 1 0.247 0.467
31 5 3 1 0 0.234 0.416
33 4 3 0 1 0.245 0.364
34 4 2 1 0 0.222 0.312
43 3 2 0 1 0.240 0.260
45 3 1 0 1 0.188 0.208

Table 6.1: Data for testing equality of survival in AML experiment.

When the weights are all taken equal, we compute Z = −1.84, whereas the Peto weights —
which reduce the influence of later observations — give us Z = −1.67. This yields one-sided
p-values of 0.033 and 0.048 respectively — a marginally significant difference — or two-sided
p-values of 0.065 and 0.096.
Applying the χ2 test yields X = 16.86, which needs to be compared to χ2 with 14 degrees of
freedom. The resulting p-value is 0.24, which is not at all significant. This should not be seen
as surprising: The differences between the two survival curves are clearly mostly in the same
direction, so we lose power when applying a test that ignores the direction of the difference.
Testing hypotheses 101

6.3.2 Kidney dialysis example


This is example 7.2 from [18]. The data are from a clinical trial of alternative methods of
placing catheters in kidney dialysis patients. The event time is the first occurrence of an exit-site
infection. The data are in the KMsurv package, in the object kidney. (Note: There is a different
data set with the same name in the survival package. To make sure you get the right one, enter
data(kidney,package=’KMsurv’).) The Kaplan–Meier estimator is shown in Figure 6.1. There
are two survival curves, corresponding to the two different methods.

Kaplan−Meier plot for kidney dialysis data


1.0
0.8
0.6
0.4
0.2
0.0

0 5 10 15 20 25
Time to infection (months)

Figure 6.1: Plot of Kaplan–Meier survival curves for time to infection of dialysis patients, based on
data described in section 1.4 of [18]. The black curve represents 43 patients with surgically-placed
catheter; the red curve 76 patients with percutaneously placed catheter.

We show the calculations for the nonparametric test of equality of distributions in Table 6.2.
The log-rank test — obtained by simply dividing the sum of all the deviations by the square root
of the sum of terms in the σj2 column — is only 1.59, so not significant. With the Peto weights the
statistic is only 1.12. This is not surprising, because the survival curves are close together (and
actually cross) early on. On the other hand, they diverge later, suggesting that weighting the
later times more heavily would yield a significant result. It would not be responsible statistical
practice to choose a different test after seeing the data. On the other hand, if we had started
with the belief that the benefits of the percutaneous method are cumulative, so that it would
make sense to expect the improved survival to appear later on, we might have planned from the
Testing hypotheses 102

beginning to use the Harrington–Fleming weights with, for example, p = 0, q = 1, tabulated in


the last column. Applying these weights gives us a test statistic ZF H = 3.11, implying a highly
significant difference.

tj nj1 nj2 dj1 dj2 σj2 Peto wt. H–F (0, 1) wt.
0.5 43 76 0 6 1.326 0.992 0.000
1.5 43 60 1 0 0.243 0.941 0.050
2.5 42 56 0 2 0.485 0.931 0.059
3.5 40 49 1 1 0.489 0.912 0.078
4.5 36 43 2 0 0.490 0.890 0.099
5.5 33 40 1 0 0.248 0.867 0.121
6.5 31 35 0 1 0.249 0.854 0.133
8.5 25 30 2 0 0.487 0.839 0.146
9.5 22 27 1 0 0.247 0.807 0.176
10.5 20 25 1 0 0.247 0.790 0.193
11.5 18 22 1 0 0.247 0.770 0.210
15.5 11 14 1 1 0.472 0.741 0.230
16.5 10 13 1 0 0.246 0.681 0.289
18.5 9 11 1 0 0.247 0.649 0.319
23.5 4 3 1 0 0.245 0.568 0.351
26.5 2 3 1 0 0.240 0.473 0.432

Table 6.2: Data for kidney dialysis study.

6.4 Nonparametric tests in R (non-examinable)


The survival package includes a command survdiff that carries out the tests described in
this lecture. The following code carries out the log-rank test and the Petos’ test (actually, the
Harrington–Fleming (1, 0) test.)
1 > kS=Surv ( k i d n e y $ time , k i d n e y $ d e l t a )
2 > s u r v d i f f ( kS∼k i d n e y $ type ) #l o g −rank t e s t
3 Call :
4 s u r v d i f f ( f o r m u l a = kS ∼ k i d n e y $ type )
5
6 N Observed Expected (O−E) ^2/E (O−E) ^2/V
7 k i d n e y $ ty pe=1 43 15 11 1.42 2.53
8 k i d n e y $ ty pe=2 76 11 15 1.05 2.53
9
10 Chisq= 2 . 5 on 1 d e g r e e s o f freedom , p= 0 . 1 1 2
11 > s u r v d i f f ( kS∼ k i d n e y $ type , rho =1) #P e t o s t e s t
12 Call :
13 s u r v d i f f ( f o r m u l a = kS ∼ k i d n e y $ type , rho = 1 )
14
15 N Observed Expected (O−E) ^2/E (O−E) ^2/V
16 k i d n e y $ ty pe=1 43 12.0 9.48 0.686 1.39
17 k i d n e y $ ty pe=2 76 10.4 12.98 0.501 1.39
18
19 Chisq= 1 . 4 on 1 d e g r e e s o f freedom , p= 0 . 2 3 9
Testing hypotheses 103

Note that the output


We conclude with the R code used to generate Figure 6.1 and Table 6.2
1 r e q u i r e ( ' KMsurv ' )
2 data ( kidney , package= ' KMsurv ' )
3 attach ( kidney )
4 k i d . s u r v=Surv ( time , d e l t a )
5 k i d . f i t =s u r v f i t ( k i d . s u r v∼type )
6 p l o t ( k i d . f i t , c o l =1:2 , x l a b= ' Time t o i n f e c t i o n ( months ) ' ,
7 main= ' Kaplan−Meier p l o t f o r k i d n e y d i a l y s i s data ' )
8
9 t=s o r t ( u niq ue ( time ) )
10 d=l a p p l y ( 1 : 2 , f u n c t i o n ( i ) s a p p l y ( t , f u n c t i o n (T) sum ( ( time [ type==i ]==T)&( d e l t a [ type
==i ]==1) ) ) )
11 n=l a p p l y ( 1 : 2 , f u n c t i o n ( i ) s a p p l y ( t , f u n c t i o n (T) sum ( time [ type==i ]>=T) ) )
12
13 keep=(n [ [ 1 ] ] ∗n [ [ 2 ] ] > 0 )
14 t=t [ keep ]
15 d=l a p p l y ( d , f u n c t i o n (D) D[ keep ] )
16 n=l a p p l y ( n , f u n c t i o n (D) D[ keep ] )
17 ddot=d [ [ 1 ] ] + d [ [ 2 ] ]
18 ndot=n [ [ 1 ] ] + n [ [ 2 ] ]
19
20 s i=ddot ∗n [ [ 1 ] ] ∗n [ [ 2 ] ] ∗ ( ndot−ddot ) / ndot / ndot / ( ndot −1)
21 p e t o s=cumprod ( ( ndot−ddot ) / ( ndot ) )
22 petow=c ( 1 , p e t o s [ 1 : ( l e n g t h ( p e t o s ) −1) ] ) ∗ ndot / ( ndot +1)
23 fhw=c (0 ,(1 − p e t o s [ 1 : ( l e n g t h ( p e t o s ) −1) ] ) )
24 e i=ddot ∗n [ [ 1 ] ] / ndot
25 wk=r e p ( 1 , l e n g t h ( e i ) )
26
27 zLR=sum (wk∗ ( d [ [ 1 ] ] − e i ) ) / s q r t ( sum (wk^2∗ s i ) )
28 zP=sum ( petow ∗ ( d [ [ 1 ] ] − e i ) ) / s q r t ( sum ( petow^2∗ s i ) )
29 zFH=sum ( fhw ∗ ( d [ [ 1 ] ] − e i ) ) / s q r t ( sum ( fhw^2∗ s i ) )
30
31 xt=x t a b l e ( c b i n d ( t , n [ [ 1 ] ] , n [ [ 2 ] ] , d [ [ 1 ] ] , d [ [ 2 ] ] , s i , petow , fhw ) ,
32 d i s p l a y=c ( ' d ' , ' f ' , ' d ' , ' d ' , ' d ' , ' d ' , ' f ' , ' f ' , ' f ' ) ,
33 d i g i t s=c ( 0 , 1 , 0 , 0 , 0 , 0 , 3 , 3 , 3 ) )
34 p r i n t ( xt , i n c l u d e . rownames=FALSE, i n c l u d e . colnames=FALSE)
Chapter 7

Testing the fit of survival models

In section 7.1 we looked at graphical plots that can be used to test the appropriateness of
the proportional hazards model. These graphical methods work when we have large groups of
individuals whose survival curves may be estimated separately, and then compared with respect
to the proportional-hazards property.
In this lecture we look more generally at the problem of testing model assumptions — mainly,
but not exclusively, the assumptions of the Cox regression model — and describe a suite of tools
that may be used for models with quantitative as well as categorical covariates.

7.1 Graphical tests of the proportional hazards assumption


We describe here several plots that can be made from survival data to determine whether the
proportional hazards assumption is reasonable.
7.1.1 Log cumulative hazard plot
The simplest graphical test require that the covariate take on a few discrete values, with a
substantial number of subjects observed in each category. If the covariate is continuous we
stratify it, defining a new categorical covariate by the original covariate being in some fixed
region.
The first approach is to consider, for categories 1, . . . , m, the Nelson–Aalen estimators H
b i (t)
of the cumulative hazard for individuals in category i. If any relative-risk model holds then
H b 0 (t), so that
b i (t) = r(β, i)H

b i (t) − log H
log H b j (t) ≈ log r(β, i) − log r(β, j)

should be approximately constant.


7.1.2 Andersen plot
In the Andersen plot we plot all the pairs (H b j (t)). If the proportional hazards assumption
b i (t), H
holds then each pair (i, j) should produce (approximately) a straight line through the origin. It is
known (cf. section 11.4 of [18]) that when the ratio of hazard rates αi (t)/αj (t) is increasing, the
corresponding Andersen plot is a convex function; decreasing ratios produce concave Andersen
plots.

104
Model diagnostics 105

7.1.3 Arjas plot


The Arjas plot is a more sophisticated graphical test, that is capable of testing the proportional
hazards assumption for a single categorical covariate, within a model that includes other covariates
(that might follow a different sort of model).
Suppose we have fit the model hi (t) = h(t|xi ) for the hazard rate of individual i — for
example, it might be hi (t) = eβ̂·xi α0 (t), but it might be something else — and we are interested
to decide whether an additional (categorical) covariate zi (taking on values 0 and 1) ought to be
included as well. For each individual we have an estimated cumulative hazard H(t|x b i ). Define
the weighted time on test for individual i at event time tj as H(tj ∧ Ti |xi ), and the total time on
b
test for level g (of the covariate z) as
X
TOTg (tj ) = b j ∧ Ti |xi );
H(t
i:zi =g

the number of events at level g will be


X
Ng (tj ) = δi 1{Ti ≤tj } .
i:zi =g

(a∧b = min{a, b}.) The idea is that if the covariate z has no effect, the difference Ng (tj )−TOTg (tj )
has expectation zero for each tj , so a plot of Ng against TOT would lie close to a straight line
with 45◦ slope. If levels of z have proportional hazards effects, we expect to see lines of different
slopes. If the effects are not proportional, we expect to see curves that are not lines.
We give an example in Figure 7.1. We have simulated a population of 100 males and 100
females, whose hazard rates are αi (t) = exi +βmale Imale t, where xi ∼ N (0, 0.25). In Figure 7.1(a)
we show the Arjas plot in the case βmale = 0; in Figure 7.1(b) we show the plot for the case
βmale = 1.

1 require ( ' survival ' )


2
3 n=100
4 #n=1000
5 c e n s r a t e =0.5
6 covmean=0
7 covsd =0.5
8 m a l e e f f =1
9 b e t a=1
10 n=100
11
12 xmale=rnorm ( n , covmean , covsd )
13 xfem=rnorm ( n , covmean , covsd )
14 x=c ( xmale , xfem )
15
16
17 Tmale=s q r t ( rexp ( n ) ∗ 2 ∗ exp(− b e t a ∗ xmale−m a l e e f f ) )
18 Tfem=s q r t ( rexp ( n ) ∗ 2 ∗ exp(− b e t a ∗ xfem ) )
19
20 T=c ( Tmale , Tfem )
21
22 # Censoring times
23 C=rexp ( 2 ∗n , c e n s r a t e )
24
Model diagnostics 106

Male effect= 0 Male effect= 1

40
50

male
female
40

30
male
Cumulative hazard

Cumulative hazard
female
30

20
20

10
10
0

0 10 20 30 40 50 60 0 10 20 30 40 50 60
Number of events Number of events

(a) βmale = 0 (b) βmale = 1

Figure 7.1: Arjas plots for simulated data.

25 t=pmin (C, T)
26
27 d e l t a =1∗ (T<C)
28 s e x=f a c t o r ( c ( r e p ( 'M' , 1 0 0 ) , r e p ( 'F ' , 1 0 0 ) ) )
29
30 c f i t =coxph ( Surv ( t , d e l t a )∼x )
31
32 #Make a p l o t t o s e e how c l o s e t h e e s t i m a t e d b a s e l i n e s u r v i v a l i s t o t h e c o r r e c t
one
33 p l o t ( s u r v f i t ( c f i t , data . frame ( x=0) ) )
34 t t = ( 0 : 3 0 0 ) / 100
35 l i n e s ( t t , exp(− t t ^2/ 2 ) , c o l =2)
36
37 b e t a=a s . numeric ( c f i t $ c o e f )
38 r e l r i s k =exp ( b e t a ∗x ) ∗ exp ( m a l e e f f ∗ c ( r e p ( 1 , n ) , r e p ( 0 , n ) ) )
39
40
41 e v e n t o r d=o r d e r ( t )
42
43 e t=t [ e v e n t o r d ]
44 e s=s e x [ e v e n t o r d ]
45 e c=x [ e v e n t o r d ]
46
47 cumhaz=−l o g ( s u r v f i t ( c f i t , data . frame ( x=e c ) ) $ s u r v )
48 h a z t r u n c=s a p p l y ( 1 : ( 2 ∗n ) , f u n c t i o n ( i ) pmin ( cumhaz [ i , i ] , cumhaz [ , i ] ) )
49 h a z t r u n c m a l e=h a z t r u n c [ , e s=='M' ]
50 h a z t r u n c f e m=h a z t r u n c [ , e s=='F ' ]
51
52 # Maximum c u m u l a t i v e hazard comes f o r i n d i v i d u a l i when we ' ve g o t t e n
53 # t o t h e row c o r r e s p o n d i n g t o e v e n t o r d [ i ]
Model diagnostics 107

54
55 TOTmale=a pply ( haztruncmale , 1 , sum )
56 TOTfem=a p ply ( haztruncfem , 1 , sum )
57
58 Nmale=cumsum ( cumhazmale $n . e v e n t ∗ ( e s=='M' ) )
59 Nfem=cumsum ( cumhazfem $n . e v e n t ∗ ( e s=='F ' ) )
60
61 p l o t ( Nmale , TOTmale , x l a b= ' Number o f e v e n t s ' , y l a b= ' Cumulative hazard ' , type= ' l ' , c o l
=2 , lwd =2,main=p a s t e ( ' Male e f f e c t= ' , m a l e e f f ) )
62 abline (0 ,1)
63 l i n e s ( Nfem , TOTfem , c o l =3, lwd=2)
64 l e g e n d ( . 1 ∗n , . 4 ∗n , c ( ' male ' , ' f e m a l e ' ) , c o l =2:3 , lwd=2)

7.1.4 Leukaemia example


Section 1.9 of [18] describes data on 101 leukaemia patients, comparing the disease-free survival
time between 50 who received allogenic bone marrow transplants and 51 who received autogenic
transplants. We follow section 11.4 of that book in testing the proportional hazards model with
two graphical tests. The data are available in the object alloauto in KMsurv.
Both plots show that the data are badly suited to a proportional hazards model. The
Andersen plot looks clearly convex, suggesting that the hazard ratio of autogenic to allogenic is
increasing. This is also what one would infer from the crossing of the two log cumulative hazard
curves.

log cumulative hazard plot for alloauto data


Andersen plot for alloauto data
1.0

1.0
0.8

0.8
log cumulative autologous hazard
log cumulative hazard

0.6
0.6

0.4
0.4

0.2
0.2

0.0
0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


0 5 10 15 20 25

time (weeks) log cumulative allogenic hazard

(a) Cumulative log hazard plot (b) Andersen plot

Figure 7.2: Graphical tests for proportional hazards assumption in alloauto data.

7.2 General principles of model selection


7.2.1 The idea of model diagnostics
A statistical model is a family of probability distributions, whose realisations are data of the sort
that we are trying to analyse. Fitting a model means that we find the member of the family that
is closest to the data in terms of some criterion. If the model is good, then this best-fit model
Model diagnostics 108

will be a reasonable proxy — for all sorts of inference purposes — for the process that originally
generated the data. That means that other properties of the data that were not used in choosing
the representative member of the family should also be close to the corresponding properties of
the model fit.
(An alternative approach, that we won’t discuss here, is model averaging, where we accept up
front that no model is really correct, and so give up the search for the “one best”. Instead, we
draw our statistical inferences from all the models in the family simultaneously, appropriately
weighted for how well they suit the data.)
The idea, then, is to look at some deviations of the data from the best-fit model — the
residuals — which may be represented in terms of test statistics or graphical plots whose
properties are known under the null hypothesis that the data came from the family of distributions
in question, and then evaluate the performance. Often, this is done not in the sense of formal
hypothesis testing — after all, we don’t expect the data to have come exactly from the model, so
the difference between rejecting and not rejecting the null hypothesis is really just a matter of
sample size — but of evaluating whether the deviations from the model seem sufficiently crass to
invalidate the analysis. In addition, residuals may show not merely that the model does not fit
the data adequately, but also show what the systematic difference is, pointing the way to an
improved model. This is the main application of martingale residuals, which we will discuss in
section 7.6. Alternatively, it may show that the failure is confined to a few individuals. Together
with other information, this may lead us to analyse these outliers as a separate group, or to
discover inconsistencies in the data collection that would make it appropriate to analyse the
remaining data without these few outliers. The main tool for detecting outliers are deviance
residuals, which we discuss in section 7.7.1.
7.2.2 A simulated example
Suppose we have data simulated from a very simple survival model, where individual i has
constant hazard 1 + xi , where xi is an observed positive covariate, with independent right
censoring at constant rate 0.2. Now suppose we choose to fit the data to a proportional hazards
model. What would go wrong? Not surprisingly, for such a simple model, the main conclusion —
that the covariate has a positive effect on the hazard — would still be qualitatively accurate.
But what about the estimate of baseline hazard?
We simulated this process with 1000 individuals, where the covariates were the absolute
values of independent normal random variables. We must first recognise that it is not entirely
clear what it even means to evaluate the accuracy of fit of such a misspecified model. If we plug
the simulated data into the Cox model, we necessarily get an exponential parameter out, telling
us that the hazard rate corresponding to covariate x is eβ̂x . Since the hazard is actually 1 + βx,
it is not clear what it would mean to say that the parameter was well estimated. Certainly,
positive β should be translated into positive β̂. Similarly, the baseline hazard of the Cox model
agrees with the baseline hazard of the additive-hazards model, in that both are supposed to
be the hazard rates for an individual with covariates 0, but their roles in the two models are
sufficiently different that any comparison is on uncertain ground.
Still, the fitted Cox model makes a prediction about the hazard rate of an individual whose
covariates are all 0, and that prediction is wrong. When we fit the data from 1000 individuals to
the Cox model, we get this output:
coef exp(coef) se(coef) z p
x 0.581 1.79 0.0559 10.4 0

Likelihood ratio test=98.6 on 1 df, p=0 n= 1000, number of events= 882


Model diagnostics 109

In Figure 7.3(a) we see the baseline survival curve estimated from these data by the Cox
model. The confidence region is quite narrow, but we see that the true survival curve — the red
curve — is nowhere near it. In Figure 7.3(b) we have a smoothed version of the hazard estimate,
and we see that forcing the data into a misspecified Cox model has turned the constant baseline
hazard into an increasing hazard.

Cox baseline survival estimate Cox smoothed hazard estimate


1.0

7
6
0.8

5
0.6

Hazard
4
0.4

3
2
0.2

1
0.0

0 1 2 3 4 0 1 2 3 4
Time

(a) Baseline survival (b) Smoothed hazard

Figure 7.3: Baseline survival and hazard estimated from the Cox proportional hazards model for
data simulated from the additive hazards model. Red is the true baseline hazard.

7.3 Cox–Snell residuals


Given a linear regression model Yi = βXi + i , we stress-test the model by looking at the residuals
Yi − β̂Xi . If the model is reasonably appropriate, the residuals should look something like a
sample from the distribution posited for the errors i . There are many ways the residuals can fail
to have the “right” distribution — wrong tails, dependence between different residuals, changing
distributions over time or depending on Xi — and consequently many ways to test them, ranging
from formal test statistics to graphical plots that are evaluated by eye.
In evaluating regression models for survival we are doing something similar, except that the
connection between the individual observations and the parameters being estimated is much
more indirect, thus demanding more ingenuity to even define the residuals, and then to evaluate
them. There is a large family of different residuals that have been defined for survival models,
each of which is useful for different parts of the task of model diagnostics, including:

• Generally evaluating the appropriateness of a regression model (such as Cox proportional


hazards or additive hazards);

• Specifically evaluating assumptions of the regression model (such as the proportional


hazards assumption, or the log-linear action of the covariates;

• Finding specific outlier individuals in an otherwise reasonably well specified model.


Model diagnostics 110

The most basic version is called the Cox–Snell residual. It is based on the observation that
if T is a sample from a distribution with cumulative hazard function H, then H(T ) has an
exponential distribution with parameter 1.
Given a parametric model H(T, β) we would then generate and evaluate Cox–Snell residuals
as follows:

1. We use the samples (Ti , δi ) to estimate a best fit β̂;

2. Compute the residuals ri := H(Ti , β̂);

3. If the model is well specified — a good fit to the data — then (ri , δi ) should be like a
right-censored sample from a distribution with constant hazard 1.

4. A standard way of evaluating the residuals is to compute and plot a Nelson–Aalen estimator
for the cumulative hazard rate of the residuals. The null hypothesis — that the data came
from the parametric model under consideration — would predict that this plot should lie
close to the line y = x.

Of course, things aren’t quite so straightforward when evaluating a semi-parametric model


(such as the Cox model) or nonparametric model (such as the Aalen additive hazards model). We
describe here the standard procedure for computing Cox–Snell residuals for the Cox proportional
hazards model (the original application):

1. We use the samples (Ti , xi , δi ) to estimate a best fit β̂;

2. We compute the Breslow estimator H


b 0 (t) for the baseline hazard;
Tx
3. Compute the residuals ri := eβ̂ i H
b 0 (Ti ).

After this, we proceed as above. Of course, there is nothing special about the Cox log-linear
form of the relative risk. Given any relative risk function r(β̂, x), we may define residuals
ri := r(β̂, xi )H
b 0 (Ti ).

7.4 Bone marrow transplantation example


(This is based on Example 11.1 of [18].)
The data here are in the object bmt of the KMsurv package. They describe the disease-free
survival time of leukaemia patients following a bone marrow transplant. We consider the model
with proportional effects for the following variables:

Z1 = patient age at transplant (centred at 28 years);


Z2 = donor age at transplant (centred at 28 years);
Z1 × Z2 ;
Z3 = patient sex (1=male, 0=female);
Z7 = waiting time to transplant.

We fit the model in R as follows:


Model diagnostics 111

1 > bmcox=coxph ( Surv ( t2 , d3 )∼z1+z2+z1 ∗ z2+z3+z7 , data=bmt )


2 > summary ( bmcox )
3 Call :
4 coxph ( f o r m u l a = Surv ( t2 , d3 ) ∼ z1 + z2 + z1 ∗ z2 + z3 + z7 , data = bmt )
5
6 n= 1 3 7 , number o f e v e n t s= 83
7
8 c o e f exp ( c o e f ) se ( coef ) z Pr ( >| z | )
9 z1 −0.1142944 0.8919953 0 . 0 3 5 6 2 9 2 −3.208 0 . 0 0 1 3 4 ∗ ∗
10 z2 −0.0859570 0.9176336 0 . 0 3 0 2 8 9 2 −2.838 0 . 0 0 4 5 4 ∗ ∗
11 z3 −0.3062033 0.7362370 0 . 2 2 9 8 5 1 5 −1.332 0 . 1 8 2 8 0
12 z7 0.0001135 1.0001135 0.0003274 0.347 0.72875
13 z1 : z2 0 . 0 0 3 6 1 7 7 1.0036243 0 . 0 0 0 9 2 0 3 3 . 9 3 1 8 . 4 6 e −05 ∗ ∗ ∗
14 −−−
15 S i g n i f . codes : 0 ' ∗∗∗ ' 0.001 ' ∗∗ ' 0.01 ' ∗ ' 0.05 ' . ' 0 . 1 ' ' 1
16
17 exp ( c o e f ) exp(− c o e f ) l o w e r . 9 5 upper . 9 5
18 z1 0.8920 1.1211 0.8318 0.9565
19 z2 0.9176 1.0898 0.8647 0.9738
20 z3 0.7362 1.3583 0.4692 1.1552
21 z7 1.0001 0.9999 0.9995 1.0008
22 z1 : z2 1.0036 0.9964 1.0018 1.0054
23
24 Concordance= 0 . 6 0 5 ( s e = 0 . 0 3 3 )
25 Rsquare= 0 . 1 0 4 (max p o s s i b l e= 0 . 9 9 6 )
26 L i k e l i h o o d r a t i o t e s t= 1 5 . 0 6 on 5 df , p =0.01012
27 Wald t e s t = 1 8 . 0 2 on 5 df , p =0.00292
28 S c o r e ( l o g r a n k ) t e s t = 1 8 . 9 on 5 df , p=0.002009.

So we see that patient age and donor age both seem to have strong effects on disease-free
survival time (with increasing age acting negatively — that is, increasing the length of disease-free
survival — as we might expect if we consider that many forms of cancer progress more rapidly
in younger patients). Somewhat surprisingly, the effect of a year of donor age is almost as strong
as the effect of a year of patient age. There is also a strong positive interaction term, suggesting
that the prognosis for an old patient receiving a transplant from an old donor is not as favourable
as we would expect from simply adding their effects. Thus, for example, the oldest patient was 80
years old, while the oldest donor was 84. The youngest patient was 35, the youngest donor just
30. The model suggests that the 80-year-old patient should have a hazard rate for relapse that is
a factor of e−0.1143×45 = 0.006 that of the youngest. Indeed, the youngest patient relapsed after
just 43 days, while the oldest died after 363 days without recurrence of the disease. The patient
with the oldest donor would be predicted to have his or her hazard rate of recurrence reduced by
a factor of e−0.0860×54 = 0.0096. Indeed, the youngest patient did have the youngest donor, and
the oldest patient nearly had the oldest. Had that been the case, the ratio of hazard rate for the
oldest to that of the youngest would have been 0.0096 × 0.006 = 0.00006. Taking account of the
interaction term we see that the actual log proportionality term predicted by the model is

exp −0.1143 × 45 − 0.0860 × 54 + 0.036177(45 × 54) = 0.45.

7.5 Testing the proportional hazards assumption: Schoenfeld


residuals
At least as important as checking the log-linearity of a covariate’s influence for the Cox proportional
hazards model is checking that the influence is constant in time. We already described how to
Model diagnostics 112

make a graphical test of the proportional hazards assumption between two groups of subjects. It
is less obvious how to test the proportional hazards assumption associated with a relative-risk
regression model.
For this purpose the standard tool is the Schoenfeld residuals. The formal derivation is left
as an exercise, but the definition of the j-th Schoenfeld residual for parameter βk is
X
Skj (tj ) := Xij k − X̄k (tj ),

where as usual ij is the individual with event at time tj , and


T X̄ (t)
Xik (t)eβ̂
P i
i∈Rj
X̄k (tj ) =
eβ̂ T X̄i (t)
P
i∈Rj

is the weighted mean of covariate Xk at time t. Thus the Schoenfeld residual measures the
difference between the covariate at time t and the average covariate at time t. If βk is constant
this has expected value 0. If the effect of Xk is increasing we expect the estimated parameter
βk to be an overestimate early on — so the individuals with events then have lower Xk than
we would have expected, producing negative Schoenfeld residuals; at later times the residuals
would tend to be positive. Thus, increasing effect is associated with increasing Schoenfeld
residuals. Likewise decreasing effect is associated with decreasing Schoenfeld residuals. As with
the martingale residuals, we typically make a smoothed plot of the Schoenfeld residuals, to get a
general picture of the time trend.
We can also make a formal test of the hypothesis βk is constant by fitting a linear regression
line to the Schoenfeld residuals as a function of time, and testing the null hypothesis of zero
slope, against the alternative of nonzero slope. Of course, such a test will have little or no power
to detect nonlinear deviations from the hypothesis of constant effect — for instance, threshold
effects, or changing direction.

7.6 Martingale residuals


7.6.1 Definition of martingale residuals
A close approximation to the residuals that we use in the linear-regression setting are the
martingale residuals. (A good source for the practicalities of martingale residuals is chapter 11 of
[18]. The mathematical details are best presented in section 4.5 of [10].)
Intuitively, the martingale residual for an individual is the difference between the number
of observed events for the individual and the expected number under the model. The expected
number of events from time 0 to t for individual i is Hi (t), so in principle this is
Z Ti
δi − h0 (s)eβ·xi (s) ds.
0

We turn this into a residual — something we can compute from the data – by replacing the
integral with respect to the hazard by the differences in the estimated cumulative hazard
fi (t) : = δi − H
M b i (Ti )
X
= δi − eβ̂·xi (tj ) ĥ0 (tj )
tj ≤Ti (7.1)
X 1
= δi − eβ̂·xi (tj ) P β·x` (tj )
tj ≤Ti `∈Rj e
Model diagnostics 113

It differs from the (negative of the) Cox–Snell residual only by the addition of δi . When the
covariates are constant in time,
fi = δi − eβ T xi H
M b 0 (Ti ).

An individual martingale residual is a very crude measure of deviation, since for any given t
the only observation here that is relevant to the survival model is δi , which is a binary observation.
If there are no ties, or if we use the Breslow method for resolving ties, sum of all the martingale
residuals is 0.
7.6.2 Application of martingale residuals for estimating covariate transforms
Martingale residuals are not very useful in the way that linear-regression residuals are, because
there is no natural distribution to compare them to. The main application is to estimate
appropriate modifications to the proportional hazards model by way of covariate transformation:

Instead of a relative risk of eβx , it might be ef (x) , where f (x) could be 1{x<x0 } or x, or
something else.
We assume that in the population xi and zi are independent. This won’t be exactly true in
reality, but obviously a strong correlation between two variables complicates efforts to disentangle
their effects through a regression model. We derive the formula under the assumption of
independence, understanding that the results will be less reliable the more intertwined the
variable zi is with the others.
Suppose the data (Ti , xi (·), zi , δi ) are sampled from a relative-risk model with two covariates:
a vector xi and an additional one-dimensional covariate zi , with

log r(β, x, z) = β T x + f (z);

that is, the Cox regression model holds except with regard to the last covariate, which acts as
h(z) := ef (z) . Let β̂ be the p-dimensional vector corresponding to the Cox model fit without the
covariate z, and let M fi be the corresponding martingale residuals.
Another complication is that we use β̂ instead β (which we don’t know). We will derive the
relationship under the assumption that they are equal; again, errors in estimating β will make
the conclusions less correct. For large n we may assume that β and β̂ are close.
Let
E[1at risk time s ef (z) | x]
= P h(z) at risk time s; x ;
 
h̄(s, x) :=
P{at risk time s | x}
 
h̄(s) := E h̄(s, x) .

We assume that h̄(s, x) is approximately the constant h̄(s).

Fact P
δi
(7.2)
  
E Mf z ≈ f (z) − log h̄(∞) .
n

Thus, we may estimate f (z0 ) by estimating the local average of M


f, averaged over z close to
z0 . For instance, we can compute a LOESS smooth curve fit to the scatterplot of points (zi , Mfi ).
The basic idea is, the martingale residuals measure the excess events, the difference between
the expected and observed number of events. If we compute the expectation without taking
Model diagnostics 114

account of z, then individuals whose z value has f (z) large positive will seem to have a large
number of excess events; and those whose f (z) is large negative will seem to have fewer events
than expected.
A reasonably formal proof may be found in [25], and also reproduced in chapter 4 of [10].

7.7 Outliers and influence


We don’t expect a model to be exactly right, but what does it mean for it to be “close enough”?
An important way a model can go badly wrong is if the model fit is dominated by a very small
number of the observations. This can happen if either there are extreme values of the covariates,
or if an outcome (the time in a survival model) is exceptionally far off of the predicted value. The
former are called high-leverage observations, the latter are outliers. We need tools for identifying
these observations, and determining how much influence they exert over the model fit.
7.7.1 Deviance residuals
The martingale residual for an individual is
observed # events − expected # events
for individual i. In principle, large values indicate outliers — results that individually are
unexpected if the model is true. The problem is, these are highly-skewed variables — they take
values between 1 and −∞ and it is hard to determine what their distribution should look like.
We define the deviance residual for individual i as
n  o1/2
di := sgn(Mfi ) −2 M fi + δi log(δi − M
fi ) . (7.3)

This choice of scaling is inspired by the intuition that each di should represent the contribution
of that individual to the total model deviance.1 Whereas the martingale residuals are between
−∞ and 1, the deviance residuals should have a similar range to a standard normal random
variable. Thus, we treat values outside a range of about −2.5 to +2.5 as outliers, potentially
7.7.2 Delta–beta residuals
Recall that the leverage of an individual observation is a measure of the impact of that observation
on the parameters of interest. The “delta–beta” residual for parameter βk and subject i is defined
as
∆βki := β̂k − β̂k(i) ,
where β̂k(i) is the estimate of β̂k with individual i removed. In principle we may compute this
exactly, by recalculating the model n times for n subjects, but this is computationally expensive.
We can approximate it by a combination of the Fisher information and the Schoenfeld residual
(actually, a variant of the Schoenfeld residual called the score residual). We do not give the
formula here, but it may be found in [5, Section 4.3]. These estimates may be called up easily in
R, as described in section 7.8.
Individuals with high values of ∆β should be looked at more closely. They may reveal
evidence of data-entry errors, interactions between different parameters, or just the influence of
extreme values of the covariate. You should particularly be worried if there are high-influence
individuals pushing the parameters of interest in one direction.
1
Recall that deviance is defined as
D = 2 log likelihood(saturated) − log likelihood(β̂) .
 

Applied to the Cox model, the saturated model isPthe one where each individual has an individual parameter βi∗ .
It is possible to derive then that the deviance is n
i=1 di .
2
Model diagnostics 115

7.8 Residuals in R (non-examinable)


As an object-oriented language, R has functions in place for any well-written package to produce
standard kinds of outputs in an appropriate way. In particular, if fit is the output of a model-
fitting procedure, resid(fit) should produce sensible residuals. If fit is an object of type coxph,
the residuals can be "martingale", "deviance", "score", "schoenfeld", "dfbeta", "dfbetas",
and "scaledsch" (scaled Schoenfeld).
7.8.1 Dutch Cancer Institute (NKI) breast cancer data
We apply these methods to the data on survival of Dutch breast cancer patients collected in the
nki data set, and discussed in [15]. The data are available in the dynpred package in R. The
most interesting thing about the study that these data come from is that it was one of the first
to relate survival of breast cancer patients to gene expression data. The data and some results
are described in [28].
The data frame includes the following covariates:

patnr Patient identification number

d Survival status; 1 = death; 0 = censored

tyears Time in years until death or last follow-up

diameter Diameter of the primary tumor

posnod Number of positive lymph nodes

age Age of the patient

mlratio oestrogen expression level

chemotherapy Chemotherapy used (yes/no)

hormonaltherapy Hormonal therapy used (yes/no)

typesurgery Type of surgery (excision or mastectomy)

histolgrade Histological grade (Intermediate, poorly, or well differentiated)

vasc.invasion Vascular invasion (-, +, or +/-)

We begin by fitting the Cox model, including all the potentially significant covariates:
1 n k i . s u r v=with ( nki , Surv ( t y e a r s , d ) )
2 n k i . cox=with ( nki , coxph ( n k i . s u r v∼p o s n o d e s+chemotherapy+hormonaltherapy
3 +h i s t o l g r a d e+age+m l r a t i o+d i a m e t e r+p o s n o d e s+v a s c . i n v a s i o n+t y p e s u r g e r y ) )
4 > summary ( n k i . cox )
5 Call :
6 coxph ( f o r m u l a = n k i . s u r v ∼ p o s n o d e s + chemotherapy + hormonaltherapy
7 + h i s t o l g r a d e + age + m l r a t i o + d i a m e t e r + p o s n o d e s + v a s c . i n v a s i o n +
typesurgery )
8
9 n= 2 9 5 , number o f e v e n t s= 79
10
11 c o e f exp ( c o e f ) s e ( c o e f ) z Pr ( >| z | )
12 posnodes 0.07443 1.07727 0.05284 1.409 0.158971
13 chemotherapyYes −0.42295 0 . 6 5 5 1 1 0 . 2 9 7 6 9 −1.421 0 . 1 5 5 3 8 1
14 hormonaltherapyYes −0.17160 0 . 8 4 2 3 2 0 . 4 4 2 3 3 −0.388 0 . 6 9 8 0 6 2
Model diagnostics 116

15 histolgradePoorly d i f f 0.26550 1.30409 0.28059 0.946 0.344030


16 histolgradeWell d i f f −1.30782 0.27041 0.54782 −2.387 0.016972 ∗
17 age −0.03937 0.96139 0.01953 −2.016 0.043816 ∗
18 mlratio −0.75031 0.47222 0.21138 −3.550 0.000386 ∗∗∗
19 diameter 0.01976 1.01996 0.01323 1.493 0.135334
20 v a s c . i n v a s i o n+ 0.60286 1.82733 0.25315 2.381 0.017245 ∗
21 v a s c . i n v a s i o n+/− −0.14580 0.86433 0.49104 −0.297 0.766530
22 typesurgerymastectomy 0.15043 1.16233 0.24864 0.605 0.545183

We could remove the insignificant covariates stepwise, or use AIC, or some other model-
selection method, but let us suppose we have reduced it to the model including just histological
grade, vascular invasion, age, and mlratio (the crucial measure of oestrogen-receptor gene
expression. we then get
1 summary ( n k i . cox )
2 Call :
3 coxph ( f o r m u l a = n k i . s u r v ∼ h i s t o l g r a d e + age + m l r a t i o + v a s c . i n v a s i o n )
4
5 n= 2 9 5 , number o f e v e n t s= 79
6
7 c o e f exp ( c o e f ) s e ( c o e f ) z Pr ( >| z | )
8 histolgradePoorly d i f f 0.42559 1.53050 0.26914 1.581 0.113810
9 histolgradeWell d i f f −1.31213 0.26925 0.54782 −2.395 0.016611 ∗
10 age −0.04302 0.95790 0.01961 −2.194 0.028271 ∗
11 mlratio −0.72960 0.48210 0.20754 −3.515 0.000439 ∗∗∗
12 v a s c . i n v a s i o n+ 0.64816 1.91203 0.24205 2.678 0.007410 ∗∗
13 v a s c . i n v a s i o n+/− −0.04951 0.95169 0.48593 −0.102 0.918840

7.8.2 Complementary log-log plot


The first model diagnostic is to plot the cumulative hazard on a log scale for different values of the
covariate. We show this in Figure 7.4. Figure 7.4(a) shows the cumulative hazard calculated from
the Cox model fit for the 10th, 50th, and 90th percentiles of mlratio. Since this is calculated from
the model, the curves are exact vertical shifts of each other; this is what the plot would look like
in the ideal case. Figure 7.4(b) shows the Nelson–Aalen estimator for the three sub-populations
formed from the upper, middle, and lower tertiles of mlratio. It does not fit perfectly, but it
is not obviously wrong to see them as vertical shifts of one another. Note that the second and
third tertiles are very similar; the big difference is between the lower tertile and the rest. Figure
7.5 shows a similar plot where the population has been stratified by the vascular invasion status.
We see here that the effect of the vascular invasion variable — reflected in the gap between
the two log cumulative hazard curves seems to increase over time. The code to generate this plot is

1 n k i . km3=s u r v f i t ( n k i . s u r v∼f a c t o r ( n k i $ v a s c . i n v a s i o n ) )
2 p l o t ( n k i . km3 , mark . time=FALSE, x l a b= ' Time ( Years ) ' , y l a b= ' l o g (cum hazard ) ' ,
3 main= ' V a s c u l a r i n v a s i o n ' , c o n f . i n t=FALSE, c o l=c ( 2 , 1 , 3 ) , fun=myfun , f i r s t x =1)
4 l e g e n d (10 , −3 , c ( '− ' , '+ ' , '+/− ' ) , c o l=c ( 2 , 1 , 3 ) , t i t l e = ' V a s c u l a r i n v a s i o n ' , lwd=2)
Model diagnostics 117

NKI survival (Cox model fit) NKI survival (Tertiles)


−1

mlratio

−1
−1.2
−2
log(cum hazard)

log(cum hazard)
−0.1
0.35

−2
−3
−4

−3
mlratio tertiles
−5

1
2

−4
3
−6

1 2 5 10 20 5 10 15
Time (Years) Time (Years)

(a) Model fit for different mlratio levels (b) Nelson–Aalen estimator for tertiles of mlratio

Figure 7.4: Log cumulative hazards of NKI data.

7.8.3 Andersen plot


We show an Andersen plot for the vascular invasion covariate in Figure 7.6. We have already
observed that the effect seems to increase with time, and this is reflected in the convex shape of
the curve.

7.8.4 Cox–Snell residuals


We compute the Cox–Snell residuals by simply extracting the martingale residuals automatically
computed by the resid function, and subtracting them from the censoring indicator:
1 n k i . mart=r e s i d u a l s ( n k i . cox )
2 n k i . CS=n k i $d−n k i . mart

We then test for the goodness of fit by computing a Nelson–Aalen estimator for the residuals,
and plotting the line y = x for comparison.
1 n k i . C S t e s t=s u r v f i t ( Surv ( n k i . CS , n k i $d )∼1 )
2 p l o t ( n k i . CStest , fun= ' cumhaz ' , mark . time=FALSE, xmax=.8 ,
3 main= ' Cox−−S n e l l r e s i d u a l p l o t f o r NKI data ' )
4 a b l i n e ( 0 , 1 , c o l =2)
Model diagnostics 118

Vascular invasion

−1
−2
log(cum hazard)
−3
Vascular invasion

+
−4

+/−
−5

5 10 15
Time (Years)

Figure 7.5: Log cumulative hazards of NKI data for subpopulations stratified by vascular invasion
status.

7.8.5 Martingale residuals


We know that the effect of the gene activity covariate mlratio is highly significant. But does it
fit the Cox linear model assumption? We test this by fitting the model without mlratio, and
then plot the martingale residuals against mlratio, using a local smoother to get a picture of
how the martingale residuals change, on average, with values of this covariate. The code is:
1 n k i . nogen=with ( nki , coxph ( n k i . s u r v∼h i s t o l g r a d e+age+v a s c . i n v a s i o n ) )
2 n k i . NGmart=r e s i d ( n k i . nogen , type= ' m a r t i n g a l e ' )
3 ord=o r d e r ( n k i $ m l r a t )
4 m1=n k i . NGmart [ ord ]
5 m2=n k i $ m l r a t [ ord ]
6 p l o t ( n k i $ m l r a t i o , n k i . NGmart , x l a b= ' m l r a t i o ' , y l a b= ' M a r t i n g a l e r e s i d u a l ' )
7 l i n e s ( l o w e s s (m1∼m2) , lwd =2 , c o l =2)

This produces the result in figure 7.9


We see that the effect declines up to a level of about −0.5, and is 0 after that. The
original paper [28] did not introduce mlratio as a log-linear effect, but created a binary variable,
distinguishing between oestrogen-receptor-negative tumours (mlratio< −0.65) and oestrogen-
receptor-positive tumours. The receptor-negative subjects are about 23% of the total. Our
analysis suggests that this is a reasonable approach. This fits very well with our observation
in section 7.8.2 that there was a clear difference between the subjects with the lowest tertile of
mlratio and all the others, but hardly any difference between the upper two tertiles.
7.8.6 Schoenfeld residuals
Schoenfeld residuals are produced in the survival package by the cox.zph command applied to
the output of coxph. The command z=cox.zph(nki.cox) produces the output
1 rho c h i s q p
2 h i s t o l g r a d e P o o r l y d i f f −0.0661 0 . 3 6 2 9 0 . 5 4 7
3 histolgradeWell d i f f 0.1686 2.3944 0.122
Model diagnostics 119

0.6
0.5
cumulative hazard +

0.4
0.3
0.2
0.1
0.0

0.00 0.05 0.10 0.15 0.20 0.25

cumulative hazard -

Figure 7.6: Andersen plot of NKI data for subpopulations stratified by vascular invasion status.

4 age 0.0313 0.0919 0.762


5 mlratio 0.1357 1.5480 0.213
6 v a s c . i n v a s i o n+ 0.0459 0.1664 0.683
7 v a s c . i n v a s i o n+/− 0.1312 1.3028 0.254
8 GLOBAL NA 9.7256 0.137

The column rho gives the correlation of the scaled Schoenfeld residual with time. (That is, each
Schoenfeld residual is scaled by an estimator of its standard deviation). The chisq and p-value
are for a test of the null hypothesis that the correlation is zero, meaning that the proportional
hazards condition holds (with constant β). GLOBAL gives the result of a chi-squared test for the
hypothesis that all of the coefficients are constant. Thus, we may accept the hypothesis that all
of the proportionality parameters are constant in time.
Plotting the output of cox.zph gives a smoothed picture of the scaled Schoenfeld residuals
as a function of time. For example, plot(z[4]) gives the output in Figure 7.10, showing an
estimate of the parameter for mlratio as a function of time. We see that the plot is not perfectly
constant, but taking into account the uncertainty indicated by the confidence intervals, it is
plausible that the parameter is constant. In particular, the covariate histolgradeWell diff
produces the smallest p-value, but from the plot it is clear that the apparent positive slope is
purely an artefact of there being two points with very high leverage right at the end, where the
variance is particularly high. (The estimation of ρ is quite crude, being simply an ordinary least
squares fit, so does not take account of the increased uncertainty at the end.)
Model diagnostics 120

20

2
15
Beta(t) for histolgradeWell diff

Beta(t) for mlratio


10

0
5

-2
0

-4
-5

1.6 2.6 3.3 4.8 5.8 8.1 11 13 1.6 2.6 3.3 4.8 5.8 8.1 11 13

Time Time

(a) histolgradeWell diff (b) mlratio

Figure 7.7: Schoenfeld residuals for some covariates in the Cox model fit for the NKI data.

Cox-Snell residual plot for NKI data


0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8

Figure 7.8: Nelson–Aalen estimator for the Cox–Snell residuals for the NKI data, together with
the line y = x.
Model diagnostics 121

1.0
0.5
Martingale residual
0.0
−0.5
−1.5

−1.5 −1.0 −0.5 0.0 0.5


mlratio

Figure 7.9: Martingale residuals for the model without mlratio, plotted against mlratio, with
a LOWESS smoother in red.
Model diagnostics 122

2
Beta(t) for mlratio
0
-2
-4

1.6 2.6 3.3 4.8 5.8 8.1 11 13

Time

Figure 7.10: Scaled Schoenfeld residuals for the mlratio parameter as a function of time.
Chapter 8

Dynamic prediction (Optional Topic)

123
Chapter 9

Correlated events and repeated events

9.1 Introduction
The survival models that we have considered depend upon the fundamental assumption that
event times are independent. There are many settings where this assumption is unreasonable:

• Clustered data: There may be multiple observations for a single individual, or for a
group with correlated times within the group. For example, the diabetes data set (in the
SurvCorr package in R) includes data for 197 patients being treated for diabetic retinopathy
— loss of vision due to diabetes. Each patient has two eyes, and all eyes are at risk until the
event (measured serious vision loss). It would be unreasonable to assume that the loss of
vision is independent between the two eyes.
Below is an excerpt from the data set. One peculiarity is that the comparison was made
within individuals, with each individual having one treated and one untreated eye, with
the variable TRT_EYE recording which eye was treated. For more information about the
data set see section 8.4.2 of [26].
1 > head ( d i a b e t e s )
2 ID LASER TRT_EYE AGE_DX ADULT TIME1 STATUS1 TIME2 STATUS2
3 1 5 2 2 28 2 46.23 0 46.23 0
4 2 14 2 1 12 1 42.50 0 31.30 1
5 3 16 1 1 9 1 42.27 0 42.27 0
6 4 25 2 2 9 1 20.60 0 20.60 0
7 5 29 1 2 13 1 38.77 0 0.30 1
8 6 46 1 1 12 1 65.23 0 54.27 1

• Multiple events: A study may consider times of events that do not remove the subject
from risk of further events. For example, a study of hospitalisation events for an elderly
population will likely see some individuals being hospitalised multiple times.

• Competing events: Right censoring may be understood as a process with “multiple


endpoints” — an individual leaves the study either through an event or through censoring.
There is an asymmetry because we choose to treat the censoring distribution as a nuisance
parameter. But in some studies there are multiple endpoints equally of interest: A medical
study participant may exit through multiple relevant causes of death, or through the
alternatives death or recovery. A participant in a sociological study of marriage may leave
the “cohabiting” state through marriage or dissolution of the partnership. In each of these

124
Correlated events 125

cases, there are distinct outcomes, each with its own hazard rate, that each serves as
“censoring” for observation of the others.

9.2 Time-to-first-event analysis


When we are confronted with multiple events for a single individual, one easy approach to
eliminating the complication is to throw out the correlated data. A common approach, called
time-to-first-event analysis, is to define the event of interest to be simply the first event. Later
events for the same individual are simply ignored.
This trivially produces independent observations, and will yield results that are consistent
and unbiased, but at the expense of losing a substantial amount of relevant information from the
data. Of course, this approach is possible only when the correlated events correspond to a single
individual, so that all covariates are identical. It would not apply to an example such as the
diabetetic retinopathy study described above, where the two correlated events for one individual
differ in their treatment group membership.

9.3 Clustered data


9.3.1 Stratified baseline
If there are a small number of large correlated groups of survival times, we may represent the
correlation within the groups by using a semiparametric model and stratifying the baseline
hazard by group. Note that there is no way to distinguish between a cluster of survival times
being “dependent”, and times sharing a group-determined hazard function.
Suppose we have k categories of individuals, ci = 1, . . . , k, each with its own baseline hazard
(c)
h0 (t), so that individuals in category ci with covariates xi (t) have hazard
(c )
hi (t) = h0 i (t)r(β, xi (t))

at time t. Then we have a partial likelihood for the observation that individual ij had the unique
event at time tj
Y r(β, xij (tj ))
LP (β) = P .
t i∈Rj : ci =ci r(β, xi (tj ))
j j

The only change, compared with the standard partial likelihood given in (5.5)
As an example, we consider a data set based on the NHANES (National Health and Nutrition
Survey) wave 3, from which we have measures of systolic and diastolic blood pressure and about
15 years of survival follow-up. It is certainly not the case that men and women, or different ethnic
groups, have the same baseline mortality rate. We could analyse the effect of blood pressure
on mortality in the groups separately, but that will produce six different parameter estimates,
each of which will be subject to more random noise. It is at least plausible that we would like to
produce a joint estimate of the influence of blood pressure on mortality, that we suppose acts in
approximately the same way against the background of distinct baseline mortality.
1 Call :
2 coxph ( f o r m u l a = with ( nhanesC , Surv ( age , age + y r s f u , e v e n t h r t ) ) ∼
3 meandias + meansys + s t r a t a ( r a c e ) + s t r a t a ( f e m a l e ) , data = nhanesC )
4
5 c o e f exp ( c o e f ) s e ( c o e f ) z p
6 meandias −0.006495 0 . 9 9 3 5 2 7 0 . 0 0 2 7 6 3 −2.351 0 . 0 1 8 7
7 meansys 0.011528 1.011595 0 . 0 0 1 3 2 0 8 . 7 3 3 <2e −16
8
9 L i k e l i h o o d r a t i o t e s t =75.9 on 2 df , p=< 2 . 2 e −16
Correlated events 126

10 n= 1 5 2 9 5 , number o f e v e n t s= 1459

If we plot the output of the survfit function applied to this result, we get the picture in Figure
9.1, showing six different baseline survival functions.

NHANES survival stratified baseline


1.0
0.8

black male
black female
white male
white female
other male
0.6

other female
0.4
0.2
0.0

30 40 50 60 70 80 90

Figure 9.1: Baseline survival estimates for different population groups from the NHANES data.

9.3.2 The sandwich estimator for variance


In an example like the diabetic retinopathy study of course there is no possibility of allowing each
“cluster” of two eyes its own baseline hazard. The clustering does not substantially change the
parameter estimates, but it does affect the estimation of variance, and hence of the confidence
interval for the parameters. In the extreme case, imagine that we had a dataset where each
individual had the same survival time repeated multiple times. More observations reduces the
variance estimate; but we would want to have a procedure that would be able to recognise that
the duplicated observations are not actually providing additional information, and that would
hence return the same variance estimate as the data set without duplication.
The standard approach in such cases is to replace the Fisher-information-based estimate for
variance Jn (β̂)−1 by the sandwich estimator

Jn (β̂)−1 Vn (β̂)Jn (β̂)−1 , (9.1)

where Vn (β̂) is an estimate of the variance-covariance matrix of the score function.


Correlated events 127

We describe here how to calculate Vn for the case of Cox regression for independent observa-
tions an explicit formula. We begin with the case of independent individual observations. (This
is not directly relevant for the case of clustered observations, but it is relevant for making robust
variance estimates for other kinds of model misspecification.) Differentiating the log partial
likelihood yields a score function, which is the same as the Schoenfeld residuals:
k 
X 
U (β) = xij − x̄(tj ) .
j=1

To create a variance estimate we rewrite this as


n
X n X
X
U (β) = δi xi (Ti ) − xi (tj )eβ·xi (tj ) dĤ0 (tj , β)
i=1 i=1 tj ≤Ti
n  
X  X 
= δi xi (Ti ) − x̄(Ti , β) − xi (tj ) − x̄(tj ) eβ·xi (tj ) dĤ0 (tj , β)
i=1 tj ≤Ti

where x̄(tj , β) is the weighted mean of x̄ at time tj

xi eβ·xi (tj )
P
i∈Rj
x̄(tj , β) = P
i∈Rj eβ·xi (tj ,β)

and
1
dĤ0 (tj ) = P β·xi (tj )
i
i∈Rj e

For large n, if β = β∗ is the limit parameter, dĤ0 (t, β∗ ) will be approximately the increment
with respect to this cumulative hazard, and by the Law of Large Numbers
 
x̄(t, β∗ ) ≈ s(t) := E x̄(t, β∗ ) .

So if β = β̂, the i-th summand will be approximately (to first order in the error β̂ − β∗ )
  X 
Ui (β̂) = δi xi (Ti ) − x̄(Ti , β̂) − xi (tj ) − x̄(tj , β̂) eβ̂·xi (tj ) dĤ0 (tj , β̂) (9.2)
tj ≤Ti
Z Ti  
xi (t) − s(t) eβ∗ ·xi (t) h0 (t, β∗ ) dt.

≈ δi xi (Ti ) − s(Ti ) −
0

The first expression is what we use to calculate Ui , but the second makes clear (since it depends
only on the observation Ti and δi ) that these are (approximately) i.i.d. Since they have expectation
0, it follows that
Vn (β̂) = Ui (β̂)Ui (β̂)T
is a consistent estimator for the variance-covariance matrix of the score function. (The mathe-
matical details of the proof may be found in [20].)
The case of clustered observations C clusters of observations, with nc individuals in cluster c
— where the separate clusters may be interpreted as independent and identically distributed — are
Correlated events 128

considered in [19]. In this case, there is correlation within a cluster, but the terms corresponding
to individuals in different clusters are independent. if we have we may take
C
X
Vn (β̂) = uc (β̂)uc (β̂)T ,
c=1

where
nc
X n o Xnc X n o
uc (β̂) = δic xic (Tic ) − x̄(Tic , β̂) − xic (tj ) − x̄(tj , β̂) eβ̂·xic (tj ) dĤ0 (tj ).
i=1 i=1 tj ≤Tic

(Here we are using the subscript ic to denote the i-th individual in cluster c.) It is not necessary
to memorise this formula (it is not examinable), but it is important to know that such a formula
exists, and needs to be applied whenever one is analysing clustered data. It will automatically
be applied in R if the coxph function is applied with an extra term in the formula of the form
+cluster(group). Commonly the clustering is by individual identifying number, so the term
becomes +cluster(id).

9.4 Multiple events


9.4.1 The Poisson model
The simplest model for repeated events is the Poisson model, taking the intensity for each
individual to be a fixed constant α when at risk. Assuming changes in the at-risk process is
independent of the counting process, the number of events observed for each individual, and in
total, will be Poisson distributed,
R∞ conditioned on the total time at risk. If we observe individual
i for a total time Ti = 0 Yi (t)dt, and observe N = N (∞) events, then the log likelihood is
X
`(α) = N log α − α Ti ,

with the MLE


N
α̂ = P .
Ti
9.4.2 The Poisson regression model
A simple generalisation would be to say that each individual has Poisson number of events, with
the intensity being constant during the time when that individual is at risk, and a function — to
be determined — of some measured covariates. If individual i is at risk for total time Ti , the
number of events Ni is Poisson distributed with parameter Ti α(xi ). Conditioned on the time at
risk the log likelihood is then
n
X
`(α) = Ni log α(xi ) − Ti α(xi ).
i=1

The most common parametric form is α = exp β · x , where β = (β0 , . . . , βp ), and we take


xi0 ≡ 1. The log likelihood then becomes


p
X n
X n
X
`(β) = βk Ni xik − Ti eβ·xi . (9.3)
k=0 i=1 i=1
Correlated events 129

The MLE then satisfies the equations


n
X n
X
Ni xik = xik Ti eβ̂·xi .
i=1 i=1

This fits into the framework of GLM (generalised linear model), and may be fit in R using
any of the standard GLM functions. Note that we are modelling Ni ∼ Po(µi ), where

log µi = log Ti + β · xi . (9.4)

We call log Ti an offset in the model.


As an example, we consider data from a trial to determine the effect of nutritional supplements
on prisoners’ rate of disciplinary offenses. There were 771 prisoners, observed for anywhere
from two weeks to half a year for a baseline period, after which half were randomly given the
supplements, the others were given placebos, and they were observed (and offences recorded) for
a variable period, mostly at least 10 weeks. We consider here only the treatment (second) period,
and try to model the effect of the treatment, and of the difference in rates between prisons (which
we call here A, B, and C). Thus our model is

log α = β0 + β1 1{treatment} + β2 1{prison B} + β3 1{prison C}.

1 #sdb0=s t a r t d a t e b a s e l i n e , s d t 0= s t a r t d a t e treatment , e d t 0=end d a t e


2 #b a s e . count=# e v e n t s b a s e l i n e , t r e a t . count=#e v e n t s t r e a t m e n t
3 # pgroup= i n d i c a t o r o f a c t i v e t r e a t m e n t
4 r i s k t i m e=edt0−s d t 0
5 p o i s r e g=glm ( t r e a t . count∼p r i s+pgroup , f a m i l y=p o i s s o n , o f f s e t=l o g ( r i s k t i m e ) )
6 summary ( p o i s r e g )
7 Call :
8 glm ( f o r m u l a = t r e a t . count ∼ p r i s + pgroup , f a m i l y = p o i s s o n ,
9 o f f s e t = log ( risktime ) )
10
11 Deviance R e s i d u a l s :
12 Min 1Q Median 3Q Max
13 −4.5092 −1.5806 −0.7360 0.4907 8.3629
14
15 Coefficients :
16 E s t i m a t e Std . E r r o r z v a l u e Pr ( >| z | )
17 ( I n t e r c e p t ) −2.39045 0 . 0 5 9 7 1 −40.031 <2e −16 ∗∗∗
18 prisB −0.99509 0 . 0 6 6 1 1 −15.051 <2e −16 ∗∗∗
19 prisC −1.99417 0 . 0 7 0 6 9 −28.212 <2e −16 ∗∗∗
20 pgroup −0.10239 0 . 0 5 0 5 1 −2.027 0.0426 ∗
21 −−−
22 S i g n i f . c o d e s : 0 `∗ ∗ ∗ ` 0 . 0 0 1 `∗ ∗ ` 0 . 0 1 `∗ ` 0 . 0 5 `. ` 0 . 1 ` ` 1
23
24 ( D i s p e r s i o n parameter f o r p o i s s o n f a m i l y taken t o be 1 )
25
26 N u l l d e v i a n c e : 2 8 8 7 . 0 on 770 d e g r e e s o f freedom
27 R e s i d u a l d e v i a n c e : 2 1 3 3 . 2 on 767 d e g r e e s o f freedom
28 AIC : 3 3 4 9 . 5
29
30 Number o f F i s h e r S c o r i n g i t e r a t i o n s : 6
Correlated events 130

This fitted model gives us a predicted expected number of events for each individual. The
difference between the observed number of events and the expected number predicted by the
model is the residual. In Figure 9.2 we plot the residuals against the fitted values. (This is the
automatic output of the command plot(poisreg), where poisreg is the output of the glm fit
above.
This would be obvious from a casual examination of the data. The mean number of events is
about 2, but some individuals have as many as 25, which is not something you would see in a
Poisson distribution. These data are over-dispersed, meaning that their variance is higher than it
would be for a Poisson distribution of the same mean.
We also note that the deviance residuals (which should be approximately standard normal
distributed if the model is correct) range from −4.5 to +8.36. The sum of their squares, called
the residual deviance, is 2133.2, which is much too large for a chi-squared variable on 767 degrees
of freedom.

Residuals vs Fitted

374

245 650
5
Residuals
0
−5

−3 −2 −1 0 1 2
Predicted values
glm(treat.count ~ pris + pgroup)

Figure 9.2: Residual plot for the prison-data Poisson regression.

9.4.3 The Andersen–Gill model


The Poisson regression model makes sense if we believe the event intensity is constant, or if all
individuals Another popular generalisation of the Poisson model, introduced in 1982 by Andersen
and Gill [3], is a semi-parametric relative-risk regression model, essentially equivalent to the Cox
proportional hazards regression model. The only change is that the at-risk indicator Yi (t) for
an individual will not, in general, become 0 after an event. Partial likelihood is defined exactly
as in (5.5), and Breslow’s formula still defines an estimate of cumulative intensity (rather than
cumulative hazard).
Correlated events 131

We can fit the model in R by using the coxph command. All we need to do is to represent
the data appropriately in a Surv object. To do this, the record for an individual gets duplicated,
with one row for each event time or censoring time. An event time will be the “stop” time in one
row, and will then be the “start” time in the next row. The covariates will repeat from row to
row for the same individual.
The model assumes that differences between individuals are completely described by the
relative-risk function determined by their covariates. If we are unsure — as we generally will be
— we can robustify the variance estimates as in section 9.3.2 by adding a +cluster(id) term.
Alternatively, we can add a hidden frailty term to the model, as described below in section 9.5.

9.5 Shared frailty model


One way to deal with the correlation among multiple events for the same individual (or for linked
individuals) is by explicitly modelling the variation in hazard rate with a random effect, generally
called a frailty term in the survival context. The most common version is a relative risk term
eωgroup , where ωgroup is an unobserved covariate with distribution assumed to have a particular
form, usually either gamma or Gaussian.
9.5.1 Negative-binomial model
The simplest version of the frailty model generalises the Poisson model: Individuals accrue events
at a constant rate, but with the unknown constant dependent on the individual. For example,
the Poisson model doesn’t really make much sense in the example discussed in section 9.4.2.
Individuals may be presumed to have differing predispositions to offend. Thus, it is not surprising
that the number of offences is more spread out than you would expect under the Poisson model,
which posits that everyone offended at the same rate.
We may generalise the Poisson regression model to better fit overdispersed data by adding a
frailty term. That is, in place of (9.4) we represent the individual intensity by

log µi = log λi + log Ti + β · xi . (9.5)

The term λi , called a multiplicative frailty, represents the individual relative rate of producing
events. The λi are treated as random effects, meaning that they are not to be estimated
individually — which would not make sense — but rather, they are taken to be i.i.d. samples
from a simple parametric distribution. When the frailty λ has a gamma distribution (with
parameters (θ, θ), because we conventionally take the frailty distribution to have mean 1), and
N is a Poisson count conditioned on λ with mean λα, then N has probability mass function
 θ  n
 Γ(n + θ) θ α
P N =n = ,
n!Γ(θ) θ+α θ+α

which is the negative binomial distribution with parameters θ and α/(θ + α). (The calculation is
left as an exercise.) Therefore this is called the negative binomial regression model. We can fit it
with the glm.nb command in R. If we apply it to the same data as before, we get the following
output:
1 > summary ( p o i s r e g 2 )
2
3 Call :
4 glm . nb ( f o r m u l a = t r e a t . count ∼ p r i s + pgroup + o f f s e t ( l o g ( r i s k t i m e ) ) ,
5 i n i t . theta = 0.8418678047 , l i n k = log )
6
7 Deviance R e s i d u a l s :
Correlated events 132

8 Min 1Q Median 3Q Max


9 −2.1179 −1.2228 −0.4695 0.2785 3.6767
10
11 Coefficients :
12 E s t i m a t e Std . E r r o r z v a l u e Pr ( >| z | )
13 ( I n t e r c e p t ) −2.28974 0 . 1 7 9 7 7 −12.737 < 2 e −16 ∗∗∗
14 prisB −1.08624 0 . 1 9 0 4 7 −5.703 1 . 1 8 e −08 ∗∗∗
15 prisC −2.08056 0 . 1 8 7 1 6 −11.117 < 2 e −16 ∗∗∗
16 pgroup −0.15331 0 . 0 9 9 8 4 −1.536 0.125
17 −−−
18 S i g n i f . codes : 0 ' ∗∗∗ ' 0.001 ' ∗∗ ' 0.01 ' ∗ ' 0.05 ' . ' 0.1 ' ' 1
19
20 ( D i s p e r s i o n parameter f o r N e g a t i v e Binomial ( 0 . 8 4 1 9 ) f a m i l y taken t o be 1 )
21
22 N u l l d e v i a n c e : 9 5 6 . 2 7 on 770 d e g r e e s o f freedom
23 R e s i d u a l d e v i a n c e : 7 5 5 . 6 1 on 767 d e g r e e s o f freedom
24 AIC : 2 6 4 5 . 4
25
26 Number o f F i s h e r S c o r i n g i t e r a t i o n s : 1
27
28
29 Theta : 0.8419
30 Std . Err . : 0.0781
31
32 2 x l o g −l i k e l i h o o d : −2635.3550

We note that, while the largest deviance residual of 3.68 suggests a possible outlier, the total
residual deviance is now quite plausible.
9.5.2 Frailty in proportional hazards models
We can use shared frailty to account for correlated times in proportional hazards regression,
whether these are unordered (clustered) times, or recurrent events. The model fitting functions
numerically exactly like any other random-effects model: We treat the individual unknown
frailties as unobserved data, whose expected values given the observed data may be calculated.
Given the individual frailties, we may maximise the parameters, and so loop through the EM
algorithm. The calculations are carried through automatically by the coxph function in R, as
long as we add a + frailty(id) (or whichever variable we are grouping by) term to the formula.
The output will include a p-value estimate for the individual
Is the frailty term actually appropriate to the data? We may test the null hypothesis that
there is no individual frailty with a likelihood ratio test. The null-hypothesis log-likelihood is
simply the log partial likelihood for a traditional model without a frailty term. The alternative log
likelihood is the log of the integrated partial likelihood — that is, integrated over the distribution
of the frailty — called the I-likelihood in the R output.
Note that the model fit automatically produces estimates of the individual frailties. If desired,
these may be used for individualised survival projections.

9.6 Example: The diabetic retinopathy data set


We analyse the data set consisting of 197 pairs of eyes, where one of each has been treated, for
the proportional effect of treatment on the hazard rate of vision loss. We first create a new data
set by doubling the original, creating one entry for each eye in the original data set.
Correlated events 133

1 D i a b e t e s <− data . frame ( i d = r e p ( ID , 2 ) , time = c (TIME1 , TIME2) ,


2 s t a t u s = c (STATUS1, STATUS2) ,
3 a d u l t =( r e p (ADULT, 2 ) ==2) , t r t = c (TRT_EYE==1,TRT_EYE==2))

Now we fit the Cox model:


1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + c l u s t e r ( i d ) ,
2 data = D i a b e t e s )
3
4 n= 3 9 4 , number o f e v e n t s= 155
5
6 c o e f exp ( c o e f ) s e ( c o e f ) r o b u s t s e z Pr ( >| z | )
7 trtTRUE −0.28091 0.75509 0.16167 0 . 1 4 5 2 7 −1.934 0.0531
8 adultTRUE 0 . 0 2 4 0 1 1.02431 0.16196 0.17399 0.138 0.8902
9
10 trtTRUE .
11 adultTRUE
12 −−−
13
14
15 exp ( c o e f ) exp(− c o e f ) l o w e r . 9 5 upper . 9 5
16 trtTRUE 0.7551 1.3243 0.5680 1.004
17 adultTRUE 1.0243 0.9763 0.7283 1.441
18
19 Concordance= 0 . 5 3 4 ( s e = 0 . 0 2 3 )
20 Rsquare= 0 . 0 0 8 (max p o s s i b l e= 0.988 )
21 L i k e l i h o o d r a t i o t e s t= 3 . 0 6 on 2 df , p=0.2
22 Wald t e s t = 3 . 7 4 on 2 df , p=0.2
23 S c o r e ( l o g r a n k ) t e s t = 3 . 0 6 on 2 df , p =0.2 , Robust = 3 . 7 2 p=0.2
24
25 ( Note : t h e l i k e l i h o o d r a t i o and s c o r e t e s t s assume i n d e p e n d e n c e o f
26 o b s e r v a t i o n s w i t h i n a c l u s t e r , t h e Wald and r o b u s t s c o r e t e s t s do not ) .

Note that in this case, because of the paired design, the robust SE is actually smaller than the
model-based (inverse-information) SE. If we had stratified instead of clustering we would obtain:
1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + s t r a t a ( i d ) ,
2 data = D i a b e t e s )
3
4 c o e f exp ( c o e f ) s e ( c o e f ) z p
5 trtTRUE −0.2122 0.8088 0 . 1 8 1 3 −1.17 0 . 2 4 2
6 adultTRUE NA NA 0.0000 NA NA
7
8 L i k e l i h o o d r a t i o t e s t =1.38 on 1 df , p =0.2407
9 n= 3 9 4 , number o f e v e n t s= 155

Note that the standard error is increased substantially relative to the model-based estimate.
If we instead fit a gamma-frailty model to account for the correlation between two eyes we
get a very similar result to that obtained from the clustered model:
1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + f r a i l t y ( i d ) ,
2 data = D i a b e t e s )
3
Correlated events 134

4 coef se ( coef ) se2 Chisq DF p


5 trtTRUE −0.31173 0 . 1 6 5 0 1 0 . 1 6 2 6 5 3 . 5 6 8 8 6 1 . 0 0 . 0 5 9
6 adultTRUE 0.00239 0.20242 0.16341 0.00014 1.0 0.991
7 f r a i l t y ( id ) 81.76565 65.1 0.079
8
9 I t e r a t i o n s : 7 o u t e r , 31 Newton−Raphson
10 V a r i a n c e o f random e f f e c t= 0 . 5 8 7 I−l i k e l i h o o d = −863
11 D e g r e e s o f freedom f o r terms= 1 . 0 0.7 65.1
12 L i k e l i h o o d r a t i o t e s t =138 on 6 6 . 7 df , p=7e −07
13 n= 3 9 4 , number o f e v e n t s= 155

Fitting a Gaussian frailty produces only a slight change:


1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + f r a i l t y ( id ,
2 d i s t r i b u t i o n = " g a u s s i a n " ) , data = D i a b e t e s )
3
4 coef se ( coef ) se2 Chisq DF p
5 trtTRUE −0.30731 0 . 1 6 5 1 0 0.16268 3.46479 1.0 0.063
6 adultTRUE 0.01567 0.19626 0.16241 0.00637 1.0 0.936
7 f r a i l t y ( id , d i s t r i b u t i o n 78.53744 57.2 0.032
8
9 I t e r a t i o n s : 6 o u t e r , 22 Newton−Raphson
10 V a r i a n c e o f random e f f e c t= 0 . 5 4 7
11 D e g r e e s o f freedom f o r terms= 1 . 0 0.7 57.2
12 L i k e l i h o o d r a t i o t e s t =136 on 5 8 . 9 df , p=5e −08
13 n= 3 9 4 , number o f e v e n t s= 155

9.7 Example: Bladder cancer data set


We follow the discussion in section 8.5.4 of [26] of a famous data set on recurrence of bladder
cancer. The data describe the time until up to four recurrences for 85 patients The version
we use is included in the survival package as the object bladder2. We wish to apply a Cox
regression analysis to evaluate the effect of the variables rx (treatment: 1=placebo, 2=active),
size (size in cm of largest tumour), and number (number of tumours), on the time (in months)
to recurrence.
The recurrences are numbered successively by the enum variable, so if we include only those
with enum==1 we will have a time-to-first-event analysis:
1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e ,
2 data = b l a d d e r 2 , s u b s e t = ( enum == 1 ) )
3
4 c o e f exp ( c o e f ) s e ( c o e f ) z p
5 rx −0.52598 0.59097 0 . 3 1 5 8 3 −1.665 0 . 0 9 5 8
6 number 0 . 2 3 8 1 8 1.26894 0.07588 3.139 0.0017
7 size 0.06961 1.07209 0.10156 0.685 0.4931
8
9 L i k e l i h o o d r a t i o t e s t =9.92 on 3 df , p =0.01927
10 n= 8 5 , number o f e v e n t s= 47

Including all the events in an Anderson–Gill model increases the number of events from
47 to 112. Naïvely we might expect the standard errors to be reduced by a factor of about
Correlated events 135

47/112 = .648, reducing the SE of the number coefficient from .076 to about .049; and the SE
p

of the rx coefficient from .316 to about .205. Carrying out the calculation yields
1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e +
2 c l u s t e r ( i d ) , data = b l a d d e r 2 )
3
4 c o e f exp ( c o e f ) s e ( c o e f ) r o b u s t s e z p
5 rx −0.46469 0.62833 0.19973 0 . 2 6 5 5 6 −1.750 0 . 0 8 0 1 5
6 number 0 . 1 7 4 9 6 1.19120 0.04707 0.06304 2.775 0.00551
7 size −0.04366 0.95728 0.06905 0 . 0 7 7 6 2 −0.563 0 . 5 7 3 7 6
8
9 L i k e l i h o o d r a t i o t e s t =17.52 on 3 df , p =0.0005531
10 n= 1 7 8 , number o f e v e n t s= 112

The se(coef) output is exactly what we predicted, but the robust se is substantially larger,
due to correlation among the observations. The actual reduction in SE is not the 35% that would
have been produced if the additional recurrences had been independent, but only about 17%. It
is as though we had only about 15 additional independent observations, rather than the 65 that
we might have naïvely supposed.
Applying a gamma frailty again yields a very similar result:
1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e +
2 f r a i l t y ( i d ) , data = b l a d d e r 2 )
3
4 coef se ( coef ) se2 Chisq DF p
5 rx −0.6077 0.3330 0.2197 3.3295 1.0 0.06805
6 number 0.2387 0.0932 0.0570 6.5567 1.0 0.01045
7 size −0.0215 0.1140 0.0717 0.0357 1.0 0.85009
8 f r a i l t y ( id ) 82.9157 45.4 0.00056
9
10 I t e r a t i o n s : 6 o u t e r , 29 Newton−Raphson
11 V a r i a n c e o f random e f f e c t= 1 . 0 8 I−l i k e l i h o o d = −436.8
12 D e g r e e s o f freedom f o r terms= 0 . 4 0.4 0.4 45.4
13 L i k e l i h o o d r a t i o t e s t =144 on 4 6 . 6 df , p=8e −12
14 n= 1 7 8 , number o f e v e n t s= 112

And with Gaussian frailty:


1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e +
2 f r a i l t y ( id , d i s t r i b u t i o n = " g a u s s i a n " ) , data = b l a d d e r 2 )
3
4 coef se ( coef ) se2 Chisq DF p
5 rx −0.5612 0.3161 0.2175 3.1517 1.0 0.0758
6 number 0.2259 0.0835 0.0534 7.3248 1.0 0.0068
7 size −0.0202 0.1065 0.0711 0.0358 1.0 0.8499
8 f r a i l t y ( id , d i s t r i b u t i o n 8 7 . 8 8 1 5 3 7 . 8 7 . 2 e −06
9
10 I t e r a t i o n s : 7 o u t e r , 27 Newton−Raphson
11 V a r i a n c e o f random e f f e c t= 0 . 8 9
12 D e g r e e s o f freedom f o r terms= 0 . 5 0.4 0.4 37.8
13 L i k e l i h o o d r a t i o t e s t =138 on 3 9 . 1 df , p=6e −13
14 n= 1 7 8 , number o f e v e n t s= 112
Bibliography

[1] Odd O. Aalen. “Further results on the non-parametric linear regression model in survival
analysis”. In: Statistics in Medicine 12 (1993), pp. 1569–88.
[2] Odd O. Aalen et al. Survival and Event History Analysis: A process point of view. Springer
Verlag, 2008.
[3] Per Kragh Andersen and Richard D Gill. “Cox’s regression model for counting processes: a
large sample study”. In: AoS (1982), pp. 1100–1120.
[4] George Leclerc Buffon. Essai d’arithmétique morale. 1777.
[5] David Collett. Modelling survival data in medical research. 3rd. Chapman & Hall/CRC,
2015.
[6] David R Cox. “Regression Models and Life-Tables”. In: Journal of the Royal Statistical
Society. Series B (Methodological) 34.2 (1972), pp. 87–22.
[7] CT4: Models Core Reading. Faculty & Institute of Acutaries, 2006.
[8] Stephen H. Embury et al. “Remission Maintenance Therapy in Acute Myelogenous Leukemia”.
In: The Western Journal of Medicine 126 (Apr. 1977), pp. 267–72.
[9] Gregory M. Erickson et al. “Tyrannosaur Life Tables: An example of nonavian dinosaur
population biology”. In: Science 313 (2006), pp. 213–7.
[10] Thomas R. Fleming and David P. Harrington. Counting Processes and Survival Analysis.
Wiley, 1991.
[11] A. J. Fox. English Life Tables No. 15. Office of National Statistics. London, 1997.
[12] Benjamin Gompertz. “On the Nature of the function expressive of the law of human mortality
and on a new mode of determining life contingencies”. In: Philosophical transactions of the
Royal Society of London 115 (1825), pp. 513–85.
[13] Peter Hall and Christopher C. Heyde. Martingale Limit Theory and its Application. New
York, London: Academic Press, 1980.
[14] David P. Harrington and Thomas R. Fleming. “A Class of Rank Test Procedures for
Censored Survival Data”. In: Biometrika 69.3 (Dec. 1982), pp. 553–66.
[15] Hans C. van Houwelingen and Theo Stijnen. “Cox Regression Model”. In: Handbook of
Survival Analysis. Ed. by John P. Klein et al. CRC Press, 2014. Chap. 1, pp. 5–26.
[16] Edward L Kaplan and Paul Meier. “Nonparametric estimation from incomplete observations”.
In: Journal of the American Statistical Association 53.282 (1958), pp. 457–481.
[17] Kathleen Kiernan. “The rise of cohabitation and childbearing outside marriage in western
Europe”. In: International Journal of Law, Policy and the Family 15 (2001), pp. 1–21.

136
Correlated events 137

[18] John P. Klein and Melvin L. Moeschberger. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd. SV, 2003.
[19] Eric W Lee et al. “Cox-type regression analysis for large numbers of small groups of
correlated failure time observations”. In: Survival analysis: State of the art. Springer, 1992,
pp. 237–247.
[20] Danyu Y Lin and Lee-Jen Wei. “The robust inference for the Cox proportional hazards
model”. In: Journal of the American statistical Association 84.408 (1989), pp. 1074–1078.
[21] A. S. Macdonald. “An actuarial survey of statistics models for decrement and transition
data. I: Multiple state, Poisson and binomial models”. In: British Actuarial Journal 2.1
(1996), pp. 129–55.
[22] Kyriakos S. Markides and Karl Eschbach. “Aging, Migration, and Mortality: Current Status
of Research on the Hispanic Paradox”. In: Journals of Gerontology: Series B 60B (2005),
pp. 68–75.
[23] Rupert G. Miller et al. Survival Analysis. Wiley, 2001.
[24] Richard Peto and Julian Peto. “Asymptotically Efficient Rank Invariant Test Procedures”.
In: Journal of the Royal Statistical Society. Series A (General) 135.2 (1972), pp. 185–207.
[25] Terry M. Therneau et al. “Martingale-based residuals for survival models”. In: Biometrika
77.1 (1990), pp. 147–60.
[26] Terry M Therneau and Patricia M Grambsch. Modeling survival data: Extending the Cox
model. Springer Science & Business Media, 2013.
[27] Anastasios Tsiatis. “A nonidentifiability aspect of the problem of competing risks”. In:
Proceedings of the National Academy of Sciences 72.1 (1975), pp. 20–22.
[28] Marc J Van De Vijver et al. “A gene-expression signature as a predictor of survival in
breast cancer”. In: New England Journal of Medicine 347.25 (2002), pp. 1999–2009.
[29] Kenneth W. Wachter. Essential Demographic Methods. Harvard University Press, 2014.
Appendix A

Problem sheets

I
Statistical Lifetime Models: HT 2019 II

A.1 Revision, lifetime distributions, Lexis diagrams and the cen-


sus approximation
Questions 1–3 are to be done for discussion in class. Questions 4 and 5 are to be handed in for
marking.

1. (a) Let L1 , . . . , Ln be independent Exp(λ) random variables. Show that the maximum
likelihood estimator for λ is given by
n
λ̂ = . (A.1)
L1 + . . . + Ln

(b) The following data resulted from a life test of refrigerator motors (hours to burnout):

Hours to burnout
104.3 158.7 193.7 201.3 206.2
227.8 249.1 307.8 311.5 329.6
358.5 364.3 370.4 380.5 394.6
426.2 434.1 552.6 594.0 691.5
i. Assuming refrigerator motors have Exp(λ) lifetimes, give the maximum likelihood
estimate for λ.
ii. Still assuming Exp(λ) lifetimes, calculate the Fisher information and construct
approximate 95% confidence intervals for λ and 1/λ using the approximate Normal
distribution of the maximum likelihood estimator.
iii. Still assuming Exp(λ) lifetimes, show that 2nλ/λ̂ ∼ χ22n . Let a be such that
P(2nλ/λ̂ ≤ a) = α/2 and b such that P(2nλ/λ̂ ≥ b) = α/2. Deduce an exact 95%
confidence interval for 1/λ.
iv. Produce a histogram of the data and comment.
v. Merge columns of your histogram appropriately to test whether the hypothesis of
Exp(λ) lifetimes can be rejected. Use a χ2 goodness of fit test.

2. (a) Let T1 , . . . , Tm be independent continuous nonnegative random variables with hazard


functions h1 (·), . . . , hm (·). Prove that T = min(T1 , . . . , Tm ) has hazard function
h1 (·) + . . . + hm (·).
(b) Let T1 , . . . Tm be independent random variables with Weibull distributions with rate
parameters k1 , . . . km and common exponent n. Prove that T = min(T1 , . . . , Tm ) also
has a Weibull distribution with exponent n.
(c) Calculate the hazard function of the truncated exponential distribution with maximal
age ω, and calculate the limit (in distribution) as λ ↓ 0.
Rt
3. Let λ be any positive function and Λ(t) = 0 λ(s)ds. Suppose X is a random variable with
exponential distribution with parameter 1. Show that Λ−1 (X) is a random variable with
hazard rate λ(t).
Statistical Lifetime Models: HT 2019 III

4. The survival times (in days after transplant) for the original n = 69 members of the
Stanford Heart Transplant Program were as follows:

Survival time after heart transplant (days)


15 3 624 46 127 64 1350 280
23 10 1024 39 730 136 1775 1
836 60 1536 1549 54 47 51 1367
1264 44 994 51 1106 897 253 147
51 875 322 838 65 815 551 66
228 65 660 25 589 592 63 12
499 305 29 456 439 48 297 389
50 339 68 26 30 237 161 14
167 110 13 1 1

The aim of this exercise is to construct the associated lifetable.

(a) Complete the following table of counts dx of associated curtate residual lifetimes (in
years=365 days), counts `x of subjects alive exactly x years after their transplant,
total time `˜x spent alive between x and x + 1 years after their transplant, by all
subjects:

x 0 1 2 3 4
dx 8 4 3
`x
`˜x 19.148 10.203 4.937 1.315
(0)
(b) Calculate the maximum likelihood estimators q̂x and q̂x for qx , x = 0, . . . , 4, based
on the discrete and continuous method, respectively.
(c) Calculate the maximum likelihood estimates. Comment on the differences.
(d) Estimate the probability to survive for 3 months
i. assuming fractional and integer parts of lifetimes are independent, and the
fractional part is uniform;
ii. assuming the force of mortality is constant over the first year;
iii. directly from the data (the total time spent alive until three months after the
transplant is 12.58 years). Hint: You may, of course, guess formulas to test your
intuition, but you should then state your assumptions and apply the discrete
and/or continuous method to justify your estimates as maximum likelihood
estimates.
Statistical Lifetime Models: HT 2019 IV

5. Review the material in on the census approximation (section 2.11.2). For purposes of your
sketches, you may assume the “census date” is the start of the year (1 January). Suppose
we have census counts from the years K, K + 1, . . . , K + N .

(a) Denote by Pk,t the number of lives under observation, aged k (last birthday), at any
time t.
i. On a Lexis diagram, sketch the region where you would find the deaths of
individuals who died in year t aged k years old; and the region where you would
find the deaths of individuals who died in year t + 1 aged k + 1 years old. Also
sketch sample lifelines for such people, indicating when these individuals might
have been born.
ii. Given n individuals at risk and aged k between times ai and bi , show that the
total time at risk is
Xn Z K+N
c
Ek = (bi − ai ) = Px,t dt.
i=1 K

iii. Assume that Pk,t is linear between census dates t = K, K+1, . . . , K+N . Calculate
Ekc in terms of Px,t , t = K, K + 1, . . . , K + N . Explain why the assumption cannot
hold exactly.
(b) Depending on the records available, you may not know the exact age of individuals at
death. You may only know the calendar year of birth and the calendar year of death.
(1)
Thus, instead of dk = # deaths aged k at last birthday before death, you will have
(2)
dk = # deaths in calendar year of the k-th birthday.
i. On a Lexis diagram, sketch the region where you would find the deaths of
individuals who died in year t whose k-th birthday was in the same year; and
the region where you would find the deaths of individuals who died in year t + 1
with k + 1-th birthday was in the same year. Also sketch sample lifelines for such
people, indicating when they might have been born.
ii. Describe the resulting estimate of the force of mortality. Explain the definition
that you need to use for Pk,t . State any further assumptions you make.
iii. What function of µt is being estimated? Assuming mortality rates are changing
with age, explain why this calculation is estimating something slightly different
than the calculation in the previous part.
Statistical Lifetime Models: HT 2019 V

A.2 Life expectancy, graduation, and survival analysis


Questions 1–3 are to be done for discussion in class. Questions 4–7 are to be handed in for
marking.

1. (a) Explain what is meant by right censoring, left censoring, right truncation, left trunca-
tion.
(b) In a study of the elderly, individuals were enrolled in the study, at varying times, if
they had already had one episode of depression. The event of interest was the onset of
a second episode. An individual could be enrolled if at some previous time an episode
of depression had been diagnosed. Which of the above mechanisms are relevant if it is
also known that the study finished after four years?
(c) In 1988 a study was published of the incubation time (waiting time from infection
until symptoms develop) of AIDS. The sample was of 258 adults who were known to
have contracted AIDS from blood transfusion. The data reported were the date of
the transfusion, and the time from infection until the disease was diagnosed. Which
of the above mechanisms are relevant for analysing these data?

2. (a) Suppose you are given estimates for a population of remaining life expectancy ex and
ex+t , corresponding to ages x and x + t (years). You wish to compute the mortality
probability t qx . Under the assumption that mortality rates are constant over this
interval, show that
t + ex+t − ex
t qx ≈ . (*)
t/2 + ex+t
Explain why this equation is only approximate, and what assumption would make it
a good approximation.
(b) The following is an estimated table of ex (in years) in ancient Rome, as computed by
Tim Parkin Demography and Roman Society, available at http://www.utexas.edu/
depts/classics/documents/Life.html.

x 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70
ex 25 33 43 41 37 34 32 29 26 23 20 17 14 10 8 6

Assuming these estimates to be correct, and assuming the mortality rates to be


constant over the age intervals, use equation (*) to estimate the annual mortality
rates 1 qx over the age intervals 0–1, 1–5, 6–10.

3. The data set ovarian, included in the survival package, presents data for 26 ovarian
cancer patients, receiving one of two treatments, which we will refer to as the single and
double treatments. (They appear in the data set as the rx variable, taking on values 1 and
2 respectively.)

(a) Create a survival object for the times in this database.


(b) Compute and plot the Kaplan–Meier estimator for the survival curves. (For a
small extra challenge, plot the single-treatment survival curve black, and the double-
treatment curve red.) You may use the survfit function.
(c) Compute the Nelson–Aalen survival curve estimate. Make a table of the relevant data
(time of events, number of events, number at risk).
Statistical Lifetime Models: HT 2019 VI

(d) Compute the standard error for the probability of survival past 400 days in each
group, as estimated by the Nelson–Aalen and Kaplan–Meier estimators.

4. The following is an investigation carried out by a (medium-sized) UK pension scheme into


the mortality of its pensioners between 2000-2002.

(a) Explain why the crude rates are usually graduated.


(b) The data used to produce the crude rates and the proposed graduated rates are as
follows.

Age Central ExpRisk Deaths crude hazard graduated hazard



x Exc dx µx µx zx
60–64 1388.9 10 0.0072 0.0061 0.5249
65–69 1188.8 17 0.0143 0.0131 0.3615
70–74 880.5 28 0.0318 0.0262 1.0266
75–79 841.6 34 0.0404 0.0487 -1.0912
80–84 402.8 41 0.1018 0.0839 1.2394
85–89 123.9 19 0.1533 0.1338 0.5949
90–94 27.9 7 0.2509 0.1975 0.6346
95–99 10.0 3 0.3000 0.2706 0.1787
100+ 7.5 2 0.2666 0.3455 -0.3673

Assume the Gompertz-Makeham model has been used for graduation. Is this a sensible
choice? Test the proposed graduation for i) Overall goodness of fit; and ii) Bias.

5. Attached is an excerpt from a cohort life table for men in England and Wales born in 1894,
including curtate life expectancies. (Data from the Human Mortality Database.) Using the
given data:

(a) Estimate the change to e0 , the curtate life expectancy at birth, if the mortality rate
in the first two years of life were reduced to modern-day levels (say q0 = 0.005,
q1 = 0.0004).
(b) Make a rough estimate of the change to e0 if the increases in mortality due to the
1914-18 war and the 1918-19 influenza pandemic had not occurred.
Statistical Lifetime Models: HT 2019 VII

Age x lx qx ex
0 100000 0.16134 44.82
1 83866 0.05398 52.39
.. .. .. ..
. . . .
14 74067 0.00220 45.99
15 73904 0.00237 45.09
16 73729 0.00260 44.20
17 73538 0.00301 43.31
18 73316 0.00313 42.44
19 73087 0.00787 41.57
20 72512 0.01836 40.90
21 71181 0.03218 40.65
22 68890 0.04424 40.98
23 65842 0.06194 41.86
24 61764 0.02088 43.59
25 60474 0.00551 43.51
26 60141 0.00385 42.75
27 59910 0.00384 41.91
28 59680 0.00391 41.07
29 59446 0.00377 40.23
30 59222 0.00386 39.38
31 58994 0.00367 38.53
32 58777 0.00380 37.67
33 58554 0.00399 36.81
34 58320 0.00445 35.96
35 58061 0.00460 35.11

6. If x is the observed value of a random variable X ∼ Binom(n, p), with known n, find the
maximum-likelihood estimator p̂, and deduce that
x(n − x)
Var(p̂) ≈ .
n3
If Ŝ(t) is the Kaplan-Meier estimator, an alternative estimator for the variance is

 Ŝ(t)2 (1 − Ŝ(t))
Var Ŝ(t) =
n(t)

where n(t) is the number at risk at time t+. If d(t) is the number of failures up to and
including time t, justify the estimation
n(t) n(t)
Ŝ(t) ≈ = ,
n(t) + d(t) n(0)

making the conservative assumption that all the censoring in the interval [0, t) takes place
at t = 0. What is the distribution of d(t) given this assumption? Explain how this can be
used to justify the expression for Var Ŝ(t) in terms of a binomial proportion estimator (as p̂
above). In the special case of no censoring, what is the connection between this estimator
and Greenwood’s estimator for the variance?
Statistical Lifetime Models: HT 2019 VIII

7. We are carrying out a hypothetical study of the survival of Alzheimer patients. We enrol 30
subjects in a clinic, and follow them over five years. We record their age at being enrolled
in the study and the age at which they left, and the cause of exit, whether death (1) or
something else (0).

Entry Exit Death Entry Exit Death


Age Age Indicator Age Age Indicator
67 72 0 69 74 1
70 71 0 69 71 0
70 73 1 66 68 0
65 70 0 73 76 1
65 68 1 67 68 0
73 78 1 66 70 1
69 74 1 69 73 1
76 78 1 66 70 1
66 67 0 78 81 1
72 76 1 66 70 1
65 70 1 68 73 1
71 75 1 70 74 1
69 71 0 66 68 0
71 74 1 89 92 1
68 73 0 68 72 1

(a) What sorts of censoring and/or truncation do we have in this study?


(b) Make a table indicating the number of subjects at risk at ages from 65 to 75.
(c) Estimate the survival curve over the age interval 70 to 75.
(d) Compute a 95% confidence interval for the survival probability from age 70 to 75.
(e) Enter the data into R and use the survival package to estimate and plot the survival
curve.
Statistical Lifetime Models: HT 2019 IX

A.3 Survival regression models and two-sample testing


Questions 1–3 are to be done for discussion in class. Questions 4–7 are to be handed in for
marking.
1. (a) Suppose we have a random sample that includes right-censored data (censoring assumed
non-informative). We wish to decide whether or not a Weibull distribution is appropriate.
Using an estimator of the survival function how might we graphically inves- tigate the
appropriateness of the model? Given that the model appears to be appropriate how would
you test whether or not the special case of an exponential model is valid? Suppose that the
Weibull model does not appear to be appropriate what graph would you use to consider a
log-logistic model?
(b) Now suppose that there are two groups to be considered (eg smokers v. non-smokers). What
graphs would be appropriate for consideration of a proportional hazards model, accelerated
life model respectively?
(c) Gehan (1965) studied 42 leukaemia patients. Some were treated with the drug 6-mercaptopurine
and the rest are controls. The trial was designed as matched pairs, but both members of a
pair observed until both came out of remission or the study ended. (The data are included
under the name gehan in the R package MASS. The description attached to these data there
says that in each pair both were withdrawn from the trial when either came out of remission.
If you have a look at the data, you can see that this is clearly not true.) The observed times
to recurrence (in months) were:

Controls: 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23


Treatment: 6+, 6, 6, 6, 7, 9+, 10+, 10, 11+, 13, 16, 17+, 19+, 20+, 22, 23, 25+,
32+, 32+, 34+, 35+

Here + indicates censored times. Investigate these data in respect of both a) and b).
2. (a) Describe the proportional hazards model, explaining what is meant by the partial likelihood
and how this can be used to estimate resgression coefficients. How might standard errors be
generated?
(b) Drug addicts are treated at two clinics (clinic 0 and clinic 1) on a drug replacement therapy.
The response variables are the time to relapse (to re-taking drugs) and the status relapse
=1 and censored =0. There are three explanatory variables, clinic (0 or 1), previous stay
in prison (no=0, yes=1) and the prescribed amount of the replacement dose. The following
results are obtained using a proportional hazards model, h(t, x) = eβx h0 (t).

Variable Coeff St Err p-value


clinic -1.009 0.215 0.000
prison 0.327 0.167 0.051
dose -0.035 0.006 0.000

What is the estimated hazard ratio for a subject from clinic 1 who has not been in prison as
compared to a subject from clinic 0 who has been in prison, given that they are each assigned
the same dose?
(c) Find a 95% confidence interval for the hazard ratio comparing those who have been in prison
to those who have not, given that clinic and dose are the same.
3. The object tongue in the package KMsurv lists survival or right-censoring times in weeks after
diagnosis for 80 patients with tongue tumours. The type random variable is 1 or 2, depending as
the tumour was aneuploid or diploid respectively.
(a) Use the log-rank test to test whether the difference in survival distributions is significant at
the 0.05 level.
Statistical Lifetime Models: HT 2019 X

(b) Repeat the above with a test that emphasises differences shortly after diagnosis.
4. (a) Sketch the shape of the hazard function in the following cases, paying attention to any changes
of shape due to changes in value of κ where appropriate.
κ
i. Weibull: S(t) = e−(ρt) .
1
ii. Log-logistic: S(t) = 1+(ρt)κ.

(b) Suppose that it is thought that an accelerated life model is valid and that the hazard function
has a maximum at a non-zero time point. Which parametric models might be appropriate?
(c) Suppose that t1 , . . . , tn are observations from a lifetime distribution with respective vectors of
covariates x1 , . . . , xn . It is thought that an appropriate distribution for lifetime y is Weibull
with parameters ρ, κ, where the link is log ρ = β · x. In the case that there is no censoring
write down the likelihood and, using maximum likelihood, give equations from which the
vector of estimated regression coefficients β (and also the estimate for κ) could be found.
What would be the asymptotic distribution of the vector of estimators? How would the
likelihood differ if some of the observations ti were right censored (assuming independent
censoring)?
5. Coronary Heart Disease (CHD) remains the leading cause of death in many countries. The evidence
is substantial that males are at higher risk than females, but the role of genetic factors versus
the gender factor is still under investigation. A study was performed to assess the gender risk of
death from CHD, controlling for genetic factors. A dataset consisting of non-identical twins was
assembled. The age at which each person died of CHD was recorded. Individuals who either had
not died or had died from other causes had censored survival times (age). A randomly selected
subsample from the data is as follows. (* indicates a censored observation.)

Age male twin Age female twin


50 63*
49* 52
56* 70*
68 75
74* 72
69* 69*
70* 70*
67 70
74* 74*
81* 81*
61 58
75* 73*

(a) Write down the times of events and list the associated risk sets.
(b) Suppose the censoring mechanism is independent of death times due to CHD, and that
the mortality rates for male and female twins satisfy the PH assumption, and let β be the
regression coefficient for the binary covariate that codes gender as 0 or 1 for male or female
respectively. Write down the partial-likelihood function. Using a computer or programmable
calculator, compute and plot the partial-likelihood for a range of values of β. What is the
Cox-regression estimate for β? What does this mean?
(c) Estimate the survival function for male twins.
(d) Suppose now only that the censoring mechanism is independent of death times due to CHD,
perform the log-rank test for equivalence of hazard amongst these two groups. Contrast the
test statistic and associated p-value with the results from the Fleming–Harrington test using
a weight W (ti ) = Ŝ(ti−1 ).
Statistical Lifetime Models: HT 2019 XI

(e) Do you think the assumption of a non-informative censoring mechanism is appropriate? Give
reasons.

6. In section 5.12.3 we describe fitting the Aalen additive hazards model for the special case of a
single (possibly time-varying) covariate. If we assume that xi takes on only the values 0 and 1 —
so it is the indicator of some discrete characteristic — then B1 (t) may be thought of as the excess
cumulative hazard up to time t due to that characteristic. Simplify the formula (5.17) for this case,
and show, in particular, that in the special case where xi is constant over time, the estimate B̂0 (t)
is equivalent to the Nelson–Aalen estimator for the cumulative hazard of the group of individuals
with xi = 0, and that the excess cumulative hazard B̂1 (t) is equivalent to the difference between
the Nelson–Aalen estimator for the cumulative hazard of the group of individuals with xi = 1, and
for the cumulative hazard of the group of individuals with xi = 0.
7. Refer to the AML study, which is described at length in Example 4.4.5 and analysed with the
Cox model in section 5.10. Using the data described in those places, estimate the difference in
cumulative hazard to 20 weeks between the two groups by
(a) The Aalen additive hazards regression model.
(b) The Cox proportional hazards regression model.
(c) Using the proportional hazards method, suppose an individual were to switch from mainte-
nance to non-maintenance after 10 weeks, and suppose the hazard rates change instantaneously.
Estimate the difference in cumulative hazard to 20 weeks between that individual and one
who had always been in the non maintenance group.
Statistical Lifetime Models: HT 2019 XII

A.4 Model diagnostics, repeated events


Questions 1,3,5 are to be done for discussion in class. Questions 2,4,6 are to be handed in for marking.
1. In a relative-risk regression model the hazard rate for individual i when they are at risk is

λi (t) = α0 (t)r(β, xi ),

where β is the vector of parameters, and xi is the vector of covariates associated with individual i,
and T
r(β, xi ) = eβ xi .
(a) Compute a formula for the k-th component of the score function (with respect to the partial
likelihood);
(b) Compute a formula for the (k, m) component of the observed partial information matrix.
2. (Based on Exercise 11.1 of [18].) The dataset larynx in the package KMsurv includes times of
death (or censoring by the end of the study) of 90 males diagnosed with cancer of the larynx
between 1970 and 1978 at a single hospital. One important covariate is the stage of the cancer,
coded as 1,2,3,4.
(a) Why would it probably not be a good idea to fit the Cox model with relative risk eβ·stage ?
What should be done instead?
(b) Explain how you would use a martingale residual plot to show that stage does not enter as a
linear covariate.
(c) Which residual plot would you use to test whether the proportional-hazards assumption holds
for age or stage, or whether the proportional effect of one of these covariates changes over
time.
(d) Explain how you would use a Cox–Snell residual plot to test whether the Cox model is
appropriate to these data. Describe the calculations you would perform, the plot that you
would create, and describe the visual features you would be looking for to evaluate the
goodness of fit.
3. Carry out these computations in R for the data set described in the previous question:
(a) One way of making R treat the stage variable appropriately is to replace it in the model
definition by factor(stage). Show that this produces the same result as defining separate
binary variables for three different outcomes.
(b) Try adding year of diagnosis or age at diagnosis as a linear covariate (in the exponent of the
relative risk). Is either statistically significant?
(c) Use a martingale residual plot to show that stage does not enter as a linear covariate.
(d) Use a residual plot to test whether one or the other of these covariates might more appropriately
enter the model in a different functional form — for example, as a step function.
(e) Use a Cox–Snell residual plot to test whether the Cox model is appropriate to these data.
Statistical Lifetime Models: HT 2019 XIII

4. We observe survivalPtimes Ti satisfying an additive Rhazards model, so the hazard for individual
p t
i is hi (t) = β0 (t) + k=1 xik (t)βk (t), with Bk (t) = 0 βk (s)ds. We define Yi (t) to be the at-risk
indicator for individual i at time t, and X(t) the matrix of covariates at time t multiplied by at-risk,
defined as in section 5.12.2. We also define N(t) to be the binary vector giving in position i the
number of events that individual i has had up to time t. Let B̂(t) be the vector of cumulative
regression coefficient estimators.
We assume the process has been observed up to a final time τ where there is a sufficient range of
subjects remaining that X(t) has full rank. Define the martingale residual vector
X
Mres (t) = N(t) − X(tj )dB̂(tj ),
tj ≤t

(a) Show that all components of Mres (t) have expectation 0, for all times 0 ≤ t ≤ τ .
(b) Suppose now that all covariates are fixed and the data are right-censored. Show that

X(0)T Mres (τ ) = 0.

(c) How might this fact be used as a model-diagnostic for the additive-hazards assumption?
5. Suppose we have a right-censored survival data set where we have accidentally copied every line of
data twice. We fit a Cox proportional hazards regression model.
(a) Show that the point estimate for β and for the baseline hazard will be the same as for the
correct (undoubled) data, but that the variance estimate will be wrong. (Use the Breslow
method for dealing with the tied observations.) What will the variance estimate be, relative
to the correct estimate?
(b) Show that the sandwich estimate described in section 9.3 will agree asymptotically with the
variance estimate for the correct data set.
(c) Carry out the calculations in R for the bmt (bone marrow transplant) data set in the KMsurv
package, referred to in section 7.4. That is, fit a Cox proportional hazards model as described
in 7.4, and then fit the same model to the data set where every line of the data object
has been duplicated. Compare the conclusions. Then add an id variable (so that the two
lines corresponding to the same patient have the same id) and redo the analysis using a
+cluster(id) term in the formula, and see if the problem is resolved.
6. Suppose n individuals experience events at constant rate λi , i = 1, . . . , n, over the same period of
time [0, T ]. The rate λi for individual i is unknown. Suppose the unknown rates λi have a gamma
distribution with parameters (r, λ), and let Ni be the observed number of events for individual i.
(a) Show that (Ni ) have a negative binomial distribution, and compute the parameters.
(b) Suppose we fit these data to a Poisson model, to obtain an estimate λ̂. How will λ̂ behave as
n → ∞?
(c) How would you test the hypothesis that the Poisson model is correct, against the alternative
that it is negative binomial?
(d) [optional] Test these conclusions with simulated data in R.
Appendix B

Solutions

XIV
Statistical Lifetime Models: HT 2019 XV

B.1 Revision, lifetime distributions, Lexis diagrams and the cen-


sus approximation
1. (a) The log likelihood function is given by
 
Yn   n
X
`(λ) = log  λe−λLk  = n log λ − λ Lk
k=1 k=1

We differentiate w.r.t. λ to get


n
0 n X n
` (λ) = − Lk , `00 (λ) = − < 0.
λ λ2
k=1

`0 has its unique zero for


n
λ = λ̂ =
L1 + . . . + Ln
and this is a maximum of `, since `00 < 0. Therefore λ̂ maximizes `.
(b) i. Just apply (a) to the data and get
20
λ̂ = = 0.002917.
l1 + . . . + l20
ii. The Fisher information is given by
n
In (λ) = −E(`00 (λ)) = (B.1)
λ2

so that, approximately λ̂ ∼ N(λ, λ2 /n) ≈ N(λ, λ̂2 /n), therefore


 
λ̂ − λ
.95 = P(|Z| < 1.96) ≈ P  √ < 1.96
λ̂/ n
√ √
= P(λ̂ − 1.96λ̂/ n < λ < λ̂ + 1.96λ̂/ n)
!
1 1 1
=P √ < < √ ,
λ̂ + 1.96λ̂/ n λ λ̂ − 1.96λ̂/ n
√ √
so that (λ̂ − 1.96λ̂/ n, λ̂ + 1.96λ̂/ n) = (0.001638, 0.004195) is an approximate 95%
confidence interval for λ, and (238.4, 610.3) an approximate 95% confidence interval for
1/λ.
iii. Since L1 + . . . + Ln ∼ Γ(n, λ), we have 2λ(L1 + . . . + Ln ) ∼ Γ(n, 12 ) = χ22n and so for
2n = 40, and the lower and upper 2.5% quantiles

0.95 = P(24.43 < |X 2 | < 59.34) = P(24.43 < 2λn/λ̂ < 59.34)
 
2n 1 2n
=P < < ,
59.34λ̂ λ 24.43λ̂
1
so the exact 95% confidence interval for λ is (231.1, 561.3).
Statistical Lifetime Models: HT 2019 XVI

Histogram of burnout times

8
6
Frequency

4
2
0

0 100 200 300 400 500 600 700

hb
iv.
v. Expected numbers under Exp(λ̂) are (e−100k − e−100(k+1) )n:

(5.1, 3.8, 2.8, 2.1, 1.6, 1.2, 0.9) and 2.6 for > 700.

For the χ2 test we require expected numbers above 5, so we keep the first bin, merge the
next three to get 8.7 and the remainder to get 6.2 (alternatively merge next two and
remainder). The data then is
Bin 0-100 100-400 400+ total
observed 0 15 5 20
expected 5.1 8.7 6.2 20
and we calculate the χ23−2 = χ21 test statistic
v
3 u 3
X (Oi − Ei )2 uX (Oi − Ei )2
= 9.84 ⇒ |Z| = t = 3.14  1.96
i=1
Ei i=1
Ei

so there is strong evidence against exponentiality.


2. (a) Identify the survival function of T as

P(T > t) = P(T1 > t, . . . , Tm > t) = P(T1 > t) . . . P(Tm > t)


( Z ) ( Z )
t t
= exp − h1 (s)ds . . . exp − hm (s)ds
0 0
( Z )
t
= exp − (h1 (s) + . . . + hm (s))ds .
0

(b) By (a), the hazard function of T now is k1 tn + . . . + km tn , so T has a Weibull distribution


with rate parameter k = k1 + . . . + km and exponent n.
(c) We first calculate the survival function and let λ → 0 to get

e−λt − e−λω ω−t


F̄ (t) = P(T > t|T ≤ ω) = → ,
1 − e−λω ω
which is the survival function of the uniform distribution on [0, ω]. This is not surprising
since the exponential density for small λ is very flat initially, also after truncation and
renormalisation.
We calculate the hazard function of the truncated exponential distribution via the density

λe−λt λ
f (t) = ⇒ h(t) = . (B.2)
1 − e−λω 1 − e−λ(ω−t)
Statistical Lifetime Models: HT 2019 XVII

3. We have X ∼ exp(1). We want the distribution of Y = Λ−1 (X).


If we let FX and FY be the corresponding cdfs, we have FX (x) = P X ≤ x = 1 − e−x , so



FY (y) = P Y ≤ y
= P Λ−1 (X) ≤ y


= P X ≤ Λ(y) because Λ is strictly increasing
= 1 − e−Λ(y) ,

which is the cdf of a random variable with hazard rate λ.


4. (a) The full table of deaths dx , lives at risk `x and total time at risk `˜x aged x is

x 0 1 2 3 4
dx 45 9 8 4 3
`x 69 24 15 7 3
`˜x 35.63 19.15 10.20 4.94 1.32

(b) The discrete method based on curtate lifetimes K (1) , . . . , K (n) , n = 69, factorises the likelihood
69
Y ∞
Y
pK (K (j)
)= (1 − qx )`x −dx qxdx (B.3)
j=1 x=0

(0)
and differentiation of each factor leads to maximum likelihood estimators q̂x = dx /`x .
The continuous method based on T (1) , . . . , T (n) , n = 69, and the assumption of constant
forces of mortality between integer ages, factorises the likelihood
69
Y ∞
Y n o
fT (T (j)
)= µdx+
x
1 exp −`˜x µx+ 12 (B.4)
2
j=1 x=0

and differentiation of each factor leads to maximum likelihood estimators q̂x = 1−exp{−dx /`˜x }.
(c) From the formulas obtained in (b) we calculate

x 0 1 2 3 4
(0)
q̂x 0.65 0.38 0.53 0.57 1
q̂x 0.717 0.375 0.543 0.555 0.898
(0)
q̂0 < q̂0 since the total time `˜0 spent at risk is very short. We can see this directly from the
data. Most subject dying in the first year die very early (e.g. three subjects die the day after
their transplant). This actually suggests that the force of mortality is not constant over the
first year, but much higher initially.
(0)
q̂4 < q̂4 = 1 allows survival beyond the maximal observed age under the continuous method.
The specification of the distribution estimate is not complete, but with no data we get no
estimate. Some methods of graduation will allow to extrapolate beyond maximal age.
By both methods, the one-year survival probabilities indicate a bathtub behaviour, decreasing
initially and then increasing.
(d) i. Under the estimates from curtate lifetimes and the assumption of independent uniform
fractional part,

P(T > 0.25) = P(T > 1) + P(K = 0, S > 0.25)


(0) (0) 3
is estimated by (1 − q̂0 ) + q̂0 = 0.837.
4
Statistical Lifetime Models: HT 2019 XVIII

ii. Under the estimates from continuous lifetimes and the assumption of constant force of
mortality between integer ages
( Z )
0.25
P(T > 0.25) = exp − µt dt
0

is estimated by exp{−0.25µ̂0+ 12 } = (1 − q̂0 )0.25 = 0.729.

iii. Again we can apply the discrete or continuous method (formally for units of three
months). The continuous method assumes constancy of forces of mortality over each
three-month period and gives an estimate
   
d 31
exp − = exp − = 0.540.
4`˜ 4 × 12.58

Here, 4`˜ is the total number of time units (as calculated from years `)
˜ at risk during first
three-month unit.
The discrete method is based on one-unit death probabilities and gives 1 − d/` =
1 − 31/69 = 0.551 as an estimate for the first-unit survival probability.
These estimates are much smaller reflecting a higher risk to die initially. In fact, this
suggests that neither assumption i. nor ii. is optimal. An initially decreasing force of
mortality would be better.
5. (a) i. The turquoise region corresponds to age x in year t. The same individuals are age x + 1
in year t + 1, and this portion of their lifelines falls in the yellow region.

x+2

x+1
Age

x−1

t−1 t t+1 t+2


Year
ii. Just write the quantities as integrals and sums and interchange the order of integration
and summation:
Xn Z bi Z K+N X n Z K+N
Exc = dt = 1{ai ≤t≤bi } dt = Px,t dt. (B.5)
i=1 ai K i=1 K
Statistical Lifetime Models: HT 2019 XIX

iii. Under the assumption of piecewise linearity we calculate


X−1 Z 1
K+N X−1 Px,k + Px,k+1
K+N
Exc = (rPx,k + (1 − r)Px,k+1 )dr = . (B.6)
0 2
k=K k=K

The assumption of piecewise linear Px,t cannot hold exactly since Px,t ∈ N, but for large
n this is negligible.
(b) i. The turquoise region corresponds to age x in year t. The same individuals are age x + 1
in year t + 1, and this portion of their lifelines falls in the yellow region.

x+2

x+1
Age

x−1

t−1 t t+1 t+2


Year
ii. We can define
(2)
Px,t = # lives at risk at time t with xth birthday in calendar year btc, (B.7)
but also ought to adjust
Z K+N
(2)
X−1
K+N
Px,k + Px+1,k+1
Exc,2 = Px,t dt ≈ (B.8)
K 2
k=K

since it is more natural to assume that the cohort of lives Px,k with x-th birthday in
calendar year k changes linearly to Px+1,k+1 since this will count the same people (being
age k at the start of year x and age k + 1 at the start of year x + 1).
R x+1
iii. The first estimate is approximating x µs ds; that is, the average of µs over the age
interval from x to x + 1, assuming age-specific mortality doesn’t change with time. (This
is just µx if µs is assumed constant on [x, x + 1).) The second estimate clearly includes
the experience of individuals at ages between x − 1 and x + 1, but weighted toward the
middle (in proportion to the width of the parallelogram in the above figure.) In fact, the
(2)
estimate µ̃x = dx /Exc,2 will be approximating
Z x+1  
1 − s − x µs ds.
x−1
Statistical Lifetime Models: HT 2019 XX

B.2 Life expectancy, graduation, and survival analysis


1. (a) See lecture notes.
(b) There is right censoring: The depression may not have recurred at the time that the study
ended, or the patient died or dropped out. There is left truncation: The first episode of
depression made the patients eligible for the study, but not immediately. Thus, the event of
interest — the recurrence of depression — could already have happened before the patient
was enrolled in the study.
(c) This study design involves right truncation: The entire study population has already ex-
perienced the event of interest (AIDS diagnosis). Any individual whose incubation period
extended beyond the truncation time would not have appeared in the study.
2. (a) We make the approximation that those who die between x and x + t survive for t/2 years on
average. Then the contribution to the life expectancy ex from those who die before x + t is
t qx · t/2, and that from those who survive to x + t is (1 − t qx ) (t + ex+t ). We have

t
ex ≈ t qx · + (1 − t qx ) (t + ex+t ) ,
2
and rearranging leads to the given formula for t qx . The approximation is reasonable if for
example deaths are approximately uniform across the interval (x, x + t), which would occur if
mortality is constant and low.
(b) Applying the approximations we get

q0 ≈ (33 − 25 + 1)/(1/2 + 33) = 9/33.5 = 0.269.


4 q1 ≈ (43 − 33 + 4)/(2 + 43) = 14/45 = 0.311.

5 q5 ≈ (41 − 43 + 5)/(2.5 + 41) = 3/43.5 = 0.069.

Under the assumption of constant mortality on these intervals, then for x = 1, 2, 3, 4 we have
1/4 1/4
qx = 1 − px = 1 − (4 p1 ) = 1 − (1 − 4 q1 ) ,

and similarly for x = 5, 6, 7, 8, 9


1/5
qx = 1 − (1 − 5 q5 ) ,

leading to qx = 0.089 for x = 1, 2, 3, 4 and qx = 0.014 for x = 5, 6, 7, 8, 9.


3.
1 library ( survival )
2
3 ## a ##
4
5 s u r v_o b j e c t <− Surv ( o v a r i a n $ f u t i m e , o v a r i a n $ f u s t a t )
6
7 # To have a l o o k a t what has been computed about s u r v i v a l
8
9
10 ## b ##
11 p l o t ( s u r v f i t ( s u r v_o b j e c t∼o v a r i a n $ rx ) , main=" Kaplan−Meier " )
12
13 > summary ( s u r v_o b j e c t )
14 C a l l : s u r v f i t ( f o r m u l a = Surv ( f u t i m e , f u s t a t ) ∼ rx )
15
16 # rx=1
17 # time n . r i s k n . event s u r v i v a l std . e r r lower 95\% CI upper 95\% CI
18 # 59 13 1 0.923 0.0739 0.789 1.000
19 # 115 12 1 0.846 0.1001 0.671 1.000
20 # 156 11 1 0.769 0.1169 0.571 1.000
Statistical Lifetime Models: HT 2019 XXI

21 # 268 10 1 0.692 0.1280 0.482 0.995


22 # 329 9 1 0.615 0.1349 0.400 0.946
23 # 431 8 1 0.538 0.1383 0.326 0.891
24 # 638 5 1 0.431 0.1467 0.221 0.840
25 #
26 # rx=2
27 # time n . r i s k n . e v e n t s u r v i v a l s t d . e r r l o w e r 95\% CI upper 95\% CI
28 # 353 13 1 0.923 0.0739 0.789 1.000
29 # 365 12 1 0.846 0.1001 0.671 1.000
30 # 464 9 1 0.752 0.1256 0.542 1.000
31 # 475 8 1 0.658 0.1407 0.433 1.000
32 # 563 7 1 0.564 0.1488 0.336 0.946
33
34 #f o r e x t r a c h a l l e n g e :
35 p l o t ( s u r v f i t ( s u r v_o b j e c t∼o v a r i a n $ rx ) ,
36 c o l=c ( " b l a c k " , " r e d " ) ,
37 main=" Kaplan−Meier " )
38 legend ( " bottomright " ,
39 c ( " s i n g l e −t r e a t m e n t " , " double−t r e a t m e n t " ) ,
40 c o l=c ( " b l a c k " , " r e d " ) , l t y =1 )
41
42 ## c ##
43
44 p l o t ( s u r v f i t ( s u r v_o b j e c t∼o v a r i a n $ rx , type= ' f l e m i n g −h a r r i n g t o n ' ) , main="
Nelson−Aalen " )
45
46
47 ### The r e s t i s t o do t h i s more ' by hand ' , computing t h e r e l e v a n t
quantities
48 ### and d i r e c t l y computing t h e Nelson−Aalen e s t i m a t o r .
49
50 attach ( ovarian )
51
52 x=o r d e r ( f u t i m e )
53 f u t i m e=f u t i m e [ x ]
54 f u s t a t=f u s t a t [ x ]
55 rx=rx [ x ]
56
57 ns=r e v ( cumsum ( r e v ( rx==1)) )
58 nd=r e v ( cumsum ( r e v ( rx==2)) )
59 hs=round ( f u s t a t ∗ ( rx==1)/ ns , 2 )
60 hd=round ( f u s t a t ∗ ( rx==2)/nd , 2 )
61
62 NelsonAalenTable =
63 s u b s e t ( data . frame ( t_i=f u t i m e , n_s i n g l e=ns , n_d o u b l e=nd ,
64 h_s i n g l e=hs , h_d o u b l e=hd ,A_s i n g l e=cumsum ( hs ) ,A_d o u b l e=cumsum ( hd ) ) , h_
s i n g l e+h_double >0)
65
66 > NelsonAalenTable
67 t_i n_s i n g l e n_d o u b l e h_s i n g l e h_d o u b l e A_s i n g l e A_d o u b l e v a r s vard
68 1 59 13 13 0.08 0.00 0.08 0.00 0.01 0.00
69 2 115 12 13 0.08 0.00 0.16 0.00 0.01 0.00
70 3 156 11 13 0.09 0.00 0.25 0.00 0.02 0.00
71 4 268 10 13 0.10 0.00 0.35 0.00 0.03 0.00
72 5 329 9 13 0.11 0.00 0.46 0.00 0.04 0.00
73 6 353 8 13 0.00 0.08 0.46 0.08 0.04 0.01
74 7 365 8 12 0.00 0.08 0.46 0.16 0.04 0.01
75 10 431 8 9 0.12 0.00 0.58 0.16 0.06 0.01
76 12 464 6 9 0.00 0.11 0.58 0.27 0.06 0.03
Statistical Lifetime Models: HT 2019 XXII

77 13 475 6 8 0.00 0.12 0.58 0.39 0.06 0.04


78 15 563 5 7 0.00 0.14 0.58 0.53 0.06 0.06
79 16 638 5 6 0.20 0.00 0.78 0.53 0.10 0.06

Kaplan-Meier
1.0
0.8
0.6
0.4
0.2

single-treatment
double-treatment
0.0

0 200 400 600 800 1000 1200

The standard errors are in the code printout above. For type 1 the variance estimate for the
Nelson–Aalen estimator is 0.04 at t = 400; for type 2 it is 0.01. So the corresponding standard
errors for the cumulative hazard are 0.2 and 0.1. The standard errors for survival are obtained
from multiplying these by Ŝ(400)2 , obtaining 0.13 and 0.097. The standard errors computed by
the survfit function for the Kaplan–Meier estimatorare in the printout above. They are 0.135
and 0.100.
4. (a) Crude estimates from the data are subject to stochastic fluctuation. Smoothing (graduating)
the estimates may make more reliable predictions.
(b) µx = a + beαx for Gompertz–Makeham. This is generally considered a reasonable model for
the hazard rate (force of mortality) from middle age onward. Note, though, that the mortality
rate doubling times (which would be approximately constant under Gompertz–Makeham)
lengthen progressively. The parameters a, b, α will have to be fitted from the data.
We apply the chi-squared test. To begin with, we combine the last two rows to have ≥ 5
expected deaths in each row. The last row becomes

99 17.5 5 0.2857 0.3027 − 0.1293


Statistical Lifetime Models: HT 2019 XXIII

(We interpolate by weighting the two rows by their central exposed to risk.) The χ2 statistic
is then 4.96 on 8 observations. Since we have estimated 3 parameters, we compare this to the
table with 5 degrees of freedom, obtaining p-value 0.42.
To test for bias we use the cumulative deviations test, obtaining Z = 0.96, and a p-value
of 0.3375. Thus, the model seems to fit. Notice that graduated hazard is generally lower —
it is strongly affected by the mortality plateau a very late ages — which would lead to an
overestimate of benefits paid. This is a relatively good error to make, though it would be
reversed if the company were selling life insurance!
5. (a) Let us write ex and px for the figures in the table, and ẽx and p̃x for the figures after we
change the rates in the first two years.
We have

e0 = p0 1 + p1 (1 + e2 ) ,

ẽ0 = p̃0 1 + p̃1 (1 + ẽ2 )

Since we change only the rates before year 2, we have e2 = ẽ2 . Then solving for ẽ0 , we have
 !
e0 /p0 − 1
ẽ0 = p̃0 1 + p̃1 .
p1

With e0 = 44.83, p0 = 0.839, p1 = 0.946, p̃0 = 0.995, p̃1 = 0.9996 we obtain ẽ0 = 56.12, an
increase of 11.29 years.
(b) It is clear from the table that the rates qx (x = 19, . . . , 25) are much larger than we would
normally expect. Comparing them with the rates just before and after, it looks like a plausible
first approximation would be to replace all these rates by a rate around 0.0035. So we could
work with a model where the mortality is constant with q = 0.0035 = 1 − 0.9965 for those 7
years. (Of course this is very rough, and there are all sorts of things we ignore including the
various effects of the war on mortality, even after it had finished).
We can represent e0 as a sum A + B + C of three terms:

A = expected number of whole years lived up to age 19,


B = expected number of whole years lived up between 19 and 26,
C = expected number of whole years lived after age 26.

Let us write A,
e B,
e and Ce for the new values once we change the rates.
The change to the rates between ages 19 and 26 makes no difference to the first term, so
A
e = A.
We have

C = 26 p0 e26
`26
= e26
`0
= 0.6014 × 42.75
= 25.71.

The change to the rates makes no difference to e26 , but we have a new value for the probability
of surviving to age 26, giving

e = (0.9965)7 `19 e26


C
`0
7
= (0.9965) × 0.7309 × 42.75
= 30.49.
Statistical Lifetime Models: HT 2019 XXIV

Finally, we can find B using B + C = e19 `19 /`0 as in the previous calculation, so
`19
B= e19 − C
`0
= 0.7309 × 41.57 − 25.71
= 4.67.
In the new model with constant rate between age 19 and age 26, we can use
e = 19 p̃0 (1 p̃19 + 2 p̃19 + · · · + 7 p̃19 )
B
7
`19 X
0.9965k
`0
k=1
`19 0.9965 − 0.99658
=
`0 0.0035
= 0.7381 × 6.903
= 5.10.
The total change in life expectancy at birth is B
e+Ce − B − C which comes to 30.49 + 5.10 −
25.71 − 4.67 = 5.21, giving a new life expectancy of around 50.
6. The log likelihood is  
n
`(p) = log + x log p + (n − x) log(1 − p).
x
This has solution 0 = `0 (p̂) = x/p̂ − (n − x)/(1 − p̂), implying p̂ = x/n. We know that the variance
of a binomial random variable is np(1 − p). Substituting p̂ for p yields the estimate
xn−x x(n − x)
Var(p̂) = Var(x/n) = n−2 Var(x) = n−1 p(1 − p) = n−1 = .
n n n3
If all the censoring occurs at t = 0 then the number of individuals at risk of dying in (0, t) is
actually n(t) + d(t). Thus alive at time t is binomial with parameters n = n(0) = n(t) + d(t) and
p = S(t). The MLE for p is thus
n(t) n(t)
Ŝ(t) = p̂ = = .
n(t) + d(t) n(0)
(If the censoring all happens at time 0, then the number at risk at time 0+ will be the same as the
sum of the number who die up to time t, and the number still at risk at time t.) The variance
estimate is
d(t)n(t) d(t) n(t) n(t)
3
= n(t)−1 = n(t)−1 (1 − Ŝ(t))Ŝ(t)2 .
n(0) n(0) n(0) n(0)
Greenwood’s estimate in the case of no censoring is
X di
Var Ŝ(t) ≈ Ŝ(t)2
ni (ni − di )
ti ≤t
X ni+1 − ni
= Ŝ(t)2
ni (ni+1 )
ti ≤t
 
X 1 1
= Ŝ(t)2 −
ni+1 ni
ti ≤t
!
1 1
= Ŝ(t)2 −
nj n0
d(t)
= Ŝ(t)2
n(t)n(0)
= n(t)−1 Ŝ(t)2 (1 − Ŝ(t))
Statistical Lifetime Models: HT 2019 XXV

as before.
7. (a) Right censoring and left truncation.
(b) If individuals who enter at age x are considered immediately available to count at risk at age
x, and those who die at age x are also at risk.

Age 65 66 67 68 69 70 71 72 73 74 75
# at risk 3 9 11 13 14 17 14 12 12 8 4
We are planning to use the actuarial estimator — so we count those who are censored or died
as having had half a year at risk, and count those who entered at a given age as having half a
year at risk in that year, we get the following counts:

Age 65 66 67 68 69 70 71 72 73 74 75
# at risk 1.5 6.0 9.5 9.5 11.5 13.0 11.5 10.5 9.0 6.0 3.5

(c) Again, counting whole years at risk for those who enter, die, or are right-censored, we have

Age ni di hi Ŝ(ti ) S(t


e i)

70 17 4 0.235 0.765 0.790


72 12 1 0.083 0.701 0.727
73 12 3 0.250 0.526 0.566
74 8 4 0.500 0.263 0.343
75 4 1 0.250 0.197 0.268

The actuarial estimate gives us

Age ni di hi Ŝ(ti ) S(t


e i)

70 13.0 4 0.308 0.692 0.735


72 10.5 1 0.095 0.626 0.668
73 9.0 3 0.333 0.418 0.479
74 6.0 4 0.667 0.139 0.246
75 3.5 1 0.286 0.099 0.185
Note that we might reasonably suggest that age is not a sensible time variable here, since
mortality is largely determined by time since diagnosis. We see that the estimator of survival
past age 78 is 0, since the single individual who happened to be in the study at that age died.
This despite the fact that there are other individuals who entered later and survived to much
older ages. We might reasonably look instead at the time-on-test as time variable. We would
then get the following calculation:

tj nj dj hj S(t
b j) S(t
e j)

2 27 1 0.04 0.96 0.96


3 22 6 0.27 0.70 0.73
4 16 8 0.50 0.35 0.44
5 8 5 0.62 0.13 0.24

(d) We use the whole-year method, rather than the actuarial estimate. Our central estimate
for the probability of surviving from age 70 to age 75 is Ŝ(74) = 0.343. Using Greenwood’s
estimate, we estimate the variance of log Ŝ(74) to be
X di 4 1 3 4
= + + +
ni (ni − di ) 17 · 13 12 · 11 12 · 9 8 · 4
ti ≤74

= 0.178,
Statistical Lifetime Models: HT 2019 XXVI


so the standard error is 0.178 = 0.422. Thus an approximate 95% confidence interval for
S(74) is  
0.343e−0.422·1.96 , 0.343e0.422·1.96 = (0.150, 0.784).

(e)
1 require ( ' survival ' )
2 age . e n t r y=c ( 6 7 , 7 0 , 7 0 , 6 5 , 6 5 , 7 3 , 6 9 , 7 6 , 6 6 , 7 2 , 6 5 , 7 1 , 6 9 , 7 1 , 6 8 , 6 9 , 6 9 , 6 6 ,
3 73 ,67 ,66 ,69 ,66 ,78 ,66 ,68 ,70 ,66 ,89 ,68)
4 age . e x i t=c
(72 ,71 ,73 ,70 ,68 ,78 ,74 ,78 ,67 ,76 ,70 ,75 ,71 ,74 ,73 ,74 ,71 ,68 ,76 ,68 ,70 ,73 ,
5 70 ,81 ,70 ,73 ,74 ,68 ,92 ,72)
6 d e l t a=c ( 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 1 , 1 )
7
8 c l i n i c . s u r v=Surv ( time=age . e n t r y , time2=age . e x i t , e v e n t=d e l t a ) # l e f t −
t r u n c a t e d , r i g h t −c e n s o r e d i s d e f a u l t
9 KM. f i t =s u r v f i t ( c l i n i c . s u r v∼1 , s u b s e t =( age . e x i t >=70) ) # S u r v i v a l o f t h o s e
p r e s e n t a f t e r age 70
10 p l o t (KM. f i t , f i r s t x =70 ,xmax=75 , y l a b= ' S u r v i v a l p r o b a b i l i t y ' , main= ' Kaplan−
Meier e s t i m a t o r ' , x l a b= ' Age ( y r s ) ' )
11
12 TOT. s u r v=Surv ( time=time . on . t e s t , e v e n t=d e l t a )
13 TOT. f i t =s u r v f i t (TOT. s u r v∼1 )
14 p l o t (TOT. f i t , y l a b= ' S u r v i v a l p r o b a b i l i t y ' , main= ' Kaplan−Meier e s t i m a t o r ' ,
x l a b= ' Time on t e s t ( y r s ) ' )

Kaplan−Meier estimator Kaplan−Meier estimator


1.0

1.0
0.8

0.8
Survival probability

Survival probability
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

70 71 72 73 74 75 0 1 2 3 4 5

Age (yrs) Time on test (yrs)

(a) Survival by age (b) Survival by time on test


Statistical Lifetime Models: HT 2019 XXVII

B.3 Survival regression models and two-sample testing


α
1. (a) In a Weibull model, the survival function is S(x) = e−(ρx) . Thus log(− log S(x)) = α log ρ +
α log x, and if we plot log(− log Ŝ(x)) against log x we should see something close to a straight
line. Since the exponential model is a submodel of the Weibull (with α = 1), we can apply
the likelihood ratio test. If `(ρ, α) is the log likelihood, we have under the null model (that
the data were sampled from an exponential distribution)

sup `(ρ, α) − sup `(ρ, 1) ∼ χ21 .


(ρ,α) ρ

1
For the log-logistic model, we expect the plot of log( Ŝ(x) −1) against log x to be approximately
linear.
(b) Let S1 and S2 be the survival curves for the two populations, and S0 the baseline survival.
Under the accelerated lifetime model, Si (x) = Si (ρi x) for some positive constants ρ1 , ρ2 .
Then if we plot Si (x) against log x, we see that whatever value S0 takes at ordinate log x, Si
will take the same value at an interval of log ρi . (The same will be true of any function of
Si .) Thus, the graphs corresponding to Ŝ1 and Ŝ2 should differ approximately by a uniform
horizontal shift.
The proportional hazards assumption is best tested by plotting log(− log Ŝi (x)). Under PH,
Si (x) = S0 (x)ρi , which implies that

log(− log Si (x)) = log(−ρi log S0 (x)) = log(ρi ) + log(− log S0 (x)).

Thus, if log(− log Ŝi (x)) is plotted against x, the two graphs should differ approximately by a
constant vertical shift if the two groups satisfy the PH assumption. The same is true if we
plot log(− log Ŝi (x)) against any function of x. Thus, if we plot log(− log Ŝi (x)) against log x,
we will see a constant vertical shift reflecting the PH assumption, and a constant horizontal
shift reflecting the AL assumption.
(c) The computations for the Kaplan–Meier estimator are given in Table B.1. In figure B.1 we
plot the two survival curves (red for control, black for treatment), as log(− log Ŝ) against
log x. Both look reasonably close to lines, so it would be reasonable to suppose that they came
from Weibull models. The lines are approximately parallel, suggesting that the α parameters
are approximately the same. This means that one curve may be obtained from another by a
horizontal or vertical shift, suggesting that PH or AL would be appropriate. (Weibull curves
with the same α parameter, it should be noted, satisfy both hypotheses.)

tj dj nj ĥj Ŝ(tj )
1 2 21 0.095 0.905
2 2 19 0.105 0.810
tj dj nj ĥj Ŝ(tj )
3 1 17 0.059 0.762
6 3 21 0.143 0.857
4 2 16 0.125 0.667
7 1 17 0.059 0.806
5 2 14 0.143 0.572
10 1 15 0.067 0.752
8 4 12 0.333 0.381
13 1 12 0.083 0.690
11 2 8 0.250 0.286
16 1 11 0.091 0.627
12 2 6 0.333 0.191
22 1 7 0.143 0.537
15 1 4 0.250 0.143
23 1 6 0.167 0.448
17 1 3 0.333 0.095
22 1 2 0.500 0.048
23 1 1 1.000 0.000

Table B.1: Estimates for control group (left) and treatment group (right) in Gehan study.
Statistical Lifetime Models: HT 2019 XXVIII

We test the hypothesis by finding maximum likelihood estimators. The log likelihood for the
exponential distribution are
X
`(λ) = (−λxi ) + d log λ,
i

where d isPthe number of uncensored observations. Since the maximum likelihood estimator
is λ̂ = d/ xi , we get maximum likelihoods of
 X 
`∗exp = d log d − 1 − log xi .

For the Weibull distribution we have


X X
`(ρ, α) = − (ρxi )α + d(α log ρ + log α) + (α − 1) log xi .
i uncensored

There is no closed form solution, but we can optimise numerically, yielding estimates

Treatment Control
λ̂ 0.025 0.12
`∗exp -42.17 -66.35
ρ̂ 0.030 0.11
α̂ 1.35 1.37
`∗weib -41.66 -64.92

The log likelihood ratio for the treatment group is thus (−41.66) − (−42.17) = 0.51, and for
the control group it is 1.43. Comparing these to the χ2 distribution with 1 degree of freedom,
we see that the cutoff for rejecting the null hypothesis that α = 1 at the 0.05 significance
level would be 3.84. Thus, we cannot reject the null hypothesis for either group.
2. (a) Assuming no ties, the partial likelihood is constructed by computing the probability that the
subjects failed in exactly the order observed, conditioned on the times observed.
The proportional hazards (PH) assumption says that subject i has hazard rate hi (x) = ri h0 (x)
at time x, where h0 is an unspecified baseline hazard. In the regression approach, we think
of ri as a function r(yi ) of a vector yi of covariates. The linear approach is to suppose
φ(r(y)) = β · y, where φ is the link function and β is a vector of parameters to estimate. In the
Cox model we use the logarithmic link function, so that r(y) = eβ·y . The partial likelihood is
defined as
Y eβy(i)
LP (β; y) := P βyj
,
t j∈Rj e
i

where x(i) represents the covariates of the subject failing at time ti and Ri is the risk set, of
those subjects at risk at ti .
We use LP as though it were a likelihood. We compute the parameters β̂ that maximise LP .
Under the assumption that the observations came from the distribution given by this model
with some (unknown) parameter β, the estimate β̂ is asymptotically normal, with mean β
and variance matrix that may be estimated by
 !−1
2
∂ `P
E −  , where `P = log LP .
∂β∂β T
Statistical Lifetime Models: HT 2019 XXIX


1





0



log(−log(Survival))





−1

● ●

● ●
−2
−3

0 1 2 3
log(Age)

Figure B.1: Plot of estimated survival for Gehan leukaemia data. The control group is in red,
the treatment group is black.

(b) The hazard ratio is

h(clinic = 1, prison=0) eβ̂·y1


=
h(clinic = 0, prison=1) eβ̂·y2
e−1.009
= 0.327
e
= 0.263.
(c) The log hazard ratio for prison/no prison is 0.327, with standard error 0.167. A 95%
confidence interval for the coefficient is 0.327 ± 1.96 · 0.167 = (0.0, 0.654). Thus a 95%
confidence intervalfor the hazard ration is e(0.0,0.654) = (1.00, 1.92).
3. We give below R code for computing this in two different ways: Using the function survdiff, which
does the computation automatically, and by extracting the relevant quantities from the survival
object and doing the computation directly.
We get Z = −1.67, which corresponds to a p-value of 0.09.
Using survdiff we get the same result, but it is reported as a chi-squared statistic of 2.8 (which is
1.672 ) on 1 degree of freedom.
Statistical Lifetime Models: HT 2019 XXX

SURVDIFF CODE
> require(’survival’)
> require(’KMsurv’)
> data(tongue)
> attach(tongue)

>
> tongue.surv=Surv(time,delta)
> tongue.fit=survfit(tongue.surv∼type)
> tdiff=survdiff(tongue.surv∼type)
> tdiff
Call:
survdiff(formula = tongue.surv ∼ type)

N Observed Expected (O-E)^2/E (O-E)^2/V


type=1 52 31 36.6 0.843 2.79
type=2 28 22 16.4 1.873 2.79

Chisq= 2.8 on 1 degrees of freedom, p= 0.0949

DIRECT COMPUTATION
# Problem sheet 4, question 1
require(’survival’)
require(’KMsurv’)
data(tongue)
attach(tongue)

tongue.surv=Surv(time,delta)
tongue.fit=survfit(tongue.surv∼type)

n1=tongue.fit$strata[1]
n2=tongue.fit$strata[2]

# Input two vectors of times t1,t2, and


# numbers at risk n1,n2 whose length is 1 longer than the t’s
# Output four vectors I1, I2, (of same length as t1,t2) and Y1,Y2
# I1[k] gives an index of I2 corresponding to
# the last time in t2 that precedes t1[k]
# Thus, we have t2[I1[k]]<=t1[k] < t2[I1[k]+1],
# and r2[I1[k]+1] is the number of type 2 individuals at risk
# at the time t1[k] (when there are r1[k] type 1 individuals)
# Y1=r1[I1]

crossrisk=function(t1,t2,r1,r2){
I1=rep(0,length(t1))
I2=rep(0,length(t2))
for(i in seq(length(t1))){
I1[i]=1+sum(t1[i]>t2)
}
for(i in seq(length(t2))){
Statistical Lifetime Models: HT 2019 XXXI

I2[i]=1+sum(t2[i]>t1)
}
list(I1,I2,r1[I2],r2[I1])
}

r1=tongue.fit$n.risk[seq(n1)]
r2=tongue.fit$n.risk[seq(n1+1,n1+n2)]

r1=c(r1,r1[n1]-tongue.fit$n.event[n1]-tongue.fit$n.censor[n1])
r2=c(r2,r2[n2]-tongue.fit$n.event[n1+n2]-tongue.fit$n.censor[n1+n2])
t1=tongue.fit$time[seq(n1)]
t2=tongue.fit$time[seq(n1+1,n1+n2)]

cr=crossrisk(t1,t2,r1,r2)

Y1=c(r1[-n1],cr[[3]])
Y2=c(cr[[4]],r2[-n2])
# Note: r1 and r2 had an extra count added on to make crossrisk work
d1=c(tongue.fit$n.event[seq(n1)],rep(0,n2))
d2=c(rep(0,n1),tongue.fit$n.event[seq(n1+1,n1+n2)])

t=c(t1,t2)

# We have to deal with the problem of ties between times for the two groups

dup1=which(duplicated(t,fromLast=TRUE))
dup2=which(duplicated(t))
ndup=length(dup1)

# Type 2 Event counts are removed from the second appearance


# and placed in the first appearance
d2[dup1]=d2[dup2]
d2=d2[-dup2]
d1=d1[-dup2]

# Type 2 at-risk counts are removed from the second appearance


# and placed in the first appearance
Y2[dup1]=Y2[dup2]
Y2=Y2[-dup2]
Y1=Y1[-dup2]
t=t[-dup2]

tord=order(t)
t=t[tord] #put times in order
## Now put everything else in the same order
Y=Y[tord]
Y1=Y1[tord]
Y2=Y2[tord]
d=d[tord]
d1=d1[tord]
d2=d2[tord]

Y=Y1+Y2
Statistical Lifetime Models: HT 2019 XXXII

d=d1+d2

# Product of number at risk


atriskprod=Y1*Y2
includes=(atriskprod>0)&(d>0)
# We only get contributions if someone’s at risk and events occurred at that time

Y=Y[includes]
Y1=Y1[includes]
Y2=Y2[includes]
d=d[includes]
d2=d2[includes]
d1=d1[includes]

t=t[includes]

wLR=Y1*Y2/Y
p=1
q=0

S=c(1,cumprod((Y-d)/Y))[-length(Y)] #K-M estimator for survival


wFH=(1-S)^q*S^p*wLR

# Now compute the test statistic

w=wLR

M=w*(d1/Y1-d2/Y2)
sigma=w*w*d*(Y-d)/Y2/Y1/(Y-1)
sK=d*Y1*Y2*(Y-d)/Y^2/(Y-1)

Z=sum(M)/sqrt(sum(sigma))

> Z
[1] -1.670246
4. (a) The plot is:
Statistical Lifetime Models: HT 2019 XXXIII

2.0

Weibull, kappa=0.5
Weibull, kappa=0.5
1.5

Weibull, kappa=1.5
Weibull, kappa=1.5
log−logistic, kappa=0.5
log−logistic,
log−logistic, kappa=0.5
kappa=1.5
log−logistic, kappa=1.5
−w1

1.0
0.5
0.0

0 1 2 3 4

(b) One could consider using the log-logistic or the log-normal.


β·x κ
(c) The hazard function is h(t) = κ(ρeβ·x )κ tκ−1 and the survival function is e−(ρe t) . Hence
the log likelihood is
X X X κ
`(ρ, κ, β) = n log κ + nκ log ρ + κ β · xi + (κ − 1) log ti − ρeβ·xi ti .

The MLE must satisfy


∂` nκ X κ
0= = − κρκ−1 eβ·xi ti ,
∂ρ ρ
  X κ
∂` 1 X X
0= =n + log ρ + β · xi + log ti − κ ρeβ·xi tκ−1
i ,
∂κ κ
∂` X   κ 
0= =κ xij 1 − ρeβ·xi ti .
∂βj i
Statistical Lifetime Models: HT 2019 XXXIV

Asymptotically, the estimators will be normally distributed. If some observations are right-
censored, the log likelihood becomes
X X X κ
`(κ, β) = nd log κ + nd κ log ρ + κ δi β · xi + (κ − 1) δi log ti − ρeβ·xi ti

where nd is the number of (uncensored) events observed.


5. (a) The times ti are 50, 52, 58, 61, 67, 68, 70, 72, 75. A full description of the risk sets requires that
we describe exactly which individuals are at risk. We number the males as M 1, . . . , M 12 and
F 1, . . . , F 12. We have then the risk sets

R1 = M 1, M 3, M 4, M 5, M 6, M 7, M 8, M 9, M 10, M 11, M 12, F 1, F 2, F 3, F 4, F 5, F 6,
F 7, F 8, F 9, F 10, F 11, F 12

R2 = M 3, M 4, M 5, M 6, M 7, M 8, M 9, M 10, M 11, M 12, F 1, F 2, F 3, F 4, F 5, F 6, F 7,
F 8, F 9, F 10, F 11, F 12

R3 = M 4, M 5, M 6, M 7, M 8, M 9, M 10, M 11, M 12, F 1, F 3, F 4, F 5, F 6, F 7, F 8, F 9,
F 10, F 11, F 12

R4 = M 4, M 5, M 6, M 7, M 8, M 9, M 10, M 11, M 12, F 1, F 3, F 4, F 5, F 6, F 7, F 8, F 9,
F 10, F 12

R5 = M 4, M 5, M 6, M 7, M 8, M 9, M 10, M 12, F 3, F 4, F 5, F 6, F 7, F 8, F 9, F 10, F 12

R6 = M 4, M 5, M 6, M 7, M 9, M 10, M 12, F 3, F 4, F 5, F 6, F 7, F 8, F 9, F 10, F 12

R7 = M 5, M 7, M 9, M 10, M 12, F 3, F 4, F 5, F 7, F 8, F 9, F 10, F 12

R8 = M 5, M 9, M 10, M 12, F 4, F 5, F 9, F 10, F 12

R9 = M 10, M 12, F 4, F 10 .

Note that there is some ambiguity in breaking ties. When an observation is censored at time
ti we must decide whether to treat the censoring as having occurred just after or just before
ti : that is, was the individual available to have been counted if they had died at time ti or
not? We have chosen the former: Thus, for instance, R9 is the set of individuals at risk at
time 75, and it includes M12, who was censored at age 75. Either one is acceptable — though
details of the study may suggest one or the other interpretation — but it should be specified.
Since we are interested only in the binary covariate of gender, we need only consider the risk
sets as counting the numbers of males and females, coded as Ri = (mi , fi ). We may then
summarise them as

R1 = (11, 12) R2 = (10, 12) R3 = (9, 11) R4 = (9, 10) R5 = (8, 9)


R6 = (7, 9) R7 = (5, 8) R8 = (4, 5) R9 = (2, 2).

(b) Using the notation as above, and setting the vector of covariates to be x = (1, 0, 0, 1, 1, 1, 0, 0, 0)
– coding female as 0 and male as 1 — we have the partial likelihood being
9 9 
Y eβxi 4β
Y
β
−1
LP = = e f i + e mi . (B.9)
f + eβ m i
i=1 i i=1

A plot of this function is in Figure B.2. The maximum likelihood is attained at β = −0.042.
Statistical Lifetime Models: HT 2019 XXXV


24 25 26 27 28
●●


log partial likelihood



●●

●●


●●




● ●●




●●

● ●●


●●


● ●●



●●


● ●●






● ●●






●● ●●





●●
● ●●




●●

● ●
●●



●●


● ●
●●






●● ●

●●



●●
● ●


●●



●●

● ●




●●



● ●●






●●

● ●
●●




●●


● ●


●●





●●
● ●
●●


●●

●●



● ●

●●





●●


● ●●



●●
●●


●●

● ●



●●



●●


●●

● ●



●●




●●


●●


● ●



●●



●●


●●



●●
● ●
●●



●●




●●



●●


●●

● ●

●●


●●



●●


●●


●●



●●


●●



●●


● ●

●●



●●


●●



●●


●●





●●


●●



●●


●●



●●


●●



●●

−2 −1 0 1 2
β

Figure B.2: Plot of negative logarithm of partial likelihood given by (B.9).

(c) We need to compute first Ŝ for the combined population. We have

event time
50 52 58 61 67 68 70 72 75
dm 1 0 0 1 1 1 0 0 0
Male
mi 11 10 9 9 8 7 5 4 2
dfi 0 1 1 0 0 0 1 1 1
Female
fi 12 12 11 10 9 9 8 5 2
di 1 1 1 1 1 1 1 1 1
Total
n 23 22 20 19 17 16 13 9 4
Ŝ(ti−1 ) 1 0.957 0.913 0.867 0.822 0.773 0.725 0.669 0.595

Plugging these into the formula


P9  
m m di
i=1 d i − n i ni
Z=r ,
f
P9 nm
i ni (ni −di )di
i=1 n2i (ni −1)

we get Z = −.063, which should be like a draw from a normal distribution if the male and
female survival times were drawn from the same distribution. In fact, we get a p-value of
1 − 2Φ(.063) = .95.
(d) For the Fleming–Harrington test we down-weight the later times, when very few are at risk,
substituting  
P9 m m di
i=1 Ŝ(ti−1 ) di − ni ni
ZF H = r = 0.105,
f
P9 2 nm
i ni (ni −di )di
i=1 Ŝ(ti−1 ) n2 (ni −1) i

yielding a p-value for the two-sided test of 0.92. In either case, of course, we would not reject
the null hypothesis. Of course, this is not surprising, as the sample is very small.
Note that this analysis could be improved by taking account of the pairing of twins.
(e) Death due to other causes is unlikely to be independent of CHD. Hence, non-informative
censoring is questionable.
Statistical Lifetime Models: HT 2019 XXXVI

6. For clarity, we repeat the derivation in this particular setting. Let Z(tj ) be the 2 × 2 matrix
X(tj )T X(tj ). Since xi (tj ) = xi = x2i is 0 or 1, we have
X
Z00 = Yi (t)2 = n(t),
X
Z10 = Z01 = Z11 = Yi (t)xi = n1 (t),

where n(t) is the number of individuals at risk at time t, and nj (t) is the number of individuals at
risk at time t with xi = j. The determinant is n1 (t)n0 (t). Inverting this, we get
!
1
− n01(t)

− n0 (t) Y1 (t) Y2 (t) ··· Yn (t)
X (t) = ,
− n01(t) n11(t) + n01(t) x1 Y1 (t) x2 Y2 (t) · · · xn Yn (t)

as long as n1 (t)n0 (t) > 0 (and 0 otherwise). Thus, since Yij (tj ) is always 1 (since the individual
who has an event must, by definition, be at risk),
 
1−x0
− n0 (tj )
X (tj )cdotij =  
− n1−x 0
0 (tj )
+ x0
n1 (tj )

Hence the baseline cumulative hazard estimate is


X 1 − x0 X 1
B̂0 (t) = = ,
n0 (tj ) n0 (tj )
tj ≤t tj ≤t: xij =0

which is the definition of the Nelson–Aalen estimator for the cumulative hazard, considering only
the individuals with xi = 0; and the estimated cumulative increment due to xi = 1 is
X 1 X 1
B̂1 (t) = − ,
n1 (tj ) n0 (tj )
tj ≤t: xij =1 tj ≤t: xij =0

7. (a) As described in the previous question, the difference may be estimated by the difference
between the Nelson–Aalen estimators:
X d0j d1j
B̂1 (t) = Ĥ0 (t) − Ĥ1 (t) = − .
n0j n1j
tj ≤t

Calling the Maintenance group number 1, and Nonmaintenance number 0, we read off of
Table 4.3 Ĥ1 (20) = Ĥ1 (18) = 0.32, and Ĥ0 (20) = 0.49, yielding

B̂1 (20) = 0.17.

The variance will be the sum of the variances of the two estimators (since they are independent).
As long as there are no ties between events from different groups, this may be estimated by
X d0j X d1j
2 + .
n0j n21j
tj ≤t tj ≤t

From the table we can see that this is


1 1 1 1 1 1 1
σ12 (20) + σ02 (20) = + 2 + 2 + 2 + 2 + 2 + 2 = 0.0788.
122 11 10 9 8 10 8

Thus, an approximate 95% confidence interval for B̂1 (20) would be

0.17 ± 0.28 · 1.96 = 0.17 ± 0.550.


Statistical Lifetime Models: HT 2019 XXXVII

(b) The Cox model fit by coxph produced the outcome

coxph(formula = Surv(time, status) ∼ x, data = aml)


coef exp(coef) se(coef) z p
×Nonmaintained 0.916 2.5 0.512 1.79 0.074
Likelihood ratio test=3.38 on 1 df p=0.0658 n= 23

In Table 5.2 we tabulated the estimators for the baseline hazard, obtaining Ĥ0 (18) = 0.254.
A central estimate for the difference in cumulative hazard between the two groups would be

(1 − eβ̂ )Â0 (18) = 1.5 · 0.254 = 0.38.

We see that this is a substantially larger estimate than we made in the nonparametric model.
This is consistent with the plot in Figure 5.4, where the purple circles and blue crosses
(representing the survival estimates from the proportional hazards model for the two groups)
are further apart at tj = 18 than the black and red lines (representing the Kaplan–Meier
estimators). This reflects that fact that the separate Kaplan–Meier estimators are cruder,
making larger jumps at less frequent intervals.
To estimate the standard error, we begin by assuming (with little justification) that the
estimators β̂ and Ĥ0 (tj ) are approximately independent. Then we can use the delta method
to estimate the variance. Let σβ2 be the variance of β̂, and σH 2
the variance of Ĥ(18). So we
can represent
β̂ ≈ β0 + σβ Z, Ĥ0 (18) = H0 (18) + σH Z 0 ,
where Z and Z 0 are standard normal (also approximately independent). We already have
the estimate σ̂β ≈ 0.512. We haven’t given a formula for an estimator of σH (18), but we can
easily compute it with R.

require(survival)

cp=coxph(Surv(time,status)∼x,data=aml)

aml.fit=survfit(cp)

aml.fit$std.err[aml.fit$time==18]
[1] 0.150247

Then our estimator for the difference in cumulative hazard is


 
(1 − eβ̂ )Ĥ0 (18) ≈ 1 − eβ0 +σβ Z H0 (18) + σH Z 0

 
≈ 1 − eβ0 1 + σβ Z H0 (18) + σH Z 0

   
≈ 1 − eβ0 H0 (18) − eβ0 σβ H0 (18)Z + 1 − eβ0 σH Z 0 − eβ0 σβ σH ZZ 0 .

(Note that the approximation in the first line is based on assuming σβ is much smaller than
β0 , which isn’t really very true here.) As long as we are assuming independence of Z and Z 0 ,
the variance will be approximately
 2    2
eβ0 σβ H0 (18) + 1 − eβ0 σH = 0.3252 + .2252 = 0.156,

so the standard error is about 0.395.


Statistical Lifetime Models: HT 2019 XXXVIII

A better estimate, also taking into account the dependence between β̂ and Ĥ0 , could be
obtained by not using the delta method, but instead treating the normal distribution of β̂
as a Bayesian posterior distribution on β0 . For a range of possible β0 we can compute an
approximate mean and variance for Ĥ0 , and then compute a Monte Carlo estimator of the
variance of Γ̂.
(c) We let x0 (t) be the covariate trajectory for this individual, so recalling that the maintained
group is the baseline this means that
(
0 if t ≤ 10,
x0 (t) =
1 if t > 10.

Using the formula (5.9) we estimate for this individual


Z t
eβx0 (u) dĤ0 (u)

Ĥ 20 x0 =
0
Z 10 Z 20
= dĤ0 (u) + eβ dĤ0 (u)
0 10

≈ Ĥ0 (10) + 2.5 Ĥ0 (20) − Ĥ0 (10)
= 0.14 + 2.5(0.114)
= 0.425.
Statistical Lifetime Models: HT 2019 XXXIX

B.4 Model diagnostics, repeated events


1. (a) For simplicity, we assume no ties. If t1 < t2 < · · · < tl are the event times, the partial
likelihood is
l 
Y exp β T xij
LP (β) = P 
T
,
j=1 i∈Rj exp β xi

where ij is the individual who had an event at time tj , and Rj is the set of those at risk at
time tj . The log partial likelihood is
l l X 
X X T
`P (β) = β T x ij − log eβ xi
.
j=l j=1 i∈Rj
Pn
(The first term could be simplified to β T i=1 δi xi , but it is perhaps better understood in
the form stated here.) The score function has k-th component
l  T
eβ xi

∂`P X X
= x ij k − xik P β T xi
.
∂βk j=1 i∈Rj e
i∈Rj

That is, it is the total difference between k-th covariate of the individual with event at time
tj and the average k-th covariate of those at risk, weighted according to the relative risk with
parameter β.
(b) The observed partial information has (k, m) coordinate given by
l X T T T
∂ 2 `P X eβ xi X eβ x i  X eβ xi 
− = xik xim P β T xi
− xik P β T xi
xim P β T xm
∂βk ∂βm j=1 i∈Rj e i∈Rj e i∈Rj e
i∈Rj i∈Rj i∈Rj

That is, it is the sum of covariances between the k-th and m-th components of individuals at
risk at time tj , where an individual is selected in proportion to relative risk.
2. (a) That would treat the categorical variable as though it were quantitative. That would force
the relative risks into particular proportions that have no empirical basis. There may be
good reason to expect the relative risk to increase with stage, but not to expect particular
proportions.
(b) We could fit the model without any covariates — so just find the Nelson–Aalen estimator— and
use that as a basis for adding in the stage as a covariate and checking the martingale residuals.
Here we will use age as an additional covariate. So we will fit the model αi (t) = α0 (t)eβ·age ,
and check for the behaviour of stage as an additional covariate. We show a box plot in figure
B.3, showing the distributions of martingale residuals for the 4 different stages. What we see
is that the residuals have essentially the same mean for stages 1 and 2, rise substantially for
stage 3, and somewhat less for stage 4.
(c) We would compute the scaled Schoenfeld residuals, and plot them as a function of event time.
If the proportional-hazards assumption holds — that is, if the proportionality parameter
associated with age is effectively constant — this should stay close to 0, with no apparent
patterns or trends.
(d) We compute the Breslow estimator H b 0 (t) for the baseline hazard. The Cox–Snell residual
βxi
is ri = H0 (Ti )e , where Ti is the time for individual i. We then compute a Nelson–Aalen
b
estimator for the right-censored times (ri , δi ). If the Cox model is a good fit, the estimated
cumulative hazard should look approximately like an upward sloping line through the origin
with slope 1.
3. (a) The R computation below shows that the coefficient for stage 2 is clearly not statistically
significant; the coefficient for stage 3 is borderline (p = 0.071); and the coefficient for stage 4
is highly significant (p = 0.000053).
Statistical Lifetime Models: HT 2019 XL

1 l a r . cph=coxph ( Surv ( time , d e l t a )∼f a c t o r ( s t a g e )+age , data=l a r y n x )


2 c o e f exp ( c o e f ) s e ( c o e f ) z p
3 factor ( stage ) 2 0.140 1.15 0.4625 0.303 0.762
4 factor ( stage ) 3 0.642 1.90 0.3561 1.804 0.071
5 factor ( stage ) 4 1.706 5.51 0 . 4 2 1 9 4 . 0 4 3 5 . 3 e −05
6 age 0.019 1.02 0.0143 1.335 0.182
7
8 L i k e l i h o o d r a t i o t e s t =18.3 on 4 df , p =0.00107 n= 9 0 , number o f
e v e n t s= 50
9
10 s t a g e 2 =( s t a g e ==2)
11 s t a g e 3 =( s t a g e ==3)
12 s t a g e 4 =( s t a g e ==4)
13 l a r 2 . cph=coxph ( Surv ( time , d e l t a )∼s t a g e 2+s t a g e 3+s t a g e 4+age , data=l a r y n x )
14 c o e f exp ( c o e f ) s e ( c o e f ) z p
15 stage2TRUE 0 . 1 4 0 0 1.1503 0.4625 0.30 0.762
16 stage3TRUE 0 . 6 4 2 4 1.9010 0.3561 1.80 0.071
17 stage4TRUE 1 . 7 0 6 0 5.5068 0 . 4 2 1 9 4 . 0 4 5 . 3 e −05
18 age 0.0190 1.0192 0.0143 1.33 0.182
19
20 L i k e l i h o o d r a t i o t e s t =18.3 on 4 df , p =0.00107
21 n= 9 0 , number o f e v e n t s= 50

(b)
1 require ( survival )
2 r e q u i r e ( KMsurv )
3
4 data ( l a r y n x )
5 l a r . cph=coxph ( Surv ( time , d e l t a )∼age , data=l a r y n x )
6
7 c o e f exp ( c o e f ) s e ( c o e f ) z p
8 age 0 . 0 2 3 3 1.02 0.0145 1.61 0.11
9
10 L i k e l i h o o d r a t i o t e s t =2.63 on 1 df , p =0.105 n= 9 0 , number o f e v e n t s=
50
11
12 l a r . f i t =s u r v f i t ( l a r . cph )
13
14 # The coxph o b j e c t has a l i s t o f t i m e s
15 # We want t o f i n d t h e i n d e x o f t h e time c o r r e s p o n d i n g t o i n d i v i d u a l
i.
16 whichtime=s a p p l y ( l a r y n x $ time , f u n c t i o n ( t ) which ( l a r . f i t $ time==t ) )
17
18 cumhaz=−l o g ( l a r . f i t $ s u r v [ whichtime ] )
19
20 b e t a=l a r . cph $ c o e f f i c i e n t s
21 r e l r i s k =exp ( b e t a ∗ ( l a r y n x $ age−mean ( l a r y n x $ age ) ) )
22 # B a s e l i n e hazard i s f o r mean v a l u e o f c o v a r i a t e
23
24 r e s i d s=l a r y n x $ d e l t a −cumhaz∗ r e l r i s k
25 #Note : We c o u l d g e t t h e same numbers out a s l a r . cph $ r e s i d u a l s
26 r e s i d s . b y s t a g e=l a p p l y ( 1 : 4 , f u n c t i o n ( i ) r e s i d s [ l a r y n x $ s t a g e==i ] )
27 b o x p l o t ( r e s i d s . b y s t a g e , x l a b= ' S t a g e ' , y l a b= ' M a r t i n g a l e r e s i d u a l ' )
Statistical Lifetime Models: HT 2019 XLI

1.0
0.5
0.0
Martingale residual

-0.5
-1.0
-1.5

1 2 3 4

Stage

Figure B.3: Box plot of martingale residuals for larynx data, stratified by stage.

(c) The plot is shown in Figure B.4. We see that there seems to be no effect of the age variable
until age 70, after which it seems to increase linearly.
Statistical Lifetime Models: HT 2019 XLII

1.0
0.5
0.0
martingale residual

-0.5
-1.0
-1.5
-2.0

40 50 60 70 80

Age (Yrs)

Figure B.4: Plot of martingale residuals against age for larynx data.

1 ########## R e s i d u a l p l o t t o t e s t age
2 aord=o r d e r ( age )
3 r e s i d s=l a r . cph2 $ r e s i d u a l s [ aord ]
4 p l o t ( age [ aord ] , r e s i d s , x l a b= ' Age ( Yrs ) ' , y l a b= ' m a r t i n g a l e r e s i d u a l ' )
5 l i n e s ( l o w e s s ( r e s i d s∼age [ aord ] ) , c o l =2)
6
7 ########## New model with age s t a r t i n g from 70
8 newage=pmax ( age [ aord ] − 70 , 0)
9 l a r . cph=coxph ( Surv ( time , d e l t a )∼f a c t o r ( s t a g e )+newage , data=l a r y n x )

(d) The scaled Schoenfeld residuals are plotted with plot(cox.zph(lar.cph)).


Statistical Lifetime Models: HT 2019 XLIII

0.2
0.1
Beta(t) for age

0.0
−0.1
−0.2

0.36 0.95 1.8 3.2 3.7 5.1 6.4 7.2

Time

Figure B.5: Plot of scaled Schoenfeld residuals to test whether age parameter is constant.

The output of the function is


1 > cox . zph ( l a r . cph )
2 rho c h i s q p
3 f a c t o r ( s t a g e ) 2 −0.0158 0 . 0 1 3 3 0.9083
4 f a c t o r ( s t a g e ) 3 −0.2599 3 . 2 3 1 3 0.0722
5 f a c t o r ( s t a g e ) 4 −0.1105 0 . 5 4 3 1 0.4611
6 age 0.1138 0.8647 0.3524
7 GLOBAL NA 4 . 6 7 5 5 0.3222

This estimates the slope of the change in the various Cox-model parameters over time. None
of the slopes is significantly different from 0.
(e) There seems to be a marked curvature of the residual plot, suggesting that the model is
underestimating the cumulative hazard later on.
1 l a r . cph=coxph ( Surv ( time , d e l t a )∼f a c t o r ( s t a g e ) , data=l a r y n x )
2 l a r . f i t =s u r v f i t ( l a r . cph )
3
4 whichtime=s a p p l y ( l a r y n x $ time , f u n c t i o n ( t ) which ( l a r . f i t $ time==t ) )
5
6 cumhaz=−l o g ( l a r . f i t $ s u r v [ whichtime ] )
7
8 b e t a=l a r . cph $ c o e f f i c i e n t s
9 r e l r i s k =exp ( matrix ( beta , 1 , 3 ) %∗% r b i n d ( s t 2 −mean ( s t 2 ) , s t 3 −mean ( s t 3 ) , s t 4
−mean ( s t 4 ) ) )
10 c o x s n e l l=c ( r e l r i s k ∗cumhaz )
11
12 CS . s u r v=Surv ( c o x s n e l l , d e l t a [ aord ] )
13 CS . f i t =s u r v f i t (CS . s u r v∼1 )
14
15 p l o t (CS . f i t $ time ,− l o g (CS . f i t $ s u r v ) , x l a b= ' Time ' ,
16 y l a b= ' F i t t e d c u m u l a t i v e hazard f o r Cox−−S n e l l r e s i d u a l s ' )
17 a b l i n e ( 0 , 1 , c o l =2)
Statistical Lifetime Models: HT 2019 XLIV

3.5
3.0
Fitted cumulative hazard for Cox-Snell residuals

2.5
2.0
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0

Time

Figure B.6: Cox–Snell residual plot for larynx data.

4. (a) The estimator B is defined in (5.14) as


X
B̂(t) = X− (tj ) dN(tj ).
tj ≤t

In a small interval of time [t, t + dt) in which the cumulative hazard covariates B are
incremented by dB(t) = β(t) dt, the expected number of events is incremented by
  
E dN(t)i = X(t) dB(t) i ,

independent of any past events. If we now condition on an event happening at time t = tj we


can say that 
  X(tj )β i
E dN(t)i = P  .
k X(tj )β k

Thus, conditioned on any past information, the expected increment to the martingale residual
at t = tj is

E Mres (t) − Mres (t−) = E dN(t) − X(tj )X− (tj )E dN(t)


     
 −1 
1 
=P T I − X(t) X(t)T X(t) X(t)T X(t)β
k β X(tj )1
 −1 
1 
=P T X(t) − X(t) X(t)T X(t) X(t)T X(t) β
k β X(tj )1
= 0.

Of course, the expected increment is 0 conditioned on no event at time t. Thus, the expectation
of Mres (t) is constant, and since it starts at 0, it is identically 0.
Statistical Lifetime Models: HT 2019 XLV

(b) We write Y(s) for the n × n diagonal matrix with the at-risk indicators on the diagonal. We
note that for right-censored data Y(s0 )Y(s) = Y(s) for s0 ≤ s, and Y(s) dN(s) = dN(s)
because Yi (s) = 0 implies that dNi (s) = 0. Thus

Y(s) dMres (s) = Y(s) dN(s) − Y(s)X(s) dB(s) = dMres (s).

Also, X(t) = Y(t)X, where we write X for X(0). Since Mres is constant except at times
s = tj , we consider the increments at s = tj , obtaining

XT dMres (tj ) = XT (tj ) − XT (tj )Y(tj )XX− (tj ) dM(tj )



 −1 T 
= XT − XT Y(tj )X XT Y(tj )X X Y(tj ) dM(tj )
 −1 T 
= XT Y(tj ) − XT Y(tj )X XT Y(tj )X

X Y(tj ) dM(tj )
= 0.

Since this is true for all s, and since Mres (0) = 0, it must be true that XT Mres (t) = 0 for all
t ≤ τ.
(c) The equation XT Mres (τ ) = 0 means that the n-dimensional vector Mres (τ ) is orthogonal to
each of the p + 1 distinct n-dimensional vectors of the coefficients. There is no linear trend
with respect to the covariates. (In other words, in the linear regression model predicting
Mres (τ ) as a function of the covariates, the coefficients are all 0.)
If the additive hazards model is true, there should be no nonlinear effect of the covariates
on the martingale residuals. So one possible model test is to plot the martingale residuals
against nonlinear functions of the residuals — for instance, the square of a covariate, or a
product of two covariates — and look for trends. This is described briefly in section 4.2.4 of
[2], and more extensively in [1].
5. (a) The correct log partial likelihood would have been
k k X 
X X T
`P (β) = β T xij − log eβ xi
.
j=1 j=1 i∈Rj

The doubled data set will produce two identical events at time tj , and the risk sets R0j will
have each individual covariate from Rj repeated. This produces a log partial likelihood
k k X 
X X T
`0P (β) = 2 β T xij − 2 log eβ xi

j=1 j=1 i∈R0j

k
X k
X  X T 
=2 β T x ij − log 2 eβ xi
j=1 j=1 i∈Rj

= 2`P (β) − 2k log 2.

So `0P will be maximised at the same β that maximises `P .


Since the log likelihood is (up to an additive constant) doubled for the data set with repetition,
so is√the partial information. The estimated variance is therefore halved, and the SE divided
by 2.
(b) We know that if the model were correctly specified, we would have Vn (β̂) ≈ Jn (β̂) for large n.
(This is Fisher’s identity, that the variance of the score function is equal to the expected second
derivative, that is, the Fisher information.) As discussed above, the log partial likelihood for
the duplicated data is twice the log partial likelihood for the correctly specified Cox model on
the true data. The variance estimate is made up of squared score functions, so the estimate
Statistical Lifetime Models: HT 2019 XLVI

derived from the duplicated data will be Vn0 (β̂) = 4Vn (β̂), while Jn0 (β̂) = 2Jn (β̂). Thus we
obtain the sandwich estimator from the duplicated data

(Jn0 )−1 Vn0 (Jn0 )−1 = (2Jn )−1 (4Vn )(2Jn )−1 = (Jn )−1 Vn (Jn )−1 ≈ Jn−1 ,

which is the correct variance estimate.


(c)
1 > require ( survival )
2 > r e q u i r e ( KMsurv )
3 > data ( bmt )
4 > bmt$ i d <− 1 : dim ( bmt ) [ 1 ]
5 > bmt . d o u b l e <− r b i n d ( bmt , bmt )
6 > bmcox=coxph ( Surv ( t2 , d3 )∼z1+z2+z1 ∗ z2 , data=bmt )
7 > bmcox . d o u b l e <− coxph ( Surv ( t2 , d3 )∼z1+z2+z1 ∗ z2 , data=bmt . d o u b l e )
8 > bmcox . c l u s t e r <− coxph ( Surv ( t2 , d3 )∼z1+z2+z1 ∗ z2+c l u s t e r ( i d ) , data=bmt .
double )
9 > bmcox
10 Call :
11 coxph ( f o r m u l a = Surv ( t2 , d3 ) ∼ z1 + z2 + z1 ∗ z2 , data = bmt )
12
13 c o e f exp ( c o e f ) se ( coef ) z p
14 z1 −0.1075108 0.8980669 0 . 0 3 4 8 8 4 3 −3.082 0 . 0 0 2 0 6
15 z2 −0.0828731 0.9204680 0 . 0 3 0 2 4 2 1 −2.740 0 . 0 0 6 1 4
16 z1 : z2 0 . 0 0 3 4 4 0 5 1.0034464 0.0009076 3.791 0.00015
17
18 L i k e l i h o o d r a t i o t e s t =13.29 on 3 df , p =0.004048
19 n= 1 3 7 , number o f e v e n t s= 83
20 > bmcox . d o u b l e
21 Call :
22 coxph ( f o r m u l a = Surv ( t2 , d3 ) ∼ z1 + z2 + z1 ∗ z2 , data = bmt . d o u b l e )
23
24 c o e f exp ( c o e f ) se ( coef ) z p
25 z1 −0.1080799 0.8975559 0 . 0 2 4 6 9 0 2 −4.377 1 . 2 0 e −05
26 z2 −0.0833621 0.9200180 0 . 0 2 1 4 0 0 9 −3.895 9 . 8 1 e −05
27 z1 : z2 0 . 0 0 3 4 5 8 8 1.0034648 0 . 0 0 0 6 4 2 5 5 . 3 8 4 7 . 3 0 e −08
28
29 L i k e l i h o o d r a t i o t e s t =26.78 on 3 df , p =6.535 e −06
30 n= 2 7 4 , number o f e v e n t s= 166
31 > bmcox . c l u s t e r
32 Call :
33 coxph ( f o r m u l a = Surv ( t2 , d3 ) ∼ z1 + z2 + z1 ∗ z2 + c l u s t e r ( i d ) ,
34 data = bmt . d o u b l e )
35
36 c o e f exp ( c o e f ) se ( coef ) robust se z p
37 z1 −0.1080799 0.8975559 0.0246902 0 . 0 3 1 9 5 2 5 −3.383 0 . 0 0 0 7 1 8
38 z2 −0.0833621 0.9200180 0.0214009 0 . 0 2 7 4 3 4 4 −3.039 0 . 0 0 2 3 7 7
39 z1 : z2 0 . 0 0 3 4 5 8 8 1.0034648 0.0006425 0 . 0 0 0 7 6 0 1 4 . 5 5 1 5 . 3 5 e −06
40
41 L i k e l i h o o d r a t i o t e s t =26.78 on 3 df , p =6.535 e −06
42 n= 2 7 4 , number o f e v e n t s= 166
Statistical Lifetime Models: HT 2019 XLVII

6. (a) By the law of total probability


∞ n
t −t λr r−1 −λt
Z

P {N = n} = e · t e dt
0 n! Γ(r)
λr Γ(n + r)
=
n!Γ(r) (λ + 1)n+r
 n  r
Γ(n + r) 1 λ
=
Γ(r)n! λ+1 λ+1
 
n+r−1 n
= p (1 − p)r
n

where p = 1/(λ + 1). This is the negative binomial distribution with parameters p and r.
(b) (Note: Apologies, the notation of the statement was somewhat confusing, because λ is the true
parameter of the NB distribution, but also the nominal Poisson parameter for the erroneous
model that we are fitting in this section. I will call this Poisson parameter λP .) The MLE
for the Poisson distribution is λ̂P = n−1 Ni . This will converge to the expected value
P
of the negative binomial, which is r/λ. We would expect this estimator to have variance
given by the variance of the Poisson distribution, which is the same as the expected value
r/nλ ≈ λ̂P /n. But the variance of the negative binomial is actually r(λ−1 + λ−2 )/n.
(c) A simple test would be to compare the sample mean to the sample variance. Under the
assumption of Poisson distribution the sample variance
1 X
σ̂ 2 := (Ni − λ̂P )2
n−1
has expected value
n h i
E[σ̂ 2 ] = E Ni2 − 2λ̂P Ni + λ̂2P
n−1
 
n 2 2 λP λP 2
= λP + λP − 2(λP + )+ + λP
n−1 n n
= λP .

The variance will be approximately (see below)

λP + 4λ2P
Var(σ̂ 2 ) ≈ + O(n−2 ).
n

(To compute this, it makes sense to write Ni = Ñ + λP and λ̂P = λ̃ + λP . Then higher powers
of λ̃ have expectations O(n−1 ), and products with Ñ are still of this sort.) Then we can take

σ̂ 2 − λ̂P
Z := q
(λ̂P + 4λ̂2P )/n

as an approximately standard normal test statistic.


Alternatively, we could perform a likelihood ratio test.
To calculate the variance of σ̂ 2 we rewrite it as
n 2
1 X 1 X
σ̂ 2 = (Ni − Nj )
n − 1 i=1 n
j6=i
n X
1 X X 
= (Ni − Nj )2 + i, j, j 0 all distinct(Ni − Nj )(Ni − Nj 0 ) .
(n − 1)n2 i=1 j6=i
Statistical Lifetime Models: HT 2019 XLVIII

The variance will be the sums of all the variances of the individual terms, plus the sums of all
the covariances of two different terms. We could calculate this exactly, but there would be a lot
of different terms to keep track of. We note, though, that for large n we can approximate this
by calculating only the first-order term in n. Each individual term contributes on the order
of n−6 to the sum. In total the number of terms is on the order of n3 , so their contribution is
on the order of n−3 in total. Covariances between terms involving entirely distinct Ni , Nj
will be 0 (by independence), so we need consider only covariances with duplications. The first
sum contributes only O(n3 ) of these, and pairs between the first and second sum contribute
only O(n4 ). The only contribution on the order of n5 is that made by pairs from the second
sum. These could have the same i or the same j or j 0 .
In total, there are n(n−1)(n−2) terms. If we are considering pairs of the form (Ni −Nj )(Ni −
Nj 0 ) and (Ni − Nj 00 )(Ni − Nj 000 ) with j’s all distinct there will be n(n − 1)(n − 2)(n − 3)(n − 4)
pairs (since we are taking account of order), and each one contributes

Cov (Ni − Nj )(Ni − Nj 0 ), (Ni − Nj 00 )(Ni − Nj 000 ) = (3λ2P + λP ) − λ2P = 2λ2P + λP .




(since the central fourth moment of a Poisson random variable with parameter λ is 3λ2 + λ).
Considering pairs of the form (Ni − Nj )(Ni − Nj 0 ) and (Ni0 − Nj )(Ni0 − Nj 00 ) we see that
there are also 2n(n − 1)(n − 2)(n − 3)(n − 4) pairs (since, having chosen i, j, j 0 for the first
term we have two possibilities of which one to repeat for the second), and each contributes

Cov (Ni − Nj )(Ni − Nj 0 ), (Ni0 − Nj )(Ni0 − Nj 00 ) = λ2P .




Putting these together, we get the indicated approximation to the variance.

You might also like