Statistical Lifetime Models
Statistical Lifetime Models
2
Statistical Lifetime-Models
David Steinsaltz, Department of Statistics, University of Oxford
P'(3,t)
3
Age
P'(2,t)
2
P'(1,t)
1 2 3 4
Time
HT 2019
SB3.2
Statistical Lifetime-Models
David Steinsaltz – 16 lectures HT 2019
[email protected]
Prerequisites
Part A Probability and Part A Statistics are prerequisites.
Website: http://www.steinsaltz.me.uk/SB3b/SB3b.html
Aims
Event times and event counts appear in many social and medical data contexts, and require
a specialised suite of techniques to handle properly, broadly known as survival analysis. This
course covers the basic definitions and techniques for creating life tables, estimating event-time
distributions, comparing and testing distributions of different populations, and evaluating the
goodness of fit of various models. A focus is on understanding when and why particular models
ought to be chosen, and on using the standard software tools in R to carry out data analysis.
Key Themes
• Life tables
Learning Objectives
At the end of this course students will
• Understand the standard notation of life tables, and be able to make inferences from life
tables;
• Derive estimators and confidence intervals for standard parametric survival models;
• Fit survival curves with standard nonparametric techniques using R, and interpret the
results;
• Select appropriate tests for equality of survival distributions, and carry out the tests in R;
• Estimate the goodness of fit of survival models using graphical and residual techniques;
Computing
The lectures include material about performing basic survival analysis in R. This is an important
element of the course, since almost any application you will make of the methods you learn here
will be done on the computer. If you have not done any R programming yet, you are encouraged
to review the introductory material provided for the Part A Statistics lectures, and try the
computing exercises from that course. There will be computing exercises in each problem sheet.
These will not be checked, but as with all of the problems on the sheets, you can ask about them
in the classes.
Synopsis
1. Introduction to survival data: hazard rates, survival curves, life tables.
3. Multiple-decrements model.*
7. Semiparametric models:
8. Model-fit diagnostics.
Note: Three topics (marked with stars) are listed in the lecture notes as “optional”. Some of
these may not be covered, depending on the available time.
v
Reading
The main reading for this course will be a set of lecture notes, that will be available on the course
website. Unless specifically stated, all material in the lecture notes is examinable.
There are lots of good books on survival analysis. Look for one that suits you. Some pointers
will be given in the lecture notes to readings that are connected, but look in the index to find
more explanation of topics that confuse you and/or interest you.
Secondary Readings
• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapters
3, 4, 7.
Secondary Readings
• Odd O. Aalen et al., Survival and Event History Analysis, Springer, 2008.
• D. F. Moore, Applied Survival Analysis Using R, Springer, 2016.
vi
3. Semiparametric models
Relative risk (proportional hazards) including the Cox model, additive hazards model,
accelerated failure models. Partial likelihood. Efron’s estimator for survival distributions.
Primary Readings:
• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapters
8 and 10.
Secondary Readings:
• J.P. Klein and M.L. Moeschberger, Survival Analysis, Springer, 2nd ed., 2003: Chapter
11.
Secondary Readings
5. Repeated events
Anderson–Gill model, Poisson regression, negative binomial model.
Contents
Glossary xii
vii
CONTENTS viii
4 Survival analysis 53
4.1 Censoring and truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Likelihood and Right Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Random censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Non-informative censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Time on test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Non-parametric survival estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Review of basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Kaplan–Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.3 Nelson–Aalen estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.4 Invented data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.5 Example: The AML study . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Left truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Variance estimation: Greenwood’s formula . . . . . . . . . . . . . . . . . . . . . . 62
4.6.1 The cumulative hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6.2 The survival function and Greenwood’s formula . . . . . . . . . . . . . . . 63
4.6.3 Reminder of the δ method . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 AML study, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Survival to ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8.1 Example: Time to next birth . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.9 Computing survival estimators in R (non-examinable) . . . . . . . . . . . . . . . . 70
4.9.1 Survival objects with only right-censoring . . . . . . . . . . . . . . . . . . 70
4.9.2 Other survival objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
CONTENTS ix
5 Regression models 74
5.1 Introduction to regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 How survival functions vary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Graphical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Generalised linear survival models . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 The relative-risk regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 Partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Significance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8 Estimating baseline hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8.1 Breslow’s estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8.2 Individual risk ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9 Dealing with ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.10 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 The Cox model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.12 The additive hazards model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.12.1 Describing the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.12.2 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A Problem sheets I
A.1 Revision, lifetime distributions, Lexis diagrams and the census approximation . . II
A.2 Life expectancy, graduation, and survival analysis . . . . . . . . . . . . . . . . . . V
A.3 Survival regression models and two-sample testing . . . . . . . . . . . . . . . . . IX
A.4 Model diagnostics, repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . XII
B Solutions XIV
B.1 Revision, lifetime distributions, Lexis diagrams and the census approximation . . XV
B.2 Life expectancy, graduation, and survival analysis . . . . . . . . . . . . . . . . . . XX
B.3 Survival regression models and two-sample testing . . . . . . . . . . . . . . . . . XXVII
B.4 Model diagnostics, repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . XXXIX
Glossary
Central Exposed To Risk Total time that individuals are at risk. Under some circumstances,
this is about the number of individuals at risk at the midpoint the estimation period. 21
cohort A group of individuals of equivalent age (in whatever sense relevant to the study),
observed over a period of time. 8, 16
cohort life table Life table showing mortality of individuals born in the same year (or approx-
imately same year). 32
force of mortality Same as mortality rate, but also used in a discrete context. 8
hazard rate Density divided by survival. Thus, the instantaneous probability of the event
occurring, conditioned on survival to time t. 8
Initial Exposed To Risk Number of individuals at risk at the start of the estimation period.
17, 21
Maximum Likelihood Estimator Estimator for a parameter, chosen to maximise the likeli-
hood function. 17
period life table Life table showing mortality of individuals of a given age living in the same
year (or approximately same year). 32
Radix The initial number of individuals in the nominal cohort described by a life table. 16
xii
Chapter 1
1
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 2
have died after their 56th birthday in the particular year under observation, which happens to
be 5031. The probability of dying in one day may then be estimated as
1 174 1
× ≈ ,
365 5031 10000
and Buffon proceeds to reason with this estimate.
From this elementary exercise we see that:
• Mortality probabilities can be estimated as the ratio of the number of deaths to the number
of individuals “at risk”.
• Mortality can serve as a model for thinking about risks (and opportunities) more generally,
for events happening at random times.
• You don’t get very far in thinking about mortality and other risks without some sort of
theoretical model.
The last claim may require a bit more elucidation. What would a naïve, empirical approach
to life tables look like? Given a census of the population by age, and a list of the ages at death
in the following year, we could compute the proportion of individuals aged x who died in the
following year. This is merely a free-floating fact, which could be compared with other facts, such
as the measured proportion of individuals aged x who died in a different year (or at a different
age, or a different place, etc.) If you want to talk about a probability of dying in that year
(for which the proportion would serve as an estimate), this is a theoretical construct, which can
be modelled (as we will see) in different ways. Once you have a probability model, this allows
you to pose (and perhaps answer) questions about the probability of dying in a given day, make
predictions about past and future trends, and isolate the effect of certain medications or life-style
changes on mortality.
There are many different kinds of problems for which the same survival analysis statistics
may be applied. Some examples which we will consider at various points in this course are:
• Time until a person diagnosed with (and perhaps treated for) a disease has a recurrence.
Often, though, we will use the term “lifetime” to represent any waiting time, along with its
attendant vocabulary: survival probability, mortality rate, cause of death, etc.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 3
A. sarcophagus 2,4,6,8,9,11,12,13,14,14,15,15,16,17,17,18,19,19,20,21,23,28
2,6,8,9,11,14,15,16,17,18,18,18,18,18,19,21,21,21,
T. rex
22,22,22,22,22,22,23,23,24,24,28
2,5,5,5,7,9,10,10,10,11,12,12,12,13,13,14,14,14,14,14,
G. libratus
15,16,16,17,17,17,18,18,18,19,19,19,20,20,21,21,21,21,22
Daspletosaurus 3,9,10,17,18,21,21,22,23,24,26,26,26
Table 1.1: 103 estimated ages of death (in years) for four different tyrannosaur species.
30
25
8
20
6
Frequency
Frequency
15
4
10
2
5
0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Notice the nomenclature: maxλ∈Λ f (λ) picks the maximal value in the range of f , arg maxλ∈Λ f (λ)
picks the λ-value in the domain of f for which this maximum is attained.
The most basic model for lifetimes is the exponential. This is the “memoryless” waiting-
time distribution, meaning that the remaining waiting time always has the same distribution,
conditioned on the event not having occurred up to any time t. This distribution has a single
parameter (k = 1) µ, and density
f (µ; T ) = µe−µT .
The parameter µ is chosen from the domain Λ = (0, ∞). If we observe independent lifetimes
T1 , . . . , Tn from the exponential distribution with parameter µ, and let T̄ := n−1 ni=1 Ti be the
P
average, the log likelihood is
n
X
log µe−µTi = n log µ − T̄ µ ,
`T (µ) =
i=1
are the entries of the Fisher Information matrix. Of course, we generally don’t know what λ is —
otherwise, we probably would not be bothering to estimate it! — so we may approximate the
Information matrix by computing Ij1 j2 (λ̂) instead. Furthermore, we may not be able to compute
the expectation in any straightforward way; in that case, we use the principle of Monte Carlo
estimation: We approximate the expectation of a random variable by the average of a sample of
observations. We already have the sample T1 , . . . , Tn from the correct distribution, so we define
the observed information matrix
n
1 X ∂ 2 log f (Ti ; λ)
Jj1 j2 (λ, T1 , . . . , Tn ) = − .
n ∂λj1 ∂λj2
i=1
Again, we may substitute Jj1 j2 (λ̂, T1 , . . . , Tn ), since the true value of λ is unknown. Thus, in the
case of a one-dimensional parameter (where the covariance matrix is just the variance and the
matrix inverse (I(λ̂))−1 is just the multiplicative inverse in R), we obtain
s s
λ̂ − 1.96 1 1
, λ̂ + 1.96
I(λ̂) I(λ̂)
T̄ = 16.03,
µ̂ = 0.062,
SEµ̂ = 0.0061,
95% confidence interval for µ̂ = (0.050, 0.074).
Aside: In the special case of exponential lifetimes, we can construct exact confidence intervals,
since we know the distribution of n/µ̂ ∼ Γ(n, µ), so that 2nµ/µ̂ ∼ χ22n allows us to use χ2 -tables.
Is the fit any good? We have various standard methods of testing goodness of fit — we
discuss an example in section 1.2.3 — but it’s pretty easy to see by eye that the histograms in
Figure 1.1 aren’t going to fit an exponential distribution, which is a declining density, very well.
In Figure 1.2 we show the empirical (observed) cumulative distribution of tyrannosaur deaths,
together with the cdf of the best exponential fit, which is obviously not a very good fit at all.
We also show (in green) the fit to a class of distribution which is an example of a larger
class that we will meet later, called the Weibull distributions. Instead of the exponential cdf
2
F (t) = 1 − e−µt , suppose we take F (t) = 1 − e−αt . Note that if we define Yi = Ti2 , we have
√
P(Yi ≤ y) = P(Ti ≤ y) = 1 − e−αy ,
1.0
0.8
0.6
CDF
0.4
0.2
0.0
0 5 10 15 20 25 30
Age
Figure 1.2: Empirical cumulative distribution of tyrannosaur deaths (circles), together with cdf
of exponential fit (red) and Weibull fit (green).
where m is the number of bins (e.g. from your histogram), but merged to satisfy size restrictions,
and k the number of parameters estimated. Oj is the random variable modelling the number
observed in bin j, Ej the number expected under maximum likelihood parameters. To justify
the approximate distribution for the test statistic, we require that at most 20% of bins have
Ej ≤ 5, none Ej ≤ 1 (‘size restriction’).
We obtain then X 2 = 17.9 for the Weibull model, and X 2 = 92.2 for the exponential
distribution. The latter produces a p-value on the order of 10−18 , but the former has a p-
value around 0.0013. Thus, while the data could not possibly have come from an exponential
distribution, or anything like it, the Weibull distribution, while unlikely to have produced exactly
these data, is a plausible candidate.
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 7
Expected Expected
Age Observed
Exponential Weibull
0–4 8 22.7 5.4
5–9 13 21.5 19.3
10–14 22 15.7 25.3
15–19 39 11.5 22.7
20–24 25 8.4 15.7
25+ 15 23.1 14.6
• Large samples We have looked at some very simple single-parameter models. Other
models, with more complicated time-varying mortality rates, may be closer to the truth.
But no matter how elaborate the multivariate parametric models that we propose, they
are unlikely to be precisely true. A parametric family will eventually be rejected once the
sample size is large enough — and since we may be concerned with statistical surveys of,
for example, the entire population of the UK, the sample sizes will be very large indeed.
Nonparametric or semiparametric methods will be better able to let the data speak for
themselves.
• Small samples While nonparametric models allow the data to speak for themselves,
sometimes we would prefer that they be somewhat muffled. When the number of observed
deaths is small — which can be the case, even in a very large data set, when considering
advanced ages, above 90, and certainly above 100, because of the small number of individuals
who survive to be at risk, but also in children, because of the very low mortality rate — the
estimates are less reliable, being subject to substantial random noise. Also, the mortality
pattern changes over time, and we are often interested in future mortality, but only have
historical data. A non-parametric estimate that precisely reflects the data at hand may
reflect less well the underlying processes, and be ill-suited to projection into the future.
Graduation (smoothing) and extrapolation methods have been developed to address these
issues.
• Incomplete observations Some observations will be incomplete. We may not know the
exact time of a death, but only that it occurred before a given time, or after a given time,
or between two known times, a phenomenon called “censoring”. (When we are informed
only of the year of a death, but not the day or time, this is a kind of censoring. Or we may
have observed only a sample of the population, with the sample being not entirely random,
but chosen according to being alive at a certain date, or having died before a certain date,
a phenomenon known as “truncation”. We need special techniques to make use of these
partial observations.) Since we are observing times, subjects who break off a study midway
through provide partial information in a clearly structured way.
• Successive events A key fact about time is its sequence. A patient is infected, develops
symptoms, has a diagnosis, a treatment, is cured or relapses, at some point dies. Some
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 8
or all of these events may be considered as a progression, and we may want to model the
sequence of random times. Some care is needed to carry out joint maximum likelihood
estimation of all transition rates in the model, from one or several individuals observed.
This can be combined with time-varying transition rates.
• Comparing lifetime distributions We may wish to compare the lifetime distributions
of different groups (e.g., smokers and nonsmokers; those receiving a traditional cholesterol
medication and those receiving the new drug) or the effect of a continuous parameter (e.g.,
weight) on the lifetime distribution.
• Changing rates Mortality rates are not static in time, creating disjunction between period
measures — looking at a cross-section of the population by age as it exists at a given
time — and cohort measures — looking at a group of individuals born at a given time,
and following them through life.
cdf F (t) = P L ≤ t ;
The hazard rate is also called mortality rate in survival contexts. The traditional name in
demography is force of mortality. This may be thought of as the instantaneous rate of dying per
unit time, conditioned on having already survived.s The exponential distribution with parameter
λ ∈ (0, ∞) is given by
Thus, the exponential is the distribution with constant force of mortality, which is a formal
statement of the “memoryless” property.
The density fT (t) is the (unconditional) infinitesimal probability to die at age t. The hazard
rate hT (t) is the (conditional) infinitesimal probability to die at age t of an individual known to
be alive at age t. It may seem that the hazard rate is a more complicated quantity than the
density, but it is very well suited to modelling mortality. Whereas the density has to integrate to
one and the distribution function (survival function) has boundary values 0 and 1, the force of
mortality has no constraints, other than being nonnegative — though if “death” is certain the
force of mortality has to integrate to infinity. Also, we can read its definition as a differential
equation and solve
( Z )
t
F̄T0 (t) = −µt F̄T (t), F̄ (0) = 1 ⇒ F̄T (t) = exp − µs ds , t ≥ 0. (1.7)
0
Note that this implies that hTx (t) = hT (x + t), so it is really associated with age x + t only, not
with initial age x nor with time t after initial age. Also note that, given a measurable function
R ∞→ R, F̄Tx (0) = 1 always holds, F̄Tx decreasing if and only if µ ≥ 0. F̄Tx (∞) = 0 if and
µ : [0, ∞)
only if 0 µt dt = ∞. This leaves a lot of modelling freedom via the force of mortality.
Densities can now be obtained from the definition of the force of mortality (and consistency)
as fTx (t) = µt+x F̄Tx (t).
for parameters A > 0, B > 0, θ > 0; m = B/θ. Note that mortality grows exponentially. If θ is
big enough, the effect is very close to introducing a maximal age ω, as the survival probabilities
decrease very quickly. There are other parametrisations for this family of distributions. The
Gompertz distribution is named for British actuary Benjamin Gompertz, who in 1825 first
published his discovery [12] that human mortality rates over the middle part of life seemed
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 10
1.00
5e-01
male
female
0.50
Survival function
5e-02
Mortality rate
0.20
0.10
5e-03
male
female
0.05
5e-04
0.02
1e-04
0.01
0 20 40 60 80 100 0 20 40 60 80 100
age age
to double at constant age intervals. It is unusual, among empirical discoveries, for having
been confirmed rather than refuted as data have improved and conditions changed, and it (or
Makeham’s modification) serves as a standard model for mortality rates not only in humans, but
in a wide variety of organisms. As an example, see Figure 1.3, which shows Canadian mortality
rates from life tables produced by Statistics Canada (available at http://www.statcan.ca:
80/english/freepub/84-537-XIE/tables.htm). Notice how close to a perfect line the mid-life
mortality rates for both males and females is, when plotted on a logarithmic scale, showing that
the Gompertz model is a very good fit.
Figure 1.3(b) shows the corresponding survival curves. It is worth recognising how much more
informative the mortality rates are. in Figure 1.3(a) we see that male mortality is regularly higher
than female mortality at all ages (and by a fairly constant ratio), we see several phases of mortality
— early decline, jump in adolescence, then steady increase through midlife, and deceleration in
extreme old age — whereas Figure 1.3(b) shows us only that mortality is accelerating overall,
and that males have accumulated higher mortality by late life.
The Weibull distribution suggests a polynomial rather than exponential growth of mortality
n o
µt = αρα tα−1 , F̄Tx (t) = exp −ρα (x + t)α − xα , x ≥ 0, t ≥ 0, (1.11)
for rate parameter ρ > 0 and exponent α > 0. The Weibull model is commonly used in engineering
contexts to represent the failure-time distribution for machines. The Weibull distribution arises
naturally as the lifespan of a machine with n redundant components, each of which has constant
failure rate, such that the machine fails only when all components have failed. Later in the
course we will discuss how to fit Weibull and Gompertz models to data.
Another class of distributions is obtained by replacing the parameter λ in the exponential dis-
tribution by a (discrete or continuous) random variable M . Then the specification of exponential
conditional densities
fT |M =λ (t) = λe−λt (1.12)
CHAPTER 1. INTRODUCTION: SURVIVAL MODELS 11
Various special cases of exponential mixtures and other extensions of the exponential distribution
have been suggested in a life insurance context.
Then the number of individuals observed to have curtate lifespan x has binomial distri-
bution Bin(22, qx ). The MLE for a binomial probability is just the naïve estimate q̂x =
# successes/# trials (where a “success”, in this case, is a death in the age interval under con-
sideration). To compute q̂2 , then, we observe that there were 22 Albertosaurs from our sample
still alive on their 22 birthdays, of which one unfortunate met its maker in the following year:
q̂2 = 1/22 ≈ 0.046. As for q̂3 , on the other hand, there were 21 Albertosaurs observed alive on
their third birthdays, and all of them arrived safely at their fourth, making q̂3 = 0/21. This
leads us to the peculiar conclusion that our best estimate for the probability of an albertosaur
dying in its third year is 0.046, but that the probability drops to 0 in its fourth year, then
becomes nonzero again in the fifth year, and so on. This violates our intuition that mortality
rates should be fairly smooth as a function of age. This problem becomes even more extreme
when we consider continuous lifetime models. With no constraints, the optimal estimator for the
mortality distribution would put all the mass on just those moments when deaths were observed
in the sample, and no mass elsewhere — in other words, infinite hazard rate at a finite set of
points at which deaths have been observed, and 0 everywhere else.
As we see from Figure 1.1, the mortality distribution for the tyrannosaurs becomes much
smoother and less erratic when we use larger bins for the histogram. This is no surprise, since
we are then sampling from a larger baseline, leading to less random fluctuation. The simplest
way to impose our intuition of regularity upon the estimators is to increase the time-step and
reduce the number of parameters to estimate. An extreme version of this, of course, is to impose
a parametric model with a small number of parameters. This is part of the standard tradeoff in
12
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 13
statistics: a free, nonparametric model is sensitive to random fluctuations, but constraining the
model imposes preconceived notions onto the data.
Notation: When the hazard rate µx is being assumed constant over each year of life, the
continuous mortality rate has been reduced to a discrete set of parameters. What do we call
these parameters? In actuarial contexts the constant hazard rate for T on the interval [k, k + 1)
is called µT (k + 12 ), or sometimes µk+ 1 when T is clear from the context. This has the advantage
2
of including in the notation the assumption of constancy over integer intervals. It is, however, a
very inflexible notation, that causes all kinds of confusion as soon as the problem gets slightly
more general — for instance, when we assume rates constant over intervals of length 5 or 10.
We will instead write µT (k) or µk , and be sure to make clear in context what intervals we are
assuming µ to be constant over.
Table 2.1: Male mortality data for England and Wales, 1990–2. From [11] (available online).
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 15
AGE x `x qx ex AGE x `x qx ex
0 100000 0.0082 73.4 52 92997 0.0058 24.4
1 99180 0.0006 73.0 53 92461 0.0065 23.6
2 99119 0.0004 72.1 54 91865 0.0070 22.7
3 99081 0.0003 71.1 55 91219 0.0080 21.9
4 99052 0.0002 70.1 56 90486 0.0089 21.0
5 99028 0.0002 69.1 57 89682 0.0099 20.2
6 99006 0.0002 68.2 58 88791 0.0112 19.4
7 98986 0.0002 67.2 59 87800 0.0124 18.6
8 98967 0.0002 66.2 60 86714 0.0139 17.9
9 98950 0.0002 65.2 61 85505 0.0156 17.1
10 98932 0.0002 64.2 62 84168 0.0173 16.4
11 98914 0.0002 63.2 63 82710 0.0197 15.6
12 98896 0.0002 62.2 64 81078 0.0221 14.9
13 98877 0.0002 61.2 65 79290 0.0245 14.3
14 98855 0.0003 60.2 66 77347 0.0269 13.6
15 98826 0.0004 59.3 67 75267 0.0302 13.0
16 98786 0.0005 58.3 68 72992 0.0327 12.4
17 98735 0.0008 57.3 69 70605 0.0364 11.8
18 98660 0.0009 56.4 70 68037 0.0389 11.2
19 98574 0.0008 55.4 71 65391 0.0432 10.6
20 98492 0.0008 54.5 72 62567 0.0473 10.1
21 98410 0.0009 53.5 73 59610 0.0525 9.6
22 98325 0.0009 52.5 74 56481 0.0573 9.1
23 98238 0.0009 51.6 75 53246 0.0613 8.6
24 98150 0.0009 50.6 76 49982 0.0679 8.1
25 98066 0.0008 49.7 77 46590 0.0744 7.7
26 97986 0.0009 48.7 78 43125 0.0807 7.2
27 97900 0.0008 47.8 79 39643 0.0882 6.8
28 97819 0.0009 46.8 80 36145 0.0967 6.5
29 97735 0.0009 45.8 81 32651 0.1036 6.1
30 97647 0.0009 44.9 82 29270 0.1127 5.7
31 97559 0.0010 43.9 83 25971 0.1221 5.4
32 97465 0.0010 43.0 84 22800 0.1332 5.1
33 97368 0.0010 42.0 85 19763 0.1425 4.8
34 97272 0.0011 41.0 86 16946 0.1555 4.5
35 97168 0.0012 40.1 87 14310 0.1698 4.2
36 97053 0.0013 39.1 88 11881 0.1799 4.0
37 96930 0.0014 38.2 89 9744 0.1946 3.8
38 96796 0.0015 37.2 90 7848 0.2016 3.6
39 96648 0.0016 36.3 91 6266 0.2153 3.3
40 96489 0.0016 35.4 92 4917 0.2355 3.1
41 96332 0.0019 34.4 93 3759 0.2611 2.9
42 96151 0.0021 33.5 94 2777 0.2787 2.7
43 95954 0.0022 32.5 95 2003 0.2912 2.6
44 95745 0.0023 31.6 96 1420 0.3023 2.5
45 95521 0.0027 30.7 97 991 0.3232 2.3
46 95269 0.0030 29.8 98 670 0.3284 2.2
47 94982 0.0034 28.9 99 450 0.3698 2.1
48 94660 0.0037 28.0 100 284 0.3737 2.0
49 94313 0.0041 27.1 101 178 0.3909 1.9
50 93926 0.0047 26.2 102 108 0.4209 1.8
51 93486 0.0052 25.3 103 63 0.4450 1.7
Table 2.2: Life table for English men, computed from data in Table 2.1
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 16
The quantities qx may be thought of as the discrete analogue of the mortality rate — we will
call it the discrete mortality rate or discrete hazard function — since it describes the probability
of dying in the next unit of time, given survival up to age x. In Table 2.1 we show the life table
computed from the raw data of Table 2.2. (It differs slightly from the official table, because the
official table added some slight corrections. The differences are on the order of 1% in qx , and
much smaller in lx .) The life table represents the effect of mortality on a nominal population
starting with size l0 called the Radix, and commonly fixed at 100,000 for large-population life
tables. Imagine 100,000 identical individuals — a cohort — born on 1 January, 1900. In the
column qx we give the estimates for the probability of an individual who is alive on his x birthday
dying in the next year, before his x + 1 birthday. (We discuss these estimates later in the chapter.)
Thus, we estimate that 820 of the 100,000 will die before their first birthday. The surviving
l1 = 99, 180 on 1 January, 1901, face a mortality probability of 0.00062 in their next year, so
that we expect 61 of them to die before their second birthday. Thus l2 = 99119. And so it goes.
The final column of this table, labelled ex , gives remaining life expectancy; we will discuss this
in section 2.13.
• Discrete methods are comfortable only when the numbers are small, whereas moving down
to the smallest measurable unit turns the measurements into large whole numbers. Once
you start measuring an average human lifespan as 30000 days (more or less), real numbers
become easier to work with, as integrals are easier than sums.
(Compare this to the suggestion once made by the physicist Enrico Fermi, that lecturers might
take their listeners’ investment of time more seriously if they thought of the 50-minute span
of a lecture as a “microcentury”.) The discrete model, it is pointed out by A. S. Macdonald
in [21] (and rewritten in [7, Unit 9]), “is not so easily generalised to settings with more than
one decrement. Even the simplest case of two decrements gives rise to difficult problems,” and
involves the unnecessary complication of estimating an Initial Exposed To Risk. We will generally
treat the continuous model as the fundamental object, and treat the discrete data as coarse
representations of an underlying continuous lifetime. However, looking beyond the actuarial
setting, there are models which really do not have an underlying continuous time parameter. For
instance, in studies of human fertility, time is measured in menstrual cycles, and there simply
are no intermediate chances to have the event occur.
where only max{k1 , . . . , kn } + 1 factors in the infinite product differ from 1, and
dx = dx (k1 , . . . , kn ) = # {1 ≤ i ≤ n : ki = x} ,
`x = `x (k1 , . . . , kn ) = # {1 ≤ i ≤ n : ki ≥ x} .
This product is maximized when its factors are maximal (the xth factor only depending on
parameter qx ). An elementary differentiation shows that q 7→ (1 − q)`−d q d is maximal for q̂ = d/`,
so that
dx (k1 , . . . , kn )
q̂x(0) = q̂x(0) (k1 , . . . , kn ) = , 0 ≤ x ≤ max{k1 , . . . , kn }.
`x (k1 , . . . , kn )
(The superscript (0) denotes the estimate based on the discrete method.) Note that for x =
(0)
max{k1 , . . . , kn }, we have q̂x = 1, so no survival beyond the highest age observed is possible
under the maximum likelihood parameters, so that (q̂ (0) )0≤x≤max{k1 ,...,kn } specifies a unique
distribution. (Varying the unspecified parameters qx , x > max{k1 , . . . , kn }, has no effect.)
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 18
Now assume that the force of mortality µs is constant on [x, x + 1), x ∈ N and denote these
values by ( Z )
x+1
µx = − log(px ) remember px = exp − µs ds . (2.3)
x
where only max{t1 , . . . , tn } + 1 factors in the infinite product differ from 1, and
dx = dx (t1 , . . . , tn ) = # 1 ≤ i ≤ n : [ti ] = x ,
n Z x+1
X
`˜x = `˜x (t1 , . . . , tn ) = 1{ti >s} ds.
i=1 x
`˜x is the total time exposed to risk. In section 2.9 we define the Central Exposed to Risk, which
is a generalisation of this notion.
The quantities µx , x ∈ N, are the parameters, and we can maximise the product by maximising
each of the factors. An elementary differentiation shows that µ 7→ µd e−µ` has a unique maximum
at µ̂ = d/`, so that
dx (t1 , . . . , tn )
µ̂x = µ̂x (t1 , . . . , tn ) = , 0 ≤ x ≤ max{t1 , . . . , tn }.
`˜x (t1 , . . . , tn )
Since maximum likelihood estimators are invariant under reparameterisation (the range of the
likelihood function remains the same, and the unique parameter where the maximum is obtained
can be traced through the reparameterisation), we obtain
( )
dx (t1 , . . . , tn )
q̂x = q̂x (t1 , . . . , tn ) = 1 − p̂x = 1 − exp {−µ̂x } = 1 − exp − . (2.5)
`˜x (t1 , . . . , tn )
For small dx /`˜x , this is close to dx /`˜x , and therefore also close to dx /`x .
Note that under q̂x , x ∈ N, there is a positive survival probability beyond the highest observed
age, and the maximum likelihood method does not fully specify a lifetime distribution, leaving
free choice beyond the highest observed age.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 19
1 qk = 1 − e−µk ; µk = − log pk .
S has the distribution of an exponential random variable conditioned on S < 1, so it has density
e−µk s
fS (s) = µk .
1 − e−µk
This assumption thus implies decreasing density of the lifetime through the interval. We also
have, for 0 ≤ s ≤ 1, and k an integer,
( Z )
k+s
s pk = P(T > k + s|T > k) = exp − µt dt = exp {−sµk } = (1 − qk )s .
k
Note that K and S are not independent, under this assumption. We also may write
Xk−1
F̄T (k + s) = exp − µi − sµk
i=0
fT (k + s) = F̄T (k + s) · µk .
There is no way to analyse (or even describe) this intra-interval refinement within the framework
of the binomial model.
Nonetheless, the simplicity and tradition of the binomial model have led actuaries to develop
a kind of continuous prosthetic for the binomial model, in the form of a supplemental (and
hidden) model for the unobserved continuous part of the lifetime. These have been discussed in
section 2.8. In the end, these are applied through the terms Initial Exposed To Risk (Ex0 ) and
Central Exposed To Risk (Exc ). These are defined more by their function than as a particular
quantity: the Initial Exposed to Risk plays the role of n in a binomial model, and Central
Exposed to Risk plays the role of total time at risk in an exponential model. They are linked by
the actuarial estimator
1
Ex0 ≈ Exc + dx .
2
This may be justified from any of our fractional-lifetime models if the number of deaths is small
relative to the number at risk. Thus, the actuarial estimator for qx is
dx
qex = .
Exc + 21 dx
The denominator, Exc + 12 dx , comprises the observed time at risk (also called central exposed to
risk) within the interval (x, x + 1), added to 1/2 the number of deaths (assumes deaths evenly
spread over the interval). This is an estimator for Ex0 , the denominator for the binomial model
estimator.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 22
The direct discrete method treats the curtate lifetimes as the true lifetimes; then Ek0 is the
same as Ekc , so the continuous model gives a strictly smaller answer, unless dk = 0. Why is that?
The difference here is that the continuous model presumes that individuals are dying all through
the year, making Ekc somewhat smaller than Ek0 . In fact, the actuarial estimator Ekc ≈ Ek − dk /2
(essentially presuming that those who died lived on average half a year), and substituting the
Taylor series expansion into (2.6) shows that in the continuous model
!
dk dk dk 3
P T <k+1 T ≥k = 0 − +O
Ek − dk /2 2(Ek0 − dk /2)2 Ek0 − dk /2
!
dk dk 3
= 0 +O .
Ek Ek0 − dk /2
That is, when the mortality fraction dk /Ek0 is small, the estimates agree up to second order in
dk /Ek0 .
A more As we already mentioned, one advantage of the continuous model is that it does not
tie the estimates to any fixed timespan.
dx
µ̂x = ,
Exc
which is the maximum likelihood estimator under the assumption of a constant force of mortality
on [x, x + 1).
In any given data set we are likely to have only partial information about the population
because:
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 23
• Date of birth and date of death may be only approximate (for instance, only calendar year
given, or only curtate age at death);
• Individuals may be “observable” for only part of the time (for instance, because of migration);
• Possible uncertainty about cause of being unobserved, or about identity of individuals;
• We may have access to only a portion of the population (for instance, an insurance company
working from the life records of its customers, or field biologists working with the population
of birds that they have managed to capture and mark).
Because of the limitations of the data, we reinterpret dx and Exc as the number of deaths
observed and number of years of life observed, for which we need to create estimates that
obey the Principle of Correspondence:
An individual alive at time t should be included in the exposure at age x at time t if
and only if, were that individual to die immediately, he or she would be counted in
the death data dx at age x.
The key point is that we can tolerate a substantial amount of uncertainty in the numerator
and the denominator (number of events and total time at risk), but failing to satisfy the Principle
of Correspondence can lead to serious error. For example, [22] analyses the “Hispanic Paradox,”
the observation that Latin American immigrants in the USA seem to have substantially lower
mortality rates than the native population, despite being generally poorer (which is usually
associated with shorter lifespans). This difference is particularly pronounced at more advanced
ages. Part of the explanation seems to be return migration: Some old hispanics return to their
home countries when they become chronically ill or disabled. Thus, there are some members of
this group who count as part of the US hispanic population for most of their lives, but whose
deaths are counted in their home-country statistics.
2.11.2 Census approximation
The task is to approximate Exc (and often also dx ) given census data. There are various forms of
census data. The most common one is
Px,k = Number of individuals in the population aged [x, x + 1) at time k = 0, . . . , n.
The problem is that we do not know when the individuals were actually available to be observed.
If this is a national census, it won’t count people who emigrated before the census, or those who
immigrated the day after the census. If we are doing a census of wildlife, we don’t count the
individuals who were not caught, and again, there will be migration in and out of the observation
area.
The basic assumption of the census approximation is that the number of individuals changes
linearly between any two consecutive census dates. By definition,
Z n
Exc = Px,t dt (2.7)
0
We only know the integrand at integer times, and linear approximation yields
n
X 1
Exc ≈ (Px,k−1 + Px,k ). (2.8)
2
k=1
d0x = Number of deaths aged x on the birthday in the calendar year of death.
Then, some of the deaths counted in d0x will be deaths aged x − 1, not x, in fact we should view
d0x as containing deaths aged in the interval (x − 1, x + 1), but not all of them. If we assume that
birthdays are uniformly spread over the year, we can also specify that the proportion of deaths
counted under d0x changes linearly from 0 to 1 and back to 0 as x − 1 increases to x and x + 1.
In order to estimate a force of mortality, we need to identify the corresponding (approximation
to) Central exposed to risk. The Principle of Correspondence requires
Z n
0
c0
Ex = Px,t dt, (2.9)
0
where
0 = Number of individuals in the population at t with xth birthday in calendar
Px,t
year btc.
Again, suppose we know the integrand at integer times. Here the linear approximation requires
some care, since the policy holders do not change age group continuously, but only at census
dates. Therefore, all continuing policy holders counted in Px,k−1
0 will be counted in Px,t
0 for all
0
The ratio d0x /Exc gives a slightly smoothed (because of the wider age interval) estimate of µx over
the time interval (k − 1, k + 1). Note however that it is not clear if this estimate is a maximum
likelihood estimate for µx under any suitable model assumptions such as constancy of the force
of mortality between half-integer ages.
experiences. (A “cohort” was originally a unit of a Roman legion.) Note that cohorts need not
be birth cohorts, as the horizontal axis of the Lexis diagram need not represent literal birthdates.
For instance, a study of marriage would start “lifelines” at the date of marriage, and would refer
to the “marriage cohort of 2008”, for instance, while a study of student employment prospects
would refer to the “student cohort of 2008”, the collection of all students who completed (or
started) their studies in that year.
3
Age
0
2000 2001 2002 2003 2004 2005 2006 2007 2008
Year
Figure 2.1: A Lexis diagram. The red region represents the experience of all individuals (at
whatever time) aged between 3 years and 4 years, 8 months. The green region represents the
experience of all individuals during the period from 15 March 2004 and 31 December 2005. The
blue region represents the experience of the cohort born in 2002, from birth to age 5.
The census approximation involves making estimates for mortality rates in regions of the Lexis
diagram. Vertical lines represent the state of the population, so a census may be represented by
counting (and describing) the lifelines that cross a given vertical line. The goal is to estimate the
hazard rate for a region (in age-time space) by
# events
total time at risk
The total time at risk is the total length
√ of lifelines intersecting the region (or, to be geometric
about it, the total length divided by 2), while the number of events is a count of the number
of dots. The problem is that we do not know the exact total time at risk. Our censuses do tell
us, though, the number of individuals at risk
The count dx described in section 2.11.2 tells us the number of deaths of individuals aged
between x and x + 1 (for integer x), so it is counting events in horizontal strips, such as we have
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 26
3
Age
0
0 1 2 3 4 5 6 7 8
Year
Figure 2.2: Census at time 3 represented by open circles. The population consists of 7 individuals.
3 are between ages 2 and 3, 2 are between ages 1 and 2, and 2 are between 0 and 1.
RT
shown in Figure 2.3. We are trying to estimate the central exposed to risk Exc := 0 Px,t dt,
where Px,t is the number of individuals alive at time t whose curtate age is x. We can represent
this as Z T
Exc = Px,t dt (2.11)
0
If we assume that Px,t is approximately linear over such an interval, we may approximate the
average over [k, k + 1] by 12 (Px,k + Px,k+1 ). Then we get the approximation
T −1
1 X 1
Exc ≈ Px,0 + Px,k + Px,T .
2 2
k=1
Note that this is just the trapezoid rule for approximating the integral (2.11).
Is this assumption of linearity reasonable? What does it imply? Consider first the individuals
whose lifelines cross a box with lower corner (k, x). (Note that, unfortunately, the order of the
age and time coordinates is reversed in the notation when we go to the geometric picture. This
has no significance except sloppiness which needs to be cleaned up.) They may enter either the
left or the lower border. In the former case (corresponding to individuals born in year x − k)
they will be counted in Px,k ; in the latter (born in x − k + 1) case in Px,k+1 . If the births in year
x − k + 1 differ from those in year x − k by a constant (that is, the difference between January 1
births in the two years is the same as the difference between February 12 births, and so on, then
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 27
on average the births in the two years on a given date will contribute 1/2 year to the central
years at risk, and will be counted once in the sum Px,k + Px,k+1 . Important to note:
• This does not actually require that births be evenly distributed through the year.
• When we say births, we mean births that survive to age k. If those born in, say, December
of one year had substantially lowered survival probability relative to a “normal” December,
this would throw the calculation off.
• These assumptions are not about births and deaths in general, but rather about births and
deaths of the population of interest: those who buy insurance, those who join the clinical
trial, etc.
If mortality levels are low, this will suffice, since nearly all lifelines will be counted among
those that cross the box. If mortality rates are high, though, we need to consider the contribution
of years at risk due to those lifelines which end in the box. In this case, we do need to assume that
births and deaths are evenly spread through the year. This assumption implies that conditioned
on a death occurring in a box, it is uniformly distributed through the box. On the one hand, that
implies that it contributes (on average) 1/4 year to the years at risk in the box. On the other
hand, it implies that the probability of it having been counted in our average 12 (Px,k + Px,k+1 )
is 12 , since it is counted only if it is in the upper left triangle of box. On average, then, these
should balance.
3
Age
0
0 1 2 3 4 5 6 7 8
Year
Figure 2.3: Census approximation when events are counted by actual curtate age. The vertical
dashed segments represent census counts, carried out on 1 January of years 2, 3, 4, 5, and 6.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 28
What happens when we count births and deaths only by calendar year? Note that Px,k 0 =P
x,k
for integers k and x. One difference is that the regions in question, which are parallelograms,
follow the same lifelines from the beginning of the year to the end. This makes the analysis more
straightforward. Lifelines that pass through the region are counted on both ends. The other
difference is that the region that begins with the census value Px,k ends not with Px,k+1 , but
with Px+1,k+1 . Thus all the lifelines passing through the region will be counted in Px,k and in
Px+1,k+1 , hence also in their average. This requires no further assumptions. For the lifelines that
end in the region to be counted appropriately, on the other hand, requires that the deaths be
evenly distributed throughout the year. (Other, slightly less restrictive assumptions, are also
possible.) In this case, each death will contribute exactly 1/2 to the estimate 12 (Px,k + Px+1,k+1 )
(since it is counted only in Px,k ), and it contributes on average 1/2 year of time at risk.
3
Age
0
0 1 2 3 4 5 6 7 8
Year
Figure 2.4: Census approximation when events are counted by calendar year of birth and death.
Vertical segments bounding the coloured regions represent census counts. P3,t , for instance, is
the number of red lifelines crossing in the yellow region, at Year t.
years for a man in Angola, to 85.6 years for a woman in Japan. The UK is in between (though,
of course, much closer to Japan), with 76.5 years for men and 81.6 years for women.
Table 2.3: 2009 Life expectancy at birth (LE) in years and infant mortality rate per thousand
live births (IMR) in selected countries, by sex. Data from US Census Bureau. International
Database available at http://www.census.gov/ipc/www/idb/idbprint.html
Life expectancies can vary significantly, even within the same country. For example, the UK
Office of National Statistics has published estimates of life expectancy for 432 local areas in
the UK. We see there that, for the period 2005–7, men in Kensington and Chelsea had a life
expectancy of 83.7 years, and women 87.8 years; whereas in Glasgow (the worst-performing area)
the corresponding figures were 70.8 and 77.1 years. Overall, English men live 2.7 years longer on
average than Scottish men, and English women 2.0 years longer.
When we think of lifetimes as random variables, the life expectancy is simply the mathematical
expectation E[T ]. By definition, Z ∞
E[T ] = tfT (t)dt.
0
Integration by parts, using the fact that fT = −F̄T0 , turns this into a much more useful form,
Z ∞ Z ∞ Z ∞ R
∞ t
E[T ] = −tF̄T (t) + F̄T (t) dt = F̄T (t) dt = e− 0 µs ds dt. (2.12)
0 0 0 0
That is, the life expectancy may be computed simply by integrating the survival function. The
discrete form of this is
∞
X ∞
X
(2.13)
E[K] = kP K = k = P K>k .
k=0 k=0
Applying this to life tables, we see that the expected curtate lifetime is
∞ ∞ ∞
X X lk X
E[K] = P K>k = = p0 · · · pk−1 .
l0
k=0 k=1 k=1
(Recall that Tx is the remaining lifetime at age x. It is defined on the event {T ≥ x}, and has
the value T − x. Its distribution is understood to be the conditional distribution of T − x on
◦ ◦
{T ≥ x}.) We see that ex ≤ ex < ex + 1. For sufficiently smooth lifetime distributions, ex ≈ ex
will be a good approximation.
2.13.2 Example
Table 2.4 shows a life table based on the mortality data for tyrannosaurs from Table 1.1. Notice
that the life expectancy at birth e0 = 16.0 years is exactly what we obtain by averaging all the
ages at death in Table 2.4.
age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
dx 0 0 3 1 1 3 2 1 2 4 4 3 4 3 8
lx 103 103 103 100 99 98 95 93 92 90 86 82 79 75 72
qx 0.00 0.00 0.03 0.01 0.01 0.03 0.02 0.01 0.02 0.04 0.05 0.04 0.05 0.04 0.11
ex 16.0 15.0 14.0 13.5 12.6 11.7 11.1 10.3 9.4 8.7 8.1 7.5 6.7 6.1 5.4
age 15 16 17 18 19 20 21 22 23 24 25 26 27 28
dx 4 4 7 10 6 3 10 8 4 3 0 3 0 2
lx 64 60 56 49 39 33 30 20 12 8 5 5 2 2
qx 0.06 0.07 0.12 0.20 0.15 0.09 0.33 0.40 0.33 0.38 0.00 0.60 0.00 1.00
ex 5.0 4.4 3.7 3.2 3.0 2.6 1.8 1.7 1.8 1.8 1.8 0.8 1.0 0.00
Table 2.4: Life table for tyrannosaurs, based on data from Table 1.1.
e0 = p0 (1 + e1 ).
This follows either from (2.13), or directly from observing that if someone survives the first year
(which happens with probability p0 ) he will have lived one year, and have (on average) e1 years
remaining. Thus,
e0 1 + e1 − e0 .6
q0 = 1 − = = = 0.008,
1 + e1 1 + e1 74.4
which is approximately right. On the 1890 life table we see that the life expectancy of a newborn
was 44.1 years, but this rose to 52.3 years for a boy on his first birthday. This can only mean
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 31
that a substantial portion of the children died in infancy. We compute the first-year mortality as
q0 = (1 + 52.3 − 44.1)/53.3 = 0.17, so about one in six.
How much would life expectancy have been increased simply by eliminating infant mortality
— that is, mortality in the first year of life? In that case, all newborns would have reached their
first birthday, at which point they would have had 52.2 years remaining on average — thus, 53.2
years in total. Today, with infant mortality almost eliminated, there is only a potential 0.6 years
remaining to be achieved from further reductions.
Year 4: 300 born, 50 die. 75 1-year-olds die. 100 2-year-olds die. 90 3-year-olds die.
x dx `x qx ex x dx `x qx ex
0 100 300 0.333 1.57 0 150 350 0.43 1.2
1 20 200 0.10 1.35 1 40 200 0.20 1.1
2 90 180 0.50 0.50 2 100 160 0.625 0.375
3 90 90 1.0 0 3 60 60 1.0 0
x qx `x dx ex
0 0.167 1000 167 1.69
1 0.25 833 208 1.03
2 0.625 625 391 0.375
3 1.0 235 1.0 0
In Table 2.5 we compute different life tables from these data. The two cohort life tables
(Tables 2.4(a) and 2.4(b)) are fairly straightforward: We start by writing down `0 (the number
of births in that cohort) and then in the dx column the number of deaths in each year from that
cohort. Subtracting those successively from `0 yields the number of survivors in each age class
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 32
`1 `2 `3
e0 = + +
`0 `0 `0
`2 `3
e1 = +
`1 `1
`3
e2 = .
`2
The period life table is computed quite differently. We start with the qx numbers, which
come from different cohorts:
We then write in the radix `0 = 1000. Of 1000 individuals born, with q0 = 0.167, we expect 167
to die, giving us our d0 . Subtracting that from `0 tells us that `1 = 833 of the 1000 newborns
live to their first birthday. And so it continues. The life expectancies are computed by the same
formula as before, but now the interpretation is somewhat different. The cohort remaining life
expectancies were the same as the actual average number of (whole) years remaining for the
population of individuals from that cohort who reached the given age. The period remaining life
expectancies are fictional, telling us how many individuals would have remained alive if we had a
cohort of 1000 that experienced in each age the same mortality rates that were in effect for the
population in year 4.
past 150 years, mortality rates decline in the interval, that means that the survival rates will be
higher than we see in the period table.
We show in Figure 2.5 a picture of how a cohort life table for the 1890 cohort would be
related to the sequence of period life tables from the 1890s through the 2000s. The mortality
rates for ages 0 through 9 (thus 1 q0 , 4 q1 , 5 q5 )3 are on the 1890s period life table, while their
mortality rates for ages 10 through 19 are on the 1900–1909 period life table, and so on. Note
that the mortality rates for the 1890s period life table yield a life expectancy at birth e0 = 44.2
years. That is the average length of life that babies born in those years would have had, if their
mortality in each year of their lives had corresponded to the mortality rates which were realised
in for the whole population in the year of their birth. Instead, though, those that survived their
early years entered the period of late-life high mortality in the mid- to late 20th century, when
mortality rates were much lower. It may seem surprising, then, that the life expectancy for the
cohort life table only goes up to 44.7 years. Is it true that this cohort only gained 6 months of
life on average, from all the medical and economic progress that took place during their lives?
Yes and no. If we look more carefully at the period and cohort life tables in Table 2.15 we
see an interesting story. First of all, a substantial fraction of potential lifespan is lost in the first
year, due to the 17% infant mortality, which is obviously the same for the cohort and period life
tables. 25% died before age 5. If mortality to age 5 had been reduced to modern levels — close
to zero — the period and cohort mortality would both be increased by about 14 years. Second,
notice that the difference in life expectancies jumps to over 5 years at age 30. Why is that?
For the 1890 cohort, age 30 was 1920 — after World War I, and after the flu pandemic. The
male mortality rate in this age class was around 0.005 in 1900–9, and less than 0.004 in 1920–9.
Averaged over the intervening decade, though, male mortality was close to 0.02. (Most of the
effect is due to the war, as we see from the fact that it almost exclusively is seen in the male
mortality; female mortality in the same period shows a slight tick upward, but it is on the order
of 0.001.) One way of measuring the horrible cost of that war is to see that for the generation
of men born in the 1890s, that was most directly affected, the advances of the 20th century
procured them on average about 4 years of additional life, relative to what might have been
expected from the mortality rates in the year of their birth. Of these 4 years, 3 12 were lost in the
war. Another way of putting this is to see that the approximately 4.5 million boys born in the
UK between 1885 and 1895 lost cumulatively about 16 million years of potential life in the war.
There are, in a sense, three basic kinds of life tables:
1. Cohort life table describing a real population. These make most sense in a biological
context, where there is a small and short-lived population. The `x numbers are actual
counts of individuals alive at each time, and the rest of the table is simply calculated from
these, giving an alternative descriptions of survival and mortality.
2. Period life tables, which describe a notional cohort (usually starting with radix `0 being
a nice round number) that passes through its lifetime with mortality rates given by the
qx . These qx are estimated from data such as those of Table 2.2, giving the number of
individuals alive in the age class during the period (or number of years lived in the age
class) and the number of deaths.
3. Synthetic cohort life tables. These take the qx numbers from a real cohort, but express
them in terms of survival `x starting from a rounded radix.
3
Actually, we have given µx for the intervals [0, 1), [1, 5), and [5, 10). We compute 1 q0 = 1−e−µ0 , 4 q1 = 1−e−4µ1 ,
5 q5 = 1 − e
−5µ5
.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 34
Figure 2.5: Decade period life tables, with the pieces joined that would make up a cohort life
table for individuals born in 1890.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 35
(a) Period life table for men in England and (b) Cohort life table for the 1890 cohort of
Wales 1890–9 men in England and Wales
x µx `x dx ex x µx `x dx ex
0 0.187 100000 17022 44.2 0 0.187 100000 17022 44.7
1 0.025 82978 7923 51.7 1 0.025 82978 7923 52.3
5 0.004 75055 1655 52.8 5 0.004 75055 1655 53.5
10 0.002 73400 908 48.9 10 0.002 73400 774 49.6
15 0.004 72492 1379 44.5 15 0.003 72626 1167 45.1
20 0.005 71113 1766 40.3 20 0.020 71459 6749 40.8
25 0.006 69347 2087 36.2 25 0.017 64710 5219 39.7
30 0.008 67260 2550 32.2 30 0.004 59491 1257 37.9
35 0.010 64710 3229 28.4 35 0.006 58234 1608 33.6
40 0.013 61481 3970 24.7 40 0.006 56626 1671 29.5
45 0.017 57511 4703 21.2 45 0.009 54955 2384 25.3
50 0.022 52808 5515 17.8 50 0.012 52571 3027 21.3
55 0.030 47293 6508 14.6 55 0.019 49544 4388 17.5
60 0.042 40785 7710 11.6 60 0.028 45156 5956 14.0
65 0.061 33075 8636 8.9 65 0.044 39200 7760 10.8
70 0.086 24439 8511 6.5 70 0.067 31440 8985 8.0
75 0.122 15928 7281 4.5 75 0.102 22455 8940 5.7
80 0.193 8647 5346 2.7 80 0.146 13515 6997 3.8
85 0.262 3301 2410 1.7 85 0.215 6518 4294 2.3
90 0.358 891 742 0.9 90 0.288 2224 1697 1.4
95 0.477 149 135 0.5 95 0.395 527 454 0.8
100 0.590 14 13 0.3 100 0.516 73 67 0.4
105 0.695 1 1 0.2 105 0.645 6 6 0.2
110 0.772 0 0 0.0 110 0.733 0 0 0.0
Table 2.6: Period and cohort tables for England and Wales. The period table is taken directly
from the Human Mortality Database. The cohort table is taken from the period tables of the
HMD, not copied from their cohort tables.
The estimator for the constant force of mortality over the year is
Dx dx
µ
ex = , with observed value c .
Exc Ex
Dx ∼ B(Ex , qx ) =⇒ Dx ∼ N Ex qx , Ex qx (1 − qx )
and
Dx qx (1 − qx )
qbx = ∼N qx , .
Ex Ex
Poisson model
Dx ∼ N (Exc µx , Exc µx )
and
µx
bx ∼ N
µ µx , c .
Ex
Tests are often done using comparisons with a published standard life table. These can
be from national tables for England and Wales published every 10 years, or insurance company
data collected by the Continuous Mortality Investigation Bureau, or from other sources.
A superscript "s" denotes "from a standard table", such as qxs and µsx .
Test statistics are generally obtained from the following:
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 37
Binomial:
dx − Ex qxs
O−E
zx = p ≈ √ .
Ex qxs (1 − qxs ) V
Poisson:
dx − E c µs
O−E
zx = p x x ≈ √ .
Exc µsx V
Both of these are denoted as zx since under a null hypothesis that the standard table is correct,
Zx has approximately standard normal distribution.
2.16.3 The tests
χ2 test
We take X
X= zx2
all ages x
This gives the sum of squares of standard normal random variables under the null hypothesis
and so is a sum of χ2 (1). Therefore
1. There may be a few large deviations offset by substantial agreement over part of the table.
The test will not pick this up.
2. There might be bias, that is, although not necessarily large, all the deviations may be of
the same sign.
3. There could be significant groups of consecutive deviations of the same sign, even if not
overall.
Signs test
Test statistic X is given by
X = #{zx > 0}
Under the null hypothesis X ∼ Binom(m, 21 ), since the probability of a positive sign should be
1/2. This should be administered as a two-tailed test. It is under-powered since it ignores the
size of the deviations but it will pick up small deviations of consistent sign, positive or negative,
and so it addresses point 2 above.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 38
(d − Ex qxs )
P
pP x ∼ N (0, 1), approximately
Ex qxs (1 − qxs )
and
(dx − Exc µsx )
P
pP ∼ N (0, 1) approximately.
Exc µsx
H0 : there is no bias
HA : there is a bias.
This test addresses point 2 again, which is that the chi-square test does not test for consistent
bias.
Other tests
There are tests to deal with consecutive bias/runs of same sign. These are called the groups of
signs test and the serial correlations test. Again a very large number of years, m, are required to
render these tests useful.
2.16.4 An example
Table 2.7 presents imaginary data for men aged 90 to 95. The column `x lists the initial at
risk, the number of men in the population on the census date, and dx is the number of deaths
from this initial population over the course of the year. Exc is the central at risk, estimated as
`x − dx /2. Standard male British mortality for these ages is listed in column µsx . (The column
µ̊x is a graduated estimate, which will be discussed in section 2.17.
Table 2.7: Table of mortality rates for an imaginary old-people’s home, with standard British
male mortality given as µsx , and graduated estimate µ̊x .
We note substantial differences between the estimates µ̂x and the standard mortality µsx , but
none of them is extremely large relative to the standard error: The largest zx is 1.85. We test the
two-sided alternative hypothesis, that the mortality rates in the old-people’s home are different
from the standard mortality rates, with a χ2 test, adding up the zx2 . The observed X 2 is 7.1,
corresponding to an observed significance level p = 0.31. (Remember that we have 6 degrees of
freedom, not 5, because these zx are independent. This is not an incidence table.)
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 39
2.17 Graduation
Graduation is a term common in lifetime analysis — particularly in actuarial contexts — for
what is generally called smoothing in statistics. Suppose that a company has collected its own
data, producing estimates for either qx or µx . The estimates may be rather irregular from year to
year and this could be an artefact of the population the company happens to have in a particular
scheme. The underlying model should probably (but not necessarily) be smoother than the raw
estimates. If it is to be considered for future predictions, then smoothing should be considered.
This is called graduation.
There is always a tradeoff in smoothing procedures. Without smoothing, real patterns get
lost in the random noise. Too much smoothing, though, can swamp the data in the model, so
that the final estimate reflects more our choice of model than any truth gleaned from the data.
2.17.1 Parametric models
We may fit a formula to the data. Possible examples are
µx = µ (Exponential);
µx = Be θx
(Gompertz);
µx = A + Beθx (Makeham)
The Gompertz can be a good model for a population of middle older age groups. The Makeham
model has an extra additive constant which is sometimes used to model “intrinsic mortality”,
which is supposed to be independent of age. We could use more complicated formulae putting in
polynomials in x.
2.17.2 Reference to a standard table
Here qx0 , µ0x represent the graduated estimates. We could have a linear dependence
qx0 = qx+k
s
, µ0x = µsx+k
In general there will be some assigned functional dependence of the graduated estimate on
the standard table value. These are connected with the notions of accelerated lifetimes and
proportional hazards, which will be central topics in the second part of the course.
2.17.3 Nonparametric smoothing
We effectively smooth our data when we impose the assumption that mortality rates are constant
over a year. We may tune the strength of smoothing by requiring rates to be constant over longer
intervals. This is a form of local averaging, and there are more and less sophisticated versions of
this. In Matlab or R the methods available include kernel smoothing, orthogonal polynomials,
cubic splines, and LOESS. These are beyond the scope of this course.
In Figure 2.6 we show a very simple example. The mortality rates are estimated by individual
years or by lumping the data in five year intervals. The green line shows a moving average of the
one-year estimates, in a window of width five years.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 40
● yearly estimate
0.8
●
0.4
● ●
●
●
0.2
●
● ●
●
● ● ● ● ● ● ● ●
0.0
● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
Age (years)
Figure 2.6: Different smooothings for A. sarcophagus mortality from Table 1.1.
0 5 10 15 20 25
Age (yrs)
Figure 2.7: Estimated tyrannosaurus mortality rates from Table 2.4, together with exponential
and Gompertz fits.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 41
as appropriate. For the weights suitable choices are either Ex or Exc respectively. Alterna-
tively we can use 1/var, where the variance is estimated for qbx or µ
bx , respectively.
The hypothesis tests we have already covered above can be used to test the graduation fit
to the data, replacing qxs , µsx by the graduated estimates. Note that in the χ2 test we must
reduce the degrees of freedom of the χ2 distribution by the number of parameters
estimated in the model for the graduation. For example if qx0 = a + bqxs , then we reduce
the degrees of freedom by 2 as the parameters a, b are estimated.
2.17.5 Examples
Standard life table
We graduate the estimates in Table 2.7, based on the standard mortality rates listed in the
column µsx , using the parametric model µ̊x = a + bµsx . The log likelihood is
X
`= dx log µ̊x+ 1 − µ̊x+ 1 Exc .
2 2
We can solve these equations numerically, to obtain â = −0.279 and b̂ = 2.6. This yields the
graduated estimates µ̊ tabulated in the final column of Table 2.7. Note that these estimates have
the virtue of being, on the one hand, closer to the observed data than the standard mortality
rates; on the other hand smoothly and monotonically increasing.
If we had used ordinary least squares to fit the mortality rates, we would have obtained
very different estimates: ã = −0.472 and b̃ = 3.44, because we would be trying to minimise the
errors in all classes equally, regardless of the nimber of observations. Weighted least squares,
with weights proportional to Exc (inverse variance) solves this problem, more or less, and gives us
estimates â∗ = −0.313 and b̂∗ = 2.75 very close to the MLE.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 42
In Figure 2.7 we plot the mortality rate estimates for the complete population of tyrannosaurs
described in Table 1.1, on a logarithmic scale, together with two parametric model fits: the
exponential model, with one parameter µ estimated by
1 n n
µ̂ = = ≈ = 0.058,
t̄ t1 + · · · + tn k1 + · · · + kn + n/2
where t1 , . . . , tn are the n lifetimes observed, and ki = bti c the curtate lifetimes; and the Gompertz
model µs = Beθs , estimated by
Q0 (θ̂) 1
θ̂ solves − = t̄,
Q(θ̂) − 1 θ̂
θ̂
B̂ := ,
Q(θ̂) − 1
1 X θti
where Q(θ) := e .
n
This yields θ̂ = 0.17 and B̂ = 0.0070. It seems apparent to the eye that the exponential fit is
quite poor, while the Gompertz fit might be pretty good. It is hard to judge the fit by eye,
though, since the quality of the fit depends in part on the number of individuals at risk that go
into the individual mortality-rate estimates, something which does not appear in the plot.
To test the hypothesis, we compute the predicted number of deaths in each age class
(exp) (exp) (exp)
dx = lx · qx if there is a constant µx = µ̂ = 0.058, meaning that qx = 0.057, and
(Gom) (Gom)
dx = lx · qx if
( )
B̂
qx = qx(Gom) := 1 − exp − eθ̂x eθ̂ − 1 ,
θ̂
of Gompertz mortality.
CHAPTER 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 43
Table 2.8: Life table for tyrannosaurs, with fit to exponential and Gompertz models, and based
on data from Table 1.1.
Chapter 3
• A working population insured for disability might transition into multiple different possible
causes of disability, which may be associated with different costs.
• Unemployed individuals may leave that state either by finding a job, or by giving up looking
for work and so becoming “long-term unemployed”.
An important common element is that calling the states “absorbing” does not have to mean that
it is a deathlike state, from which nothing more happens. Rather, it simply means that our
model does not follow any further developments.
3.1.1 An introductory example
This example is taken from section 8.2 of [29].
According to United Nations statistics, the probability of dying for men in Zimbabwe in
2000 was 5 q30 = 0.1134, with AIDS accounting for approximately 4/5 of the deaths in this age
group. Suppose we wish to answer the question: what would be the effect on mortality rates of a
complete cure for AIDS?
One might immediately be inclined to think that the mortality rate would be reduced to 1/5
of its current rate, so that what the probability of dying of some other cause in the absence of
AIDS, which we might write as 5 q30 OT HER∗ , would be 0.02268. On further reflection, though, it
seems that this is too low: This is the proportion of people aged 30 who currently die of causes
other than AIDS. If AIDS were eliminated, surely some of the people who now die of AIDS
would instead die of something else.
44
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 45
Of course, this is not yet a well-defined mathematical problem. To make it such, we need to
impose extra conditions. In particular, we impose the competing risks assumption: Individual
causes of death are assumed to act independently. You might imagine an individual drawing lots
from multiple urns, labelled “AIDS”, “Stroke”, “Plane crash”, to determine whether he will die of
this cause in the next year. The fraction of black lots among the white is precisely qx , when the
individual has age x. If he gets no black lot, he survives the year. If he draws two or more, we
only get to see the one drawn first, since he can only die once. The probability of surviving is
then the product of the survival probabilities:
CAU SE1
1 − t qxCAU SE2 · · · (3.1)
t qx = 1 − 1 − t qx
What is the fraction of deaths due to a given cause? Assuming constant mortality rate over the
time interval due to each cause, we have
CAU SE1
1 − t qxCAU SE1 = e−tλx .
Given a death, the probability of it being due to a given cause, is proportional to the associated
hazard rate. Consequently,
(Note that this is the same formula that we use for changing lengths of time intervals: t qx =
1 − (1 − 1 qx )t .) This tells us the probability of dying from cause 1 in the absence of any other
cause. The probability of dying of any cause at all is then given by (3.1).
Applying this to our Zimbabwe AIDS example, treating the causes as being either AIDS or
OTHER, we see that the probability of dying of AIDS in the absence of any other cause is
AIDS∗
5 q30 = 1 − (1 − 5 q30 )4/5 = 1 − 0.88664/5 = 0.0918,
while the probability of dying of any other cause, in the absence of AIDS, is
OT HER∗
5 q30 = 1 − (1 − 5 q30 )4/5 = 1 − 0.88661/5 = 0.0238.
Such models occur naturally where insurance policies provide different benefits for different
causes of death, or distinguish death and disability, possibly in various different strengths or
forms. This is also clearly a building block (one transition only) for general Markov models,
where states j = 1, . . . , m may not all be absorbing.
Such a model depends upon the assumption that different causes of death act independently
— that is, the probability of dying is the product of what might be understood as the probability
of dying from each individual cause acting alone.
3.1.3 Multiple decrements – time-homogeneous rates
In the time-homogenous case, we can think of the multiple decrement model as m exponential
clocks Cj with parameters λj , 1 ≤ j ≤ m, and when the first clock goes off, say, clock j, the
only transition takes place, and leads to state j. Alternatively, we can describe the model as
consisting of one L ∼ Exp(λ+ ) holding time in state 0, after which the new state j is chosen
independently with probability λj /λ+ , 1 ≤ j ≤ m. The likelihood for a sample of size 1 consists
of two ingredients, the density λ+ e−tλ+ of the exponential time, and the probability λj /λ+ of
the transition observed. This gives λj e−tλ+ , or, for a sample of size n of lifetimes ti and states
ji , 1 ≤ i ≤ n,
n m
n
Y Y
λji e−ti λ+ = λj j e−λj (t1 +...+tn ) , (3.3)
i=1 j=1
where nj is the number of transitions to j. Again, this can be solved factor by factor to give
nj
λ̂j = , 1 ≤ j ≤ m. (3.4)
t1 + . . . + tn
Let us assume that the forces of decrement λj (t) = λj (x) are constant on x ≤ t < x + 1, for all
x ∈ N and 1 ≤ j ≤ m. Then the likelihood can be given as
m
YY dj,x n o
λj (x + 12 ) exp −`˜x λj (x + 12 ) , (3.6)
x∈N j=1
where dj,x is the number of decrements to state j between ages x and x + 1, and `˜x is the total
time spent alive between ages x and x + 1.
Now the parameters are λj (x), x ∈ N, 1 ≤ j ≤ m, and they are again well separated to
deduce
dj,x
λ̂j (x + 21 ) = , 1 ≤ j ≤ m, 0 ≤ x ≤ max{L1 , . . . , Ln }. (3.7)
`˜x
Similarly, we can try to adapt the method to get maximum likelihood estimators from the curtate
lifetimes. We can write down the likelihood as
n m
d
Y Y Y
`x −dx
p(J,K) (ji , [ti ]) = (1 − qx ) j,x
qj,x , (3.8)
i=1 x∈N j=1
but 1 − qx = 1 − q1,x − . . . − qm,x does not factorise, so we have to maximise simultaneously for
all 1 ≤ j ≤ m expressions of the form
m
d
Y
`−d1 −...−dm
(1 − q1 − . . . − qm ) qj j . (3.9)
j=1
(We suppress the indices x.) A zero derivative with respect to qj amounts to
(` − d1 − . . . − dm )qj = dj (1 − q1 − . . . − qm ), 1 ≤ j ≤ m, (3.10)
(1 − q)`−d q d ⇒ q̂ = d/`
d −µ`
µ e ⇒ µ̂ = d/`
m
d
(1 − q1 − . . . − qm )`−d1 −...−dm
Y
qj j ⇒ q̂j = dj /`, j = 1, . . . , m.
j=1
with four states S = {W, V, I, ∆}, where W =‘working’, V =’left the company voluntarily’,
I =’left the company involuntarily’ and ∆ =’left the company through death’.
If we observe nx people aged x, then
dx,V dx,I dx,∆
λ̂x = , σ̂x = , µ̂x = (3.15)
`˜x `˜x `˜x
where `˜x is the total amount of time spent working aged x, dx,V is the total number of workers
who left the company voluntarily aged x, dx,I is the total number of workers who left the company
involuntarily aged x, dx,∆ is the total number of workers dying aged x.
p(J,K) (j, x) = P(J = j, K = x) = P(L ≤ x + 1, J = j|L > x)P(L > x) = p0 . . . px−1 qj,x . (3.19)
Note that this bivariate probability mass function is simple, whereas the joint distribution of
(L, J) is conceptually more demanding since L is continuous and J is discrete. We chose to
express the marginal probability density function of L and the conditional probability mass
function of J given L = t. In the assignment questions, you will see an alternative description
in terms of sub-probability densities gj (t) = dtd
P(L ≤ t, J = j), which you can normalise –
gj (t)/P(J = j) is the conditional density of L given J = j.
the number of individuals marrying and separating, respectively, and similar for the estimation
of hazard rates. (For simplicity, we have divided the separation data, which were actually only
given for the periods [0, 3] and [3, 5], as though there were a count for separations in [0, 1].)
Table 3.1: Data from [17] on rates of conversion of cohabitations into marriage or separation, by
years since birth of first child
(a) % cohabiting couples remaining together (b) % of cohabiting couples who marry
(from among those who did not marry) within stated time.
n after 3 years after 5 years n 1 year 3 years 5 years
106 61 48 150 18 30 39
Translating the data in Table 3.1 into a multiple-decrement life table requires some interpretive
work.
1. There are only 106 individuals given for the data on separation; this is both because the
individuals who eventually married were excluded from this tabulation, and because the
two tabulations were based on slightly different samples.
4. Note that separations are given by survival percentages, while marriages are given by loss
percentages.
We now construct a combined life table from the data in Table 3.1. The purpose of this
model is to integrate information from the two data sets. This requires some assumptions, to
wit, that the transition rates to the two different absorbing states are the same for everyone, and
that they are constant over the periods 0–1, 1–3, 3–5 (and constant over 0–3 for separation).
The procedure is essentially the same as the construction of the single-decrement life table,
except that the survival is decremented by both loss counts dM x and dx ; and the estimation of
S
˜
years at risk `x now depends on both decrements, so is
0
S x −x
`˜x = `x0 (x0 − x) + (dM
x + dx ) ,
2
where x0 is the next age on the life table. Thus, for example, `˜1 , which is the number of years at
risk from age 1 to 3, is 64 · 2 + 41 · 1 = 169.
One of the challenges is that we observe transitions to Separation conditional on never being
Married, but the transitions to Married are unconditioned. The data for marriage are more
straightforward: These are absolute decrements, rather than conditional ones. If we set up a life
table an a radix of 1000, we know that the decrements due to marriage should be exactly the
percentages given in Table 3.1(b); that is, 180, 120, and 90. We begin by putting these into our
multiple-decrements life table, Table 3.2.
Of these nominal 1000 individuals, there are 610 who did not marry. Multiplying by the
percentages in Table 3.1(a) we estimate 238 separations in the first 3 years, and 79 in the next 2
years.
Table 3.2: Multiple decrement life table for survival of cohabiting relationships, from time of
birth of first child, computed from data in Table 3.1.
x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–1 1000 180 ? ? ? ?
1–3 ? 120 ? ? ? ?
3–5 462 90 79 755 ? ?
At this point, the only thing preventing us from completing the life table is that we don’t
know how to allocate the 238 separations between the first two rows, which makes it impossible
to compute the total years at risk for each of these intervals. The easiest way is to begin by
making a first approximation by assuming the marriage rate is the unchanged over the first three
years, and then coming back to correct it. Our first step, then, is the multiple-decrement table
in Table 3.3. According to our usual approximation, we estimate the years at risk for the first
period as 1000 · 3 − 538 · 1.5 = 2193; and in the second period as 462 · 2 − 169 · 1 = 755. The two
decrement rates are then estimated by dividing the number of events by the number of years at
risk.
What correction do we need? We need to estimate the number of separations in the first year.
Our model says that we have two independent times S and M , the former with constant hazard
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 51
x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–3 1000 300 238 2193 0.137 0.109
3–5 462 90 79 755 0.119 0.105
µS
−µM S
1 −µ1.5
P{min{S, M } < 1 & S < M } = M 1.5 S 1−e 2 ;
µ 1 + µ1.5
2
µS1.5 −µM S
1 −µ1.5 −2µM S
2 −2µ1.5
P{1 < min{S, M } < 3 & S < M } = e 2 1 − e .
µM
2 + µ1.5
S
In Table 3.3 we have taken both values µM to be equal, the estimate being 0.137; were we to do
a second round of correction we could take the estimates to be distinct. We thus obtain
0.109
P{min{S, M } < 1 & S < M } ≈ 1 − e−0.137−0.109 = 0.0966;
0.137 + 0.109
0.109
P{1 < min{S, M } < 3 & S < M } ≈ e−0.137−0.109 1 − e−2·0.137−2·0.109 = 0.135.
0.137 + 0.109
Thus, we allocate 0.418 ∗ 238 = 99.4 of the separations to the first period, and 138.6 to the
second. This way we can complete the life table by computing the total years at risk for each
period, and hence the hazard rate estimates.
In principle, we could do a second round of approximation, based on the updated hazard rate
estimates. In fact, there is no real need to do this. The approximations are not very sensitive to
the exact value of the hazard rate. If we do substitute the new values in, the estimated fraction
of separations occurring in the first year will shift only from 0.418 to 0.419.
x `x dM
x dSx `˜x µ̂M
x µ̂Sx
0–1 1000 180 99.4 860.2 0.209 0.116
1–3 720.6 120 138.6 1183 0.101 0.116
3–5 462 90 79 755 0.119 0.105
We are now in a position to use the model to draw some potentially interesting conclusions.
For instance, we may be interested to know the probability that a cohabitation with children will
CHAPTER 3. MULTIPLE-DECREMENTS MODEL (OPTIONAL TOPIC) 52
end in separation. We need to decide what to do with the lack of observations after 5 years. For
simplicity, let us assume that rates remain constant after that point, so that all cohabitations
would eventually end in one of these fates. Applying the formula (3.16), we see that
Z ∞
µSx F̄ (x)dx.
P separate =
0
We have then
Z 1 Z 3 Z ∞
−0.325x −0.325−0.217(x−1)
e−0.759−0.224(x−3) dx
P separate = 0.116 e dx + 0.116 e dx + 0.105
0 1 3
0.116 h i 0.116 h i 0.105 h i
= 1 − e−0.325 + e−0.325 − e−0.759 + e−0.759
0.325 0.217 0.224
= 0.099 + 0.136 + 0.219
= 0.454.
Chapter 4
Survival analysis
53
CHAPTER 4. SURVIVAL ANALYSIS 54
dependence between the event of observation and the value that is observed. In right-censoring,
for instance, the fact of observing a time implies that it occurred before the censoring time. The
distribution of a time conditioned on its being observed is thus different from the distribution of
the times that were censored.
There are different levels of independence, of course. In the case of random (type III)
censoring, the censoring time itself is independent of the (potentially) observed time. In Type II
censoring, the censoring time depends in a complicated way on all the observation times.
It was shown by [27] that it is impossible to reconstruct the joint distribution of (Te, C) from
these data. It is possible to reconstruct the marginal distributions under the assumption that
event times Te and censoring times C are independent.
Define
of the event time and the censoring time respectively. Then the likelihood is
Y Y
L= f (tei )SC (tei ) S(ci )fC (ci )
δi =1 δi =0
Y Y
= h(ti )δi S(ti ) × hC (ti )1−δi SC (ti ).
i i
Since this factors into an expression involving the distribution of Te and one involving the
distribution of C, we may perform likelihood inference on the event-time distribution without
reference to the censoring time distribution.
4.2.2 Non-informative censoring
As stated above, the random censoring assumption is both too strong and too weak. It is too
strong because it makes highly restrictive assumptions about the nature of the censoring process
Note that this factorisation depends upon the assumption of independent censoring. In fact, the
same conclusion follows from a weaker assumption, called non-informative censoring.
We generally want to retain the assumption that the event times Tei are jointly independent.
The censoring process is called non-informative if it satisfies the following conditions:
1. For each fixed t, and each i, the distribution of (Ti − t)1{Ti >t} is independent of {Ci > t}.
That is, knowing that an individual was or was not censored by time t gives no information
about the lifetime remaining after time t, if they were still alive at time t.
CHAPTER 4. SURVIVAL ANALYSIS 55
2. The event-time distribution and the censoring distribution do not depend on the same
parameters.
The latter condition cannot be formalised in a frequentist framework, since we would be taking
the distributions as fixed (but unknown), allowing no formal way to define “dependence”. In a
parametric setting this is simply the assertion that we maximise the likelihood in Te without
regard to the distribution of C. (Note that in the absence of an independence assumption, such
as the first condition above, it would not be possible to define a joint distribution in which the
distribution of Te varies in a way that the distribution of C ignores.) In a Bayesian context we
are asserting that prior distribution on the distribution of (Te, C) makes the distribution of Te
independent of the distribution of C.
Note that this is a convenient modelling assumption, not an assumption about the
underlying process. We are free to impose this assumption. If the assumption has been imposed
incorrectly, the main effect will be only to reduce the efficiency of estimation.
The first assumption, on the other hand — or some substitute assumption — is absolutely
necessary for the procedures we describe — or, indeed, any such procedure — to yield unbiased
estimates. In a regression setting where we consider the covariates xi to be random we weaken
the independence assumptions to be merely independence conditional on the observed covariates.
As an example, Type II censoring is clearly not independent, but it is non-informative.
Warning: The definitions of random and non-informative censoring are not entirely standardised.
All times are in months. Each patient has their own zero time, the time at which the patient
entered the study (accrual time). For each patient we record time to event of interest or censoring
time, whichever is the smaller, and the status, δ = 1 if the event occurs and δ = 0 if the patient
is censored. If it is the recurrence that is of interest, so in fact the relevant time, the “time
between”, is measured relative to the zero time that is the onset of cancer.
CHAPTER 4. SURVIVAL ANALYSIS 56
The basic idea is the following: There is no way to estimate a hazard rate from data without
some kind of smoothing, so the most direct representation of the data comes from estimating the
survival function directly. On any interval [a, b) on which no events have been observed, it is
natural to estimate Ŝ(b) − Ŝ(a) = 0. That is, the estimated probability of an event occurring in
this interval is the empirical probability 0. This leads us to estimate the survival function by a
step function whose jumps are at the points tj where events have been observed.
We conventionally list the event times (not the censoring times) in order as (t0 = 0 <)t1 <
. . . < tj < . . . . If there are ties — several individuals i with Ti = tj and δi = 1 — we represent
this by a count dj , being the number of individuals whose event was at tj . (Thus all dj are at
least 1.)
At the point ti there is a drop in Ŝ. (Like cdfs, survival functions are taken to be right-
continuous: S(t) = P{Te > t}.) The size of the drop at an event time tj is
Estimating the change in S is then a matter of estimating this conditional probability, which is
also known as the discrete hazard hj . We have the natural estimator (which is also the MLE) for
this conditional probability, that is the empirical fraction of those observed to meet the condition
— that is, they survived at least to time tj — who died at time tj . This will be dj /nj , where nj is
the number of individuals at risk at time tj — that is, individuals i such that Ti ≥ tj , meaning
that they have neither been censored, nor have they had their event, before time tj .
The Kaplan-Meier estimator is the result of this process:
!
Y Y dj
S(t)
b = 1 − ĥj = 1−
nj
tj ≤t tj ≤t
where
Note
nj+1 + cj + dj = nj
where cj = #{censored in [tj , tj+1 )}. If there are no censored observations before the first failure
time then n0 = n1 = #{in study}. Generally we assume t0 = 0.
4.4.3 Nelson–Aalen estimator
The Nelson–Aalen estimator for the cumulative hazard function is
X dj X
H(t)
b = = hbj
nj
tj ≤t tj ≤t
CHAPTER 4. SURVIVAL ANALYSIS 58
This is natural for a discrete estimator, as we have simply summed the estimates of the hazards
at each time, instead of integrating, to get the cummulative hazard. This correspondingly gives
an estimator of S of the form
S(t)
e = exp −H(t)b
X di
= exp −
ni
ti ≤t
Here + indicates a censored observation. Then we can calculate both estimators for S(t) at all
time points. It is considered unsafe to extrapolate much beyond the last time point, 14, even
with a large data set.
Table 4.1: Computations of survival estimates for invented data set (4.1)
Table 4.2: Times of complete remission for preliminary analysis of AML data, in weeks. Censored
observations denoted by +.
Table 4.3: Computations for the Kaplan–Meier and Nelson–Aalen survival curve estimates of the
AML data.
1.0
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Table 4.4: Invented data illustrating left truncation. Event times after the censoring time may
be purely nominal, since they may not have occurred at all; these are marked with *. The row
Observation shows what has actually been observed. When the event time comes before the
truncation time the individual is not included in the study; this is marked by a ◦.
Patient ID 5 2 9 0 1 3 7 6 4 8
Event time 2 5 5 * 7 * 12 * * *
Censoring time 10 8 7 8 11 7 14 14 14 14
Truncation time −2 3 6 0 1 0 6 6 −5 1
Observation 2 5 ◦ 8+ 7 7+ 12 14+ 14+ 14+
Table 4.5: Computations of survival estimates for invented data set of Table 4.4.
We give a version of these data in Table 4.4. Note that patient number 9 was truncated at
time 6 (i.e., entered the nursing home at age 86) but her event was at time 5 (i.e., she had already
suffered from dementia since age 85), hence was not included in the study. In table 4.5 we give
the computations for the Kaplan–Meier estimate of the survival function. The computations are
exactly the same as those of sectionP 4.4.4, except
P for one important change: The number at risk
nj is not simply the number n − tj <t dj − tj <t kj of individuals who have not yet had their
event or censoring time. Rather, an individual is at risk at time t if her event time and censoring
time are both ≥ t, and if the truncation time is ≤ t. (As usual, we assume that individuals who
have their event or are censored in a given year, were at risk during that year. We are similarly
assuming that those who entered the study at age x are at risk during that year.) At the start of
our invented study there are only 6 individuals at risk, so the estimated hazard for the event at
age 2 becomes 1/6.
CHAPTER 4. SURVIVAL ANALYSIS 62
In the most common cases of truncation we need do nothing at all, other than be careful
in interpreting the results. For instance, suppose we were simply studying the age after 80 at
which individuals develop dementia by a longitudinal design, where 100 healthy individuals 80
years old are recruited and followed for a period of time. Those who are already impaired at age
80 are truncated. All this means is that we have to understand (as we surely would) that the
results are conditional on the individual not suffering from dementia until age 80.
where Z is approximately standard normal. Substituting, as usual, ĥj for the unknown hj we
get the approximation
1/2
X dj (nj − dj )
b − H(t) ≈
H(t) 3
Z. (4.2)
t ≤t
n j j
This yields
X dj (nj − dj )
σ̃ 2 (t) := (4.3)
t ≤t
n3j
j
In fact, we don’t want to model the cumulative hazard as a step function with fixed jump
times. We could imagine interpolating extra “jump times”, when no events happen to have
been observed — hence with discrete hazard estimate 0 — and note that the estimate of Ĥ
and its variance is unchanged, allowing us to approximate a continuous cumulative hazard to
any accuracy we wish. To see a more mathematically satisfying derivation for σ̃(t), that uses
martingale methods to avoid discretisation, see [2].
4.6.2 The survival function and Greenwood’s formula
We can apply (4.4) directly to obtain a confidence interval for S(t) = e−H(t)
n o n o
exp −H(t)b − z1−α/2 σ̃(t) , exp −H(t) b + z1−α/2 σ̃(t)
n o n o (4.5)
= S(t) exp −z1−α/2 σ̃(t) , S(t) exp z1−α/2 σ̃(t)
e e .
While the confidence interval (4.5) is preferable in some respects, we also need a confidence
interval centred on the traditional Kaplan–Meier estimator. To derive an estimator for the
variance we start with the formula
X
log Ŝ(t) = log(1 − ĥj ).
tj ≤t
for the variance of log Ŝ(t). We may use this exactly as in (4.5) to produce confidence intervals
for log S(t), and hence for S(t). Traditionally we apply the delta method again to produce the
estimator
2
X dj
σG (t) := Ŝ(t)2
nj (nj − dj )
tj ≤t
To this order,
E g(Y1 , . . . , Yk ) ≈ g(µ1 , . . . , µk ),
X ∂g ∂g (4.6)
Var g(Y1 , . . . , Yk ) ≈ (µ) (µ)Σij .
∂yi ∂yj
i,j
When k = 2 it becomes
E g(Y1 , Y2 ) ≈ g(µ1 , µ2 ),
2 2
∂g 2 ∂g 2 ∂g ∂g
Var g(Y1 , Y2 ) ≈ (µ) σ1 + (µ) σ2 + 2 (µ) (µ) Cov (Y1 , Y2 ) .
∂y1 ∂y2 ∂y1 ∂y2
(4.8)
where z is the appropriate quantile of the normal distribution. Note that the approximation
cannot be assumed to be very good in this case, since the number of individuals at risk is too
small for the asymptotics to be reliable. We show the confidence intervals in Figure 4.2.
1.0
0.8
0.6
Survival
0.4
0.2
0.0
0 10 20 30 40 50
Time (weeks)
Figure 4.2: Greenwood’s estimate of 95% confidence intervals for survival in maintenance group
of the AML study.
Table 4.6: Computations for Greenwood’s estimate of the standard error of the Kaplan-Meier
survival curve from the maintenance population in the AML data. “lower” and “upper” are
bounds for 95% confidence intervals, based on the log-normal distribution.
dj
tj nj dj nj (nj −dj ) Var(log Ŝ(tj )) lower upper
9 11 1 0.009 0.009 0.754 1.000
13 10 1 0.011 0.020 0.619 1.000
18 8 1 0.018 0.038 0.488 1.000
23 7 1 0.024 0.062 0.377 0.999
31 5 1 0.050 0.112 0.255 0.946
34 4 1 0.083 0.195 0.155 0.875
48 2 1 0.500 0.695 0.036 0.944
4.8 Survival to ∞
Let T be a survival time, and define the conditional survival function
S0 (t) := P T > t T < ∞ ;
that is, the probability of surviving past time t given that the event eventually does occur. We
have
P{∞ > T > t}
S0 (t) = . (4.9)
P{∞ > T }
How can we estimate S0 ? Nelson–Aalen estimators will never reach ∞ (which would mean 0
survival); Kaplan–Meier estimators will reach 0 if and only if the last individual at risk actually
has an observed event. In either case, there is no mathematical principle for distinguishing
between the actual survival to ∞ — that is, the probability that the event never occurs — and
simply running out of data. Nonetheless, in many cases there can be good reasons for thinking
that there is a time t∂ such that the event will never happen if it hasn’t happened by that time.
In that case we may use the fact that {T < ∞} = {T < t∂ } to estimate
Ŝ(t) − Ŝ(t∂ )
Ŝ0 (t) = . (4.10)
1 − Ŝ(t∂ )
In this case, assuming that S(t) is constant after t = t∂ , we need to estimate the variance of
Ŝ0 (t). To compute a confidence interval for S0 (t) we apply the delta method again, in a slightly
more complicated form. Suppose we set Y1 = Ŝ(t) and Y2 = Ŝ(t∂ )/Ŝ(t). The variance of Y1 is
approximated just by Greenwood’s formula
X dj
σ12 := Var S(t)
b ≈ σ̂ 2 (t) = S(t)
b 2 ,
tj ≤t
nj nj − dj
Since Y1 and Y2 depend on distinct survival events they are uncorrelated. We may apply the
delta method (4.8) to the two-variable function g(y1 , y2 ) = y1 (1 − y2 )/(1 − y1 y2 ), obtaining
(4.11)
If there is no a priori reason to choose a value of t∂ , we may estimate it from the data as
max{ti }, as long as there is a significant length of time during which there is a significant number
of individuals under observation, when an event could have been observed.
4.8.1 Example: Time to next birth
This is an example discussed repeatedly in [2]. It has the advantage of being a large data set,
where the asymptotic assumptions may be assumed to hold; it has the corresponding disadvantage
that we cannot write down the data or perform calculations by hand.
The data set at http://folk.uio.no/borgan/abg-2008/data/second_births.txt lists, for
53,558 women listed in Norway’s birth registry, the time (in days) from first to second birth.
(Obviously, many women do not have a second birth, and the observations for these women will
be treated as censored.)
In Figure 4.3(a) we show the Kaplan–Meier estimator computed and automatically plotted
by the survfit command. Figure 4.3(b) shows a crude estimate for the distribution of time-
to-second-birth for those women who actually had a second birth. We see that the last birth
time recorded in the registry was 3677, after which time none of the remaining 131 women had a
recorded second birth. Thus, the second curve is simply the same as the first curve, rescaled to
go between 1 and 0, rather than between 1 and 0.293 as the original curve does.
The code used to generate the plots is in Code 1.
CHAPTER 4. SURVIVAL ANALYSIS 68
Time (days)
Time (days)
Figure 4.3: Time (in days) between first and second birth from Norwegian registry data.
CHAPTER 4. SURVIVAL ANALYSIS 70
21, 47, 47, 58+, 71, 71+, 125, 143+, 143+, 143+ (4.12)
sim.surv = Surv(c(21, 47, 47, 58, 71, 71, 125, 143, 143, 143), c(1, 1, 1, 0, 1, 0, 1, 0, 0, 0)).
Fitting models is done with the survfit command. This is designed for comparing distribu-
tions, so we need to put in a some sort of covariate. Then we can write
sim.fit=survfit(sim.surv∼1,conf.int=.99)
and then plot(sim.fit), or
plot(sim.fit,main=‘Kaplan-Meier for simulated data set’,
xlab=‘Time’,ylab=‘Survival’)
to plot the Kaplan–Meier estimator of the survival function, as in Figure 4.4. The dashed lines
are the Greenwood estimator of a 99% confidence interval. (The default for conf.int is 0.95.)
The Nelson–Aalen estimatorcan also be computed with survfit. The associated survival
estimator Se = e−Â is called the Fleming–Harrington estimator, and it may be estimated with
fit=survfit(formula, type=’fleming-harrington’). The cumulative hazard — the log of
the survival estimator — may be printed with plot(fit, fun=’cumhaz’).
If you want to compute it more directly, you can extract the information in the survfit
object. If you want to see what’s inside an R object, you can use the str command. The output
is shown in Code 2.
We can then compute the Nelson–Aalen estimator with a function such as the one in Figure
3. This is plotted together with the Kaplan–Meier estimator in Figure 4.5. As you can see, the
two estimators are similar, and the Nelson–Aalen survival is always higher than the KM.
4.9.2 Other survival objects
Left-censored data are represented with
Surv(time, event,type=’left’).
Here event can be 0/1 or 1/2 or TRUE/FALSE for alive/dead, i.e., censored/not censored.
1.0
0.8
0.6
Survival
0.4
0.2
0.0
Time
Figure 4.4: Plot of Kaplan–Meier estimates from data in (4.12). Dashed lines are 95% confidence
interval from Greenwood’s estimate.
Interval censoring also takes time and time2, with type=’interval’. In this case, the event
can be 0 (right-censored), 1 (event at time), 2 (left-censored), or 3 (interval-censored).
CHAPTER 4. SURVIVAL ANALYSIS 73
+
+
0.6
Survival
0.4
Kaplan-Meier
Nelson-Aalen
0.2
0.0
Time
Figure 4.5: Plot of Kaplan–Meier (black) and Nelson–Aalen (red) estimates from data in (4.12).
Dashed lines are pointwise 95% confidence intervals.
Chapter 5
Parametric Any parametric model may be turned into a regression model by imposing a functional
assumption that links the covariates to one or more model parameters. As we will discuss
in section 5.4, there are traditional choices of parameters to modify in this way that provide
natural interpretations in a survival context.
Semiparametric In a fully parametric model the parameters defining the survival probabilities are not
separated from the parameters that describe the effect of the covariates. Assumptions
that we impose on the hazard rates may bias the estimate of the parameters measuring
covariate effects. Semiparametric approaches split the estimation of a baseline hazard rate
— which is done nonparametrically — from the estimation of a small number of parameters
that describe how individuals with particular parameter values differ from that baseline.
By far the most popular semiparametric model is the Cox proportional hazards regression
model, described in section 5.5.
74
CHAPTER 5. REGRESSION MODELS 75
Nonparametric A useful alternative approach is the Aalen additive-hazards regression model, described in
section 5.12, which estimates both cumulative hazards and cumulative effects of covariates
in a unified nonparametric way.
• Additive hazards: It seems natural when comparing two groups to measure the difference
in survival in terms of the cumulative difference in hazard — which is to say, in expected
number of events — and to suppose that each increment in the covariate might produce a
fixed increment in cumulative hazard. In an epidemiological context this means that the
output of the model is the expected number of events caused or prevented by a change
in treatment or risk factors. One disadvantage to this approach — described in detail in
section 5.12 is that there is the potential for estimated hazards to become negative in some
parameter regimes, which is nonsensical.
hi (t) = ρi h0 (t)
where h0 (t) is the baseline hazard, and ρi is a function of covariates, which may themselves
be changing over time. Equivalently, we have Hi (t) = ρi H0 (t) for the cumulative hazard,
and
Si (t) = S0 (t)ρi
This sort of model is also called relative-risk regression.
• Acccelerated lifetimes In this approach we say that there is a standard survival function
S0 (t), which applies to everyone, but different individuals run through the function at
different rates. So individual i with acceleration parameter ρi will have survival function
Equivalently, we have Hi (t) = H0 (ρi t) for the cumulative hazard, or hi (t) = ρi h0 (ρi t)
for the hazard. AL models will not be considered in this course except in the context of
parametric models.
Similarly for Hg (t). Thus, the plot of Ŝg (t) or Ĥg (t) may be used as a diagnostic for AL models,
where we accept the AL assumption when we see an approximate agreement between the curves
when shifted horizontally.
To interrogate the PH assumption we plot the log cumulative hazard estimate (or plot the
cumulative hazard on a log scale). If distinct groups differ by a proportionality constant
So if we plot the log Ĥg (t) against either t or log t (where g is, again, a group of individuals) we
expect to see a vertical shift between groups. Note that log Ĥ = log(− log Ŝ) (or log(− log S),
e as
a consequence of which this plot is known as the complementary log log plot.
Taking both models together it is clear that we could plot
log − log Sbg (t) against log t
as then we can check for AL and PH in one plot. Generally Sbg will be calculated as the
Kaplan–Meier estimator for group g.
• If the accelerated life model is plausible we expect to see a horizontal shift between groups.
• If the proportional hazards model is plausible we expect to see a vertical shift between
groups.
Of course, if the data came from a Weibull distribution, with differences in the ρ parameter,
it is simultaneously AL and PH. We see that
log − log Sg (t) = log ρg + α log t.
Thus, survival curve estimates for different groups should appear approximately as parallel lines,
which of course may be viewed as vertical or as horizontal shifts of one another.
In section 5.4 we illustrate this with two simulations of populations with Gompertz mortality.
The shape parameter α is assumed to be the same for each observation in the study.
As an example, the row of data for an individual will include
• response
– event time ti ;
CHAPTER 5. REGRESSION MODELS 77
We can now compute MLEs for α and all components of the vector β,q using numerical optimisation,
√
giving estimators α
b, βb together with their standard errors ( Var α
b, Var βbj ), estimated from the
observed information matrix. Of course, the same could have been done for another parametric
model instead of the Weibull.
In general, if we have a parametric model with cumulative hazard Hα (t) (where α represents
the parameters of the model that do not vary between individuals), and hα (t) = Hα0 (t) is the
hazard function, we have the likelihood
Y δ i β·xi
β·xi β·xi
L(α, β) = e hα e ti e−H(e ti ) .
i
If observations are left-truncated, say at a time si , the survival term includes only the cumulative
hazard from si to ti :
β·xi β·xi
Y i
i
L(α, β) = eβ·x hα eβ·x ti e−H(e ti )+H(e si ) .
i
For example, if we are fitting a Weibull model, the shape parameter will be α. Recall that
when fitting a Weibull we can test for α = 1 — the null hypothesis that the data actually might
have come from the simpler exponential distribution — using
b exp ∼ χ2 (1), asymptotically.
b weib − 2 log L
2 log L
CHAPTER 5. REGRESSION MODELS 78
Female 0.06
0
Male
Log cumulative mortality
0.61
−2
0.64
0.23 0.64
−4
−6
20 40 60 80 100
Age
Female 0.22
0
Male
Log cumulative mortality
2.17
−2
1.45
0.18
−4
0.96
−6
20 40 60 80 100
Age
(5.1)
hi (t) = h t xi = h0 (t)r β, xi (t); t
r(β, x) = 1 + βx.
In the multidimensional-covariate setting we can generalise this to the excess relative risk model
(taking p to be the dimension of the covariate)
p
Y
(5.3)
r(β, x) = 1 + β j xj .
j=1
This allows each covariate to contribute its own excess relative risk, independent of the others.
Alternatively, we can define the linear relative risk function
p
X
r(β, x) = 1 + βj xj . (5.4)
j=1
CHAPTER 5. REGRESSION MODELS 81
A note about coding of covariates: In order for the baseline survival to make sense,
the point where all the covariates are 0 must correspond to a plausible vector of co-
variates for an individual who could be in the sample. Thus, quantitative covariates
should be centred, either at the mean, the median, or some approximately central
value. Note that this normalisation cancels out between numerator and denomina-
tor of the partial likelihood in Cox regression, but it becomes especially important
when we add interaction effects, typically introduced as products of covariates.
1. A list of event times t1 < t2 < · · · . (We are assuming no ties, for the moment.)
3. The values of all individuals’ covariates (at times tj , if they are varying).
4. The risk sets Rj = {i : Ti ≥ tj }, the set of individuals who are at risk at time tj .
The most common way to fit a relative-risk model is to split the likelihood into two pieces:
The likelihood of the event times, and the conditional likelihood of the choice of subjects given
the event times. The first piece is assumed to contain relatively little information about the
parameters, and its dependence on the parameters is quite complicated.
We use the second piece to estimate β. Conditioned on some event happening at time t, the
probability that it is individual i is
where R(t) is the risk set, the set of individuals at risk at time t. We have the partial likelihood
Y Y r(β, xij (tj ); tj )
LP (β) = π ij tj = P . (5.5)
tj t l∈Rj r(β, xl (tj ); tj )
j
The partial likelihood is useful because it involves only the parameters β, isolating them from
the nonparametric (and often less interesting) α0 . The maximiser of the partial likelihood has
the same essential properties as the MLE.
∂2
− log LP (β).
∂βi ∂βj
CHAPTER 5. REGRESSION MODELS 82
• Wald statistic: ξW
2 := (β̂ − β )T J(β )(β̂ − β );
0 0 0
Under the null hypothesis these are all asymptotically chi-squared distributed with p degrees of
freedom. Here J(β) is the observed Fisher partial information matrix. There is a computable
estimate for this, which is fairly straightforward, but notationally slightly tricky in the general
(multivariate) case, so we do not include it here. (See equations (4.46) and (4.48) of [2] if you are
interested.) As usual, we can approximate the expected information by the observed information.
Here U (β) is the vector of score functions ∂`P /∂βj .
c0 (t) = e−H
S
c0 (t)
,
1
h
c0 (tj ) = P
r(β, xi (tj ))
i∈Rj
Breslow’s estimator for the cumulative baseline hazard is very similar to the Nelson–Aalen
estimator. We start from the constraint that our MLE for the cumulative hazard must confine
the points of increase (which are the times when events might be observed) to the times when
CHAPTER 5. REGRESSION MODELS 83
events actually were observed. Conditioning on that constraint1 we want the baseline hazard
estimator to have the form X
H
c0 (t) = h
c0 (tj ),
tj ≤t
1
h
c0 = P (5.6)
r(β, xi (tj ))
i∈Rj
c0 = P 1
h (5.7)
eβ·xi (tj )
i∈Rj
We estimate h0 (tj ) by
1
A likelihood constrained to have certain features of the “parameters” (which, in this case, are an entire hazard
function) already optimised, so that the remaining parameters can be computed, is called a profile likelihood.
CHAPTER 5. REGRESSION MODELS 84
which we approximate by
X r(β̂, x(tj ))
Ĥ(t x) = P . (5.9)
tj ≤t i∈Rj r(β̂, xi (tj ))
X eβ̂·x(tj )
Ĥ(t x) = . (5.10)
eβ̂·x(tj )
P
tj ≤t i∈Rj
can approximate this by deducting the average of the risks that depart. In other words, in the
above example, the first contribution to the partial likelihood becomes
r1 r2
.
(r1 + r2 + r3 + r4 + r5 )( 12 (r1 + r2 ) + r3 + r4 + r5 )
An alternative approach, due to Breslow, makes no correction for the progressive loss of risk
in the denominator:
XX X
`Breslow
P (β) = log r(β, xi (tj )) − dj log r β, xi (tj ) .
tj i∈Dj i∈Rj
This approximation is always too small, and tends to shift the estimates of β toward 0. It is
widely used as a default in software packages (SAS, not R!) for purely historical reasons.
0.5
0.0
-0.5
log(-log(Survival))
-1.0
-1.5
-2.0
-2.5
-3.0
0 10 20 30 40 50
Age
Figure 5.2: Iterated log plot of survival of two populations in AML study, to test proportional
hazards assumption.
2.8e-18
2.4e-18
LP
2.0e-18
1.6e-18
Figure 5.3: A plot of the partial likelihood from (5.11). Dashed line is at β = 0.9155.
CHAPTER 5. REGRESSION MODELS 87
Table 5.1: Output of the coxph function run on the aml data set.
The z is simply the Z-statistic for testing the hypothesis that β = 0, so z = β̂/SE(β̂). We
see that z = 1.79 corresponds to a p-value of 0.074, so we would not reject the null hypothesis at
level 0.05.
We show the estimated baseline hazard in Figure 5.4; the relevant numbers are given in Table
5.2. For example, the first hazard, corresponding to t1 = 5, is given by
1 1
ĥ0 (5) = + = 0.050,
12eβ̂ + 11 11eβ̂ + 11
substituting in β̂ = 0.9155.
1.0
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50 60
Figure 5.4: Estimated baseline hazard under the PH assumption. The purple circles show the
baseline hazard; blue crosses show the baseline hazard shifted up proportionally by a multiple
of eβ̂ = 2.5. The dashed green line shows the estimated survival rate for the mixed population
(mixing the two estimates by their proportions in the initial population).
CHAPTER 5. REGRESSION MODELS 88
Table 5.2: Computations for the baseline hazard LME for the AML data, in the proportional
hazards model, with maintained group as baseline, and relative risk eβ̂ = 2.498. We write Y M
and Y N for the number at risk in the maintenance and non-maintenance groups respectively.
0 10 20 30 40 50 60
Figure 5.5: Comparing the estimated population survival under the PH assumption (green
dashed line) with the estimated survival for the combined population (blue dashed line), found
by applying the Nelson–Aalen estimator to the population, ignoring the covariate.
CHAPTER 5. REGRESSION MODELS 89
n=100
censrate=0.5
covmean=0
covsd=0.5
beta=1
x=rnorm(n,covmean,covsd)
T=sqrt(rexp(n)*2*exp(-beta*x))
# Censoring times
C=rexp(2*n,censrate)
t=pmin(C,T)
delta=1*(T<C)
Then the Cox model may be fit with the command
cfit=coxph(Surv(t,delta)~x)
> summary(cfit)
Call:
coxph(formula = Surv(t, delta) ~ x)
plot(survfit(cfit,data.frame(x=0)),main=’Cox example’)
tt=(0:300)/100
lines(tt,exp(-tt^2/2),col=2)
legend(.1,.2,c(’baseline survival estimate’,’true baseline’),col=1:2,lwd=2)
1.0
0.8
0.6
0.4
0.2 Cox example
Figure 5.6: Survival estimated from 100 individuals simulated from the Cox proportional hazards
2
model. True baseline survival e−t /2 is plotted in red.
Suppose now we have a categorical variable — for example, three different treatment groups,
labelled 0,1,2 — with relative risk 1,2,3, let us say. If we were to use the command
cfit=coxph(Surv(t,delta)~x)
we would get the wrong result:
coef exp(coef) se(coef) z p
x 0.335 1.4 0.0865 3.88 0.00011
This comes close to estimating correctly the relative risks 2 and 2.5, which in the first version
were estimated as 1.4 and 1.42 = 1.96.
If you want to include time-varying covariates, this may be done crudely by having multiple
time intervals for each individual, with all having a start and a stop time, and all but (perhaps)
the last being right-censored. (Of course, if individuals have multiple events, then there may be
multiple intervals that end with an event.) This allows the covariate to be changed stepwise.
• the statistical methods for fitting additive hazards regression models make it relatively
easy to allow for effects that change with time;
• results of the additive model lend themselves to a natural interpretation as “excess mortality”.
The model parameters (unknown) are the baseline hazard β0 (t), and p functions β1 (t), . . . , βp (t).
The hazard for individual i at time t is then
We want to estimate this cumulative excess risk, as we always do for nonparametric models, by a
function that is piecewise constant, taking jumps only at the times tj when there is an event.
We may represent this2 as X
Bk (t) = dBk (tj ),
tj ≤t
where dBk (tj ) is simply a notation for the discrete hazard increment at time tj .
Let Yi (t) be the risk indicator for individual i — that is, it is 1 if individual i is at risk at
time t, and 0 otherwise. Consider the vector whose i-th component is
p
X
Yi (tj ) dB̂0 (tj ) + Yi (tj )xik (tj ) dB̂k (tj )
k=1
In matrix form this is X(tj ) dB̂(tj ), where Here X(t) is the n × (p + 1) matrix whose (i, k)
component is Yi (t)xik (t), with xi0 (t) ≡ 1. This is the hazard increment at time tj . If we observed
one event for individual i at time tj , we want the hazard increment to be about 1; if we observed
no event (as for most of the at-risk population) we want the increment to be about 0.
Of course, there is no way to choose the p + 1 components of dB̂(tj ) to make all n of these
relations come out exactly right. So, as in the case of linear regression, we look for a least-squares
approximation. That is, we solve exactly as in the case of multivariate linear regression. Taking
account of the possibility that the “design matrix” X(t) will eventually have rank below p + 1 —
after enough subjects have dropped out — at which point the estimation process has to stop, we
follow our usual procedure by defining
( −1
− X(t)T X(t) X(t)T if X(t) has full rank,
X (t) := (5.13)
0 otherwise.
In other words, it is the generalised inverse of X whenever this exists. Our usual least-squares
solution for this equation is then
where dN(tj ) is a vector of all 0’s, except for a 1 in the ij component, where ij is the individual
having an event at time tj . (Here we are assuming no ties, as is conventional. If there are ties,
we could still obtain an unbiased estimator by letting dN(tj ) have a 1 in all the coordinates
corresponding to individuals with events at time tj .)
2
Formally, this may be understood as a stochastic integral with respect to a jump process, but this formalism
is beyond the scope of this course. Those who are interested may find the details in section 4.2 of [2].
CHAPTER 5. REGRESSION MODELS 93
A consequence of the approach we have taken is that B̂k (t) is an unbiased estimator3 for
Bk (t) for each t.
Assume now that there is a single individual ij having event at time tj . (If there are ties,
this would require, as with relative-risk regression, some decision about the order of the events
in order to compute the variance. The R function aareg breaks ties randomly.) Then for large
sample sizes the estimators B̂k (t) are approximately multivariate normally distributed with
means Bk (t) and variance-covariance matrix estimated by
X
Σ̂k` (t) = B̂ − B∗
, B̂ − B ∗
(t) = X− −
kij (tj )X`ij (tj ). (5.15)
k `
tj ≤t
Suppose we wish to test the hypothesis that βk (·) is identically 0, for some particular k. This
is equivalent to saying that any test statistic
Z ∞
Zk = W (s)βk (s) ds
0
is 0. We will discuss nonparametric testing in more detail in chapter 6, in particular the criteria
for choosing a good weight function. The crucial thing is that W (s) has to be computable on
the basis of past information (that is, events and censoring before time s.)
Under the null hypothesis Zk is approximately normal with mean 0 and variance
X 2
Vk := W (tj )X−
kij (t j ) . (5.16)
tj
5.12.3 Examples
Single covariate
Suppose each individual has a single covariate xi (t) at time t, and the hazard for individual i at
time t is β0 (t) + β1 (t)xi (t). If we let R(t) be the set of individuals at risk at time t, and define
1 X
µk (t) = xi (t)k ,
#R(t)
i∈R(t)
we have !
T 1 µ1 (t)
X(t) X(t) = #R(t) ,
µ1 (t) µ2 (t)
If we assume that t is such that there are still multiple individuals whose event is after t, and
that the xi (t) are all distinct, then the Cauchy–Schwarz inequality tells us that the denominator
is always > 0.
Rt
3
To be precise, it is an unbiased estimator for 0 βk (s)1{X− (s) has full rank} ds, since our estimation breaks down
when the population is reduced to the point where X− no longer has full rank. Thus, there is a bias of the order
of the probability of this breakdown occurring. It will be small under most circumstances, where the study is
stopped long before most individuals have had their events.
CHAPTER 5. REGRESSION MODELS 94
We also observe that X(t)T dN(t) is a 2 × 1 vector which is 0 except at times tj , when it is
!
T dj
X(tj ) dN(tj ) = P ,
i∈Dj xi (tj )
where Dj is the set of individuals who have events at time Dj and dj = #Dj . The estimator
(5.14) then becomes
! !
B̂0 (t) X dj µ2 (tj ) − µ1 (tj )x̄j
= , (5.17)
B̂1 (t) #R(tj )(µ2 (tj ) − µ1 (tj )2 ) −µ1 (tj ) + x̄j
tj ≤t
Simulated data
We consider the single-covariate additive model from section 5.12.3. We consider a population of
n individuals, where the hazard rate for individual i is
xi
λi (t) = 1 + ,
1+t
with xi being i.i.d. covariates with N(1, 0.25) distribution. So the effect of the covariate decreases
with time. We assume independent right censoring at constant rate 0.5. We consider two cases:
n = 100 and n = 1000. First of all, we need to simulate the times. We use the result of problem
3 from Problem Sheet A.1. The cumulative hazard for individual i is
censrate=0.5
covmean=0
covsd=0.5
xtime=function(T,x){
u=uniroot(function(t) t+x*log(1+t)-T,c(0,max(T,2*(T-x))))
u$root
}
n=1000
CHAPTER 5. REGRESSION MODELS 95
# Censoring times
C=rexp(n,censrate)
xi=rnorm(n,covmean,covsd)
T=rep(0,n)
for (i in 1:n){
T[i]=xtime(rexp(1),xi[i])
}
t=pmin(T,C)
delta=(T<C)
afit=aareg(Surv(t,delta)~xi)
plot(afit,xlim=c(0,1),ylim=c(-.5,1.5))
s=(0:120)/100
lines(s,log(1+s),col=2)
The results are in Figure 5.7. Note that the estimates for n = 100 are barely useful even just
for distinguishing the effect of the covariates from 0; on the other hand, bear in mind that these
are pointwise confidence intervals, so interpretations in terms of the entire time-course of B are
more complicated (and beyond the scope of this course). The estimates with n = 1000 are much
more useful.
Applying print() to an aareg object gives useful summary information about the model fit.
Applying it to the n = 100 simulation we get
> print(afit,maxtime=1)
Call:
aareg(formula = Surv(T, delta) ~ xi)
n= 100
70 out of 80 unique event times used
The slope is a crude estimate of the rate of increase of B· (t) with t (based on fitting a weighted
least-squares line to the estimates). We use the option maxtime=1 since about 80% of the events
are in [0, 1], so that the estimates become extremely erratic after t = 1. If we leave out this
option, the slope will not make much sense (though we could extend the range significantly
further when n = 1000). In this case, we would get a slope estimate of 2.
Note that the p-value for the covariate coefficient (row “xi”) is based on the SE for the
cumulative weighted test statistic for that particular parameter, and has nothing to do with
the slope estimate. The chi-squared statistic is a joint is based on a weighted cumulative test
CHAPTER 5. REGRESSION MODELS 96
statistic for all effects to be 0, and it has chi-squared distribution with p degrees of freedom. In
the case p = 1 it is just the square of the single-variable test statistic.
1.5
1.5
1.0
1.0
0.5
0.5
xi
xi
0.0
0.0
-0.5
-0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time
Figure 5.7: Estimated cumulative hazard increment per unit of covariate (B̂1 (t)) for two different
sample sizes, together with pointwise 95% confidence intervals. The true value is B1 (t) = log(1+t),
which is plotted in red.
Chapter 6
A common question that we may have is, whether two (or more) samples of survival times may
be considered to have been drawn from the same distribution: That is, whether the populations
under observation are subject to the same hazard rate.
97
Testing hypotheses 98
Thus, when the number of groups k = 2, we have dj = d1j + d2j and nj = n1j + n2j .
Generally we are interested in testing the null hypothesis H0 , that there is no difference
between the hazard rates of the two groups, against the two-sided alternative that there is
a difference in the hazard rates. The guiding principle is quite elementary, quite similar to
our approach to the proportional hazards model: We treat each event time ti as a new and
independent experiment. Under the null hypothesis, the next event is simply a random sample
from the risk set. Thus, the probability of the death at time tj being from group 1 is n1j /nj ,
and the probability of it being from group 2 is n2j /nj .
This describes only the setting where the events all occur at distinct times: That is, dj are all
exactly 1. More generally, the null hypothesis predicts that the group identities of the individuals
whose events are at time tj are like a sample of size dj without replacement from a collection of
n1j ‘1’s and n2j ‘2’s. The distribution of d1j under such sampling is called the hypergeometric
distribution. It has
n1j
expectation = dj , and
nj
n1j n2j (nj − dj )dj
variance =: σj2 = .
n2j (nj − 1)
n n
Note that if dj is negligible with respect to nj , this variance formula reduces to dj ( n1jj )( n2jj ),
which is just the variance of a binomial distribution.
Conditioned on all the events up to time tj (hence on nj , n1j , n2j ) and on dj , the random
d
variable d1j − n1j njj has expectation 0 and variance σj2 . If we multiply it by an arbitrary weight
d
W (tj ), determined by the data up to time tj , we still have W (tj )(d1j − n1j njj ) being a random
variable with (conditional) expectation 0, but now (conditional) variance W (tj )2 σj2 . This means
that if we define for k = 1, . . . , m
m
k
X dj
Mk := W (tj ) dj1 − nj1 ,
nj
j=1
k=1
these will be random variables with expectation 0 and variance ki=1 W (tj )2 σj2 . While the
P
increments are not independent, we may still apply a version of the Central Limit Theorem
to show that Mk is approximately normal when the sample size is large enough. (In technical
terms, the sequence of random variables Mk is a martingale, and the appropriate theorem is the
Martingale Central Limit Theorem. See [13] for more details.) We then base our tests on the
statistic Pm dj
j=1 W (tj ) dj1 − nj1 nj
Z := r
2 nj1 nj2 (nj −dj )dj
Pm
j=1 W (tj ) n2 (n −1)
j j
which should have a standard normal distribution under the null hypothesis.
Note that, as in the Cox regression setting, right censoring and left truncation are automatically
taken care of, by appropriate choice of the risk sets.
Testing hypotheses 99
1. W (tj ) = 1, ∀i. This is the log rank test, and is the test in most common use. The log
rank test is aimed at detecting a consistent difference between hazards in the two groups
and is best placed to consider this alternative when the proportional hazard assumption
applies. It is maximally asymptotically efficient in the proportional hazards context; in
fact, it is equivalent to the score test for the Cox regression parameter being 0, hence is
asymptotically equivalent to the likelihood ratio test. A criticism is that it can give too
much weight to the later event times when numbers in the risk sets may be relatively small.
2. R. Peto and J. Peto [24] proposed a test that emphasises deviations that occur early on,
when there are more individuals under observation. Petos’ test uses a weight dependent
on a modified estimated survival function, estimated for the whole study. The modified
estimator is
Y nj + 1 − dj
S(t)
e =
nj + 1
tj ≤t
e i−1 ) nj
W (tj ) = S(t
nj + 1
This has the advantage of giving more weight to the early events and less to the later ones
where the population remaining is smaller. Much like the cumulative-deviations test, this
avoids giving extra weight to excess deaths that come later.
3. W (tj ) = nj has also been suggested (Gehan, Breslow). This again downgrades the effect of
the later times.
4. D. Harrington and T. Fleming [14] proposed a class of tests that include Petos’ test and
the logrank test as special cases. The Fleming–Harrington tests use
p q
W (tj ) = S(t
b j−1 ) 1 − S(t
b j−1 )
where Sb is the Kaplan-Meier survival function, estimated for all the data. Then p = q = 0
gives the logrank test and p = 1, q = 0 gives a test very close to Peto’s test. If we were to
set p = 0, q > 0 this would emphasise the later event times if needed for some reason. This
might be the case if we were testing the effect of a medication that we expected to need
some time after initiation of treatment to become fully effective, or if the patients were
where Oj and Ej are observed and expected numbers of events. Consequently, positive and
negative fluctuations can cancel each other out. This could conceal a substantial difference
between hazard rates which is not of the proportional hazards form, but where the hazard rates
(for instance) cross over, with group 1 having (say) the higher hazard early, and the lower hazard
later. One way to detect such an effect is with a test statistic to which fluctuations contribute
only their absolute values. For instance, we could use the standard χ2 statistic
m X
k
X (Oij − Eij )2
X := .
Eij
i=1 j=1
Asymptotically, this should have the χ2 distribution with (k − 1)m degrees of freedom. Of course,
if the number of groups k = 2, this is the same as
m
X (O1j − E1j )2
X := n n .
d 1j (1 − n1jj )
i=1 j nj
6.3 Examples
6.3.1 The AML example
We can use these tests to compare the survival of the two groups in the AML experiment discussed
in section 4.4.5. The relevant quantities are tabulated in Table 6.1.
When the weights are all taken equal, we compute Z = −1.84, whereas the Peto weights —
which reduce the influence of later observations — give us Z = −1.67. This yields one-sided
p-values of 0.033 and 0.048 respectively — a marginally significant difference — or two-sided
p-values of 0.065 and 0.096.
Applying the χ2 test yields X = 16.86, which needs to be compared to χ2 with 14 degrees of
freedom. The resulting p-value is 0.24, which is not at all significant. This should not be seen
as surprising: The differences between the two survival curves are clearly mostly in the same
direction, so we lose power when applying a test that ignores the direction of the difference.
Testing hypotheses 101
0 5 10 15 20 25
Time to infection (months)
Figure 6.1: Plot of Kaplan–Meier survival curves for time to infection of dialysis patients, based on
data described in section 1.4 of [18]. The black curve represents 43 patients with surgically-placed
catheter; the red curve 76 patients with percutaneously placed catheter.
We show the calculations for the nonparametric test of equality of distributions in Table 6.2.
The log-rank test — obtained by simply dividing the sum of all the deviations by the square root
of the sum of terms in the σj2 column — is only 1.59, so not significant. With the Peto weights the
statistic is only 1.12. This is not surprising, because the survival curves are close together (and
actually cross) early on. On the other hand, they diverge later, suggesting that weighting the
later times more heavily would yield a significant result. It would not be responsible statistical
practice to choose a different test after seeing the data. On the other hand, if we had started
with the belief that the benefits of the percutaneous method are cumulative, so that it would
make sense to expect the improved survival to appear later on, we might have planned from the
Testing hypotheses 102
tj nj1 nj2 dj1 dj2 σj2 Peto wt. H–F (0, 1) wt.
0.5 43 76 0 6 1.326 0.992 0.000
1.5 43 60 1 0 0.243 0.941 0.050
2.5 42 56 0 2 0.485 0.931 0.059
3.5 40 49 1 1 0.489 0.912 0.078
4.5 36 43 2 0 0.490 0.890 0.099
5.5 33 40 1 0 0.248 0.867 0.121
6.5 31 35 0 1 0.249 0.854 0.133
8.5 25 30 2 0 0.487 0.839 0.146
9.5 22 27 1 0 0.247 0.807 0.176
10.5 20 25 1 0 0.247 0.790 0.193
11.5 18 22 1 0 0.247 0.770 0.210
15.5 11 14 1 1 0.472 0.741 0.230
16.5 10 13 1 0 0.246 0.681 0.289
18.5 9 11 1 0 0.247 0.649 0.319
23.5 4 3 1 0 0.245 0.568 0.351
26.5 2 3 1 0 0.240 0.473 0.432
In section 7.1 we looked at graphical plots that can be used to test the appropriateness of
the proportional hazards model. These graphical methods work when we have large groups of
individuals whose survival curves may be estimated separately, and then compared with respect
to the proportional-hazards property.
In this lecture we look more generally at the problem of testing model assumptions — mainly,
but not exclusively, the assumptions of the Cox regression model — and describe a suite of tools
that may be used for models with quantitative as well as categorical covariates.
b i (t) − log H
log H b j (t) ≈ log r(β, i) − log r(β, j)
104
Model diagnostics 105
(a∧b = min{a, b}.) The idea is that if the covariate z has no effect, the difference Ng (tj )−TOTg (tj )
has expectation zero for each tj , so a plot of Ng against TOT would lie close to a straight line
with 45◦ slope. If levels of z have proportional hazards effects, we expect to see lines of different
slopes. If the effects are not proportional, we expect to see curves that are not lines.
We give an example in Figure 7.1. We have simulated a population of 100 males and 100
females, whose hazard rates are αi (t) = exi +βmale Imale t, where xi ∼ N (0, 0.25). In Figure 7.1(a)
we show the Arjas plot in the case βmale = 0; in Figure 7.1(b) we show the plot for the case
βmale = 1.
40
50
male
female
40
30
male
Cumulative hazard
Cumulative hazard
female
30
20
20
10
10
0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Number of events Number of events
25 t=pmin (C, T)
26
27 d e l t a =1∗ (T<C)
28 s e x=f a c t o r ( c ( r e p ( 'M' , 1 0 0 ) , r e p ( 'F ' , 1 0 0 ) ) )
29
30 c f i t =coxph ( Surv ( t , d e l t a )∼x )
31
32 #Make a p l o t t o s e e how c l o s e t h e e s t i m a t e d b a s e l i n e s u r v i v a l i s t o t h e c o r r e c t
one
33 p l o t ( s u r v f i t ( c f i t , data . frame ( x=0) ) )
34 t t = ( 0 : 3 0 0 ) / 100
35 l i n e s ( t t , exp(− t t ^2/ 2 ) , c o l =2)
36
37 b e t a=a s . numeric ( c f i t $ c o e f )
38 r e l r i s k =exp ( b e t a ∗x ) ∗ exp ( m a l e e f f ∗ c ( r e p ( 1 , n ) , r e p ( 0 , n ) ) )
39
40
41 e v e n t o r d=o r d e r ( t )
42
43 e t=t [ e v e n t o r d ]
44 e s=s e x [ e v e n t o r d ]
45 e c=x [ e v e n t o r d ]
46
47 cumhaz=−l o g ( s u r v f i t ( c f i t , data . frame ( x=e c ) ) $ s u r v )
48 h a z t r u n c=s a p p l y ( 1 : ( 2 ∗n ) , f u n c t i o n ( i ) pmin ( cumhaz [ i , i ] , cumhaz [ , i ] ) )
49 h a z t r u n c m a l e=h a z t r u n c [ , e s=='M' ]
50 h a z t r u n c f e m=h a z t r u n c [ , e s=='F ' ]
51
52 # Maximum c u m u l a t i v e hazard comes f o r i n d i v i d u a l i when we ' ve g o t t e n
53 # t o t h e row c o r r e s p o n d i n g t o e v e n t o r d [ i ]
Model diagnostics 107
54
55 TOTmale=a pply ( haztruncmale , 1 , sum )
56 TOTfem=a p ply ( haztruncfem , 1 , sum )
57
58 Nmale=cumsum ( cumhazmale $n . e v e n t ∗ ( e s=='M' ) )
59 Nfem=cumsum ( cumhazfem $n . e v e n t ∗ ( e s=='F ' ) )
60
61 p l o t ( Nmale , TOTmale , x l a b= ' Number o f e v e n t s ' , y l a b= ' Cumulative hazard ' , type= ' l ' , c o l
=2 , lwd =2,main=p a s t e ( ' Male e f f e c t= ' , m a l e e f f ) )
62 abline (0 ,1)
63 l i n e s ( Nfem , TOTfem , c o l =3, lwd=2)
64 l e g e n d ( . 1 ∗n , . 4 ∗n , c ( ' male ' , ' f e m a l e ' ) , c o l =2:3 , lwd=2)
1.0
0.8
0.8
log cumulative autologous hazard
log cumulative hazard
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
Figure 7.2: Graphical tests for proportional hazards assumption in alloauto data.
will be a reasonable proxy — for all sorts of inference purposes — for the process that originally
generated the data. That means that other properties of the data that were not used in choosing
the representative member of the family should also be close to the corresponding properties of
the model fit.
(An alternative approach, that we won’t discuss here, is model averaging, where we accept up
front that no model is really correct, and so give up the search for the “one best”. Instead, we
draw our statistical inferences from all the models in the family simultaneously, appropriately
weighted for how well they suit the data.)
The idea, then, is to look at some deviations of the data from the best-fit model — the
residuals — which may be represented in terms of test statistics or graphical plots whose
properties are known under the null hypothesis that the data came from the family of distributions
in question, and then evaluate the performance. Often, this is done not in the sense of formal
hypothesis testing — after all, we don’t expect the data to have come exactly from the model, so
the difference between rejecting and not rejecting the null hypothesis is really just a matter of
sample size — but of evaluating whether the deviations from the model seem sufficiently crass to
invalidate the analysis. In addition, residuals may show not merely that the model does not fit
the data adequately, but also show what the systematic difference is, pointing the way to an
improved model. This is the main application of martingale residuals, which we will discuss in
section 7.6. Alternatively, it may show that the failure is confined to a few individuals. Together
with other information, this may lead us to analyse these outliers as a separate group, or to
discover inconsistencies in the data collection that would make it appropriate to analyse the
remaining data without these few outliers. The main tool for detecting outliers are deviance
residuals, which we discuss in section 7.7.1.
7.2.2 A simulated example
Suppose we have data simulated from a very simple survival model, where individual i has
constant hazard 1 + xi , where xi is an observed positive covariate, with independent right
censoring at constant rate 0.2. Now suppose we choose to fit the data to a proportional hazards
model. What would go wrong? Not surprisingly, for such a simple model, the main conclusion —
that the covariate has a positive effect on the hazard — would still be qualitatively accurate.
But what about the estimate of baseline hazard?
We simulated this process with 1000 individuals, where the covariates were the absolute
values of independent normal random variables. We must first recognise that it is not entirely
clear what it even means to evaluate the accuracy of fit of such a misspecified model. If we plug
the simulated data into the Cox model, we necessarily get an exponential parameter out, telling
us that the hazard rate corresponding to covariate x is eβ̂x . Since the hazard is actually 1 + βx,
it is not clear what it would mean to say that the parameter was well estimated. Certainly,
positive β should be translated into positive β̂. Similarly, the baseline hazard of the Cox model
agrees with the baseline hazard of the additive-hazards model, in that both are supposed to
be the hazard rates for an individual with covariates 0, but their roles in the two models are
sufficiently different that any comparison is on uncertain ground.
Still, the fitted Cox model makes a prediction about the hazard rate of an individual whose
covariates are all 0, and that prediction is wrong. When we fit the data from 1000 individuals to
the Cox model, we get this output:
coef exp(coef) se(coef) z p
x 0.581 1.79 0.0559 10.4 0
In Figure 7.3(a) we see the baseline survival curve estimated from these data by the Cox
model. The confidence region is quite narrow, but we see that the true survival curve — the red
curve — is nowhere near it. In Figure 7.3(b) we have a smoothed version of the hazard estimate,
and we see that forcing the data into a misspecified Cox model has turned the constant baseline
hazard into an increasing hazard.
7
6
0.8
5
0.6
Hazard
4
0.4
3
2
0.2
1
0.0
0 1 2 3 4 0 1 2 3 4
Time
Figure 7.3: Baseline survival and hazard estimated from the Cox proportional hazards model for
data simulated from the additive hazards model. Red is the true baseline hazard.
The most basic version is called the Cox–Snell residual. It is based on the observation that
if T is a sample from a distribution with cumulative hazard function H, then H(T ) has an
exponential distribution with parameter 1.
Given a parametric model H(T, β) we would then generate and evaluate Cox–Snell residuals
as follows:
3. If the model is well specified — a good fit to the data — then (ri , δi ) should be like a
right-censored sample from a distribution with constant hazard 1.
4. A standard way of evaluating the residuals is to compute and plot a Nelson–Aalen estimator
for the cumulative hazard rate of the residuals. The null hypothesis — that the data came
from the parametric model under consideration — would predict that this plot should lie
close to the line y = x.
After this, we proceed as above. Of course, there is nothing special about the Cox log-linear
form of the relative risk. Given any relative risk function r(β̂, x), we may define residuals
ri := r(β̂, xi )H
b 0 (Ti ).
So we see that patient age and donor age both seem to have strong effects on disease-free
survival time (with increasing age acting negatively — that is, increasing the length of disease-free
survival — as we might expect if we consider that many forms of cancer progress more rapidly
in younger patients). Somewhat surprisingly, the effect of a year of donor age is almost as strong
as the effect of a year of patient age. There is also a strong positive interaction term, suggesting
that the prognosis for an old patient receiving a transplant from an old donor is not as favourable
as we would expect from simply adding their effects. Thus, for example, the oldest patient was 80
years old, while the oldest donor was 84. The youngest patient was 35, the youngest donor just
30. The model suggests that the 80-year-old patient should have a hazard rate for relapse that is
a factor of e−0.1143×45 = 0.006 that of the youngest. Indeed, the youngest patient relapsed after
just 43 days, while the oldest died after 363 days without recurrence of the disease. The patient
with the oldest donor would be predicted to have his or her hazard rate of recurrence reduced by
a factor of e−0.0860×54 = 0.0096. Indeed, the youngest patient did have the youngest donor, and
the oldest patient nearly had the oldest. Had that been the case, the ratio of hazard rate for the
oldest to that of the youngest would have been 0.0096 × 0.006 = 0.00006. Taking account of the
interaction term we see that the actual log proportionality term predicted by the model is
exp −0.1143 × 45 − 0.0860 × 54 + 0.036177(45 × 54) = 0.45.
make a graphical test of the proportional hazards assumption between two groups of subjects. It
is less obvious how to test the proportional hazards assumption associated with a relative-risk
regression model.
For this purpose the standard tool is the Schoenfeld residuals. The formal derivation is left
as an exercise, but the definition of the j-th Schoenfeld residual for parameter βk is
X
Skj (tj ) := Xij k − X̄k (tj ),
is the weighted mean of covariate Xk at time t. Thus the Schoenfeld residual measures the
difference between the covariate at time t and the average covariate at time t. If βk is constant
this has expected value 0. If the effect of Xk is increasing we expect the estimated parameter
βk to be an overestimate early on — so the individuals with events then have lower Xk than
we would have expected, producing negative Schoenfeld residuals; at later times the residuals
would tend to be positive. Thus, increasing effect is associated with increasing Schoenfeld
residuals. Likewise decreasing effect is associated with decreasing Schoenfeld residuals. As with
the martingale residuals, we typically make a smoothed plot of the Schoenfeld residuals, to get a
general picture of the time trend.
We can also make a formal test of the hypothesis βk is constant by fitting a linear regression
line to the Schoenfeld residuals as a function of time, and testing the null hypothesis of zero
slope, against the alternative of nonzero slope. Of course, such a test will have little or no power
to detect nonlinear deviations from the hypothesis of constant effect — for instance, threshold
effects, or changing direction.
We turn this into a residual — something we can compute from the data – by replacing the
integral with respect to the hazard by the differences in the estimated cumulative hazard
fi (t) : = δi − H
M b i (Ti )
X
= δi − eβ̂·xi (tj ) ĥ0 (tj )
tj ≤Ti (7.1)
X 1
= δi − eβ̂·xi (tj ) P β·x` (tj )
tj ≤Ti `∈Rj e
Model diagnostics 113
It differs from the (negative of the) Cox–Snell residual only by the addition of δi . When the
covariates are constant in time,
fi = δi − eβ T xi H
M b 0 (Ti ).
An individual martingale residual is a very crude measure of deviation, since for any given t
the only observation here that is relevant to the survival model is δi , which is a binary observation.
If there are no ties, or if we use the Breslow method for resolving ties, sum of all the martingale
residuals is 0.
7.6.2 Application of martingale residuals for estimating covariate transforms
Martingale residuals are not very useful in the way that linear-regression residuals are, because
there is no natural distribution to compare them to. The main application is to estimate
appropriate modifications to the proportional hazards model by way of covariate transformation:
√
Instead of a relative risk of eβx , it might be ef (x) , where f (x) could be 1{x<x0 } or x, or
something else.
We assume that in the population xi and zi are independent. This won’t be exactly true in
reality, but obviously a strong correlation between two variables complicates efforts to disentangle
their effects through a regression model. We derive the formula under the assumption of
independence, understanding that the results will be less reliable the more intertwined the
variable zi is with the others.
Suppose the data (Ti , xi (·), zi , δi ) are sampled from a relative-risk model with two covariates:
a vector xi and an additional one-dimensional covariate zi , with
that is, the Cox regression model holds except with regard to the last covariate, which acts as
h(z) := ef (z) . Let β̂ be the p-dimensional vector corresponding to the Cox model fit without the
covariate z, and let M fi be the corresponding martingale residuals.
Another complication is that we use β̂ instead β (which we don’t know). We will derive the
relationship under the assumption that they are equal; again, errors in estimating β will make
the conclusions less correct. For large n we may assume that β and β̂ are close.
Let
E[1at risk time s ef (z) | x]
= P h(z) at risk time s; x ;
h̄(s, x) :=
P{at risk time s | x}
h̄(s) := E h̄(s, x) .
Fact P
δi
(7.2)
E Mf z ≈ f (z) − log h̄(∞) .
n
account of z, then individuals whose z value has f (z) large positive will seem to have a large
number of excess events; and those whose f (z) is large negative will seem to have fewer events
than expected.
A reasonably formal proof may be found in [25], and also reproduced in chapter 4 of [10].
This choice of scaling is inspired by the intuition that each di should represent the contribution
of that individual to the total model deviance.1 Whereas the martingale residuals are between
−∞ and 1, the deviance residuals should have a similar range to a standard normal random
variable. Thus, we treat values outside a range of about −2.5 to +2.5 as outliers, potentially
7.7.2 Delta–beta residuals
Recall that the leverage of an individual observation is a measure of the impact of that observation
on the parameters of interest. The “delta–beta” residual for parameter βk and subject i is defined
as
∆βki := β̂k − β̂k(i) ,
where β̂k(i) is the estimate of β̂k with individual i removed. In principle we may compute this
exactly, by recalculating the model n times for n subjects, but this is computationally expensive.
We can approximate it by a combination of the Fisher information and the Schoenfeld residual
(actually, a variant of the Schoenfeld residual called the score residual). We do not give the
formula here, but it may be found in [5, Section 4.3]. These estimates may be called up easily in
R, as described in section 7.8.
Individuals with high values of ∆β should be looked at more closely. They may reveal
evidence of data-entry errors, interactions between different parameters, or just the influence of
extreme values of the covariate. You should particularly be worried if there are high-influence
individuals pushing the parameters of interest in one direction.
1
Recall that deviance is defined as
D = 2 log likelihood(saturated) − log likelihood(β̂) .
Applied to the Cox model, the saturated model isPthe one where each individual has an individual parameter βi∗ .
It is possible to derive then that the deviance is n
i=1 di .
2
Model diagnostics 115
We begin by fitting the Cox model, including all the potentially significant covariates:
1 n k i . s u r v=with ( nki , Surv ( t y e a r s , d ) )
2 n k i . cox=with ( nki , coxph ( n k i . s u r v∼p o s n o d e s+chemotherapy+hormonaltherapy
3 +h i s t o l g r a d e+age+m l r a t i o+d i a m e t e r+p o s n o d e s+v a s c . i n v a s i o n+t y p e s u r g e r y ) )
4 > summary ( n k i . cox )
5 Call :
6 coxph ( f o r m u l a = n k i . s u r v ∼ p o s n o d e s + chemotherapy + hormonaltherapy
7 + h i s t o l g r a d e + age + m l r a t i o + d i a m e t e r + p o s n o d e s + v a s c . i n v a s i o n +
typesurgery )
8
9 n= 2 9 5 , number o f e v e n t s= 79
10
11 c o e f exp ( c o e f ) s e ( c o e f ) z Pr ( >| z | )
12 posnodes 0.07443 1.07727 0.05284 1.409 0.158971
13 chemotherapyYes −0.42295 0 . 6 5 5 1 1 0 . 2 9 7 6 9 −1.421 0 . 1 5 5 3 8 1
14 hormonaltherapyYes −0.17160 0 . 8 4 2 3 2 0 . 4 4 2 3 3 −0.388 0 . 6 9 8 0 6 2
Model diagnostics 116
We could remove the insignificant covariates stepwise, or use AIC, or some other model-
selection method, but let us suppose we have reduced it to the model including just histological
grade, vascular invasion, age, and mlratio (the crucial measure of oestrogen-receptor gene
expression. we then get
1 summary ( n k i . cox )
2 Call :
3 coxph ( f o r m u l a = n k i . s u r v ∼ h i s t o l g r a d e + age + m l r a t i o + v a s c . i n v a s i o n )
4
5 n= 2 9 5 , number o f e v e n t s= 79
6
7 c o e f exp ( c o e f ) s e ( c o e f ) z Pr ( >| z | )
8 histolgradePoorly d i f f 0.42559 1.53050 0.26914 1.581 0.113810
9 histolgradeWell d i f f −1.31213 0.26925 0.54782 −2.395 0.016611 ∗
10 age −0.04302 0.95790 0.01961 −2.194 0.028271 ∗
11 mlratio −0.72960 0.48210 0.20754 −3.515 0.000439 ∗∗∗
12 v a s c . i n v a s i o n+ 0.64816 1.91203 0.24205 2.678 0.007410 ∗∗
13 v a s c . i n v a s i o n+/− −0.04951 0.95169 0.48593 −0.102 0.918840
1 n k i . km3=s u r v f i t ( n k i . s u r v∼f a c t o r ( n k i $ v a s c . i n v a s i o n ) )
2 p l o t ( n k i . km3 , mark . time=FALSE, x l a b= ' Time ( Years ) ' , y l a b= ' l o g (cum hazard ) ' ,
3 main= ' V a s c u l a r i n v a s i o n ' , c o n f . i n t=FALSE, c o l=c ( 2 , 1 , 3 ) , fun=myfun , f i r s t x =1)
4 l e g e n d (10 , −3 , c ( '− ' , '+ ' , '+/− ' ) , c o l=c ( 2 , 1 , 3 ) , t i t l e = ' V a s c u l a r i n v a s i o n ' , lwd=2)
Model diagnostics 117
mlratio
−1
−1.2
−2
log(cum hazard)
log(cum hazard)
−0.1
0.35
−2
−3
−4
−3
mlratio tertiles
−5
1
2
−4
3
−6
1 2 5 10 20 5 10 15
Time (Years) Time (Years)
(a) Model fit for different mlratio levels (b) Nelson–Aalen estimator for tertiles of mlratio
We then test for the goodness of fit by computing a Nelson–Aalen estimator for the residuals,
and plotting the line y = x for comparison.
1 n k i . C S t e s t=s u r v f i t ( Surv ( n k i . CS , n k i $d )∼1 )
2 p l o t ( n k i . CStest , fun= ' cumhaz ' , mark . time=FALSE, xmax=.8 ,
3 main= ' Cox−−S n e l l r e s i d u a l p l o t f o r NKI data ' )
4 a b l i n e ( 0 , 1 , c o l =2)
Model diagnostics 118
Vascular invasion
−1
−2
log(cum hazard)
−3
Vascular invasion
−
+
−4
+/−
−5
5 10 15
Time (Years)
Figure 7.5: Log cumulative hazards of NKI data for subpopulations stratified by vascular invasion
status.
0.6
0.5
cumulative hazard +
0.4
0.3
0.2
0.1
0.0
cumulative hazard -
Figure 7.6: Andersen plot of NKI data for subpopulations stratified by vascular invasion status.
The column rho gives the correlation of the scaled Schoenfeld residual with time. (That is, each
Schoenfeld residual is scaled by an estimator of its standard deviation). The chisq and p-value
are for a test of the null hypothesis that the correlation is zero, meaning that the proportional
hazards condition holds (with constant β). GLOBAL gives the result of a chi-squared test for the
hypothesis that all of the coefficients are constant. Thus, we may accept the hypothesis that all
of the proportionality parameters are constant in time.
Plotting the output of cox.zph gives a smoothed picture of the scaled Schoenfeld residuals
as a function of time. For example, plot(z[4]) gives the output in Figure 7.10, showing an
estimate of the parameter for mlratio as a function of time. We see that the plot is not perfectly
constant, but taking into account the uncertainty indicated by the confidence intervals, it is
plausible that the parameter is constant. In particular, the covariate histolgradeWell diff
produces the smallest p-value, but from the plot it is clear that the apparent positive slope is
purely an artefact of there being two points with very high leverage right at the end, where the
variance is particularly high. (The estimation of ρ is quite crude, being simply an ordinary least
squares fit, so does not take account of the increased uncertainty at the end.)
Model diagnostics 120
20
2
15
Beta(t) for histolgradeWell diff
0
5
-2
0
-4
-5
1.6 2.6 3.3 4.8 5.8 8.1 11 13 1.6 2.6 3.3 4.8 5.8 8.1 11 13
Time Time
Figure 7.7: Schoenfeld residuals for some covariates in the Cox model fit for the NKI data.
Figure 7.8: Nelson–Aalen estimator for the Cox–Snell residuals for the NKI data, together with
the line y = x.
Model diagnostics 121
1.0
0.5
Martingale residual
0.0
−0.5
−1.5
Figure 7.9: Martingale residuals for the model without mlratio, plotted against mlratio, with
a LOWESS smoother in red.
Model diagnostics 122
2
Beta(t) for mlratio
0
-2
-4
Time
Figure 7.10: Scaled Schoenfeld residuals for the mlratio parameter as a function of time.
Chapter 8
123
Chapter 9
9.1 Introduction
The survival models that we have considered depend upon the fundamental assumption that
event times are independent. There are many settings where this assumption is unreasonable:
• Clustered data: There may be multiple observations for a single individual, or for a
group with correlated times within the group. For example, the diabetes data set (in the
SurvCorr package in R) includes data for 197 patients being treated for diabetic retinopathy
— loss of vision due to diabetes. Each patient has two eyes, and all eyes are at risk until the
event (measured serious vision loss). It would be unreasonable to assume that the loss of
vision is independent between the two eyes.
Below is an excerpt from the data set. One peculiarity is that the comparison was made
within individuals, with each individual having one treated and one untreated eye, with
the variable TRT_EYE recording which eye was treated. For more information about the
data set see section 8.4.2 of [26].
1 > head ( d i a b e t e s )
2 ID LASER TRT_EYE AGE_DX ADULT TIME1 STATUS1 TIME2 STATUS2
3 1 5 2 2 28 2 46.23 0 46.23 0
4 2 14 2 1 12 1 42.50 0 31.30 1
5 3 16 1 1 9 1 42.27 0 42.27 0
6 4 25 2 2 9 1 20.60 0 20.60 0
7 5 29 1 2 13 1 38.77 0 0.30 1
8 6 46 1 1 12 1 65.23 0 54.27 1
• Multiple events: A study may consider times of events that do not remove the subject
from risk of further events. For example, a study of hospitalisation events for an elderly
population will likely see some individuals being hospitalised multiple times.
124
Correlated events 125
cases, there are distinct outcomes, each with its own hazard rate, that each serves as
“censoring” for observation of the others.
at time t. Then we have a partial likelihood for the observation that individual ij had the unique
event at time tj
Y r(β, xij (tj ))
LP (β) = P .
t i∈Rj : ci =ci r(β, xi (tj ))
j j
The only change, compared with the standard partial likelihood given in (5.5)
As an example, we consider a data set based on the NHANES (National Health and Nutrition
Survey) wave 3, from which we have measures of systolic and diastolic blood pressure and about
15 years of survival follow-up. It is certainly not the case that men and women, or different ethnic
groups, have the same baseline mortality rate. We could analyse the effect of blood pressure
on mortality in the groups separately, but that will produce six different parameter estimates,
each of which will be subject to more random noise. It is at least plausible that we would like to
produce a joint estimate of the influence of blood pressure on mortality, that we suppose acts in
approximately the same way against the background of distinct baseline mortality.
1 Call :
2 coxph ( f o r m u l a = with ( nhanesC , Surv ( age , age + y r s f u , e v e n t h r t ) ) ∼
3 meandias + meansys + s t r a t a ( r a c e ) + s t r a t a ( f e m a l e ) , data = nhanesC )
4
5 c o e f exp ( c o e f ) s e ( c o e f ) z p
6 meandias −0.006495 0 . 9 9 3 5 2 7 0 . 0 0 2 7 6 3 −2.351 0 . 0 1 8 7
7 meansys 0.011528 1.011595 0 . 0 0 1 3 2 0 8 . 7 3 3 <2e −16
8
9 L i k e l i h o o d r a t i o t e s t =75.9 on 2 df , p=< 2 . 2 e −16
Correlated events 126
10 n= 1 5 2 9 5 , number o f e v e n t s= 1459
If we plot the output of the survfit function applied to this result, we get the picture in Figure
9.1, showing six different baseline survival functions.
black male
black female
white male
white female
other male
0.6
other female
0.4
0.2
0.0
30 40 50 60 70 80 90
Figure 9.1: Baseline survival estimates for different population groups from the NHANES data.
We describe here how to calculate Vn for the case of Cox regression for independent observa-
tions an explicit formula. We begin with the case of independent individual observations. (This
is not directly relevant for the case of clustered observations, but it is relevant for making robust
variance estimates for other kinds of model misspecification.) Differentiating the log partial
likelihood yields a score function, which is the same as the Schoenfeld residuals:
k
X
U (β) = xij − x̄(tj ) .
j=1
xi eβ·xi (tj )
P
i∈Rj
x̄(tj , β) = P
i∈Rj eβ·xi (tj ,β)
and
1
dĤ0 (tj ) = P β·xi (tj )
i
i∈Rj e
For large n, if β = β∗ is the limit parameter, dĤ0 (t, β∗ ) will be approximately the increment
with respect to this cumulative hazard, and by the Law of Large Numbers
x̄(t, β∗ ) ≈ s(t) := E x̄(t, β∗ ) .
So if β = β̂, the i-th summand will be approximately (to first order in the error β̂ − β∗ )
X
Ui (β̂) = δi xi (Ti ) − x̄(Ti , β̂) − xi (tj ) − x̄(tj , β̂) eβ̂·xi (tj ) dĤ0 (tj , β̂) (9.2)
tj ≤Ti
Z Ti
xi (t) − s(t) eβ∗ ·xi (t) h0 (t, β∗ ) dt.
≈ δi xi (Ti ) − s(Ti ) −
0
The first expression is what we use to calculate Ui , but the second makes clear (since it depends
only on the observation Ti and δi ) that these are (approximately) i.i.d. Since they have expectation
0, it follows that
Vn (β̂) = Ui (β̂)Ui (β̂)T
is a consistent estimator for the variance-covariance matrix of the score function. (The mathe-
matical details of the proof may be found in [20].)
The case of clustered observations C clusters of observations, with nc individuals in cluster c
— where the separate clusters may be interpreted as independent and identically distributed — are
Correlated events 128
considered in [19]. In this case, there is correlation within a cluster, but the terms corresponding
to individuals in different clusters are independent. if we have we may take
C
X
Vn (β̂) = uc (β̂)uc (β̂)T ,
c=1
where
nc
X n o Xnc X n o
uc (β̂) = δic xic (Tic ) − x̄(Tic , β̂) − xic (tj ) − x̄(tj , β̂) eβ̂·xic (tj ) dĤ0 (tj ).
i=1 i=1 tj ≤Tic
(Here we are using the subscript ic to denote the i-th individual in cluster c.) It is not necessary
to memorise this formula (it is not examinable), but it is important to know that such a formula
exists, and needs to be applied whenever one is analysing clustered data. It will automatically
be applied in R if the coxph function is applied with an extra term in the formula of the form
+cluster(group). Commonly the clustering is by individual identifying number, so the term
becomes +cluster(id).
The most common parametric form is α = exp β · x , where β = (β0 , . . . , βp ), and we take
This fits into the framework of GLM (generalised linear model), and may be fit in R using
any of the standard GLM functions. Note that we are modelling Ni ∼ Po(µi ), where
This fitted model gives us a predicted expected number of events for each individual. The
difference between the observed number of events and the expected number predicted by the
model is the residual. In Figure 9.2 we plot the residuals against the fitted values. (This is the
automatic output of the command plot(poisreg), where poisreg is the output of the glm fit
above.
This would be obvious from a casual examination of the data. The mean number of events is
about 2, but some individuals have as many as 25, which is not something you would see in a
Poisson distribution. These data are over-dispersed, meaning that their variance is higher than it
would be for a Poisson distribution of the same mean.
We also note that the deviance residuals (which should be approximately standard normal
distributed if the model is correct) range from −4.5 to +8.36. The sum of their squares, called
the residual deviance, is 2133.2, which is much too large for a chi-squared variable on 767 degrees
of freedom.
Residuals vs Fitted
374
245 650
5
Residuals
0
−5
−3 −2 −1 0 1 2
Predicted values
glm(treat.count ~ pris + pgroup)
We can fit the model in R by using the coxph command. All we need to do is to represent
the data appropriately in a Surv object. To do this, the record for an individual gets duplicated,
with one row for each event time or censoring time. An event time will be the “stop” time in one
row, and will then be the “start” time in the next row. The covariates will repeat from row to
row for the same individual.
The model assumes that differences between individuals are completely described by the
relative-risk function determined by their covariates. If we are unsure — as we generally will be
— we can robustify the variance estimates as in section 9.3.2 by adding a +cluster(id) term.
Alternatively, we can add a hidden frailty term to the model, as described below in section 9.5.
The term λi , called a multiplicative frailty, represents the individual relative rate of producing
events. The λi are treated as random effects, meaning that they are not to be estimated
individually — which would not make sense — but rather, they are taken to be i.i.d. samples
from a simple parametric distribution. When the frailty λ has a gamma distribution (with
parameters (θ, θ), because we conventionally take the frailty distribution to have mean 1), and
N is a Poisson count conditioned on λ with mean λα, then N has probability mass function
θ n
Γ(n + θ) θ α
P N =n = ,
n!Γ(θ) θ+α θ+α
which is the negative binomial distribution with parameters θ and α/(θ + α). (The calculation is
left as an exercise.) Therefore this is called the negative binomial regression model. We can fit it
with the glm.nb command in R. If we apply it to the same data as before, we get the following
output:
1 > summary ( p o i s r e g 2 )
2
3 Call :
4 glm . nb ( f o r m u l a = t r e a t . count ∼ p r i s + pgroup + o f f s e t ( l o g ( r i s k t i m e ) ) ,
5 i n i t . theta = 0.8418678047 , l i n k = log )
6
7 Deviance R e s i d u a l s :
Correlated events 132
We note that, while the largest deviance residual of 3.68 suggests a possible outlier, the total
residual deviance is now quite plausible.
9.5.2 Frailty in proportional hazards models
We can use shared frailty to account for correlated times in proportional hazards regression,
whether these are unordered (clustered) times, or recurrent events. The model fitting functions
numerically exactly like any other random-effects model: We treat the individual unknown
frailties as unobserved data, whose expected values given the observed data may be calculated.
Given the individual frailties, we may maximise the parameters, and so loop through the EM
algorithm. The calculations are carried through automatically by the coxph function in R, as
long as we add a + frailty(id) (or whichever variable we are grouping by) term to the formula.
The output will include a p-value estimate for the individual
Is the frailty term actually appropriate to the data? We may test the null hypothesis that
there is no individual frailty with a likelihood ratio test. The null-hypothesis log-likelihood is
simply the log partial likelihood for a traditional model without a frailty term. The alternative log
likelihood is the log of the integrated partial likelihood — that is, integrated over the distribution
of the frailty — called the I-likelihood in the R output.
Note that the model fit automatically produces estimates of the individual frailties. If desired,
these may be used for individualised survival projections.
Note that in this case, because of the paired design, the robust SE is actually smaller than the
model-based (inverse-information) SE. If we had stratified instead of clustering we would obtain:
1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + s t r a t a ( i d ) ,
2 data = D i a b e t e s )
3
4 c o e f exp ( c o e f ) s e ( c o e f ) z p
5 trtTRUE −0.2122 0.8088 0 . 1 8 1 3 −1.17 0 . 2 4 2
6 adultTRUE NA NA 0.0000 NA NA
7
8 L i k e l i h o o d r a t i o t e s t =1.38 on 1 df , p =0.2407
9 n= 3 9 4 , number o f e v e n t s= 155
Note that the standard error is increased substantially relative to the model-based estimate.
If we instead fit a gamma-frailty model to account for the correlation between two eyes we
get a very similar result to that obtained from the clustered model:
1 coxph ( f o r m u l a = Surv ( time , s t a t u s ) ∼ t r t + a d u l t + f r a i l t y ( i d ) ,
2 data = D i a b e t e s )
3
Correlated events 134
Including all the events in an Anderson–Gill model increases the number of events from
47 to 112. Naïvely we might expect the standard errors to be reduced by a factor of about
Correlated events 135
47/112 = .648, reducing the SE of the number coefficient from .076 to about .049; and the SE
p
of the rx coefficient from .316 to about .205. Carrying out the calculation yields
1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e +
2 c l u s t e r ( i d ) , data = b l a d d e r 2 )
3
4 c o e f exp ( c o e f ) s e ( c o e f ) r o b u s t s e z p
5 rx −0.46469 0.62833 0.19973 0 . 2 6 5 5 6 −1.750 0 . 0 8 0 1 5
6 number 0 . 1 7 4 9 6 1.19120 0.04707 0.06304 2.775 0.00551
7 size −0.04366 0.95728 0.06905 0 . 0 7 7 6 2 −0.563 0 . 5 7 3 7 6
8
9 L i k e l i h o o d r a t i o t e s t =17.52 on 3 df , p =0.0005531
10 n= 1 7 8 , number o f e v e n t s= 112
The se(coef) output is exactly what we predicted, but the robust se is substantially larger,
due to correlation among the observations. The actual reduction in SE is not the 35% that would
have been produced if the additional recurrences had been independent, but only about 17%. It
is as though we had only about 15 additional independent observations, rather than the 65 that
we might have naïvely supposed.
Applying a gamma frailty again yields a very similar result:
1 coxph ( f o r m u l a = Surv ( s t a r t , stop , e v e n t ) ∼ rx + number + s i z e +
2 f r a i l t y ( i d ) , data = b l a d d e r 2 )
3
4 coef se ( coef ) se2 Chisq DF p
5 rx −0.6077 0.3330 0.2197 3.3295 1.0 0.06805
6 number 0.2387 0.0932 0.0570 6.5567 1.0 0.01045
7 size −0.0215 0.1140 0.0717 0.0357 1.0 0.85009
8 f r a i l t y ( id ) 82.9157 45.4 0.00056
9
10 I t e r a t i o n s : 6 o u t e r , 29 Newton−Raphson
11 V a r i a n c e o f random e f f e c t= 1 . 0 8 I−l i k e l i h o o d = −436.8
12 D e g r e e s o f freedom f o r terms= 0 . 4 0.4 0.4 45.4
13 L i k e l i h o o d r a t i o t e s t =144 on 4 6 . 6 df , p=8e −12
14 n= 1 7 8 , number o f e v e n t s= 112
[1] Odd O. Aalen. “Further results on the non-parametric linear regression model in survival
analysis”. In: Statistics in Medicine 12 (1993), pp. 1569–88.
[2] Odd O. Aalen et al. Survival and Event History Analysis: A process point of view. Springer
Verlag, 2008.
[3] Per Kragh Andersen and Richard D Gill. “Cox’s regression model for counting processes: a
large sample study”. In: AoS (1982), pp. 1100–1120.
[4] George Leclerc Buffon. Essai d’arithmétique morale. 1777.
[5] David Collett. Modelling survival data in medical research. 3rd. Chapman & Hall/CRC,
2015.
[6] David R Cox. “Regression Models and Life-Tables”. In: Journal of the Royal Statistical
Society. Series B (Methodological) 34.2 (1972), pp. 87–22.
[7] CT4: Models Core Reading. Faculty & Institute of Acutaries, 2006.
[8] Stephen H. Embury et al. “Remission Maintenance Therapy in Acute Myelogenous Leukemia”.
In: The Western Journal of Medicine 126 (Apr. 1977), pp. 267–72.
[9] Gregory M. Erickson et al. “Tyrannosaur Life Tables: An example of nonavian dinosaur
population biology”. In: Science 313 (2006), pp. 213–7.
[10] Thomas R. Fleming and David P. Harrington. Counting Processes and Survival Analysis.
Wiley, 1991.
[11] A. J. Fox. English Life Tables No. 15. Office of National Statistics. London, 1997.
[12] Benjamin Gompertz. “On the Nature of the function expressive of the law of human mortality
and on a new mode of determining life contingencies”. In: Philosophical transactions of the
Royal Society of London 115 (1825), pp. 513–85.
[13] Peter Hall and Christopher C. Heyde. Martingale Limit Theory and its Application. New
York, London: Academic Press, 1980.
[14] David P. Harrington and Thomas R. Fleming. “A Class of Rank Test Procedures for
Censored Survival Data”. In: Biometrika 69.3 (Dec. 1982), pp. 553–66.
[15] Hans C. van Houwelingen and Theo Stijnen. “Cox Regression Model”. In: Handbook of
Survival Analysis. Ed. by John P. Klein et al. CRC Press, 2014. Chap. 1, pp. 5–26.
[16] Edward L Kaplan and Paul Meier. “Nonparametric estimation from incomplete observations”.
In: Journal of the American Statistical Association 53.282 (1958), pp. 457–481.
[17] Kathleen Kiernan. “The rise of cohabitation and childbearing outside marriage in western
Europe”. In: International Journal of Law, Policy and the Family 15 (2001), pp. 1–21.
136
Correlated events 137
[18] John P. Klein and Melvin L. Moeschberger. Survival Analysis: Techniques for Censored
and Truncated Data. 2nd. SV, 2003.
[19] Eric W Lee et al. “Cox-type regression analysis for large numbers of small groups of
correlated failure time observations”. In: Survival analysis: State of the art. Springer, 1992,
pp. 237–247.
[20] Danyu Y Lin and Lee-Jen Wei. “The robust inference for the Cox proportional hazards
model”. In: Journal of the American statistical Association 84.408 (1989), pp. 1074–1078.
[21] A. S. Macdonald. “An actuarial survey of statistics models for decrement and transition
data. I: Multiple state, Poisson and binomial models”. In: British Actuarial Journal 2.1
(1996), pp. 129–55.
[22] Kyriakos S. Markides and Karl Eschbach. “Aging, Migration, and Mortality: Current Status
of Research on the Hispanic Paradox”. In: Journals of Gerontology: Series B 60B (2005),
pp. 68–75.
[23] Rupert G. Miller et al. Survival Analysis. Wiley, 2001.
[24] Richard Peto and Julian Peto. “Asymptotically Efficient Rank Invariant Test Procedures”.
In: Journal of the Royal Statistical Society. Series A (General) 135.2 (1972), pp. 185–207.
[25] Terry M. Therneau et al. “Martingale-based residuals for survival models”. In: Biometrika
77.1 (1990), pp. 147–60.
[26] Terry M Therneau and Patricia M Grambsch. Modeling survival data: Extending the Cox
model. Springer Science & Business Media, 2013.
[27] Anastasios Tsiatis. “A nonidentifiability aspect of the problem of competing risks”. In:
Proceedings of the National Academy of Sciences 72.1 (1975), pp. 20–22.
[28] Marc J Van De Vijver et al. “A gene-expression signature as a predictor of survival in
breast cancer”. In: New England Journal of Medicine 347.25 (2002), pp. 1999–2009.
[29] Kenneth W. Wachter. Essential Demographic Methods. Harvard University Press, 2014.
Appendix A
Problem sheets
I
Statistical Lifetime Models: HT 2019 II
1. (a) Let L1 , . . . , Ln be independent Exp(λ) random variables. Show that the maximum
likelihood estimator for λ is given by
n
λ̂ = . (A.1)
L1 + . . . + Ln
(b) The following data resulted from a life test of refrigerator motors (hours to burnout):
Hours to burnout
104.3 158.7 193.7 201.3 206.2
227.8 249.1 307.8 311.5 329.6
358.5 364.3 370.4 380.5 394.6
426.2 434.1 552.6 594.0 691.5
i. Assuming refrigerator motors have Exp(λ) lifetimes, give the maximum likelihood
estimate for λ.
ii. Still assuming Exp(λ) lifetimes, calculate the Fisher information and construct
approximate 95% confidence intervals for λ and 1/λ using the approximate Normal
distribution of the maximum likelihood estimator.
iii. Still assuming Exp(λ) lifetimes, show that 2nλ/λ̂ ∼ χ22n . Let a be such that
P(2nλ/λ̂ ≤ a) = α/2 and b such that P(2nλ/λ̂ ≥ b) = α/2. Deduce an exact 95%
confidence interval for 1/λ.
iv. Produce a histogram of the data and comment.
v. Merge columns of your histogram appropriately to test whether the hypothesis of
Exp(λ) lifetimes can be rejected. Use a χ2 goodness of fit test.
4. The survival times (in days after transplant) for the original n = 69 members of the
Stanford Heart Transplant Program were as follows:
(a) Complete the following table of counts dx of associated curtate residual lifetimes (in
years=365 days), counts `x of subjects alive exactly x years after their transplant,
total time `˜x spent alive between x and x + 1 years after their transplant, by all
subjects:
x 0 1 2 3 4
dx 8 4 3
`x
`˜x 19.148 10.203 4.937 1.315
(0)
(b) Calculate the maximum likelihood estimators q̂x and q̂x for qx , x = 0, . . . , 4, based
on the discrete and continuous method, respectively.
(c) Calculate the maximum likelihood estimates. Comment on the differences.
(d) Estimate the probability to survive for 3 months
i. assuming fractional and integer parts of lifetimes are independent, and the
fractional part is uniform;
ii. assuming the force of mortality is constant over the first year;
iii. directly from the data (the total time spent alive until three months after the
transplant is 12.58 years). Hint: You may, of course, guess formulas to test your
intuition, but you should then state your assumptions and apply the discrete
and/or continuous method to justify your estimates as maximum likelihood
estimates.
Statistical Lifetime Models: HT 2019 IV
5. Review the material in on the census approximation (section 2.11.2). For purposes of your
sketches, you may assume the “census date” is the start of the year (1 January). Suppose
we have census counts from the years K, K + 1, . . . , K + N .
(a) Denote by Pk,t the number of lives under observation, aged k (last birthday), at any
time t.
i. On a Lexis diagram, sketch the region where you would find the deaths of
individuals who died in year t aged k years old; and the region where you would
find the deaths of individuals who died in year t + 1 aged k + 1 years old. Also
sketch sample lifelines for such people, indicating when these individuals might
have been born.
ii. Given n individuals at risk and aged k between times ai and bi , show that the
total time at risk is
Xn Z K+N
c
Ek = (bi − ai ) = Px,t dt.
i=1 K
iii. Assume that Pk,t is linear between census dates t = K, K+1, . . . , K+N . Calculate
Ekc in terms of Px,t , t = K, K + 1, . . . , K + N . Explain why the assumption cannot
hold exactly.
(b) Depending on the records available, you may not know the exact age of individuals at
death. You may only know the calendar year of birth and the calendar year of death.
(1)
Thus, instead of dk = # deaths aged k at last birthday before death, you will have
(2)
dk = # deaths in calendar year of the k-th birthday.
i. On a Lexis diagram, sketch the region where you would find the deaths of
individuals who died in year t whose k-th birthday was in the same year; and
the region where you would find the deaths of individuals who died in year t + 1
with k + 1-th birthday was in the same year. Also sketch sample lifelines for such
people, indicating when they might have been born.
ii. Describe the resulting estimate of the force of mortality. Explain the definition
that you need to use for Pk,t . State any further assumptions you make.
iii. What function of µt is being estimated? Assuming mortality rates are changing
with age, explain why this calculation is estimating something slightly different
than the calculation in the previous part.
Statistical Lifetime Models: HT 2019 V
1. (a) Explain what is meant by right censoring, left censoring, right truncation, left trunca-
tion.
(b) In a study of the elderly, individuals were enrolled in the study, at varying times, if
they had already had one episode of depression. The event of interest was the onset of
a second episode. An individual could be enrolled if at some previous time an episode
of depression had been diagnosed. Which of the above mechanisms are relevant if it is
also known that the study finished after four years?
(c) In 1988 a study was published of the incubation time (waiting time from infection
until symptoms develop) of AIDS. The sample was of 258 adults who were known to
have contracted AIDS from blood transfusion. The data reported were the date of
the transfusion, and the time from infection until the disease was diagnosed. Which
of the above mechanisms are relevant for analysing these data?
2. (a) Suppose you are given estimates for a population of remaining life expectancy ex and
ex+t , corresponding to ages x and x + t (years). You wish to compute the mortality
probability t qx . Under the assumption that mortality rates are constant over this
interval, show that
t + ex+t − ex
t qx ≈ . (*)
t/2 + ex+t
Explain why this equation is only approximate, and what assumption would make it
a good approximation.
(b) The following is an estimated table of ex (in years) in ancient Rome, as computed by
Tim Parkin Demography and Roman Society, available at http://www.utexas.edu/
depts/classics/documents/Life.html.
x 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70
ex 25 33 43 41 37 34 32 29 26 23 20 17 14 10 8 6
3. The data set ovarian, included in the survival package, presents data for 26 ovarian
cancer patients, receiving one of two treatments, which we will refer to as the single and
double treatments. (They appear in the data set as the rx variable, taking on values 1 and
2 respectively.)
(d) Compute the standard error for the probability of survival past 400 days in each
group, as estimated by the Nelson–Aalen and Kaplan–Meier estimators.
Assume the Gompertz-Makeham model has been used for graduation. Is this a sensible
choice? Test the proposed graduation for i) Overall goodness of fit; and ii) Bias.
5. Attached is an excerpt from a cohort life table for men in England and Wales born in 1894,
including curtate life expectancies. (Data from the Human Mortality Database.) Using the
given data:
(a) Estimate the change to e0 , the curtate life expectancy at birth, if the mortality rate
in the first two years of life were reduced to modern-day levels (say q0 = 0.005,
q1 = 0.0004).
(b) Make a rough estimate of the change to e0 if the increases in mortality due to the
1914-18 war and the 1918-19 influenza pandemic had not occurred.
Statistical Lifetime Models: HT 2019 VII
Age x lx qx ex
0 100000 0.16134 44.82
1 83866 0.05398 52.39
.. .. .. ..
. . . .
14 74067 0.00220 45.99
15 73904 0.00237 45.09
16 73729 0.00260 44.20
17 73538 0.00301 43.31
18 73316 0.00313 42.44
19 73087 0.00787 41.57
20 72512 0.01836 40.90
21 71181 0.03218 40.65
22 68890 0.04424 40.98
23 65842 0.06194 41.86
24 61764 0.02088 43.59
25 60474 0.00551 43.51
26 60141 0.00385 42.75
27 59910 0.00384 41.91
28 59680 0.00391 41.07
29 59446 0.00377 40.23
30 59222 0.00386 39.38
31 58994 0.00367 38.53
32 58777 0.00380 37.67
33 58554 0.00399 36.81
34 58320 0.00445 35.96
35 58061 0.00460 35.11
6. If x is the observed value of a random variable X ∼ Binom(n, p), with known n, find the
maximum-likelihood estimator p̂, and deduce that
x(n − x)
Var(p̂) ≈ .
n3
If Ŝ(t) is the Kaplan-Meier estimator, an alternative estimator for the variance is
Ŝ(t)2 (1 − Ŝ(t))
Var Ŝ(t) =
n(t)
where n(t) is the number at risk at time t+. If d(t) is the number of failures up to and
including time t, justify the estimation
n(t) n(t)
Ŝ(t) ≈ = ,
n(t) + d(t) n(0)
making the conservative assumption that all the censoring in the interval [0, t) takes place
at t = 0. What is the distribution of d(t) given this assumption? Explain how this can be
used to justify the expression for Var Ŝ(t) in terms of a binomial proportion estimator (as p̂
above). In the special case of no censoring, what is the connection between this estimator
and Greenwood’s estimator for the variance?
Statistical Lifetime Models: HT 2019 VIII
7. We are carrying out a hypothetical study of the survival of Alzheimer patients. We enrol 30
subjects in a clinic, and follow them over five years. We record their age at being enrolled
in the study and the age at which they left, and the cause of exit, whether death (1) or
something else (0).
Here + indicates censored times. Investigate these data in respect of both a) and b).
2. (a) Describe the proportional hazards model, explaining what is meant by the partial likelihood
and how this can be used to estimate resgression coefficients. How might standard errors be
generated?
(b) Drug addicts are treated at two clinics (clinic 0 and clinic 1) on a drug replacement therapy.
The response variables are the time to relapse (to re-taking drugs) and the status relapse
=1 and censored =0. There are three explanatory variables, clinic (0 or 1), previous stay
in prison (no=0, yes=1) and the prescribed amount of the replacement dose. The following
results are obtained using a proportional hazards model, h(t, x) = eβx h0 (t).
What is the estimated hazard ratio for a subject from clinic 1 who has not been in prison as
compared to a subject from clinic 0 who has been in prison, given that they are each assigned
the same dose?
(c) Find a 95% confidence interval for the hazard ratio comparing those who have been in prison
to those who have not, given that clinic and dose are the same.
3. The object tongue in the package KMsurv lists survival or right-censoring times in weeks after
diagnosis for 80 patients with tongue tumours. The type random variable is 1 or 2, depending as
the tumour was aneuploid or diploid respectively.
(a) Use the log-rank test to test whether the difference in survival distributions is significant at
the 0.05 level.
Statistical Lifetime Models: HT 2019 X
(b) Repeat the above with a test that emphasises differences shortly after diagnosis.
4. (a) Sketch the shape of the hazard function in the following cases, paying attention to any changes
of shape due to changes in value of κ where appropriate.
κ
i. Weibull: S(t) = e−(ρt) .
1
ii. Log-logistic: S(t) = 1+(ρt)κ.
(b) Suppose that it is thought that an accelerated life model is valid and that the hazard function
has a maximum at a non-zero time point. Which parametric models might be appropriate?
(c) Suppose that t1 , . . . , tn are observations from a lifetime distribution with respective vectors of
covariates x1 , . . . , xn . It is thought that an appropriate distribution for lifetime y is Weibull
with parameters ρ, κ, where the link is log ρ = β · x. In the case that there is no censoring
write down the likelihood and, using maximum likelihood, give equations from which the
vector of estimated regression coefficients β (and also the estimate for κ) could be found.
What would be the asymptotic distribution of the vector of estimators? How would the
likelihood differ if some of the observations ti were right censored (assuming independent
censoring)?
5. Coronary Heart Disease (CHD) remains the leading cause of death in many countries. The evidence
is substantial that males are at higher risk than females, but the role of genetic factors versus
the gender factor is still under investigation. A study was performed to assess the gender risk of
death from CHD, controlling for genetic factors. A dataset consisting of non-identical twins was
assembled. The age at which each person died of CHD was recorded. Individuals who either had
not died or had died from other causes had censored survival times (age). A randomly selected
subsample from the data is as follows. (* indicates a censored observation.)
(a) Write down the times of events and list the associated risk sets.
(b) Suppose the censoring mechanism is independent of death times due to CHD, and that
the mortality rates for male and female twins satisfy the PH assumption, and let β be the
regression coefficient for the binary covariate that codes gender as 0 or 1 for male or female
respectively. Write down the partial-likelihood function. Using a computer or programmable
calculator, compute and plot the partial-likelihood for a range of values of β. What is the
Cox-regression estimate for β? What does this mean?
(c) Estimate the survival function for male twins.
(d) Suppose now only that the censoring mechanism is independent of death times due to CHD,
perform the log-rank test for equivalence of hazard amongst these two groups. Contrast the
test statistic and associated p-value with the results from the Fleming–Harrington test using
a weight W (ti ) = Ŝ(ti−1 ).
Statistical Lifetime Models: HT 2019 XI
(e) Do you think the assumption of a non-informative censoring mechanism is appropriate? Give
reasons.
6. In section 5.12.3 we describe fitting the Aalen additive hazards model for the special case of a
single (possibly time-varying) covariate. If we assume that xi takes on only the values 0 and 1 —
so it is the indicator of some discrete characteristic — then B1 (t) may be thought of as the excess
cumulative hazard up to time t due to that characteristic. Simplify the formula (5.17) for this case,
and show, in particular, that in the special case where xi is constant over time, the estimate B̂0 (t)
is equivalent to the Nelson–Aalen estimator for the cumulative hazard of the group of individuals
with xi = 0, and that the excess cumulative hazard B̂1 (t) is equivalent to the difference between
the Nelson–Aalen estimator for the cumulative hazard of the group of individuals with xi = 1, and
for the cumulative hazard of the group of individuals with xi = 0.
7. Refer to the AML study, which is described at length in Example 4.4.5 and analysed with the
Cox model in section 5.10. Using the data described in those places, estimate the difference in
cumulative hazard to 20 weeks between the two groups by
(a) The Aalen additive hazards regression model.
(b) The Cox proportional hazards regression model.
(c) Using the proportional hazards method, suppose an individual were to switch from mainte-
nance to non-maintenance after 10 weeks, and suppose the hazard rates change instantaneously.
Estimate the difference in cumulative hazard to 20 weeks between that individual and one
who had always been in the non maintenance group.
Statistical Lifetime Models: HT 2019 XII
λi (t) = α0 (t)r(β, xi ),
where β is the vector of parameters, and xi is the vector of covariates associated with individual i,
and T
r(β, xi ) = eβ xi .
(a) Compute a formula for the k-th component of the score function (with respect to the partial
likelihood);
(b) Compute a formula for the (k, m) component of the observed partial information matrix.
2. (Based on Exercise 11.1 of [18].) The dataset larynx in the package KMsurv includes times of
death (or censoring by the end of the study) of 90 males diagnosed with cancer of the larynx
between 1970 and 1978 at a single hospital. One important covariate is the stage of the cancer,
coded as 1,2,3,4.
(a) Why would it probably not be a good idea to fit the Cox model with relative risk eβ·stage ?
What should be done instead?
(b) Explain how you would use a martingale residual plot to show that stage does not enter as a
linear covariate.
(c) Which residual plot would you use to test whether the proportional-hazards assumption holds
for age or stage, or whether the proportional effect of one of these covariates changes over
time.
(d) Explain how you would use a Cox–Snell residual plot to test whether the Cox model is
appropriate to these data. Describe the calculations you would perform, the plot that you
would create, and describe the visual features you would be looking for to evaluate the
goodness of fit.
3. Carry out these computations in R for the data set described in the previous question:
(a) One way of making R treat the stage variable appropriately is to replace it in the model
definition by factor(stage). Show that this produces the same result as defining separate
binary variables for three different outcomes.
(b) Try adding year of diagnosis or age at diagnosis as a linear covariate (in the exponent of the
relative risk). Is either statistically significant?
(c) Use a martingale residual plot to show that stage does not enter as a linear covariate.
(d) Use a residual plot to test whether one or the other of these covariates might more appropriately
enter the model in a different functional form — for example, as a step function.
(e) Use a Cox–Snell residual plot to test whether the Cox model is appropriate to these data.
Statistical Lifetime Models: HT 2019 XIII
4. We observe survivalPtimes Ti satisfying an additive Rhazards model, so the hazard for individual
p t
i is hi (t) = β0 (t) + k=1 xik (t)βk (t), with Bk (t) = 0 βk (s)ds. We define Yi (t) to be the at-risk
indicator for individual i at time t, and X(t) the matrix of covariates at time t multiplied by at-risk,
defined as in section 5.12.2. We also define N(t) to be the binary vector giving in position i the
number of events that individual i has had up to time t. Let B̂(t) be the vector of cumulative
regression coefficient estimators.
We assume the process has been observed up to a final time τ where there is a sufficient range of
subjects remaining that X(t) has full rank. Define the martingale residual vector
X
Mres (t) = N(t) − X(tj )dB̂(tj ),
tj ≤t
(a) Show that all components of Mres (t) have expectation 0, for all times 0 ≤ t ≤ τ .
(b) Suppose now that all covariates are fixed and the data are right-censored. Show that
X(0)T Mres (τ ) = 0.
(c) How might this fact be used as a model-diagnostic for the additive-hazards assumption?
5. Suppose we have a right-censored survival data set where we have accidentally copied every line of
data twice. We fit a Cox proportional hazards regression model.
(a) Show that the point estimate for β and for the baseline hazard will be the same as for the
correct (undoubled) data, but that the variance estimate will be wrong. (Use the Breslow
method for dealing with the tied observations.) What will the variance estimate be, relative
to the correct estimate?
(b) Show that the sandwich estimate described in section 9.3 will agree asymptotically with the
variance estimate for the correct data set.
(c) Carry out the calculations in R for the bmt (bone marrow transplant) data set in the KMsurv
package, referred to in section 7.4. That is, fit a Cox proportional hazards model as described
in 7.4, and then fit the same model to the data set where every line of the data object
has been duplicated. Compare the conclusions. Then add an id variable (so that the two
lines corresponding to the same patient have the same id) and redo the analysis using a
+cluster(id) term in the formula, and see if the problem is resolved.
6. Suppose n individuals experience events at constant rate λi , i = 1, . . . , n, over the same period of
time [0, T ]. The rate λi for individual i is unknown. Suppose the unknown rates λi have a gamma
distribution with parameters (r, λ), and let Ni be the observed number of events for individual i.
(a) Show that (Ni ) have a negative binomial distribution, and compute the parameters.
(b) Suppose we fit these data to a Poisson model, to obtain an estimate λ̂. How will λ̂ behave as
n → ∞?
(c) How would you test the hypothesis that the Poisson model is correct, against the alternative
that it is negative binomial?
(d) [optional] Test these conclusions with simulated data in R.
Appendix B
Solutions
XIV
Statistical Lifetime Models: HT 2019 XV
0.95 = P(24.43 < |X 2 | < 59.34) = P(24.43 < 2λn/λ̂ < 59.34)
2n 1 2n
=P < < ,
59.34λ̂ λ 24.43λ̂
1
so the exact 95% confidence interval for λ is (231.1, 561.3).
Statistical Lifetime Models: HT 2019 XVI
8
6
Frequency
4
2
0
hb
iv.
v. Expected numbers under Exp(λ̂) are (e−100k − e−100(k+1) )n:
(5.1, 3.8, 2.8, 2.1, 1.6, 1.2, 0.9) and 2.6 for > 700.
For the χ2 test we require expected numbers above 5, so we keep the first bin, merge the
next three to get 8.7 and the remainder to get 6.2 (alternatively merge next two and
remainder). The data then is
Bin 0-100 100-400 400+ total
observed 0 15 5 20
expected 5.1 8.7 6.2 20
and we calculate the χ23−2 = χ21 test statistic
v
3 u 3
X (Oi − Ei )2 uX (Oi − Ei )2
= 9.84 ⇒ |Z| = t = 3.14 1.96
i=1
Ei i=1
Ei
λe−λt λ
f (t) = ⇒ h(t) = . (B.2)
1 − e−λω 1 − e−λ(ω−t)
Statistical Lifetime Models: HT 2019 XVII
FY (y) = P Y ≤ y
= P Λ−1 (X) ≤ y
= P X ≤ Λ(y) because Λ is strictly increasing
= 1 − e−Λ(y) ,
x 0 1 2 3 4
dx 45 9 8 4 3
`x 69 24 15 7 3
`˜x 35.63 19.15 10.20 4.94 1.32
(b) The discrete method based on curtate lifetimes K (1) , . . . , K (n) , n = 69, factorises the likelihood
69
Y ∞
Y
pK (K (j)
)= (1 − qx )`x −dx qxdx (B.3)
j=1 x=0
(0)
and differentiation of each factor leads to maximum likelihood estimators q̂x = dx /`x .
The continuous method based on T (1) , . . . , T (n) , n = 69, and the assumption of constant
forces of mortality between integer ages, factorises the likelihood
69
Y ∞
Y n o
fT (T (j)
)= µdx+
x
1 exp −`˜x µx+ 12 (B.4)
2
j=1 x=0
and differentiation of each factor leads to maximum likelihood estimators q̂x = 1−exp{−dx /`˜x }.
(c) From the formulas obtained in (b) we calculate
x 0 1 2 3 4
(0)
q̂x 0.65 0.38 0.53 0.57 1
q̂x 0.717 0.375 0.543 0.555 0.898
(0)
q̂0 < q̂0 since the total time `˜0 spent at risk is very short. We can see this directly from the
data. Most subject dying in the first year die very early (e.g. three subjects die the day after
their transplant). This actually suggests that the force of mortality is not constant over the
first year, but much higher initially.
(0)
q̂4 < q̂4 = 1 allows survival beyond the maximal observed age under the continuous method.
The specification of the distribution estimate is not complete, but with no data we get no
estimate. Some methods of graduation will allow to extrapolate beyond maximal age.
By both methods, the one-year survival probabilities indicate a bathtub behaviour, decreasing
initially and then increasing.
(d) i. Under the estimates from curtate lifetimes and the assumption of independent uniform
fractional part,
ii. Under the estimates from continuous lifetimes and the assumption of constant force of
mortality between integer ages
( Z )
0.25
P(T > 0.25) = exp − µt dt
0
iii. Again we can apply the discrete or continuous method (formally for units of three
months). The continuous method assumes constancy of forces of mortality over each
three-month period and gives an estimate
d 31
exp − = exp − = 0.540.
4`˜ 4 × 12.58
Here, 4`˜ is the total number of time units (as calculated from years `)
˜ at risk during first
three-month unit.
The discrete method is based on one-unit death probabilities and gives 1 − d/` =
1 − 31/69 = 0.551 as an estimate for the first-unit survival probability.
These estimates are much smaller reflecting a higher risk to die initially. In fact, this
suggests that neither assumption i. nor ii. is optimal. An initially decreasing force of
mortality would be better.
5. (a) i. The turquoise region corresponds to age x in year t. The same individuals are age x + 1
in year t + 1, and this portion of their lifelines falls in the yellow region.
x+2
x+1
Age
x−1
The assumption of piecewise linear Px,t cannot hold exactly since Px,t ∈ N, but for large
n this is negligible.
(b) i. The turquoise region corresponds to age x in year t. The same individuals are age x + 1
in year t + 1, and this portion of their lifelines falls in the yellow region.
x+2
x+1
Age
x−1
since it is more natural to assume that the cohort of lives Px,k with x-th birthday in
calendar year k changes linearly to Px+1,k+1 since this will count the same people (being
age k at the start of year x and age k + 1 at the start of year x + 1).
R x+1
iii. The first estimate is approximating x µs ds; that is, the average of µs over the age
interval from x to x + 1, assuming age-specific mortality doesn’t change with time. (This
is just µx if µs is assumed constant on [x, x + 1).) The second estimate clearly includes
the experience of individuals at ages between x − 1 and x + 1, but weighted toward the
middle (in proportion to the width of the parallelogram in the above figure.) In fact, the
(2)
estimate µ̃x = dx /Exc,2 will be approximating
Z x+1
1 − s − x µs ds.
x−1
Statistical Lifetime Models: HT 2019 XX
t
ex ≈ t qx · + (1 − t qx ) (t + ex+t ) ,
2
and rearranging leads to the given formula for t qx . The approximation is reasonable if for
example deaths are approximately uniform across the interval (x, x + t), which would occur if
mortality is constant and low.
(b) Applying the approximations we get
Under the assumption of constant mortality on these intervals, then for x = 1, 2, 3, 4 we have
1/4 1/4
qx = 1 − px = 1 − (4 p1 ) = 1 − (1 − 4 q1 ) ,
Kaplan-Meier
1.0
0.8
0.6
0.4
0.2
single-treatment
double-treatment
0.0
The standard errors are in the code printout above. For type 1 the variance estimate for the
Nelson–Aalen estimator is 0.04 at t = 400; for type 2 it is 0.01. So the corresponding standard
errors for the cumulative hazard are 0.2 and 0.1. The standard errors for survival are obtained
from multiplying these by Ŝ(400)2 , obtaining 0.13 and 0.097. The standard errors computed by
the survfit function for the Kaplan–Meier estimatorare in the printout above. They are 0.135
and 0.100.
4. (a) Crude estimates from the data are subject to stochastic fluctuation. Smoothing (graduating)
the estimates may make more reliable predictions.
(b) µx = a + beαx for Gompertz–Makeham. This is generally considered a reasonable model for
the hazard rate (force of mortality) from middle age onward. Note, though, that the mortality
rate doubling times (which would be approximately constant under Gompertz–Makeham)
lengthen progressively. The parameters a, b, α will have to be fitted from the data.
We apply the chi-squared test. To begin with, we combine the last two rows to have ≥ 5
expected deaths in each row. The last row becomes
(We interpolate by weighting the two rows by their central exposed to risk.) The χ2 statistic
is then 4.96 on 8 observations. Since we have estimated 3 parameters, we compare this to the
table with 5 degrees of freedom, obtaining p-value 0.42.
To test for bias we use the cumulative deviations test, obtaining Z = 0.96, and a p-value
of 0.3375. Thus, the model seems to fit. Notice that graduated hazard is generally lower —
it is strongly affected by the mortality plateau a very late ages — which would lead to an
overestimate of benefits paid. This is a relatively good error to make, though it would be
reversed if the company were selling life insurance!
5. (a) Let us write ex and px for the figures in the table, and ẽx and p̃x for the figures after we
change the rates in the first two years.
We have
e0 = p0 1 + p1 (1 + e2 ) ,
ẽ0 = p̃0 1 + p̃1 (1 + ẽ2 )
Since we change only the rates before year 2, we have e2 = ẽ2 . Then solving for ẽ0 , we have
!
e0 /p0 − 1
ẽ0 = p̃0 1 + p̃1 .
p1
With e0 = 44.83, p0 = 0.839, p1 = 0.946, p̃0 = 0.995, p̃1 = 0.9996 we obtain ẽ0 = 56.12, an
increase of 11.29 years.
(b) It is clear from the table that the rates qx (x = 19, . . . , 25) are much larger than we would
normally expect. Comparing them with the rates just before and after, it looks like a plausible
first approximation would be to replace all these rates by a rate around 0.0035. So we could
work with a model where the mortality is constant with q = 0.0035 = 1 − 0.9965 for those 7
years. (Of course this is very rough, and there are all sorts of things we ignore including the
various effects of the war on mortality, even after it had finished).
We can represent e0 as a sum A + B + C of three terms:
Let us write A,
e B,
e and Ce for the new values once we change the rates.
The change to the rates between ages 19 and 26 makes no difference to the first term, so
A
e = A.
We have
C = 26 p0 e26
`26
= e26
`0
= 0.6014 × 42.75
= 25.71.
The change to the rates makes no difference to e26 , but we have a new value for the probability
of surviving to age 26, giving
Finally, we can find B using B + C = e19 `19 /`0 as in the previous calculation, so
`19
B= e19 − C
`0
= 0.7309 × 41.57 − 25.71
= 4.67.
In the new model with constant rate between age 19 and age 26, we can use
e = 19 p̃0 (1 p̃19 + 2 p̃19 + · · · + 7 p̃19 )
B
7
`19 X
0.9965k
`0
k=1
`19 0.9965 − 0.99658
=
`0 0.0035
= 0.7381 × 6.903
= 5.10.
The total change in life expectancy at birth is B
e+Ce − B − C which comes to 30.49 + 5.10 −
25.71 − 4.67 = 5.21, giving a new life expectancy of around 50.
6. The log likelihood is
n
`(p) = log + x log p + (n − x) log(1 − p).
x
This has solution 0 = `0 (p̂) = x/p̂ − (n − x)/(1 − p̂), implying p̂ = x/n. We know that the variance
of a binomial random variable is np(1 − p). Substituting p̂ for p yields the estimate
xn−x x(n − x)
Var(p̂) = Var(x/n) = n−2 Var(x) = n−1 p(1 − p) = n−1 = .
n n n3
If all the censoring occurs at t = 0 then the number of individuals at risk of dying in (0, t) is
actually n(t) + d(t). Thus alive at time t is binomial with parameters n = n(0) = n(t) + d(t) and
p = S(t). The MLE for p is thus
n(t) n(t)
Ŝ(t) = p̂ = = .
n(t) + d(t) n(0)
(If the censoring all happens at time 0, then the number at risk at time 0+ will be the same as the
sum of the number who die up to time t, and the number still at risk at time t.) The variance
estimate is
d(t)n(t) d(t) n(t) n(t)
3
= n(t)−1 = n(t)−1 (1 − Ŝ(t))Ŝ(t)2 .
n(0) n(0) n(0) n(0)
Greenwood’s estimate in the case of no censoring is
X di
Var Ŝ(t) ≈ Ŝ(t)2
ni (ni − di )
ti ≤t
X ni+1 − ni
= Ŝ(t)2
ni (ni+1 )
ti ≤t
X 1 1
= Ŝ(t)2 −
ni+1 ni
ti ≤t
!
1 1
= Ŝ(t)2 −
nj n0
d(t)
= Ŝ(t)2
n(t)n(0)
= n(t)−1 Ŝ(t)2 (1 − Ŝ(t))
Statistical Lifetime Models: HT 2019 XXV
as before.
7. (a) Right censoring and left truncation.
(b) If individuals who enter at age x are considered immediately available to count at risk at age
x, and those who die at age x are also at risk.
Age 65 66 67 68 69 70 71 72 73 74 75
# at risk 3 9 11 13 14 17 14 12 12 8 4
We are planning to use the actuarial estimator — so we count those who are censored or died
as having had half a year at risk, and count those who entered at a given age as having half a
year at risk in that year, we get the following counts:
Age 65 66 67 68 69 70 71 72 73 74 75
# at risk 1.5 6.0 9.5 9.5 11.5 13.0 11.5 10.5 9.0 6.0 3.5
(c) Again, counting whole years at risk for those who enter, die, or are right-censored, we have
tj nj dj hj S(t
b j) S(t
e j)
(d) We use the whole-year method, rather than the actuarial estimate. Our central estimate
for the probability of surviving from age 70 to age 75 is Ŝ(74) = 0.343. Using Greenwood’s
estimate, we estimate the variance of log Ŝ(74) to be
X di 4 1 3 4
= + + +
ni (ni − di ) 17 · 13 12 · 11 12 · 9 8 · 4
ti ≤74
= 0.178,
Statistical Lifetime Models: HT 2019 XXVI
√
so the standard error is 0.178 = 0.422. Thus an approximate 95% confidence interval for
S(74) is
0.343e−0.422·1.96 , 0.343e0.422·1.96 = (0.150, 0.784).
(e)
1 require ( ' survival ' )
2 age . e n t r y=c ( 6 7 , 7 0 , 7 0 , 6 5 , 6 5 , 7 3 , 6 9 , 7 6 , 6 6 , 7 2 , 6 5 , 7 1 , 6 9 , 7 1 , 6 8 , 6 9 , 6 9 , 6 6 ,
3 73 ,67 ,66 ,69 ,66 ,78 ,66 ,68 ,70 ,66 ,89 ,68)
4 age . e x i t=c
(72 ,71 ,73 ,70 ,68 ,78 ,74 ,78 ,67 ,76 ,70 ,75 ,71 ,74 ,73 ,74 ,71 ,68 ,76 ,68 ,70 ,73 ,
5 70 ,81 ,70 ,73 ,74 ,68 ,92 ,72)
6 d e l t a=c ( 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 1 , 1 )
7
8 c l i n i c . s u r v=Surv ( time=age . e n t r y , time2=age . e x i t , e v e n t=d e l t a ) # l e f t −
t r u n c a t e d , r i g h t −c e n s o r e d i s d e f a u l t
9 KM. f i t =s u r v f i t ( c l i n i c . s u r v∼1 , s u b s e t =( age . e x i t >=70) ) # S u r v i v a l o f t h o s e
p r e s e n t a f t e r age 70
10 p l o t (KM. f i t , f i r s t x =70 ,xmax=75 , y l a b= ' S u r v i v a l p r o b a b i l i t y ' , main= ' Kaplan−
Meier e s t i m a t o r ' , x l a b= ' Age ( y r s ) ' )
11
12 TOT. s u r v=Surv ( time=time . on . t e s t , e v e n t=d e l t a )
13 TOT. f i t =s u r v f i t (TOT. s u r v∼1 )
14 p l o t (TOT. f i t , y l a b= ' S u r v i v a l p r o b a b i l i t y ' , main= ' Kaplan−Meier e s t i m a t o r ' ,
x l a b= ' Time on t e s t ( y r s ) ' )
1.0
0.8
0.8
Survival probability
Survival probability
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
70 71 72 73 74 75 0 1 2 3 4 5
1
For the log-logistic model, we expect the plot of log( Ŝ(x) −1) against log x to be approximately
linear.
(b) Let S1 and S2 be the survival curves for the two populations, and S0 the baseline survival.
Under the accelerated lifetime model, Si (x) = Si (ρi x) for some positive constants ρ1 , ρ2 .
Then if we plot Si (x) against log x, we see that whatever value S0 takes at ordinate log x, Si
will take the same value at an interval of log ρi . (The same will be true of any function of
Si .) Thus, the graphs corresponding to Ŝ1 and Ŝ2 should differ approximately by a uniform
horizontal shift.
The proportional hazards assumption is best tested by plotting log(− log Ŝi (x)). Under PH,
Si (x) = S0 (x)ρi , which implies that
log(− log Si (x)) = log(−ρi log S0 (x)) = log(ρi ) + log(− log S0 (x)).
Thus, if log(− log Ŝi (x)) is plotted against x, the two graphs should differ approximately by a
constant vertical shift if the two groups satisfy the PH assumption. The same is true if we
plot log(− log Ŝi (x)) against any function of x. Thus, if we plot log(− log Ŝi (x)) against log x,
we will see a constant vertical shift reflecting the PH assumption, and a constant horizontal
shift reflecting the AL assumption.
(c) The computations for the Kaplan–Meier estimator are given in Table B.1. In figure B.1 we
plot the two survival curves (red for control, black for treatment), as log(− log Ŝ) against
log x. Both look reasonably close to lines, so it would be reasonable to suppose that they came
from Weibull models. The lines are approximately parallel, suggesting that the α parameters
are approximately the same. This means that one curve may be obtained from another by a
horizontal or vertical shift, suggesting that PH or AL would be appropriate. (Weibull curves
with the same α parameter, it should be noted, satisfy both hypotheses.)
tj dj nj ĥj Ŝ(tj )
1 2 21 0.095 0.905
2 2 19 0.105 0.810
tj dj nj ĥj Ŝ(tj )
3 1 17 0.059 0.762
6 3 21 0.143 0.857
4 2 16 0.125 0.667
7 1 17 0.059 0.806
5 2 14 0.143 0.572
10 1 15 0.067 0.752
8 4 12 0.333 0.381
13 1 12 0.083 0.690
11 2 8 0.250 0.286
16 1 11 0.091 0.627
12 2 6 0.333 0.191
22 1 7 0.143 0.537
15 1 4 0.250 0.143
23 1 6 0.167 0.448
17 1 3 0.333 0.095
22 1 2 0.500 0.048
23 1 1 1.000 0.000
Table B.1: Estimates for control group (left) and treatment group (right) in Gehan study.
Statistical Lifetime Models: HT 2019 XXVIII
We test the hypothesis by finding maximum likelihood estimators. The log likelihood for the
exponential distribution are
X
`(λ) = (−λxi ) + d log λ,
i
where d isPthe number of uncensored observations. Since the maximum likelihood estimator
is λ̂ = d/ xi , we get maximum likelihoods of
X
`∗exp = d log d − 1 − log xi .
There is no closed form solution, but we can optimise numerically, yielding estimates
Treatment Control
λ̂ 0.025 0.12
`∗exp -42.17 -66.35
ρ̂ 0.030 0.11
α̂ 1.35 1.37
`∗weib -41.66 -64.92
The log likelihood ratio for the treatment group is thus (−41.66) − (−42.17) = 0.51, and for
the control group it is 1.43. Comparing these to the χ2 distribution with 1 degree of freedom,
we see that the cutoff for rejecting the null hypothesis that α = 1 at the 0.05 significance
level would be 3.84. Thus, we cannot reject the null hypothesis for either group.
2. (a) Assuming no ties, the partial likelihood is constructed by computing the probability that the
subjects failed in exactly the order observed, conditioned on the times observed.
The proportional hazards (PH) assumption says that subject i has hazard rate hi (x) = ri h0 (x)
at time x, where h0 is an unspecified baseline hazard. In the regression approach, we think
of ri as a function r(yi ) of a vector yi of covariates. The linear approach is to suppose
φ(r(y)) = β · y, where φ is the link function and β is a vector of parameters to estimate. In the
Cox model we use the logarithmic link function, so that r(y) = eβ·y . The partial likelihood is
defined as
Y eβy(i)
LP (β; y) := P βyj
,
t j∈Rj e
i
where x(i) represents the covariates of the subject failing at time ti and Ri is the risk set, of
those subjects at risk at ti .
We use LP as though it were a likelihood. We compute the parameters β̂ that maximise LP .
Under the assumption that the observations came from the distribution given by this model
with some (unknown) parameter β, the estimate β̂ is asymptotically normal, with mean β
and variance matrix that may be estimated by
!−1
2
∂ `P
E − , where `P = log LP .
∂β∂β T
Statistical Lifetime Models: HT 2019 XXIX
●
1
●
●
●
●
0
●
●
log(−log(Survival))
●
●
●
●
−1
● ●
● ●
−2
−3
0 1 2 3
log(Age)
Figure B.1: Plot of estimated survival for Gehan leukaemia data. The control group is in red,
the treatment group is black.
SURVDIFF CODE
> require(’survival’)
> require(’KMsurv’)
> data(tongue)
> attach(tongue)
>
> tongue.surv=Surv(time,delta)
> tongue.fit=survfit(tongue.surv∼type)
> tdiff=survdiff(tongue.surv∼type)
> tdiff
Call:
survdiff(formula = tongue.surv ∼ type)
DIRECT COMPUTATION
# Problem sheet 4, question 1
require(’survival’)
require(’KMsurv’)
data(tongue)
attach(tongue)
tongue.surv=Surv(time,delta)
tongue.fit=survfit(tongue.surv∼type)
n1=tongue.fit$strata[1]
n2=tongue.fit$strata[2]
crossrisk=function(t1,t2,r1,r2){
I1=rep(0,length(t1))
I2=rep(0,length(t2))
for(i in seq(length(t1))){
I1[i]=1+sum(t1[i]>t2)
}
for(i in seq(length(t2))){
Statistical Lifetime Models: HT 2019 XXXI
I2[i]=1+sum(t2[i]>t1)
}
list(I1,I2,r1[I2],r2[I1])
}
r1=tongue.fit$n.risk[seq(n1)]
r2=tongue.fit$n.risk[seq(n1+1,n1+n2)]
r1=c(r1,r1[n1]-tongue.fit$n.event[n1]-tongue.fit$n.censor[n1])
r2=c(r2,r2[n2]-tongue.fit$n.event[n1+n2]-tongue.fit$n.censor[n1+n2])
t1=tongue.fit$time[seq(n1)]
t2=tongue.fit$time[seq(n1+1,n1+n2)]
cr=crossrisk(t1,t2,r1,r2)
Y1=c(r1[-n1],cr[[3]])
Y2=c(cr[[4]],r2[-n2])
# Note: r1 and r2 had an extra count added on to make crossrisk work
d1=c(tongue.fit$n.event[seq(n1)],rep(0,n2))
d2=c(rep(0,n1),tongue.fit$n.event[seq(n1+1,n1+n2)])
t=c(t1,t2)
# We have to deal with the problem of ties between times for the two groups
dup1=which(duplicated(t,fromLast=TRUE))
dup2=which(duplicated(t))
ndup=length(dup1)
tord=order(t)
t=t[tord] #put times in order
## Now put everything else in the same order
Y=Y[tord]
Y1=Y1[tord]
Y2=Y2[tord]
d=d[tord]
d1=d1[tord]
d2=d2[tord]
Y=Y1+Y2
Statistical Lifetime Models: HT 2019 XXXII
d=d1+d2
Y=Y[includes]
Y1=Y1[includes]
Y2=Y2[includes]
d=d[includes]
d2=d2[includes]
d1=d1[includes]
t=t[includes]
wLR=Y1*Y2/Y
p=1
q=0
w=wLR
M=w*(d1/Y1-d2/Y2)
sigma=w*w*d*(Y-d)/Y2/Y1/(Y-1)
sK=d*Y1*Y2*(Y-d)/Y^2/(Y-1)
Z=sum(M)/sqrt(sum(sigma))
> Z
[1] -1.670246
4. (a) The plot is:
Statistical Lifetime Models: HT 2019 XXXIII
2.0
Weibull, kappa=0.5
Weibull, kappa=0.5
1.5
Weibull, kappa=1.5
Weibull, kappa=1.5
log−logistic, kappa=0.5
log−logistic,
log−logistic, kappa=0.5
kappa=1.5
log−logistic, kappa=1.5
−w1
1.0
0.5
0.0
0 1 2 3 4
Asymptotically, the estimators will be normally distributed. If some observations are right-
censored, the log likelihood becomes
X X X κ
`(κ, β) = nd log κ + nd κ log ρ + κ δi β · xi + (κ − 1) δi log ti − ρeβ·xi ti
Note that there is some ambiguity in breaking ties. When an observation is censored at time
ti we must decide whether to treat the censoring as having occurred just after or just before
ti : that is, was the individual available to have been counted if they had died at time ti or
not? We have chosen the former: Thus, for instance, R9 is the set of individuals at risk at
time 75, and it includes M12, who was censored at age 75. Either one is acceptable — though
details of the study may suggest one or the other interpretation — but it should be specified.
Since we are interested only in the binary covariate of gender, we need only consider the risk
sets as counting the numbers of males and females, coded as Ri = (mi , fi ). We may then
summarise them as
(b) Using the notation as above, and setting the vector of covariates to be x = (1, 0, 0, 1, 1, 1, 0, 0, 0)
– coding female as 0 and male as 1 — we have the partial likelihood being
9 9
Y eβxi 4β
Y
β
−1
LP = = e f i + e mi . (B.9)
f + eβ m i
i=1 i i=1
A plot of this function is in Figure B.2. The maximum likelihood is attained at β = −0.042.
Statistical Lifetime Models: HT 2019 XXXV
●
●
24 25 26 27 28
●●
●
●
●
−2 −1 0 1 2
β
event time
50 52 58 61 67 68 70 72 75
dm 1 0 0 1 1 1 0 0 0
Male
mi 11 10 9 9 8 7 5 4 2
dfi 0 1 1 0 0 0 1 1 1
Female
fi 12 12 11 10 9 9 8 5 2
di 1 1 1 1 1 1 1 1 1
Total
n 23 22 20 19 17 16 13 9 4
Ŝ(ti−1 ) 1 0.957 0.913 0.867 0.822 0.773 0.725 0.669 0.595
we get Z = −.063, which should be like a draw from a normal distribution if the male and
female survival times were drawn from the same distribution. In fact, we get a p-value of
1 − 2Φ(.063) = .95.
(d) For the Fleming–Harrington test we down-weight the later times, when very few are at risk,
substituting
P9 m m di
i=1 Ŝ(ti−1 ) di − ni ni
ZF H = r = 0.105,
f
P9 2 nm
i ni (ni −di )di
i=1 Ŝ(ti−1 ) n2 (ni −1) i
yielding a p-value for the two-sided test of 0.92. In either case, of course, we would not reject
the null hypothesis. Of course, this is not surprising, as the sample is very small.
Note that this analysis could be improved by taking account of the pairing of twins.
(e) Death due to other causes is unlikely to be independent of CHD. Hence, non-informative
censoring is questionable.
Statistical Lifetime Models: HT 2019 XXXVI
6. For clarity, we repeat the derivation in this particular setting. Let Z(tj ) be the 2 × 2 matrix
X(tj )T X(tj ). Since xi (tj ) = xi = x2i is 0 or 1, we have
X
Z00 = Yi (t)2 = n(t),
X
Z10 = Z01 = Z11 = Yi (t)xi = n1 (t),
where n(t) is the number of individuals at risk at time t, and nj (t) is the number of individuals at
risk at time t with xi = j. The determinant is n1 (t)n0 (t). Inverting this, we get
!
1
− n01(t)
− n0 (t) Y1 (t) Y2 (t) ··· Yn (t)
X (t) = ,
− n01(t) n11(t) + n01(t) x1 Y1 (t) x2 Y2 (t) · · · xn Yn (t)
as long as n1 (t)n0 (t) > 0 (and 0 otherwise). Thus, since Yij (tj ) is always 1 (since the individual
who has an event must, by definition, be at risk),
1−x0
− n0 (tj )
X (tj )cdotij =
− n1−x 0
0 (tj )
+ x0
n1 (tj )
which is the definition of the Nelson–Aalen estimator for the cumulative hazard, considering only
the individuals with xi = 0; and the estimated cumulative increment due to xi = 1 is
X 1 X 1
B̂1 (t) = − ,
n1 (tj ) n0 (tj )
tj ≤t: xij =1 tj ≤t: xij =0
7. (a) As described in the previous question, the difference may be estimated by the difference
between the Nelson–Aalen estimators:
X d0j d1j
B̂1 (t) = Ĥ0 (t) − Ĥ1 (t) = − .
n0j n1j
tj ≤t
Calling the Maintenance group number 1, and Nonmaintenance number 0, we read off of
Table 4.3 Ĥ1 (20) = Ĥ1 (18) = 0.32, and Ĥ0 (20) = 0.49, yielding
The variance will be the sum of the variances of the two estimators (since they are independent).
As long as there are no ties between events from different groups, this may be estimated by
X d0j X d1j
2 + .
n0j n21j
tj ≤t tj ≤t
In Table 5.2 we tabulated the estimators for the baseline hazard, obtaining Ĥ0 (18) = 0.254.
A central estimate for the difference in cumulative hazard between the two groups would be
We see that this is a substantially larger estimate than we made in the nonparametric model.
This is consistent with the plot in Figure 5.4, where the purple circles and blue crosses
(representing the survival estimates from the proportional hazards model for the two groups)
are further apart at tj = 18 than the black and red lines (representing the Kaplan–Meier
estimators). This reflects that fact that the separate Kaplan–Meier estimators are cruder,
making larger jumps at less frequent intervals.
To estimate the standard error, we begin by assuming (with little justification) that the
estimators β̂ and Ĥ0 (tj ) are approximately independent. Then we can use the delta method
to estimate the variance. Let σβ2 be the variance of β̂, and σH 2
the variance of Ĥ(18). So we
can represent
β̂ ≈ β0 + σβ Z, Ĥ0 (18) = H0 (18) + σH Z 0 ,
where Z and Z 0 are standard normal (also approximately independent). We already have
the estimate σ̂β ≈ 0.512. We haven’t given a formula for an estimator of σH (18), but we can
easily compute it with R.
require(survival)
cp=coxph(Surv(time,status)∼x,data=aml)
aml.fit=survfit(cp)
aml.fit$std.err[aml.fit$time==18]
[1] 0.150247
(Note that the approximation in the first line is based on assuming σβ is much smaller than
β0 , which isn’t really very true here.) As long as we are assuming independence of Z and Z 0 ,
the variance will be approximately
2 2
eβ0 σβ H0 (18) + 1 − eβ0 σH = 0.3252 + .2252 = 0.156,
A better estimate, also taking into account the dependence between β̂ and Ĥ0 , could be
obtained by not using the delta method, but instead treating the normal distribution of β̂
as a Bayesian posterior distribution on β0 . For a range of possible β0 we can compute an
approximate mean and variance for Ĥ0 , and then compute a Monte Carlo estimator of the
variance of Γ̂.
(c) We let x0 (t) be the covariate trajectory for this individual, so recalling that the maintained
group is the baseline this means that
(
0 if t ≤ 10,
x0 (t) =
1 if t > 10.
where ij is the individual who had an event at time tj , and Rj is the set of those at risk at
time tj . The log partial likelihood is
l l X
X X T
`P (β) = β T x ij − log eβ xi
.
j=l j=1 i∈Rj
Pn
(The first term could be simplified to β T i=1 δi xi , but it is perhaps better understood in
the form stated here.) The score function has k-th component
l T
eβ xi
∂`P X X
= x ij k − xik P β T xi
.
∂βk j=1 i∈Rj e
i∈Rj
That is, it is the total difference between k-th covariate of the individual with event at time
tj and the average k-th covariate of those at risk, weighted according to the relative risk with
parameter β.
(b) The observed partial information has (k, m) coordinate given by
l X T T T
∂ 2 `P X eβ xi X eβ x i X eβ xi
− = xik xim P β T xi
− xik P β T xi
xim P β T xm
∂βk ∂βm j=1 i∈Rj e i∈Rj e i∈Rj e
i∈Rj i∈Rj i∈Rj
That is, it is the sum of covariances between the k-th and m-th components of individuals at
risk at time tj , where an individual is selected in proportion to relative risk.
2. (a) That would treat the categorical variable as though it were quantitative. That would force
the relative risks into particular proportions that have no empirical basis. There may be
good reason to expect the relative risk to increase with stage, but not to expect particular
proportions.
(b) We could fit the model without any covariates — so just find the Nelson–Aalen estimator— and
use that as a basis for adding in the stage as a covariate and checking the martingale residuals.
Here we will use age as an additional covariate. So we will fit the model αi (t) = α0 (t)eβ·age ,
and check for the behaviour of stage as an additional covariate. We show a box plot in figure
B.3, showing the distributions of martingale residuals for the 4 different stages. What we see
is that the residuals have essentially the same mean for stages 1 and 2, rise substantially for
stage 3, and somewhat less for stage 4.
(c) We would compute the scaled Schoenfeld residuals, and plot them as a function of event time.
If the proportional-hazards assumption holds — that is, if the proportionality parameter
associated with age is effectively constant — this should stay close to 0, with no apparent
patterns or trends.
(d) We compute the Breslow estimator H b 0 (t) for the baseline hazard. The Cox–Snell residual
βxi
is ri = H0 (Ti )e , where Ti is the time for individual i. We then compute a Nelson–Aalen
b
estimator for the right-censored times (ri , δi ). If the Cox model is a good fit, the estimated
cumulative hazard should look approximately like an upward sloping line through the origin
with slope 1.
3. (a) The R computation below shows that the coefficient for stage 2 is clearly not statistically
significant; the coefficient for stage 3 is borderline (p = 0.071); and the coefficient for stage 4
is highly significant (p = 0.000053).
Statistical Lifetime Models: HT 2019 XL
(b)
1 require ( survival )
2 r e q u i r e ( KMsurv )
3
4 data ( l a r y n x )
5 l a r . cph=coxph ( Surv ( time , d e l t a )∼age , data=l a r y n x )
6
7 c o e f exp ( c o e f ) s e ( c o e f ) z p
8 age 0 . 0 2 3 3 1.02 0.0145 1.61 0.11
9
10 L i k e l i h o o d r a t i o t e s t =2.63 on 1 df , p =0.105 n= 9 0 , number o f e v e n t s=
50
11
12 l a r . f i t =s u r v f i t ( l a r . cph )
13
14 # The coxph o b j e c t has a l i s t o f t i m e s
15 # We want t o f i n d t h e i n d e x o f t h e time c o r r e s p o n d i n g t o i n d i v i d u a l
i.
16 whichtime=s a p p l y ( l a r y n x $ time , f u n c t i o n ( t ) which ( l a r . f i t $ time==t ) )
17
18 cumhaz=−l o g ( l a r . f i t $ s u r v [ whichtime ] )
19
20 b e t a=l a r . cph $ c o e f f i c i e n t s
21 r e l r i s k =exp ( b e t a ∗ ( l a r y n x $ age−mean ( l a r y n x $ age ) ) )
22 # B a s e l i n e hazard i s f o r mean v a l u e o f c o v a r i a t e
23
24 r e s i d s=l a r y n x $ d e l t a −cumhaz∗ r e l r i s k
25 #Note : We c o u l d g e t t h e same numbers out a s l a r . cph $ r e s i d u a l s
26 r e s i d s . b y s t a g e=l a p p l y ( 1 : 4 , f u n c t i o n ( i ) r e s i d s [ l a r y n x $ s t a g e==i ] )
27 b o x p l o t ( r e s i d s . b y s t a g e , x l a b= ' S t a g e ' , y l a b= ' M a r t i n g a l e r e s i d u a l ' )
Statistical Lifetime Models: HT 2019 XLI
1.0
0.5
0.0
Martingale residual
-0.5
-1.0
-1.5
1 2 3 4
Stage
Figure B.3: Box plot of martingale residuals for larynx data, stratified by stage.
(c) The plot is shown in Figure B.4. We see that there seems to be no effect of the age variable
until age 70, after which it seems to increase linearly.
Statistical Lifetime Models: HT 2019 XLII
1.0
0.5
0.0
martingale residual
-0.5
-1.0
-1.5
-2.0
40 50 60 70 80
Age (Yrs)
Figure B.4: Plot of martingale residuals against age for larynx data.
1 ########## R e s i d u a l p l o t t o t e s t age
2 aord=o r d e r ( age )
3 r e s i d s=l a r . cph2 $ r e s i d u a l s [ aord ]
4 p l o t ( age [ aord ] , r e s i d s , x l a b= ' Age ( Yrs ) ' , y l a b= ' m a r t i n g a l e r e s i d u a l ' )
5 l i n e s ( l o w e s s ( r e s i d s∼age [ aord ] ) , c o l =2)
6
7 ########## New model with age s t a r t i n g from 70
8 newage=pmax ( age [ aord ] − 70 , 0)
9 l a r . cph=coxph ( Surv ( time , d e l t a )∼f a c t o r ( s t a g e )+newage , data=l a r y n x )
0.2
0.1
Beta(t) for age
0.0
−0.1
−0.2
Time
Figure B.5: Plot of scaled Schoenfeld residuals to test whether age parameter is constant.
This estimates the slope of the change in the various Cox-model parameters over time. None
of the slopes is significantly different from 0.
(e) There seems to be a marked curvature of the residual plot, suggesting that the model is
underestimating the cumulative hazard later on.
1 l a r . cph=coxph ( Surv ( time , d e l t a )∼f a c t o r ( s t a g e ) , data=l a r y n x )
2 l a r . f i t =s u r v f i t ( l a r . cph )
3
4 whichtime=s a p p l y ( l a r y n x $ time , f u n c t i o n ( t ) which ( l a r . f i t $ time==t ) )
5
6 cumhaz=−l o g ( l a r . f i t $ s u r v [ whichtime ] )
7
8 b e t a=l a r . cph $ c o e f f i c i e n t s
9 r e l r i s k =exp ( matrix ( beta , 1 , 3 ) %∗% r b i n d ( s t 2 −mean ( s t 2 ) , s t 3 −mean ( s t 3 ) , s t 4
−mean ( s t 4 ) ) )
10 c o x s n e l l=c ( r e l r i s k ∗cumhaz )
11
12 CS . s u r v=Surv ( c o x s n e l l , d e l t a [ aord ] )
13 CS . f i t =s u r v f i t (CS . s u r v∼1 )
14
15 p l o t (CS . f i t $ time ,− l o g (CS . f i t $ s u r v ) , x l a b= ' Time ' ,
16 y l a b= ' F i t t e d c u m u l a t i v e hazard f o r Cox−−S n e l l r e s i d u a l s ' )
17 a b l i n e ( 0 , 1 , c o l =2)
Statistical Lifetime Models: HT 2019 XLIV
3.5
3.0
Fitted cumulative hazard for Cox-Snell residuals
2.5
2.0
1.5
1.0
0.5
0.0
Time
In a small interval of time [t, t + dt) in which the cumulative hazard covariates B are
incremented by dB(t) = β(t) dt, the expected number of events is incremented by
E dN(t)i = X(t) dB(t) i ,
Thus, conditioned on any past information, the expected increment to the martingale residual
at t = tj is
Of course, the expected increment is 0 conditioned on no event at time t. Thus, the expectation
of Mres (t) is constant, and since it starts at 0, it is identically 0.
Statistical Lifetime Models: HT 2019 XLV
(b) We write Y(s) for the n × n diagonal matrix with the at-risk indicators on the diagonal. We
note that for right-censored data Y(s0 )Y(s) = Y(s) for s0 ≤ s, and Y(s) dN(s) = dN(s)
because Yi (s) = 0 implies that dNi (s) = 0. Thus
Also, X(t) = Y(t)X, where we write X for X(0). Since Mres is constant except at times
s = tj , we consider the increments at s = tj , obtaining
Since this is true for all s, and since Mres (0) = 0, it must be true that XT Mres (t) = 0 for all
t ≤ τ.
(c) The equation XT Mres (τ ) = 0 means that the n-dimensional vector Mres (τ ) is orthogonal to
each of the p + 1 distinct n-dimensional vectors of the coefficients. There is no linear trend
with respect to the covariates. (In other words, in the linear regression model predicting
Mres (τ ) as a function of the covariates, the coefficients are all 0.)
If the additive hazards model is true, there should be no nonlinear effect of the covariates
on the martingale residuals. So one possible model test is to plot the martingale residuals
against nonlinear functions of the residuals — for instance, the square of a covariate, or a
product of two covariates — and look for trends. This is described briefly in section 4.2.4 of
[2], and more extensively in [1].
5. (a) The correct log partial likelihood would have been
k k X
X X T
`P (β) = β T xij − log eβ xi
.
j=1 j=1 i∈Rj
The doubled data set will produce two identical events at time tj , and the risk sets R0j will
have each individual covariate from Rj repeated. This produces a log partial likelihood
k k X
X X T
`0P (β) = 2 β T xij − 2 log eβ xi
k
X k
X X T
=2 β T x ij − log 2 eβ xi
j=1 j=1 i∈Rj
derived from the duplicated data will be Vn0 (β̂) = 4Vn (β̂), while Jn0 (β̂) = 2Jn (β̂). Thus we
obtain the sandwich estimator from the duplicated data
(Jn0 )−1 Vn0 (Jn0 )−1 = (2Jn )−1 (4Vn )(2Jn )−1 = (Jn )−1 Vn (Jn )−1 ≈ Jn−1 ,
where p = 1/(λ + 1). This is the negative binomial distribution with parameters p and r.
(b) (Note: Apologies, the notation of the statement was somewhat confusing, because λ is the true
parameter of the NB distribution, but also the nominal Poisson parameter for the erroneous
model that we are fitting in this section. I will call this Poisson parameter λP .) The MLE
for the Poisson distribution is λ̂P = n−1 Ni . This will converge to the expected value
P
of the negative binomial, which is r/λ. We would expect this estimator to have variance
given by the variance of the Poisson distribution, which is the same as the expected value
r/nλ ≈ λ̂P /n. But the variance of the negative binomial is actually r(λ−1 + λ−2 )/n.
(c) A simple test would be to compare the sample mean to the sample variance. Under the
assumption of Poisson distribution the sample variance
1 X
σ̂ 2 := (Ni − λ̂P )2
n−1
has expected value
n h i
E[σ̂ 2 ] = E Ni2 − 2λ̂P Ni + λ̂2P
n−1
n 2 2 λP λP 2
= λP + λP − 2(λP + )+ + λP
n−1 n n
= λP .
λP + 4λ2P
Var(σ̂ 2 ) ≈ + O(n−2 ).
n
(To compute this, it makes sense to write Ni = Ñ + λP and λ̂P = λ̃ + λP . Then higher powers
of λ̃ have expectations O(n−1 ), and products with Ñ are still of this sort.) Then we can take
σ̂ 2 − λ̂P
Z := q
(λ̂P + 4λ̂2P )/n
The variance will be the sums of all the variances of the individual terms, plus the sums of all
the covariances of two different terms. We could calculate this exactly, but there would be a lot
of different terms to keep track of. We note, though, that for large n we can approximate this
by calculating only the first-order term in n. Each individual term contributes on the order
of n−6 to the sum. In total the number of terms is on the order of n3 , so their contribution is
on the order of n−3 in total. Covariances between terms involving entirely distinct Ni , Nj
will be 0 (by independence), so we need consider only covariances with duplications. The first
sum contributes only O(n3 ) of these, and pairs between the first and second sum contribute
only O(n4 ). The only contribution on the order of n5 is that made by pairs from the second
sum. These could have the same i or the same j or j 0 .
In total, there are n(n−1)(n−2) terms. If we are considering pairs of the form (Ni −Nj )(Ni −
Nj 0 ) and (Ni − Nj 00 )(Ni − Nj 000 ) with j’s all distinct there will be n(n − 1)(n − 2)(n − 3)(n − 4)
pairs (since we are taking account of order), and each one contributes
(since the central fourth moment of a Poisson random variable with parameter λ is 3λ2 + λ).
Considering pairs of the form (Ni − Nj )(Ni − Nj 0 ) and (Ni0 − Nj )(Ni0 − Nj 00 ) we see that
there are also 2n(n − 1)(n − 2)(n − 3)(n − 4) pairs (since, having chosen i, j, j 0 for the first
term we have two possibilities of which one to repeat for the second), and each contributes