GLMM Introduction for Tree Breeders
GLMM Introduction for Tree Breeders
Table of Contents
Introduction ..................................................................................................................................... 2
Linear Models ................................................................................................................................. 2
Linear regression example .......................................................................................................... 4
Linear Mixed Model ....................................................................................................................... 8
Generalized Linear Models ........................................................................................................... 11
Binary Data Example Disease incidence probability ............................................................. 12
Count Data Example Number of trees infected ..................................................................... 14
Generalized Linear Mixed Model ................................................................................................. 16
Overdispersion in Binomial and Poisson Regression Models ...................................................... 18
Example1: Binomial Counts in Randomized Blocks.................................................................... 20
Analysis as a GLM .................................................................................................................... 21
Analysis as GLMM Random block effects ............................................................................ 24
Analysis with Smooth Spatial Trends ....................................................................................... 26
GLMM with ASReml................................................................................................................ 29
Spatial R structure with ASReml .............................................................................................. 31
Example 2: Binary response variable with genetic effects ........................................................... 32
References ..................................................................................................................................... 46
Introduction
Generalized Linear Mixed Models (GLMM) have attracted considerable attention over the last
years. The word Generalized refers to non-normal distributions for the response variable, and
the word Mixed refers to random effects in addition to the usual fixed effects of regression
analysis. With the development of modern statistical packages such as SAS, R, and ASReml, a
large variety of statistical analyses are available to a larger audience. However, along with being
able to handle more sophisticated models comes a responsibility on the part of the user to be
informed on how these advanced tools work.
The objective of this workshop is to provide an introduction to generalized linear mixed models
by first discussing some of the assumptions and deficiencies of statistical linear models in
general, then giving examples of uses in common situations in the natural sciences.
The first section reviews linear models and regression analysis for simple and multiple
variables. Two numerical examples are solved using the SAS REG software.
The second section presents linear mixed models by adding the random effects to the linear
model. A simple numerical example is presented using the SAS MIXED Procedure.
The third (last) section introduces generalized linear models. Two illustrative examples of
binary and count data are presented using the SAS GLIMMIX procedure and ASReml
software.
Linear Models
Linear models (regression) are often used for modeling the relationship between a single variable
y, called the response or dependent variable, and one or more predictor, independent or
explanatory variables, X1,,Xp. When p=1, it is called simple regression but when p >1 it is
called multiple regression.
Regression analysis can be used to assess the relationship between explanatory variables on the
response variable. It is also a useful tool to predict future observations or merely describe the
structure of the data.
To start with a simple example, suppose that y is the weight of trees, the predictors are the height
(X1), and the age of the trees (X2). Typically the data will be available in the form of an array
like the following
where yi is the observation of the i-th tree and n is the number of observations.
2
There is an infinite number of ways to model the relationship between the response and the
explanatory variables. However, to keep it simple, the relationship can be modeled through a
linear function in the parameters as follows
where for i = 0, 1, 2 are unknown parameters and is the error term. Thus, the problem is
reduced to the estimation of three parameters.
Notice that in a linear model the parameters enter linearly, but the predictors do not necessarily
have to be linear. For instance, consider the following two functions
The first one is linear in the parameters, but the second one is not.
Using matrix representation, the regression equation for the above example can be written as:
where
The estimation of can be carried out using the least square approach. That is, we define as
the best estimate of in the sense that minimizes the sum of the squared errors.
Differentiating with respect to and setting equal to zero, it can be shown that satisfies the
normal equations
is invertible
So far, we have not assumed any distributional form for the errors . The usual assumption is that
the errors are normally distributed and in practice this is often, although not always, a reasonable
assumption.
If we assume that the errors are independent and identically normally distributed with mean 0
and variance , that is to say
, then the expectations of observations are
DF
Squares
Square
Model
1
Error
17
Corrected Total 18
7193.24912
2142.48772
9335.73684
7193.24912
126.02869
Root MSE
Dependent Mean
Coeff Var
11.22625
100.02632
11.22330
Mean
F Value
Pr > F
57.08
<.0001
R-Square
Adj R-Sq
0.7705
0.7570
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Height
1
1
-143.02692
3.89903
32.27459
0.51609
-4.43
7.55
0.0004
<.0001
The intercept is
spline
ctop=red
DF
Model
2
Error
16
Corrected Total 18
Squares
7574.69595
1761.04089
9335.73684
Root MSE
Dependent Mean
Coeff Var
Sum of
Square
Mean
F Value
Pr > F
3787.34798
110.06506
34.41
<.0001
10.49119
100.02632
10.48843
R-Square
Adj R-Sq
0.8114
0.7878
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
Height
Age
1
1
1
-155.29024
3.49864
2.68549
30.87234
0.52808
1.44255
-5.03
6.63
1.86
0.0001
<.0001
0.0811
Following with the notation, the estimated parameters of the multiple linear regression are
,
and
. Thus, the relationship among variables can be
expressed as:
ge
where:
y is the n x 1 vector of observations,
is a p x 1 vector of fixed effects,
is a q x 1 vector of random effects,
is a n x 1 vector of random error terms,
X is the n x p design matrix for the fixed effects relating observations y to ,
Z is the n x q design matrix for the random effects relating observations y to .
We assume that and are uncorrelated random variables with zero means and covariance
matrices G and R, respectively.
[ ]
[ ]
[ ]
[ ]
Thus, the expectation and variance (V) of the observation vector y are given by:
[ ]
[ ]
Understanding the V matrix is a very important component of working with mixed models since
it contains both sources of random variation and defines how these models differ from
computations with Ordinary Least Squares (OLS).
If you only have random effects models (such as a randomized block design) the G matrix is the
primary focus. On the other hand, for repeated measures or for or spatial analysis, the R matrix is
relevant.
If we also assume the random terms are normally distributed as:
,
Then, the observation vector will be normally distributed
For the general linear mixed model described above, the Hendersons mixed model equations
(MME) can be used to find and , the best linear unbiased estimator (BLUE) of , and the best
linear unbiased predictor (BLUP) of , respectively.
)( )
If the G and R matrices are known, generalized least squares can estimate any linear
combination of the fixed effects . However, as usually is the case these matrices are not known,
so a complex iterative algorithm for fitting linear models must be used to estimate them.
Consider the following example. Suppose that we have collected data on the growth of different
trees measured in two different locations. We can assume that the trees come from a large
population, which is a reasonable assumption, and therefore, we will treat them as random.
Tree (t)
1
2
3
4
5
Location (l)
1
2
2
1
2
Height (y)
87
84
75
90
79
[ ]
[
][ ]
[ ]
where
We can compute the solutions using R software or any other software that uses matrix algebra.
[ ]
[ ]
the relationship between the dependent variable and the fixed effects can be modeled
through a linear function,
The variance is not a function of the mean, and
the random terms follow a normal distribution
Link
Identity
Logit = ln(i(1- i)
Log
Inverse Link
e = 1/(1 + e)
e
Selection of inverse link functions is typically based on the error distribution. The logit link
function, unlike the identity link function, will always yield estimated means in the range of zero
to one. For most univariate link functions, link and inverse link functions are increasing
11
monotonic functions. In other words, an increase in the linear predictor results in an increase in
the conditional mean, but not at a constant rate.
Variance Function
The variance function is used to model non-systematic variability. Typically with a generalized
linear model, residual variability arises from two sources. First, variability arises from the
sampling distribution. For example, a Poisson random variable with mean has a variance of .
Second, additional variability, or over-dispersion, is often observed.
Variance function models the relationship between the variance of y and .
Distribution
Normal
Binomial
Poisson
Variance - v()
1
(1 -)
Consider the situation where individual seeds are laid on damp soil in different pots. In this
experiment, the pots are kept at different temperatures T for a number of days D. After an
arbitrary number of days, the seeds are inspected and the outcome y=1 is recorded if it has
germinated and y=0 otherwise. The probability of germination p can be modeled through a linear
function of the form:
It is worth noting that p is the expected value of y for a binomial distribution. Also notice the
non-linear relationship between the outcome p and the linear predictor is modeled by the
inverse link function. In this particular case, the link function is the logistic link function or logit:
(
12
p
p
1.00
1.00
0.75
0.75
0.50
0.00
0.50
0.25
3.75
0.25
7.50
Days
0.00
40.00
11.25
26.67
13.33
Temperature
0.00
15.00
0.00
3.75
0.00
40.00
26.67
Temperature 13.33
7.50
Days
11.25
0.00 15.00
Figure 4: Disease incidence probability as a function of Temperature and Day. The values of
for i = 0, 1, 2 have been chosen for illustrative purposes
13
[
] . Then, the linear predictor takes the form:
The inverse link function is,
. Using the model we can estimate how many days are needed for the
probability of disease incidence to exceed 80% at a given temperature. After some simple
algebra it can be shown that at temperature 10 at least 20 days are needed to reach a probability
of 0.80 germination.
The above example has no random effects so it is a generalized linear model (GLM), not a
generalized mixed model (GLMM).
, and
14
In Figure 5 the intensity as a function of age and height is plotted. Again, the values of for i
= 0, 1, 2 have been chosen for illustrative purposes only. Then, the linear predictor takes the
form:
.
Here is a SAS code to reproduce Figure 5.
goptions reset=all cback=white htitle=15pt htext=15pt;
data intensity;
do A=0 to 50 by 0.5;
do H=0 to 30 by 0.5;
I=exp(-2 - 0.03*A - 0.01*H);
output;
end;
end;
run;
proc g3d data=intensity;
title 'Intensity';
plot A*H=I / rotate=160 tilt=80 grid
xticknum=4 yticknum=3 zticknum=5
zmin=0 zmax=0.15
caxis = black ctop=blue cbottom=red;
label A='Age (years)' H='Height (meters)' I='lambda';
run; quit;
Intensity
Intensity
lambda
lambda
0.150
0.150
0.113
0.113
0.075
0.075
0.038
0.038
30
0.000
0
25
Age (years)
20
10
20
50 30
Height (meters)
0.000
50
10 Height (meters)
25
Age (years)
0 0
Figure 5: Intensity as a function of Age and Height. The inverse link function is defined as
, where
.
Notice that the above count data example does not include random effects, therefore, it is a
generalized linear model, not a generalized linear mixed model. In the next section the
generalized linear mixed model is presented.
15
( )
Where
y represents the (n x 1) response vector,
X the (n x p) design matrix of rank k for the (p x 1) fixed effects and
Z the (n x q) design matrix for the (q x 1) random effects u.
The random effects u are assumed to be normally distributed with mean 0 and variance matrix
G, that is to say
.
[ ]
[ ]
The fixed and random effects are combined to form a linear predictor
The model for the vector of observations y is obtained by adding a vector of residuals, , as
follows:
The relationship between the linear predictor and the vector of observations is modeled as
|
The above notation denotes that the conditional distribution of y given u has mean
and
variance R. The conditional distribution of | is usually referred as the error distribution. Note
that Instead of specifying a distribution for y, as in the case of a GLM, we now specify a
distribution for the conditional response, | . This formulation is also known as the conditional
model specification.
Last, the variance matrix of the observations is given by:
16
Where matrix A is a diagonal matrix that contains the variance functions of the model.
The class of generalized linear mixed models contains several important types of statistical
models. For example,
Linear models: no random effects, identity link function, and normal distribution
Generalized linear models: no random effects present
Linear mixed models: random effects, identity link function, and normal distribution
The generalized linear mixed models have been developed to address the deficiencies of linear
mixed models.
There are many cases when the implied assumptions are not appropriate.
For instance, the linear mixed model assumes that the relationship between the mean of the
dependent variable y and the fixed and random effects can be modeled through a linear function.
This assumption is questionable, for example, in modeling disease incidence.
Another assumption of the linear mixed model is that the variance is not a function of the mean,
and the random effects follow a normal distribution. The assumption of constant variance is
violated when analyzing a zero/one trait, such as diseased (1) or not diseased (0). In this case, the
response variable is Binomial. So, for a predicted disease incidence, the variance is
,
which is a function of the mean.
The assumption of normality is not valid for a binary trait. The outcome is a random variable that
can only take two values, zero or one. In contrast, the normal distribution is a bell shaped curve
that can take any real number.
Finally, the predictions from linear mixed models can take any value. Whereas predictions for a
binary variable is bounded (0,1) or for a count variable, it cannot take negative values.
Yes
Sum
Sample1
30
70
100
Sample2
20
180
200
Sum
50
250
300
17
Sample1
Probability of a Yes outcome = 70/100 (0.70)
Probability of a No outcome = 30/100 (0.30)
The odds of Outcome in Sample1 Odds = p / (1-p) = 0.70 /0.30 = 2.3
We expect 2.3 times as many occurrences as non-occurrences in Sample1.
Sample2
Probability of a Yes outcome = 180/200 (0.90)
Probability of a No outcome = 20/200 (0.10)
The odds of Outcome in Sample2 Odds = p / (1-p) = 0.90 /0.10 = 9
We expect 9 times as many occurrences as non-occurrences in Sample2.
The odds ratio of Sample2 to Sample1 is OR = 9/2.3 = 3.86
The odds of having outcome (Yes) in Sample2 is almost 4 times those in Sample1.
Over-dispersion results when the data appear more dispersed than is expected under some
reference model. It may occur with count data analyzed with binomial or Poisson regression
models, since the variance of both distributions is a function of the mean. That is,
Var[Y] = f(E[Y]) *
With both distributions the scale parameter phi is assigned a value of 1.
To understand what over-dispersion implies, first review the linear regression model (computed
with ordinary least squares). Under the normal distribution data are never over-dispersed because
the mean and variance are not related. The expectations of a linear model (y = X' + e) are
e ~ NID (0, 2e)
The variance of the residuals (sigma^2) is assumed constant for all linear combinations of the
covariates. This variance is estimated from the data and can assume any value greater than zero no
matter what the mean value is. Thus, the response values are assumed to have constant variance:
Var(y)= 2e*1
The normal errors and identity link function (linear regression) have the variance function Var()
=1. This variance is constant for all yi.
For a generalized linear model:
18
g() = X'
where =E(y) and g is the link function. The variance of y is:
Var(y)= * V()
That is, the variance of an observation equals some constant (the scale parameter) times a
function of the mean of y.
For a binary variable y, the variance is multiplicative function of its mean: Var(y) = (1-)
Under the Poisson distribution the variance of is the mean itself: Var(y) = E(y)= .
In either case, the observed counts have variances that are functions which depend on the value
of the mean. That is, the variance of y depends on the expectation of y, which is estimated from
the data.
When either model is fit under the assumption that the data were generated from a binomial
distribution or by a Poisson process, the scale parameter, phi, is automatically set equal 1. That is
why we see 1.00 for error variance in the ASReml output or in SAS GENMOD procedure output
when we fit Poisson or Binomial distributions. The value 1 is not error variance, but it is a scale
parameter and should not be used as variance component to calculate heritability for binomial
distribution.
For binomial and Poisson regression models, the covariance matrix (and hence the standard errors
of the parameter estimates) is estimated under the assumption that the chosen model is appropriate.
More variation in the data may be present than is expected by the distributional assumption. This is
called over-dispersion (also known as heterogeneity) which typically occurs when the
observations are correlated or are collected from "clusters".
To identify possible over-dispersion in the data for a given model, divide the deviance by its
degrees of freedom: this is called the dispersion parameter. If the deviance is reasonably "close"
to the degrees of freedom (i.e., the scale parameter=1) then evidence of over-dispersion is lacking.
Dispersion parameter (or scaled deviance)=Deviance/DF
A scale parameter that is greater than 1 does not necessarily imply over-dispersion is present. This
can also indicate other problems, such as an incorrectly specified model (omitted variables,
interactions, or non-linear terms), an incorrectly specified functional form (an additive rather than
a multiplicative model may be appropriate), as well as influential or outlying observations.
If you believe you have correctly specified the model and the scale estimate is greater than 1, then
conclude your data are over-dispersed. You should be able to identify possible reasons why your
data are over-dispersed. If you do not correct for over-dispersion, the estimates of the standard
19
errors are too small which leads to biased inferences (i.e. you will observe smaller p-values than
you should and thus make more Type I errors). As a result, confidence intervals will also be
incorrect.
When you have the "correct" model, outliers are not a problem, and the scaled deviance is large,
there are various choices for SAS procedures (GENMOD, GLIMMIX, NLMIXED) or for
ASReml to correct for over-dispersion.
damaged plants
plants;
lat lng n Y @@;
16 1 2 9 1
6 1 4 9 9
15 2 2 14 7
5 2 4 11 8
12 3 2 11 8
3 3 4 12 5
9 4 2 15 8
1 4 4 8 7
3 5 2 11 9
2 5 4 9 9
7 6 2 10 8
6 6 4 10 7
13 7 2 6 0
16 7 4 9 0
1 8 2 13 12
4 8 4 14 7
13 1 6 7 0
14 1 8 9 0
20
3
3
3
3
3
3
4
4
4
4
4
4
4
4
;
4 2 5 15 11
3 2 7 15 11
6 3 5 16 9
15 3 7 7 0
11 4 5 8 1
5 4 7 12 7
9 5 5 15 8
12 5 7 13 5
15 6 5 17 6
14 6 7 12 5
13 7 5 13 2
3 7 7 9 9
2 8 5 12 8
5 8 7 11 10
3
3
3
3
3
3
4
4
4
4
4
4
4
4
10 2 6 9 7
9 2 8 13 5
1 3 6 8 8
12 3 8 12 8
16 4 6 15 1
2 4 8 16 12
4 5 6 10 6
1 5 8 15 9
6 6 6 8 2
7 6 8 15 8
8 7 6 13 9
10 7 8 6 6
11 8 6 9 7
16 8 8 15 7
If infestations are independent among experimental units, and all plants within a unit have the
same propensity of infestation, then the
are binomial random variables.
10
11
12
13
14
15
16
Plant varities
Figure 1. Data visualization and summary is an important step before any statistical analysis.
The chart shows large differences between varieties for infestation. The horizontal dashed line
shows the overall mean (0.54) incidence.
Analysis as a GLM
Lets consider first a standard generalized linear model for independent binomial counts. The
SAS statements would be as follows:
proc glimmix data=HessianFly;
class block entry;
21
WORK.HESSIANFLY
Y
n
Binomial
Logit
Default
Diagonal
Maximum Likelihood
Residual
The GLIMMIX procedure recognizes that this is a model for uncorrelated data (variance matrix
is diagonal) and that parameters can be estimated by maximum likelihood.
The Class Level Information table lists the levels of the variables specified in the CLASS
statement and the ordering of the levels.
Class Level Information
Class
Levels
block
entry
4
16
Values
1 2 3 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
64
22
64
396
736
21
0
1
64
Because of the absence of random effects in this model, there are no columns in the Z matrix.
The 21 columns in the X matrix comprise the intercept, 4 columns for the block effect and 16
columns for the entry effect.
The Fit Statistics table lists information about the fitted model.
Fit Statistics
-2 Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
CAIC (smaller is better)
HQIC (smaller is better)
Pearson Chi-Square
Pearson Chi-Square / DF
265.69
303.69
320.97
344.71
363.71
319.85
106.74
2.37
The -2 Log Likelihood values are useful for comparing nested models, and the information
criteria AIC, AICC, BIC, CAIC, and HQIC are useful for comparing non-nested models. On
average, the ratio between the Pearson Chi-square statistic and its degrees of freedom should
equal one in GLMs. Values larger than one are indicative of over-dispersion. With a ratio of
2.37, these data appear to exhibit more dispersion than expected under a binomial model with
block and varietal effects.
The Type III Tests of Fixed Effect table displays significance tests for the two fixed effects in
the model.
Type III Tests of Fixed Effects
Effect
Num
DF
Den
DF
F Value
Pr > F
block
entry
3
15
45
45
1.42
6.96
0.2503
<.0001
23
These tests are Wald-type tests, not likelihood ratio tests. The entry effect is clearly significant in
this model with a p-value of < 0.0001, indicating that the 16 wheat varieties are not equally
susceptible to damage by the Hessian fly.
Treating the block effects as random changes the estimates compared to a model with fixed
block effects.
Selected tables of output are given below.
Model Information
24
Data Set
Response Variable (Events)
Response Variable (Trials)
Response Distribution
Link Function
Variance Function
Variance Matrix
Estimation Technique
Degrees of Freedom Method
WORK.HESSIANFLY
Y
n
Binomial
Logit
Default
Not blocked
Residual PL
Containment
In the presence of random effects and a conditional binomial distribution, PROC GLIMMIX
does not use maximum likelihood for estimation. Instead, the GLIMMIX procedure applies a
restricted (residual) pseudo-likelihood algorithm.
The Dimensions table has changed from the previous model. The Dimensions table indicates
that there is a single G-side parameter, the variance of the random block effect.
Dimensions
G-side Cov. Parameters
Columns in X
Columns in Z
Subjects (Blocks in V)
Max Obs per Subject
1
17
4
1
64
Note that although the block effect has four levels, only a single variance component is
estimated. The Z matrix has four columns, however, corresponding to the four levels of the block
effect. Because no SUBJECT= option is used in the RANDOM statement, the GLIMMIX
procedure treats these data as having arisen from a single subject with 64 observations.
The Optimization Information table indicates that a Quasi-Newton method is used to solve the
optimization problem. This is the default method for GLMM models.
Optimization Information
Optimization Technique
Parameters in Optimization
Lower Boundaries
Upper Boundaries
Fixed Effects
Starting From
Dual Quasi-Newton
1
1
0
Profiled
Data
The Fit Statistics table shows information about the fit of the GLMM.
Fit Statistics
-2 Res Log Pseudo-Likelihood
Generalized Chi-Square
182.21
107.96
25
Gener. Chi-Square / DF
2.25
The generalized chi-square statistic measures the residual sum of squares in the final model and
the ratio with its degrees of freedom is a measure of variability of the observation about the mean
model. The over-dispersion parameter (2.25) is still larger than 1.
The variance of the random block effects in the following table is rather small. The random
block model does not provide a suitable adjustment for dispersion.
Covariance Parameter
Estimates
Cov
Parm
Estimate
Standard
Error
block
0.01116
0.03116
Because the block variance component is small, the Type III test for the variety effect in is
affected only very little compared to the standard GLM.
Type III Tests of Fixed Effects
Effect
Num
DF
Den
DF
F Value
Pr > F
Entry
15
45
6.90
<.0001
In the experimental designs, researchers had row-column (longitude and latitude) values for each
data to account for spatial scale micro-site variation.
26
WORK.HESSIANFLY
Y
n
Binomial
Logit
Default
Intercept
Residual PL
Containment
Dimensions
R-side Cov. Parameters
Columns in X
Columns in Z per Subject
Subjects (Blocks in V)
Max Obs per Subject
2
17
0
1
64
Fit Statistics
-2 Res Log Pseudo-Likelihood
Generalized Chi-Square
Gener. Chi-Square / DF
158.85
121.51
2.53
Subject
Estimate
Standard
Error
27
SP(EXP)
Residual
Intercept
0.9052
2.5315
0.4404
0.6974
The sill of the spatial process, the variance of the underlying residual effect, is estimated as
2.5315. The one third of practical range of a spatial process is 0.9052.
Type III Tests of Fixed Effects
Effect
Num
DF
Den
DF
F Value
Pr > F
entry
15
48
3.60
0.0004
The F value (3.6) for the entry effect has been sharply reduced compared to the previous
analyses. The smooth spatial variation accounts for some of the variation among the varieties
The following plot compares the LS-means of
varieties. Varieties with negative LS-means
have less infestation with Hessian fly.
28
Conclusions
In this example three models were considered for the analysis of a randomized block design with
binomial outcomes.
If data are correlated, a standard generalized linear model often will indicate over-dispersion
relative to the binomial distribution. Two courses of action are considered in this example to
address this over-dispersion.
First, the inclusion of G-side random effects models the correlation indirectly; it is induced
through the sharing of random effects among responses from the same block.
Second, the R-side spatial covariance structure models covariation directly.
In generalized linear (mixed) models, these two modeling approaches can lead to different
inferences, because the models have different interpretation. The random block effects are
modeled on the linked (logit) scale, and the spatial effects were modeled on the mean scale.
Only in a linear mixed model are the two scales identical.
Title: Hessianfly.
#Damaged,Plants,block,Entry,lat,lng
#2,8,1,14,1,1
#1,9,1,16,1,2
y
N
block *
entry *
lat *
lng *
yRatio !=y !/N
Hessianfly.csv
!SKIP 1
Mu=P=1/(1+exp(-XB))
V=Mu(1-Mu)/N
Warning: The LogL value is unsuitable for comparing GLM models
Notice:
1 singularities detected in design matrix.
1 LogL=-46.8426
S2= 1.0000
48 df
Dev/DF=
2 LogL=-46.8446
S2= 1.0000
48 df
Dev/DF=
3 LogL=-46.8446
S2= 1.0000
48 df
Dev/DF=
4 LogL=-46.8446
S2= 1.0000
48 df
Dev/DF=
5 LogL=-46.8446
S2= 1.0000
48 df
Dev/DF=
2.671
2.671
2.671
2.671
2.671
The heterogeneity factor [Deviance / DF] gives some indication as how well the discrete
distribution fits the data. A value greater than 1 suggests the data are over-dispersed, that is the
data values are more variable than expected under the chosen distribution.
Model
64
terms
48
Source of Variation
7 mu
4 entry
Gamma
1.00000
Component
1.00000
Wald F statistics
NumDF
DenDF
F-inc
1
48.0
4.64
15
48.0
6.87
Comp/SE
0.00
% C
0 F
P-inc
0.036
<.001
30
LogL Converged
The F-value for the Entry (plant varieties) is large and significant.
We can adjust for heterogeneity (over-dispersion) by using !DISPERSION qualifier in ASReml.
The dispersion parameter is estimated from the residuals if not supplied by the analyst. Here is
the model to account for over-dispersion.
y !bin !TOTAL=N !dispersion ~ mu entry
Output
Source of Variation
8 mu
4 entry
Wald F statistics
NumDF
DenDF
1
48.0
15
48.0
F-inc
2.05
3.03
P-inc
0.159
0.002
After adjusting for heterogeneity in the variance we see a much smaller F test for Entry. It is still
significant. The predictions for plant varieties do not change but their standard errors change.
Model terms
Gamma
Component
AR=AutoR
8 -0.183224
-0.183224
AR=AutoR
8 0.709363E-01 0.0709363
Source of Variation
8 mu
4 entry
Wald F statistics
NumDF
DenDF
1
7.1
15
43.0
Comp/SE
% C
-1.62
0 U
0.70
0 U
F-inc
5.12
7.42
P-inc
0.058
<.001
Seedlings were grown in a greenhouse and inoculated with Phytophthora cinnamomi, a soil
borne pathogen. Subsequently, survival or mortality of each seedling was assessed biweekly.
Fraser fir was used as control. The objective of the research was to examine the genetic variation
between and within seed sources of Turkish firs and estimate heritability values for disease
susceptibility.
The first 10 lines of data are presented here
Sort,Rep,Tray,Species,Prov,Family,Tree,Wk2,Wk4,Wk6,Wk8,Wk10,Wk12,Wk14,Wk16
1,1,1,Turkish,SAF,120, 1, 0,0,0,0,0,0,0,0
2,1,1,Turkish,SAF,120, 2, 0,0,0,0,0,0,0,0
3,1,1,Turkish,SAF,120, 3, 0,0,0,0,0,0,0,0
4,1,1,Turkish,SAF,120, 4, 0,0,0,0,0,0,0,0
5,1,1,Turkish,SAF,120, 1, 0,0,0,0,0,0,0,0
6,1,1,Turkish,SAF,120, 6, 0,0,0,0,0,0,0,0
7,1,1,Turkish,SAF,120, 7, 0,0,0,0,0,0,0,0
8,1,1,Turkish,SAF,120, 8, 0,0,0,0,0,0,0,0
9,1,1,Turkish,SAF,120, 9, 0,0,0,0,0,0,0,0
10,1,1,Turkish,SAF,120,10, 0,0,0,0,0,0,0,0
The means were plotted against time (weeks) to examine the trends (linear, quadratic etc.) in
disease incidence but also visually depict the interactions of species and provenances with time.
100
Fraser
fir
Mortality (%)
80
Trojan
fir
60
40
Turkish
fir
20
Momi fir
0
2
10
12
14
16
# Data file
!PART 1
Wk_16 !bin !logit ~ mu Rep*Prov ,
!r Prov.Family Rep.Prov.Family
34
There MUST be
a blank field
before names
Notice that we define the distribution using the !BIN qualifier (binomial) and underlying link
function using !LOGIT qualifier in ASReml. The logit is the default link function. The variance
on the underlying scale is 2/3 = 3.28987=~ 3.298 (underlying logistic distribution) for the logit
link (Gilmour et al. 2006).
The results are
ASReml 3.0 [01 Jan 2009]
Title: Pc inoculation master thru wk16 filtered.
Build fl [ 2 Sep 2009]
64 bit
31 Jul 2011 11:52:06.419
32 Mbyte Windows x64 Tfir21_Wk_16
Licensed to: North Carolina State University
30-sep-2011
***********************************************************
* Contact [email protected] for licensing and support *
***************************************************** ARG *
Folder: C:\Users\fisik\Documents\_Research\PROJECTS\Christmas
Tree\Phytophthora Inoculations\ASREML
Rep *
!SKIP 1
Species !A
Prov !A
Family !I
QUALIFIERS: !SKIP 1
!DDF 2
! Turkish fir
QUALIFIER: !DOPART
1 is active
Reading Tfir_dat.csv FREE FORMAT skipping
1 lines
Univariate analysis of Wk_16
Summary of 3662 records retained of 3675 read
Model term
1 Rep
2 Tray
3 Species
4 Prov
5 Family
MinNon0
Mean
1
2.4913
1
4.6182
1
1.0000
1
2.5800
1
32.8610
MaxNon0
4
9
1
4
66
StndDevn
35
6 Tree
18
0
0
1
7.0866
18
7 Wk_2
0 3662 0.000
0.000
0.000
0.000
8 Wk_4
0 3578 1.000
0.2294E-01 1.000
0.1497
9 Wk_6
0 3394 1.000
0.7318E-01 1.000
0.2605
10 Wk_8
0 3176 1.000
0.1327
1.000
0.3393
11 Wk_10
0 2988 1.000
0.1841
1.000
0.3876
12 Wk_12
0 2751 1.000
0.2488
1.000
0.4324
13 Wk_14
0 2531 1.000
0.3088
1.000
0.4621
14 Wk_16
Variate
0 2379 1.000
0.3504
1.000
0.4771
15 mu
1
16 Rep.Prov
16 1 Rep
:
4
4 Prov
:
4
17 Prov.Family
264 4 Prov
:
4
5 Family
:
66
18 Rep.Prov.Family 1056 1 Rep
:
4 17 Prov.Family
: 264
Forming
1345 equations: 25 dense.
Initial updates will be shrunk by factor
0.010
Notice: Algebraic Denominator DF calculation is not available
Numerical derivatives will be used.
Distribution and link: Binomial; Logit Mu=P=1/(1+exp(-XB))
V=Mu(1-Mu)/N
Warning: The LogL value is unsuitable for comparing GLM models
Notice:
9 singularities detected in design matrix.
1 LogL=-4751.08
S2= 1.0000
3646 df
Dev/DF=
1.126
2 LogL=-4751.23
S2= 1.0000
3646 df
Dev/DF=
1.126
3 LogL=-4752.53
S2= 1.0000
3646 df
Dev/DF=
1.124
4 LogL=-4757.13
S2= 1.0000
3646 df
Dev/DF=
1.119
5 LogL=-4765.90
S2= 1.0000
3646 df
Dev/DF=
1.114
6 LogL=-4778.73
S2= 1.0000
3646 df
Dev/DF=
1.109
7 LogL=-4785.19
S2= 1.0000
3646 df
Dev/DF=
1.106
8 LogL=-4786.54
S2= 1.0000
3646 df
Dev/DF=
1.106
9 LogL=-4786.64
S2= 1.0000
3646 df
Dev/DF=
1.106
10 LogL=-4786.64
S2= 1.0000
3646 df
Dev/DF=
1.106
11 LogL=-4786.64
S2= 1.0000
3646 df
Dev/DF=
1.106
Final parameter values
0.45778
0.12291
1.0000
Deviance from GLM fit
3646
4031.00
Variance heterogeneity factor [Deviance/DF]
1.11
- - - Results from analysis of Wk_16 - - Notice: While convergence of the LogL value indicates that the model has
stabilized, its value CANNOT be used to formally test differences between
Generalized Linear (Mixed) Models.
Approximate stratum variance decomposition
Stratum
Degrees-Freedom
Variance
Component Coefficients
Prov.Family
39.25
2.06632
4.2
1.0
Rep.Prov.Family
11.64
0.122911
0.0
1.0
Source
Prov.Family
Rep.Prov.Family
Variance
Model
264
1056
3662
Source of Variation
15 mu
1 Rep
4 Prov
terms
264
1056
3646
Gamma
0.457785
0.122911
1.00000
Component
0.457785
0.122911
1.00000
Wald F statistics
NumDF
DenDF
1
58.7
3
158.6
3
59.1
F-inc
60.66
6.09
9.79
Comp/SE
4.14
2.41
0.00
%
0
0
0
C
P
P
F
P-inc
<.001
<.001
<.001
36
16 Rep.Prov
9
167.8
0.85
0.567
Notice: The DenDF values are calculated ignoring fixed/boundary/singular
variance parameters using numerical derivatives.
Warning: These Wald F statistics are based on the working variable and are
not equivalent to an Analysis of Deviance. Standard errors are scaled by the
variance of the working variable, not the residual deviance.
17 Prov.Family
264 effects fitted (
198 are zero)
18 Rep.Prov.Family
1056 effects fitted (
792 are zero)
Finished: 31 Jul 2011 11:52:09.270
LogL Converged
We are interested in the variance components (Component) in this study to understand the effect
of genetics and environment on the disease incidence. Heritability will tell us about the effects of
genetics (family differences) on the incidence compared to phenotypic variance.
The family effect and other random effects are on logistic scale with a variance of 3.28987=2/3
(Gilmour et al. 1985). Because we have wind-pollinated families assumed to be half-siblings,
variance explained by family effect is 1/4 of additive genetic variance (Falconer and Mackay
1996). The total additive genetic variance would be 4* Var(Prov.Family).
We are mostly interested in selection of families and thus the heritability of interest would be
family mean heritability.
Where
is the aggregate family variance component across provenances,
is the
replication by family interaction variance,
is the fixed error variance, r is the number of
replications and n is the number of seedlings per family. The error variance was set to 3.29 in
calculation of phenotypic variances as suggested by Gilmour et al. (1985). Standard errors of
heritabilities were estimated using Delta method (Lynch and Walsh 1998).
The denominator in the above formula is the phenotypic variance of family means, r is the
number of replications (4), and n is the number of seedlings per family (on average it is 52
seedlings). Using the numbers from the output file given above column named Component, the
heritability would be = 0.457785 / (0.457785 + 0.1229/4 + 3.29/52) = 0.83
Wk_2
Wk_10
Wk_4
Wk_12
Wk_6
Wk_14
Wk_8
Wk_16
Output
4
5
6
7
8
Total 1
3.871
0.1166
Pheno 1
3.871
0.1166
Pheno_Fam 1 0.5483
0.1099
ErrorVar 3
3.290
0.000
AddVar 1
1.831
0.4421
P_Fam
= Family
1/Total 1
4= 0.1183
0.0254
P_RepFam
= Rep.Fami
2/Total 1
4= 0.0318
0.0129
P_Error
= ErrorVar
7/Total 1
4= 0.8500
0.0256
H2I
= AddVar
8/Pheno 1
5= 0.4731
0.1016
H2F
= Family
1/Pheno_Fa
6= 0.8349
0.0403
Notice: The parameter estimates are followed by their approximate
standard errors.
Additive genetic variance is 1.83 0.442. Family effect explained about 12% of total variance
(0.118) observed in the study. Family mean heritability is 0.83 which is high, suggesting that if
we select families with low disease incidence and use them for plantations, we will be able to
control the disease successfully.
The following are a few lines from the prediction file of ASReml (.sln). The predictions from the
model are on the logit scale and they do not include the Provenance effect in which they were
sampled.
EFFECT
Family
Family
Family
Family
Family
Family
Family
Family
LEVEL
ULU
AKY
BOL
SAF
1
2
3
4
BLUP
0.3549
0.000
-1.086
-1.150
1.376
0.6037
-0.9268
0.4487
Stderr
0.2242
0.000
0.2838
0.2553
0.2915
0.3092
0.4280
0.3716
In order to rank families across the provenances we need to add the predicted values of
Provenances to the families. Lets assume the families 1, 2 and 3 are from ULU provenance and
family 4 is from BOL provenance. The predictions of those families would be
It is also more straightforward to interpret probabilities (which range between 0 and 1) than
predictions on the logit scale. In order to obtain the probabilities, we need to apply the inverse of
link function.
p = exp( ) / [1 + exp()]
Where, is the vector of solutions for families (Best Linear Unbiased Prediction (BLUP)).
Predicted probability values ( p ) range between 0.0 and 1.0. A high probability value indicates a
high probability of mortality.
Family
ID
1
2
3
4
GCA
1.376
0.6073
-0.9268
0.4487
Breeding value
(2*GCA)
2.7520
1.2146
-1.8536
0.8974
Provenance
Sum
ULU = 0.3545
ULU = 0.3545
ULU = 0.3545
BOL= -0.1086
3.1065
1.5691
-1.4991
1.2519
Predicted
probability
0.96
0.83
0.18
0.78
Family 1 has the predicted probability of 0.96 for mortality, whereas family 3 had only 0.18
probability of mortality.
randomized complete block design was used with 9 replications. Each clone had one copy in a
block (single-tree plot design). Block effect was considered fixed and the family and clone
effects were random. When the trees were 3 years old in the field, the presence=1 and absence=0
of disease (galls) were recorded. We are interested in partitioning phenotypic variance into
genetics and environmental components and to predict genetic values of clones. (Isik et. al. 2005.
Predicted genetic gains and testing efficiency from two loblolly pine clonal trials. Canadian J.
Forest Research 35: 1754-1766).
The probability of infection (p) of a single tree was modeled with the generalized linear mixed
model using a logit (canonical) link function.
ijk = log [p/(1-p)] = + ri + fj + c(f)kj + eijk
in a matrix form the model is
= + X + Zu + e
where ijk is the link function g(), and is the conditional mean, p is the proportion of infected
trees, ri is the fixed effect of the ith block, fj is the random effect of the jth family with N(0, I2f),
c(f)kj is the random effect of the kth clone within the jth family with N(0, I2c(f)), and eijk is the
random residual with N(0, I2e).
The variance of observations is
Var(y) = E[Var( y| u)] + Var[E( Y| u)]
= A1/2RA1/2 + ZGZT
Where the A is diagonal matrix and contains the variance function of the model. That is A =
diag{p(1-p)} and p = Pr (yi = 1). The R is the variance matrix for residuals random effects. The
vector of random effects u, was assumed to be multivariate normal with a variance-covariance
matrix of G=Var(u) (SAS Institute Inc. 1996). The Z and ZT are design matrix and transposed
design matrix, respectively, for random effects. The validity of the model fitted to the rust data
and the predicted values of clones are closely related to the average rust infection. The average
infection in the experiment was 0.38. We assumed that an average rust infection 0.38 is within
acceptable boundaries. An infection average smaller than 0.2 or greater than 0.8 would be
associated with high environmental variance (error) not suitable for analysis.
Because linear predictors for rust infection were computed on a logit scale, the solutions from the
generalized linear mixed model are difficult to interpret. Therefore, predicted probabilities ( p )
of the clones were calculated by applying the inverse of the link function and using the BLUP of
the random solution vector ().
p = [exp[X+ )] / [1 + exp(X+)]
40
Where, X is the design matrix for fixed effects, is the solutions for fixed effects (i.e., the
intercept), and is the solution for random effects or the Best Linear Unbiased Prediction
(BLUP) of clones. Clone rust infection predicted probability values ( p ) range between 0.0 and
1.0. A high probability value for a clone indicates a high probability of disease infection. These
values are best linear unbiased predictors for clones.
Using variance components, we can easily calculate repeatability of clone means or heritability
of clone means.
H2 = 2c / [2c + 2e / n]
Where H2 is the repeatability of clone means, 2c is the variance explained by the clone effects,
2e is the variance of residuals which is fixed to 3.29, and n is the number of trees per clone in
the experiment.
!r family clone
Model
Gamma
4
1128
2369
0.579831
2.70983
1.00000
Component
Comp/SE
% C
0.579831
2.70983
1.00000
1.10
7.94
0.00
0 P
0 P
0 F
The variance due to clone differences explained a large proportion of disease incidence (2.7098).
Family component was 0.5798. The repeatability of clones is calculated as follows.
H2 = 2c / [2c + 2e / n]
= 2.7 / [2.7 + (3.3 / 4.3) ]
41
Level
1
F
H
I
K
F.F0101
F.F0103
F.F0104
F.F0107
Estimate
Std error
-1.046
0.2428
0.0354
-0.9985
0.7203
-1.213
0.7804
-1.364
-1.306
0.4251
0.4295
0.4307
0.4398
0.4242
1.137
0.7163
1.099
1.114
Best linear unbiased estimates of some clones on measured scale (inverse link) are given below.
The last column (inverse link) is back-transformed predicted probability (BLUP) of the clones
after adding the MU to the ESTIMATE as follows.
p= exp[mu + BLUP(clone)] / [1+exp(mu + BLUP(clone)]
Clone
Estimate
Std Err
Inverse link
F.F0101
F.F0103
F.F0104
F.F0107
F.F0108
F.F0110
F.F0111
F.F0112
F.F0113
F.F0116
-1.213
0.7804
-1.364
-1.306
-1.303
-1.292
0.8226
1.118
-0.4338
0.7804
1.137
0.7163
1.099
1.114
1.115
1.118
0.7198
0.6705
0.8745
0.7163
0.09
0.43
0.08
0.09
0.09
0.09
0.44
0.52
0.19
0.43
Clone F0104 had the lowest probability of disease infection ( p = 0.08), whereas clone F0112
had the highest probability of infection. Assuming a probability of infection of 0.50 for a
Checklot family, how much genetic gain can be realized if the top 3 clones are selected over the
Checklot tree?
dbms=CSV replace;
getnames=YES;
datarow=2;
run;
Fitting the model using SAS GLIMMIX procedure.
proc glimmix data=rust asycov;
class rep fam clone ;
model rust3 (event='1')= rep /s dist=binary link=logit;
random fam clone /s ;
output out=p pred(blup ilink)=predicted lcl=lower ;
ods output solutionr=s_r solutionf=s_f;
ods exclude solutionr solutionf;
run;
1. In the MODEL statement, the probability of the event=1 (infection) is modeled. If you do
not specify the event, the code may choose 0, depending on the order. After the slash in
MODEL, Best Linear Unbiased Estimates of fixed effects (BLUEs) are requested. The
distribution of data is defined as binary (DIST=BINARY) and LOGIT link function is
used for transformation.
2. FAM and CLONE effects are random. Best linear unbiased predictors (BLUP) of random
effects are requested using /S option.
3. In the OUTPUT OUT statement inverse link of BLUP estimates were requested with the
lower confidence level (LCL).
4. The ODS OUTPUT statement creates two data sets; one for random effects predictions
and one for fixed effects estimates.
OUTPUT
The GLIMMIX Procedure
Model Information
Data Set
Response Variable
Response Distribution
Link Function
Variance Function
Variance Matrix
Estimation Technique
Degrees of Freedom Method
WORK.RUST
rust3
Binary
Logit
Default
Not blocked
Residual PL
Containment
Class
Levels
rep
fam
clone
9
4
282
Values
1 2 3 4 5 6 7 8 9
F H I K
F0100 F0101 F0103 F0104 F0107 F0108 F0110
2528
2369
Response Profile
Ordered
Value
1
Total
Frequency
rust3
0
1831
2
538
The order of the outcome (observations) whether it is 0 or 1 is important. Check above table to
make sure modeling the event=1.
Dimensions
G-side Cov. Parameters
Columns in X
Columns in Z
Subjects (Blocks in V)
Max Obs per Subject
2
10
286
1
2369
Optimization Information
Optimization Technique
Parameters in Optimization
Lower Boundaries
Upper Boundaries
Fixed Effects
Starting From
Dual Quasi-Newton
2
2
0
Profiled
Data
Iteration History
Convergence criterion (PCONV=1.11022E-8) satisfied.
Fit Statistics
44
11916.75
1469.74
0.62
Covariance Parameter
Estimates
Cov
Parm
Estimate
Standard
Error
0.5798
2.7098
0.5285
0.3439
fam
clone
Num
DF
Den
DF
F Value
Pr > F
2079
6.39
<.0001
rep
The output includes model information, variance components as well as solutions for fixed
effects (BLUE) and solutions for random effects (BLUPs). The following is a partial output from
the S_R (prediction file for random effects) file.
Solution for Random Effects
Effect
fam
fam
fam
fam
fam
clone
clone
clone
clone
clone
F
H
I
K
clone
Estimate
Std Err
F0101
F0103
F0104
F0107
F0108
-0.2428
-0.03543
0.9985
-0.7203
1.2134
-0.7804
1.3638
1.3064
1.3030
0.4295
0.4307
0.4398
0.4242
1.1374
0.7163
1.0992
1.1138
1.1146
The Estimate column displays the BLUP estimates on the logit scale. Since linear predictors
for rust incidence were computed on logit scale, predicted probability (p) of random effects can
be calculated by applying the inverse link function. For example, probability of a clone being
infected by the disease can be calculated as follows:
p= exp[mu + BLUP(clone)] / [1+exp(mu + BLUP(clone)]
clone
Estimate
StdErr
inverse link
45
F0101
F0103
F0104
F0107
F0108
F0110
F0111
F0112
F0113
F0116
-1.213
0.780
-1.364
-1.306
-1.303
-1.292
0.823
1.118
-0.434
0.780
1.137
0.716
1.099
1.114
1.115
1.118
0.720
0.671
0.874
0.716
0.08
0.37
0.07
0.07
0.07
0.07
0.38
0.46
0.15
0.37
The predicted probabilities of clones are similar to what we calculated from ASReml.
Acknowledgement
Alfredo Farjat, PhD student with the Cooperative Tree Improvement Program in the Department
of Forestry and Environmental Resources at NC State University contributed the document,
especially the theory parts.
References
[1] SAS System for Mixed Models, July 1996, SAS Publishing. Ramon C. Littell, George A.
Milliken, Walter W. Stroup and Russell Wolfinger.
[2] An Introduction to Generalized Linear Mixed Models. Stephen D. Kachman, Department of
Biometry, University of NebraskaLincoln.
[3] GLIMMIX Procedure: http://support.sas.com/rnd/app/papers/glimmix.pdf
[4] Repeated Measures Modeling With PROC MIXED. E. Barry Moser, Louisiana State
University, Baton Rouge, LA. SUGI 29 Proceedings, Paper 188-29.
[5] PROC MIXED: Underlying Ideas with Examples. David A. Dickey, NC State University,
Raleigh, NC. SAS Global Forum 2008, Statistics and Data Analysis, Paper 374-2008.
[6] Ideas and Examples in Generalized Linear Mixed Models. David A. Dickey, N. Carolina
State U., Raleigh, NC. SAS Global Forum 2010, Statistics and Data Analysis, Paper 263-2010.
[7] Introducing the GLIMMIX Procedure for Generalized Linear Mixed Models.
Oliver Schabenberger, SAS Institute Inc., Cary, NC. SUGI 30, Paper 196-30.
[8] Practical Regression and Anova using R. Julian Faraway. http://www.r-project.org/
46
[9] Falconer, D.S., and Mackay, T.F.C. 1996. Introduction to Quantitative Genetics. Fourth
Edition, Longman Group Ltd., Essex, England, 464 p.
[10] Gilmour, A.R., Anderson, R.D., and Rae, A.L. 1985. The analysis of binomial data by a
generalized linear mixed model. Biometrika, 72: 593599. doi:10.1093/biomet/72.3.593.
[11] Gilmour, AR, Gogel BJ, Cullis BR, Thomson R. 2009. ASREML User Guide, Release 3.0.
VSN International Ltd, Hemel Hempstead, HP1, 1ES, UK. 267 p.
[12] Littell RC, Henry PR, CB Ammerman (1998) Statistical analysis of repeated measures data
using SAS procedures. J. Anim. Sci. 76: 1216-1231.
[13] Lynch, M., and Walsh, B. 1998. Genetics and analysis of quantitative traits. Sinauer
Associates, Inc., Sunderland, Mass.
[14] Stephen Kachmans course notes, University of Nebraska:
http://statistics.unl.edu/faculty/steve/index.shtml.
47