0% found this document useful (0 votes)
132 views17 pages

Beta Regression For Modelling Rates and Proportions: Silvia L. P. Ferrari and Francisco Cribari-Neto

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views17 pages

Beta Regression For Modelling Rates and Proportions: Silvia L. P. Ferrari and Francisco Cribari-Neto

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Journal of Applied Statistics,

Vol. 31, No. 7, 799–815, August 2004

Beta Regression for Modelling Rates


and Proportions
SILVIA L. P. FERRARI* AND FRANCISCO CRIBARI-NETO**
*Departamento de Estatı́stica/IME, Universidade de São Paulo, Brazil, **Departamento de
Estatı́stica, CCEN, Universidade Federal de Pernambuco, Brazil

A This paper proposes a regression model where the response is beta distributed using
a parameterization of the beta law that is indexed by mean and dispersion parameters. The
proposed model is useful for situations where the variable of interest is continuous and restricted
to the interval (0, 1) and is related to other variables through a regression structure. The
regression parameters of the beta regression model are interpretable in terms of the mean of the
response and, when the logit link is used, of an odds ratio, unlike the parameters of a linear
regression that employs a transformed response. Estimation is performed by maximum likelihood.
We provide closed-form expressions for the score function, for Fisher’s information matrix and
its inverse. Hypothesis testing is performed using approximations obtained from the asymptotic
normality of the maximum likelihood estimator. Some diagnostic measures are introduced.
Finally, practical applications that employ real data are presented and discussed.

K W: Beta distribution, maximum likelihood estimation, leverage, proportions,


residuals

Introduction
Practitioners commonly use regression models to analyse data that are perceived
to be related to other variables. The linear regression model, in particular, is
commonly used in applications. It is not, however, appropriate for situations
where the response is restricted to the interval (0, 1) since it may yield fitted
values for the variable of interest that exceed its lower and upper bounds. A
possible solution is to transform the dependent variable so that it assumes values
on the real line, and then to model the mean of the transformed response as a
linear predictor based on a set of exogenous variables. This approach, however,
has drawbacks, one of them being the fact that the model parameters cannot be
easily interpreted in terms of the original response. Another shortcoming is that
measures of proportions typically display asymmetry, and hence inference based
on the normality assumption can be misleading. Our goal is to propose a
regression model that is tailored for situations where the dependent variable ( y)
is measured continuously on the standard unit interval, i.e. 0\y\1. The
proposed model is based on the assumption that the response is beta distributed.
The beta distribution, as is well known, is very flexible for modelling proportions

Correspondence Address: Silvia L. P. Ferrari, Departamento de Estatı́stica/IME, Universidade de São


Paulo, Caixa Postal 66281, São Paulo/SP, 05311–970, Brazil. Email: [email protected]

0266-4763 Print/ 1360-0532 Online/04/070799-17 © 2004 Taylor & Francis Ltd


DOI: 10.1080/0266476042000214501
800 S. L. P. Ferrari & F. Cribari-Neto

since its density can have quite different shapes depending on the values of the
two parameters that index the distribution. The beta density is given by
!(pòq)
n( y; p, q)ó yp1(1ñy)q1, 0\y\1 (1)
!(p)!(q)
where p[0, q[0 and (·) is the gamma function. The mean and variance of y
are, respectively,
p
E( y)ó (2)
(pòq)
and
pq
var( y)ó (3)
(pòq)2(pòqò1)
The mode of the distribution exists when both p and q are greater than one:
mode( y)ó(pñ1)/(pòqñ2). The uniform distribution is a particular case of
equation (1) when póqó1. Estimation of p and q by maximum likelihood and
the application of small sample bias adjustments to the maximum likelihood
estimators of these parameters are discussed by Cribari-Neto & Vasconcellos
(2002).
‘Beta distributions are very versatile and a variety of uncertainties can be
usefully modelled by them. This flexibility encourages its empirical use in a wide
range of applications’ (Johnson et al., 1995, p. 235). Several applications of the
beta distribution are discussed by Bury (1999) and by Johnson et al. (1995).
These applications, however, do not involve situations where the practitioner is
required to impose a regression structure for the variable of interest. Our interest
lies in situations where the behaviour of the response can be modelled as a
function of a set of exogenous variables. To that end, we shall propose a beta
regression model. We shall also discuss the estimation of the unknown parameters
by maximum likelihood and some diagnostic techniques. Large sample inference
is also considered. The modelling and inferential procedures we propose are
similar to those for generalized linear models (McCullagh & Nelder, 1989),
except that the distribution of the response is not a member of the exponential
family. An alternative to the model we propose is the simplex model in Jørgensen
(1997) which is defined by four parameters. Our model, on the other hand, is
defined by only two parameters, and is flexible enough to handle a wide range
of applications.
It is noteworthy that several empirical applications can be handled using the
proposed class of regression models. As a first illustration, consider the dataset
collected by Prater (1956). The dependent variable is the proportion of crude oil
converted to gasoline after distillation and fractionation, and the potential
covariates are: the crude oil gravity (degrees API), the vapour pressure of the
crude oil (lbf/in2), the crude oil 10% point ASTIM (i.e. the temperature at which
10% of the crude oil has become vapour), and the temperature (ºF) at which all
the gasoline is vaporized. The dataset contains 32 observations on the response
and on the independent variables. It has been noted (Daniel & Wood, 1971, Ch. 8)
Beta Regression for Modelling Rates and Proportions 801

that there are only ten sets of values of the first three explanatory variables that
correspond to ten different crudes and were subjected to experimentally con-
trolled distillation conditions. This dataset was analysed by Atkinson (1985),
who used the linear regression model and noted that there is ‘indication that the
error distribution is not quite symmetrical, giving rise to some unduly large and
small residuals’ (Atkinson, 1995, p. 60). He proceeded to transform the response
so that the transformed dependent variable assumed values on the real line, and
then used it in a linear regression analysis. Our approach will be different: we
shall analyse these data using the beta regression model proposed in the next
section.
The paper unfolds as follows. The next section presents the beta regression
model, and discusses maximum likelihood estimation and large sample inference.
Diagnostic measures are discussed in the section after. The fourth section
contains applications of the proposed regression model, including an analysis of
Prater’s gasoline data. Concluding remarks are given in the final section. Tech-
nical details are presented in two separate appendices.

The Model, Estimation and Testing


Our goal is to define a regression model for beta distributed random variables.
The density of the beta distribution is given in equation (1), where it is indexed
by p and q. However, the regression it is typically more useful to model the mean
of the response. It is also typical to define the model so that it contains a
precision (or dispersion) parameter. In order to obtain a regression structure for
the mean of the response along with a precision parameter, we shall work with
a different parameterization of the beta density. Let óp/(pòq) and ópòq,
i.e. pó and qó(1ñ). It follows from equations (2) and (3) that
E( y)ók
and
V(k)
var( y)ó
1ò{
where V()ó(1ñ), so that  is the mean of the response variable and  can
be interpreted as a precision parameter in the sense that, for fixed , the larger
the value of , the smaller the variance of y. The density of y can be written, in
the new parameterization, as
!({)
f( y; k, {)ó yI{1(1ñy)1I{1, 0\y\1 (4)
!(k{)!((1ñk){)
where 0\\1 and [0. Figure 1 shows a few different beta densities along
with the corresponding values of (, ). It is noteworthy that the densities can
display quite different shapes depending on the values of the two parameters. In
particular, it can be symmetric (when ó1/2) or asymmetric (when Ö1/2).
Additionally, we note that the dispersion of the distribution, for fixed , decreases
as  increases. It is also interesting to note that, in the two upper panels, two
802 S. L. P. Ferrari & F. Cribari-Neto

Figure 1. Beta densities for different combinations of (, )

densities have ‘J shapes’ and two others have inverted ‘J shapes’. Although we
did not plot the uniform case, we note that when ó1/2 and ó2 the density
reduces to that of a standard uniform distribution. The beta density can also be
‘U shaped’ (skewed or not), and this situation is also not displayed in Figure 1.
Throughout the paper we shall assume that the response is constrained to the
standard unit interval (0, 1). The model we shall propose, however, is still useful
for situations where the response is restricted to the interval (a, b), where a and
b are known scalars, a\b. In this case, one would model ( yña)/(bña) instead
of modelling y directly.
Beta Regression for Modelling Rates and Proportions 803

Let y , . . . , y be independent random variables, where each y , tó1, . . . , n,


1 n t
follows the density in equation (4) with mean  and unknown precision . The
t
model is obtained by assuming that the mean of y can be written as
t

k
g(k )ó ; x b óg (5)
t ti i t
i1

where ó( , . . . ,  )T is a vector of unknown regression parameters ( é Rk)


1 k
and x , . . . , x are observations on k covariates (k\n), which are assumed fixed
t1 tk
and known. Finally, g(·) is a strictly monotonic and twice differentiable link
function that maps (0, 1) into R. Note that the variance of y is a function of 
t t
and, as a consequence, of the covariate values. Hence, non-constant response
variances are naturally accommodated into the model.
There are several possible choices for the link function g(·). For instance,
one can use the logit specification g()ólog{/(1ñ)}, the probit function
g()ó1(), where (·) is the cumulative distribution function of a standard
normal random variable, the complementary log-log link g()ólog{ñlog(1ñ)},
the log-log link g()óñlog{ñlog()}, among others. For a comparison of these
link functions, see McCullagh & Nelder (1989, section 4.3.1), and for other
transformations, see Atkinson (1985, Ch. 7).
A particularly useful link function is the logit link, in which case we case write
T
ext @

t 1òexT@
t

where xTó(x , . . . , x ), tó1, . . . , n. Here, the regression parameters have an


t t1 tk
important interpretation. Suppose that the value of the ith regressor is increased
by c units and all other independent variables remain unchanged, and let †
denote the mean of y under the new covariate values, whereas  denotes the
mean of y under the original covariate values. Then, it is easy to show that

k†/(1ñk†)
ec@ió
k/(1ñk)

that is, exp{c } equals the odds ratio. Consider, for instance, Prater’s gasoline
i
example introduced in the previous section, and define the odds of converting
crude oil into gasoline as the number of units of crude oil, out of ten units, that
are, on average, converted into gasoline divided by the number of units that are
not converted. As an illustration, if, on average, 20% of the crude oil is
transformed into gasoline, then the odds of conversion equals 2/8. Suppose that
the temperature at which all the gasoline is vaporized increases by 50ºF, then 50
times the regression parameter associated with this covariate can be interpreted
as the log of the ratio between the chance of converting crude oil into gasoline
under the new setting relative to the old setting, all other variables remaining
constant.
804 S. L. P. Ferrari & F. Cribari-Neto

The log-likelihood function based on a sample of n independent observations is

n
l(b, {)ó ; l (k , {) (6)
t t
t1
where
l (k , {)ólog !({)ñlog !(k {)ñlog !((1ñk ){)ò(k {ñ1) log y (7)
t t t t t t
ò{(1ñk ){ñ1} log(1ñy )
t t
with  defined so that equation (5) holds. Let y*ólog{ y /(1ñy )} and
t t t t
k*ót(k {)ñt((1ñk ){). The score function, obtained by differentiating the
t t t
log-likelihood function with respect to the unknown parameters (see Appendix
A), is given by (U (b, {)T , U{(b, {))T, where
@
U (b, {)ó{X TT( y*ñk*) (8)
@
with X being an nîk matrix whose tth row is xT , Tódiag{1/g@(k ), . . . ,
t 1
1/g@(k )}, y*ó( y* , . . . , y*)T and k*ó(k* , . . . , k*)T, and
n 1 n 1 n
n
U{(b, {)ó ; {k ( y*ñk*)òlog(1ñy )ñt((1ñk ){)òt({)} (9)
t t t t t
t1
The next step is to obtain an expression for Fisher’s information matrix. The
notation can be described as follows. Let Wódiag{w , . . . , w }, with
1 n
1
w ó{{t@(k {)òt@((1ñk ){)}
t t t { g@(k )}2
t
có(c , . . . , c ) , with c ó{{t@(k {)k ñt@((1ñk ){) (1ñk )}, where @(·) is the
T
1 n t t t t t
trigamma function. Also, let Dódiag{d , . . . , d }, with d ót@(k {)k2ò
1 n t t t
t@((1ñk ){) (1ñk )2ñt@({). It is shown in Appendix A that Fisher’s information
t t
matrix is given by

 
K K
KóK(b, {)ó @@ @{ (10)
K{ K{{
@
where K ó{X TWX, K {óKT{bóX TTc and K{{ótr(D). Note that the parameters
@@ @
 and  are not orthogonal, in contrast to what is verified in the class of
generalized linear regression models (McCullagh & Nelder, 1989).
Under the usual regularity conditions for maximum likelihood estimation,
when the sample size is large,

    
b̂ b
N , K1
{ˆ ˜ k 1 {

approximately, where b̂ and {ˆ are the maximum likelihood estimators of  and


, respectively. It is thus useful to obtain an expression for K1, which can be
Beta Regression for Modelling Rates and Proportions 805

used to obtain asymptotic standard errors for the maximum likelihood estimates.
Using standard expressions for the inverse of partitioned matrices (e.g. Rao,
1973, p. 33), we obtain

 
K@@ K@{
K1óK1(b, {)ó (11)
K{@ K{{

where

 
1 X TTccTT TX(X TWX)1
K@@ó (X TWX)1 I ò
{ k c{

with cótr(D)ñ{1cTT TX(XTWX)1X TTc,


1
K@{ó(K{@)Tóñ (XTWX)1X TTc
c{
and K{{óc1. Here, I is the kîk identity matrix.
k
The maximum likelihood estimators of  and  are obtained from the
equations U (, )ó0 and U{(b, {)ó0, and do not have closed-form. Hence,
@
they need to be obtained by numerically maximizing the log-likelihood function
using a nonlinear optimization algorithm, such as a Newton algorithm or a quasi-
Newton algorithm; for details, see Nocedal & Wright (1999). The optimization
algorithms require the specification of initial values to be used in the iterative
scheme. Our suggestion is to use as an initial point estimate for , the ordinary
least squares estimate of this parameter vector obtained from a linear regression
of the transformed responses g( y ), . . . , g( y ) on X, i.e. (X TX)1X Tz, where
1 n
zó( g( y ), . . . , g( y ))T. We also need an initial guess for . As noted earlier,
1 n
var( y )ó (1ñ )/(1ò) which implies that ó (1ñ )/var( y )ñ1. Note that
t t t t t t
var( g( y ))Bvar{ g(k )ò( y ñk )g@(k )}óvar( y ){ g@(k )}2
t t t t t t t
that is, var( y )Bvar{ g( y )}{ g@( )}2. Hence, the initial guess for  we suggest is
t t t
1 n ǩt(1ñǩt)
; ñ1
n t1 p̌2
t
where ǩ is obtained by applying g1(·) to the tth fitted value from the linear
t
regression of g( y ), . . . , g( y ) on X, i.e. ǩ óg1(xT(X TX)1X Tz) and
1 n t t
p̌2óěTě/[(nñk){ g@(ǩ )}2]; here, ĕózñX(X TX)1X Tz is the vector of ordinary
t t
least squares residuals from the linear regression that employs the transformed
response. These initial guesses worked well in the applications described in the
fourth section.
Large sample inference is considered in Appendix B. We have developed
likelihood ratio, score and Wald tests for the regression parameters. In addition,
we have obtained confidence intervals for the precision and for the regression
parameters, for the odds ratio when the logit link is used, and for the mean
response.
806 S. L. P. Ferrari & F. Cribari-Neto

Diagnostic Measures
After the fit of the model, it is important to perform diagnostic analyses in order
to check the goodness-of-fit of the estimated model. We shall introduce a global
measure of explained variation and graphical tools for detecting departures from
the postulated model and influential observations.
At the outset, a global measure of explained variation can be obtained by
computing the pseudo R2 (R2) defined as the square of the sample correlation
p
coefficient between ĝ and g( y). Note that 0OR2O1 and perfect agreement
p
between ĝ and g( y), and hence between k̂ and y, yields R2ó1.
p
The discrepancy of a fit can be measured as twice the difference between
the maximum log-likelihood achievable (saturated model) and that achieved by
the model under investigation. Let D( y; k, {)ó&n 2(l (k̃ , {)ñl (k , {)),
t1 t t t t
where k̃ is the value of  that solves Ll /Lk ó0, i.e. {( y*ñk*)ó0. When  is
t t t t t t
large, k*Blog{k /(1ñk )}, and it then follows that k̃ By ; see Appendix B. For
t t t t t
known , this discrepancy measure is D( y; k̄, {), where k̄ is the maximum
likelihood estimator of  under the model being investigated. When  is
unknown, an approximation to this quantity is D( y; k̂, {ˆ ); it can be named, as
usual, the deviance for the current mode. Note that D( y; k̂, {ˆ )ó&n (rd)2, where
t1 t
rdósign( y ñk̂ ){2(l (k̃ , {ˆ )ñl (k̂ , {ˆ ))}12
t t t t t t t
Note now that the tth observation contributes a quantity (rd)2 to the deviance,
t
and thus an observation with a large absolute value of rd can be viewed as
t
discrepant. We shall call rd the tth deviance residual.
t
It is also possible to define the standardized residuals:
y ñk̂
ró t t
t
vâr( y )
t
where k̂ óg1(xTb̂) and vâr( y )ó{k̂ (1ñk̂ )}/(1ò{ˆ ). A plot of these residuals
t t t t t
against the index of the observations (t) should show no detectable pattern. Also,
a detectable trend in the plot of r against ĝ could be suggestive of link function
t t
misspecification.
Since the distribution of the residuals is not known, half-normal plots with
simulated envelopes are a helpful diagnostic tool (Atkinson, 1985, section 4.2;
Neter et al., 1996, section 14.6). The main idea is to enhance the usual half-
normal plot by adding a simulated envelope that can be used to decide whether
the observed residuals are consistent with the fitted model. Half-normal plots
with a simulated envelope can be produced as follows:
(i) fit the model and generate a simulated sample of n independent
observations using the fitted model as if it were the true model;
(ii) fit the model to the generated sample, and compute the ordered
absolute values of the residuals;
(iii) repeat steps (i) ad (ii) k times;
(iv) consider the n sets of the k order statistics; for each set compute its
average, minimum and maximum values;
Beta Regression for Modelling Rates and Proportions 807

(v) plot these values and the ordered residuals of the original sample
against the half-normal scores 1((tònñ1/8)/(2nò1/2)).
The minimum and maximum values of the k order statistics yield the envelope.
Atkinson (1985, p. 36) suggests using kó19, so that the probability that a given
absolute residual will fall beyond the upper band provided by the envelope is
approximately equal to 1/20ó0.05. Observations corresponding to absolute
residuals outside the limits provided by the simulated envelope are worthy of
further investigation. Additionally, if a considerable proportion of points falls
outside the envelope, then one has evidence against the adequacy of the fitted
model.
Next, we shall be concerned with the identification of influential observations
and residual analysis. In what follows we shall use the generalized leverage
proposed by Wei et al. (1998), which is defined as
Lỹ
GL(h̃)ó
LyT
where  is an s-vector such that E( y)ó() and h̃ is an estimator of , with
ỹók(h̃). Here, the (t, u) element of GL(h̃), i.e. the generalized leverage of the
estimator h̃ at (t, u), is the instantaneous rate of change in tth predicted value
with respect to the uth response value. As noted by the authors, the generalized
leverage is invariant under reparameterization and observations with large GL
tu
are leverage points. Let ĥ be the maximum likelihood estimator of , assumed to
exist and to be unique, and assume that the log-likelihood function has second-
order continuous derivatives with respect to  and y. Wei et al. (1998) have
shown that the generalized leverage is obtained by evaluating

 
L2l 1 L2l
GL(h)óD ñ
F LhLh T
LhLyT

at ĥ, where D óL/LT.


F
As a first step, we shall obtain a closed-form for GL() in the beta regression
model proposed in the previous section under the assumption that  is known.
It is easy to show that D óTX. The expression for the elements of ñL2l/LLT
@
is given in Appendix A, and it follows that
L2l
ñ ó{X TQX
LbLbT
where Qódiag{q , . . . , q } with
1 n

 
gA(k ) 1
q ó {{t@(k {)òt@((1ñk ){}ò( y*ñk*) t , tó1, . . . , n
t t t t t g@(k ) { g@(k )}2
t t
Additionally, it can be shown that L2l/LLyTóX TTM, where Módiag{m , . . . ,
1
m } with m ó1/{ y (1ñy )}, tó1, . . . , n. Therefore, we obtain
n t t t
GL(b)óTX(X TQX)1X TTM (12)
808 S. L. P. Ferrari & F. Cribari-Neto

We note that if we replace the observed information, ñL2l/LLT, by the


expected information, E(ñL2l/LLT), the expression for GL() is as given in
equation (12) but with Q replaced by W; we shall call this matrix GL*(). It is
noteworthy that the diagonal elements of GL*() are the same as those of
M12TX(X TWX)1X TM12, and that M12T is a diagonal matrix whose tth
diagonal element is given by { g@( )V( y )12}1. It is important to note that there
t t
is a close connection between the diagonal elements of GL*() and those of the
usual ‘hat matrix’,
HóW12X(X TWX)1X TW12
when  is large. The relationship stems from the fact that, when the precision
parameter is large, the tth diagonal elements of W12 is approximately equal to
{ g@( )V( )12}1; see Appendix C.
t t
Now let  be unknown, and hence Tó(T, ). Here, D ó[TX 0], where 0 is an
F
n-vector of zeros. Also, ñL2l/LLT is given by equation (10) with W replaced by
T
Q and c replaced by f, where fó( f , . . . , f ) with f ó{c ñ( y*ñk*)}, tó1, . . . , n.
1 n t t t t
It is thus clear that the inverse of ñL2l/LLT will be given by equation (11)
with W and c replaced by Q and f, respectively. Additionally,

 
L2l {X TTM
T
ó
LhLy bT

where bó(b , . . . , b )T with b óñ( y ñ )/{ y (1ñy )}, tó1, . . . , n. It can now be
1 n t t t t t
shown that
1
GL(b, {)óGL(b)ò TX(X TQX)1X TTf( f TTX(X TQX)1X TTMñbT)
c{
where GL() is given in equation (12). When  is large, GL(, )BGL().
A measure of the influence of each observation on the regression parameter
estimates is Cook’s distance (Cook, 1977) given by k1(b̂ñb̂ )TX TWX(b̂ñb̂ ),
t t
where b̂ is the parameter estimate without the tth observation. It measures the
t
squared distance between b̂ and b̂ . To avoid fitting the model nò1 times, we
t
shall use the usual approximation to Cook’s distance given by
h r2
Có tt t
t k(1ñh )2
tt
It combines leverage and residuals. It is common practice to plot C against t.
t
Finally, we note that other diagnostic measures can be considered, such as
local influence measures (Cook, 1986).

Applications
This section contains two applications of the beta regression model proposed in the
second section. all computations were carried out using the matrix programming
language Ox (Doornik, 2001). The computer code and dataset used in the first
application are available at http://www.de.ufpe.br/˜cribari/betareg_example.zip.
Beta Regression for Modelling Rates and Proportions 809

Table 1. Parameter estimates using Prater’s gasoline data

Parameter Estimate Std. error z stat p-value


 ñ6.15957 0.18232 ñ33.78 0.0000
1
 1.72773 0.10123 17.07 0.0000
2
 1.32260 0.11790 11.22 0.0000
3
 1.57231 0.11610 13.54 0.0000
4
 1.05971 0.10236 10.35 0.0000
5
 1.13375 0.10352 10.95 0.0000
6
 1.04016 0.10604 9.81 0.0000
7
 0.54369 0.10913 4.98 0.0000
8
 0.49590 0.10893 4.55 0.0000
9
 0.38579 0.11859 3.25 0.0011
10
 0.01097 0.00041 26.58 0.0000
11
 440.27838 110.02562

Estimation was performed using the quasi-Newton optimization algorithm


known as BFGS with analytic first derivatives. The choice of starting values for
the unknown parameters followed the suggestion made in the second section.
Consider initially Prater’s gasoline data described in the Introduction. The
interest lies in modelling the proportion of crude oil converted to gasoline after
distillation and fractionation. As noted earlier, there are only ten sets of values
of three of the explanatory variables which correspond to ten different crudes
subjected to experimentally controlled distillation conditions. The data were
ordered according to the ascending order of the covariate that measures the
temperature at which 10% of the crude oil has become vapour. This variable
assumes ten different values and they are used to define the ten batches of crude
oil. The model specification for the mean of the response uses an intercept
(x ó1), nine dummy variables for the first nine batches of crude oil (x , . . . , x )
1 2 10
and the covariate that measures the temperature (ºF) at which all the gasoline is
vaporized (x ). Estimation results using the logit link function are given in
11
Table 1.
The pseudo R2 of the estimated regression was 0.9617. Diagnostic plots are
given in Figure 2. An inspection of Figure 2 reveals that the largest standardized
and deviance residuals in absolute value correspond to observation 4. Also, C
4
is much larger than the remaining Cook’s measures, thus suggesting that the
fourth observation is the most influential. Additionally, observation 4 deviates
from the pattern shown in the lower right panel (plot of the diagonal elements
of GL(b̂, {ˆ ) v. k̂ ; observation 29, the one with largest generalized leverage, also
t
displays deviation from the main pattern). On the other hand, it is noteworthy
that the generalized leverage for this observation is not large relative to the
remaining ones. We note, however, that y is the largest value of the response;
4
its observed value is 0.457 and the corresponding fitted value equals 0.508. The
analysis of these data carried out by Atkinson (1985, Ch. 7) using a linear
regression specification for transformations of the response also singles out
observation 4 as influential.
We fitted the beta regression model without the fourth observation and noted
810 S. L. P. Ferrari & F. Cribari-Neto

Figure 2. Six diagnostic plots for Prater’s gasoline data. The upper left panel plots the
standardized residuals against t, the upper right panel plots the deviance residuals versus t,
the middle left panel displays the half-normal plot of absolute deviance residuals with a
simulated envelope, the middle right panel plots standardized residuals against ĝ , the lower
t
left panel presents a plot of C versus t, and the lower right panel plots the diagonal elements
t
of GL(b̂, ) against k̂
t

that the point estimates of the s were not significantly altered, but that the
estimate of the precision parameter jumped from 440.3 to 577.8; despite that,
however, the reduction in the asymptotic standard errors of the regression
parameter estimates was negligible.
The next application uses data on food expenditure, income, and number of
persons in each household from a random sample of 38 households in a large
US city; the source of the data is Griffiths et al. (1993, Table 15.4). The interest
lies in modelling the proportion of income spent on food ( y) as a function of
the level of income (x ) and the number of persons in the household (x ). At
2 3
the outset, consider a linear regression of the response on the covariates. The
estimated regression displayed evidence of heteroskedasticity; the p-value for
Koenker’s (1981) homoskedasticity test was 0.0514. If we consider instead the
regression of log{ y/(1ñy)} on the two covariates, the evidence of heteroskedastic-
ity is attenuated, but the residuals become highly asymmetric to the left.
We shall now consider the beta regression model proposed in the second
section. As previously mentioned, this model accommodates naturally non-
constant variances and skewness. The model is specified as

g(k )ób òb x òb x
t 1 2 t2 3 t3
Beta Regression for Modelling Rates and Proportions 811

Table 2. Parameter estimates using data on food expenditure

Parameter Estimate Std. error z stat p-value


 ñ0.62255 0.22385 ñ2.78 0.0054
1
 ñ0.01230 0.00304 ñ4.05 0.0001
2
 0.11846 0.03534 3.35 0.0008
3
 35.60975 8.07960

The link function used was logit. The parameter estimates are given in Table 2.
The pseudo R2 of the estimated regression was 0.3878.
The values in Table 2 show that both covariates are statistically significant at
the usual nominal levels. We also note that there is a negative relationship
between the mean response (proportion of income spent on food) and the level
of income, and that there is a positive relationship between the mean response
and the number of persons in the household. Diagnostic plots similar to those
presented in Figure 2 were also produced but for brevity are not presented.

Concluding Remarks
This paper proposed a regression model tailored for responses that are measured
continuously on the standard unit interval, i.e. y é (0, 1), which is the situation that
practitioners encounter when modelling rates and proportions. The underlying
assumption is that the response follows a beta law. As is well known, the beta
distribution is very flexible for modelling data on the standard unit interval,
since the beta density can display quite different shapes depending on the values
of the parameters that index the distribution. We use a parameterization in
which a function of the mean of the dependent variable is given by a linear
predictor that is defined by regression parameters and explanatory variables. The
proposed parameterization also allows for a precision parameter. When the logit
link function is used to transform the mean response, the regression parameters
can be interpreted in terms of the odds ratio. Parameter estimation is performed
by maximum likelihood, and we provide closed-form expressions for the score
function, for Fisher’s information matrix and its inverse. Interval estimation
for different population quantities (such as regression parameters, precision
parameter, mean response, odds ratio) is discussed. Tests of hypotheses on the
regression parameters can be performed using asymptotic tests, and three tests
are presented: likelihood ratio, score and Wald. We also consider a set of
diagnostic techniques that can be employed to identify departures from the
postulated model and influential observations. These include a measure of the
degree of leverage of the different observations, and a half normal plot of
residuals with envelopes obtained from a simulation scheme. Applications using
real data sets were presented and discussed.

Acknowledgements
The authors gratefully acknowledge partial financial support from CNPq and
FAPESP. The authors also thank Gilberto Paula and a referee for comments
and suggestions on an earlier draft.
812 S. L. P. Ferrari & F. Cribari-Neto

References
Abramowtiz, M. & Stegun, I. A. (1965) Handbook of Mathematical Functions with Formulas, Graphs and
Mathematical Tables (New York: Dover).
Atkinson, A. C. (1985) Plots, Transformations and Regression: An Introduction to Graphical Methods of
Diagnostic Regression Analysis (New York: Oxford University Press).
Bury, K. (1999) Statistical Distributions in Engineering (New York: Cambridge University Press).
Cook, R. D. (1977) Detection of influential observations in linear regression, Technometrics, 19, pp. 15–18.
Cook, R. D. (1986) Assessment of local influence (with discussion), Journal of the Royal Statistical Society
B, 48, pp. 133–169.
Cribari-Neto, F. & Vasconcellos, K. L. P. (2002) Nearly unbiased maximum likelihood estimation for the
beta distribution, Journal of Statistical Computation and Simulation, 72, pp. 107–118.
Daniel, C. & Wood, F. S. (1971) Fitting Equations to Data (New York: Wiley).
Doornik, J. A. (2001) Ox: an Object-oriented Matrix programming Language, 4th edn (London: Timberlake
Consultants and Oxford: http://www.nuff.ox.ac.uk/Users/Doornik/).
Griffiths, W. E., Hill, R. C. & Judge, G. G. (1993) Learning and Practicing Econometrics (New York:
Wiley).
Johnson, N. L., Kotz, S. & Balakrishnan, N. (1995) Continuous Univariate Distributions, vol. 2, 2nd edn
(New York: Wiley).
Jørgesen, B. (1997) Proper dispersion models (with discussion), Brazilian Journal of Probability and
Statistics, 11, pp. 89–140.
Koenker, R. (1981) A note on studentizing a test for heteroscedasticity, Journal of Econometrics, 17,
pp. 107–112.
McCullagh, P. & Nelder, J. A. (1989) Generalized Linear Models, 2nd edn (London: Chapman and Hall).
Neter, J., Kutner, M. H., Nachtsheim, C. J. & Wasserman, W. (1996) Applied Linear Statistical Models,
4th edn (Chicago, IL: Irwin).
Nocedal, J. & Wright, S. J. (1999) Numerical Optimization (New York: Springer-Verlag).
Prater, N. H. (1956) Estimate gasoline yields from crudes, Petroleum Refiner, 35, pp. 236–238.
Rao, C. R. (1973) Linear Statistical Inference and Its Applications, 2nd edn (New York: Wiley).
Wei, B.-C., Hu, Y.-Q. & Fung, W.-K. (1998) Generalized leverage and its applications, Scandinavian
Journal of Statistics, 25, pp. 25–37.

Appendix A
In this appendix we obtain the score function and the Fisher information matrix
for (, ). The notation used here is defined in the second section. From equation
(6) we get, for ió1, . . . , k,

Ll(b, {) n Llt(kt , {) dkt Lgt


ó; (A1)
Lb t1
Lk dg Lb
i t t i
Note that d /d ó1/g@( ). Also, from equation (7)
t t t

 
Ll (k , {) y
t t ó{ log t ñ{t(k {)ñt((1ñk ){)} (A2)
Lk 1ñy t t
t t

where (·) is the digamma function, i.e. (z)ód log (z)/dz for z[0. From
regularity conditions, it is known that the expected value of the derivative in
equation (7) equals zero, so that k*óE( y*), where y* and k* are defined in the
t t t t
second section. Hence,
Beta Regression for Modelling Rates and Proportions 813

Ll(b, {) n 1
ó{ ; ( y*ñk*) x (A3)
Lb t t g@(k ) ti
i t1 t
We then arrive at the matrix expression for the score function for  given in
equation (8). Similarly, it can be shown that the score function for  can be
written as in equation (9).
From equation (A1), the second derivative of l(, ) with respect to the s is
given by

 
L2l(b, {) n L Ll (k , {) dk dk Lg
ó; t t t t tx
Lb Lb Lk
t1 t
Lk dg dg Lb ti
i j t t t j

 
n L2l (k , {) dk Ll (k , {) L dk dk
ó; t t tò t t t tx x
Lk2 dg Lk Lk dg dg ti tj
t1 t t t t t t
Since E(Ll ( , )/L )ó0, we have
t t t

    
L2l(b, {) n L2l (k , {) dk 2
E ó; E t t t x x
Lb Lb Lk2 dg ti tj
i j t1 t t
Now, from equation (A2) we have
L2l (k , {)
t t óñ{2{t@(k {)òt@((1ñk ){)}
Lk2 t t
t
and hence

 
L2l(b, {) n
E óñ{ ; w x x
Lb Lb t ti tj
i j t1
In matrix form, we have that

 
L2l(b, {)
E ó{X TWX
LbLbT

From equation (A3), the second derivative of l(, ) with respect to  and 
i
can be written as

 
L2l(b, {) n Lk* 1
ó ; ( y*ñk*)ñ{ t x
Lb L{ t t L{ g@(k ) ti
i t1 t
Since E( y*)ók* and Lk*/L{ót@(k {)k ñt@((1ñk ){) (1ñk ), we arrive at
t t t t t t t

 
L2l(b, {) n 1
E óñ ; c x
Lb L{ t g@(k ) ti
i t1 t
814 S. L. P. Ferrari & F. Cribari-Neto

In matrix notation, we then have

 
L2l(b, {)
E óñX TTc
LbL{

Finally, L2l(, )/L2 comes by differentiating the expression in equation (9)


with respect to . We arrive at E(L2l(b, {)/L{2)óñ&n d , which, in matrix
t1 t
notation, can be written as

 
L2l(b, {)
E óñtr(D)
L{2

It is now easy to obtain the Fisher information matrix for (, ) given in
equation (10).

Appendix B
In this Appendix, we show how to perform large sample inference in the
beta regression model we propose. Consider, for instance, the test of the null
hypothesis H : b ób0 versus H : b Öb0, where  ó( , . . . ,  )T and
0 1 1 1 1 1 1 1 m
b0ó(b0 , . . . , b0)T, for m\k, and b0 given. The log-likelihood ratio statistic is
1 1 m 1
u ó2{l(b̂, {ˆ )ñl(b̃, {˜ )}
1
where l(, ) is the log-likelihood function and (b̃T , {˜ )T is the restricted maximum
likelihood estimator of (T, )T obtained by imposing the null hypothesis. Under
the usual regularity conditions and under H , u  D s2 , so that a test can be
0 1 m
performed using approximate critical values from the asymptotic s2 distribution.
m
In order to describe the score test, let U denote the m-vector containing the
1@
first m elements of the score function for  and let K@@ be the mîm matrix
11
formed out of the first m rows and the first m columns of K1. It can be shown,
T
using equation (8), that U ó{X T( y*ñk*), where X is partitioned as [X X ]
1@ 1 1 2
following the partition of . Rao’s score statistic can be written as
u óŨT K̃@@ Ũ
2 1@ 11 1@
where tildes indicate that the quantities are evaluated at the restricted maximum
likelihood estimator. Under the usual regularity conditions and under
H ,u  D s2 .
0 2 m
Asymptotic inference can also be performed using Wald’s test. The test statistic
for the test of H : b ób0 is
0 1 1
u ó(b̂ ñb0)T (K̂@@ )1 (b̂ ñb0)
3 1 1 11 1 1
where K̂@@ equals K@@ evaluated at the unrestricted maximum likelihood esti-
11 11
mator, and b̂ is the maximum likelihood estimator of  . Under mild
1 1
regularity conditions and under H , u  D s2 . In particular, for testing the
0 3 m
significance of the ith regression parameter ( ), ió1, . . . , k, one can use the
i
signed square root of Wald’s statistic, i.e. b̂ /se(b̂ ), where se(b̂ ) is the asymptotic
i i i
Beta Regression for Modelling Rates and Proportions 815

standard error of the maximum likelihood estimator of b̂ obtained from the


i
inverse of Fisher’s information matrix evaluated at the maximum likelihood
estimates. The limiting null distribution of the test statistic is standard normal.
An approximate (1ñ)î100% confidence interval for  , ió1, . . . , k and
i
0\\1/2, has limits given by b̂ ô1(1ñ/2)se(b̂ ). Additionally, approximate
i i
confidence regions for sets of regression parameters can be obtained by inverting
one of the three large sample tests described above. Similarly, an asymptotic
(1ñ)î100% confidence interval for  has limits {ˆ ô'1(1ña/2)se({ˆ ), where
se({ˆ )óĉ12. Additionally, an approximate (1ñ)î100% confidence interval for
the odds ratio ec@i, when the logit link is used, is
[exp{c(b̂ ñ'1(1ña/2)se(b̂ ))}, exp{c(b̂ ò'1(1ña/2)se(b̂ ))}]
i i i i
Finally, an approximate (1ñ)î100% confidence interval for , the mean of the
response, for a given vector of covariate values x can be computed as
[g1(ĝñ'1(1ña/2)se(ĝ)), g1(ĝò'1(1ña/2)se(ĝ))]

where ĝóxTb̂ and se(ĝ)óxT côv(b̂)x; here, côv(b̂) is obtained from the inverse
of Fisher’s information matrix evaluated at the maximum likelihood estimates
by excluding the row and column of this matrix corresponding to the precision
parameter. The above interval is valid for strictly increasing link functions.

Appendix C
Here we shall obtain approximations for w and k* , tó1, . . . , n, when   and
t t t
(1ñ ) are large. At the outset, note that (Abramowitz & Stegun, 1965, p. 259),
t
as zê,
1 1 1
t(z)ólog(z)ñ ñ ò ò. . . (C1)
2z 12z2 120z4

1 1 1 1
t@(z)ó ò ò ñ ò. . . (C2)
z 2z2 6z3 30z5
In what follows, we shall drop the subscript t (that indexes observations). When
 and (1ñ) are large, it follows from equation (C2) that

 
1 1 1 1 1
wB{ ò ó
k{ (1ñk){ g@(k)2 k(1ñk) g@(k)2

Also, from equation (C1) we obtain

 
k
k*Blog(k{)ñlog((1ñk){)ólog
1ñk

You might also like