0% found this document useful (0 votes)
46 views14 pages

Terbraak 1998

The document discusses a new objective function for Partial Least Squares (PLS) regression, derived in the context of multivariate analysis, which contrasts with the SIMPLS method. It highlights the differences in constraints between PLS and SIMPLS, particularly regarding the weight vector's length, and presents an implicit deflation algorithm for PLS. The findings contribute to a deeper understanding of PLS and SIMPLS, their relationship with other methods, and the implications for data analysis in various applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views14 pages

Terbraak 1998

The document discusses a new objective function for Partial Least Squares (PLS) regression, derived in the context of multivariate analysis, which contrasts with the SIMPLS method. It highlights the differences in constraints between PLS and SIMPLS, particularly regarding the weight vector's length, and presents an implicit deflation algorithm for PLS. The findings contribute to a deeper understanding of PLS and SIMPLS, their relationship with other methods, and the implications for data analysis in various applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

JOURNAL OF CHEMOMETRICS, VOL.

12, 41–54 (1998)

THE OBJECTIVE FUNCTION OF PARTIAL LEAST SQUARES


REGRESSION

CAJO J. F. TER BRAAK1* AND SIJMEN DE JONG2


1
Centre for Biometry Wageningen, CPRO-DLO, PO Box 16, NL-6700 AA Wageningen, Netherlands
2
Unilever Research Laboratorium, PO Box 114, NL-3130 AC Vlaardingen, Netherlands

SUMMARY
A simple objective function in terms of undeflated X is derived for the latent variables of multivariate PLS
regression. The objective function fits into the basic framework put forward by Burnham et al. (J. Chemometrics,
10, 31–45 (1996)). We show that PLS and SIMPLS differ in the constraint put on the length of the X-weight
vector. It turns out that PLS does not penalize the length of the part of the weight vector that can be expressed
as a linear combination of the preceding weights, whereas SIMPLS does. By using artificial data sets, it is shown
that it depends on the data which of the two methods explains the larger amount of variance in X and Y. The
objective function framework adds insight to the nature of PLS and SIMPLS and how they relate to other
methods. In addition, we present an implicit deflation algorithm for PLS, explain why PLS and SIMPLS become
equivalent when Y changes from multivarite to univariate, and list some geometrical results that may also prove
useful in the study of other latent variable methods. © 1997 John Wiley & Sons, Ltd.

J. Chemometrics, Vol. 12, 41–54 (1998)

KEY WORDS PLS; SIMPLS; multivariate regression; latent variable; covariance maximization; objective
function

1. INTRODUCTION
In many applications, data from two different measurement systems are to be related, in which each
measurement system produces data on a large number of variables. To get more insight to the data and
the underlying chemical processes, it is often helpful to reduce the number of variables of each set to
a smaller number of latent variables or factors. This idea is quite natural, because the variables are
often highly correlated. Once variables are aggregated into sums or other linear combinations, the
individual variables become essentially redundant. There are, however, a number of rival latent
variable methods for relating two sets of variables. Choosing a method that is fit for the job at hand
is easy when each method has a clearly delimited domain of application. As a means towards
achieving this aim, Burnham et al.1 developed a theoretical framework for latent variable methods. For
convenience the two sets of variables are indicated by X and Y after the n 3 p and n 3 m data matrices
X and Y in which the values of the p predictors and m responses are stored, with n the number of
samples. Often the relation between X and Y is asymmetric in that Y is to be predicted from X, but
the framework also contains methods for cases in which the relation is symmetric. A unifying
framework highlights both the similarities and dissimilarities between different methods.
In the basic framework the latent variables are extracted on the basis of maximization of the
covariance between linear combinations of the columns of X and linear combinations of the columns
of Y, subject to constraints on the weight vectors that define the linear combinations. Canonical

* Correspondence to: C. J. F. ter Braak.

CCC 0886–9383/98/010041–14 $17.50 Received 25 April 1997


© 1998 John Wiley & Sons, Ltd. Accepted 28 July 1997
42 C. J. F. TER BRAAK AND S. DE JONG

correlation-based regression (CCR), reduced rank regression (RRR) and simple partial least squares
regression (SIMPLS) easily fit into this basic framework, but standard partial least squares regression
(PLS) does not, except for the extraction of the first factor (or when Y is unidimensional). As in
Reference 1, the abbreviation PLS will be used throughout to mean PLS regression for the multivariate
case.2 The problem with PLS is that the second, third and later factors are extracted from residual
matrices that are obtained from X and Y by successive deflation with respect to the preceding X-factor
scores.
After a careful study of the deflation process, Burnham et al.1 extended the basic framework to
incorporate all mentioned methods. The objective function of the extended framework is, however,
rather complicated. The constraints remain simple. Burnham et al.1 question the rationale for the
deflation process in PLS in an appendix and emphasize their point of view by proposing undeflated
PLS (UPLS), but ‘more as an illustration of the impact of the deflation process on PLS than as any
real improvement over existing methods’. UPLS is equivalent to a singular value decomposition of
XTY. L’histoire se repète. SIMPLS was also the result of an attempt to better understand PLS. The
results of PLS and SIMPLS are often very similar1, 3 and are identical if Y is univariate.3 SIMPLS and
PLS thus have approximately the same predictive power. Both numerically and theoretically, SIMPLS
has advantages over standard PLS: the algorithms for SIMPLS are more efficient than the algorithms
for PLS, and the objective function for SIMPLS is more elegant and easier to understand than the
objective function for PLS. The question is therefore whether PLS is better replaced by SIMPLS.
As Burnham et al. write, ‘two papers which claim to have an objective function for PLS beyond the
first pair of latent variables are actually describing SIMPLS’. This has no consequence for univariate
y4–6 but has, at least potentially, consequences for multivariate Y.7 For ecological applications one of
us attempted to extend PLS in the correspondence analysis (CA) framework (Appendix of Reference
8) but used the wrong definitions for PLS. Compared with SIMPLS, one extra constraint was proposed
in which the Y-factors were constrained to be orthogonal to the preceding X-factor scores. This
constraint holds in reduced rank regression and canonical correlation analysis but not necessarily in
PLS or SIMPLS. The misunderstanding was, at least partially, due to the fact that algorithms that are
efficient for ecological data exploit the sparseness of the X- and Y-matrices. Because deflation
destroys the sparseness, iterative algorithms have been developed in which deflation is achieved
implicitly by orthogonalizing both the X-factor scores and the Y-factor scores with respect to the
preceding X-factors (Appendix I). In the algorithm of Appendix I, PLS works directly on the original
data matrices X and Y. It therefore comes as a surprise that the objective function of PLS cannot be
formulated in a simple way in terms of the original data matrices.
In this paper a simple objective function for the latent variables of PLS is derived which fits into
the basic framework of Burnham et al. We show that PLS and SIMPLS differ in the constraint put on
the length of the X-weight vector and resolve some of the questions that Burnham et al. express about
the deflation process in PLS. The paper is organized as follows. In Section 2 we summarize standard
PLS and prepare for the derivation of the objective function of PLS in terms of the original matrices.
In Section 3 we discuss the objective function frameworks of Burnham et al. and show how a slight
generalization allows both standard PLS and SIMPLS to fit naturally in the same framework. Having
found how PLS can be posed as a constrained maximization problem, avoiding deflated X-matrices,
we proceed in Section 4 to discuss the subtle distinction between SIMPLS and PLS. Some specific
aspects are diverted to appendices.

2. PARTIAL LEAST SQUARES


This section summarizes PLS in terms of deflated matrices and prepares for the derivation of the
objective function of PLS In terms of the original matrices.

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 43

We use the notation that A 2 is a generalized inverse of the matrix A and A + is the unique Moore–
Penrose inverse.9 If A has full column rank, A + = (ATA) 2 1AT. Note that AA + is the orthogonal projector
on the column space of A, which reduces to the more familiar form A(ATA) 2 1AT if A has full column
rank. With In we denote the identity matrix of order n. We assume throughout that the predictors and
response variables in X and Y are centred in a preprocessing step, because this allows us to interpret
inner products as covariances (up to a constant factor).
The latent variables (factors) of PLS are derived sequentially. This section describes how the ith pair
of latent variables is derived from the data matrices X and Y and the preceding i 2 1 latent variables.
PLS is a regression method. Thus we concentrate on how the latent X-space is constructed. This latent
space is spanned by orthogonal factors, t1 , t2 , . . . , ti which are linear combinations of the X-variables.
At this point we are not yet concerned about how the optimal weight vectors are selected. We do not
even require at this point that the X-factors are orthogonal. Let X1 = X and Y1 = Y and let Xj and Yj
( j=2, 3, . . .) denote the residual (or deflated) matrices that are obtained from X and Y by successive
deflation with respect to the preceding X-factor scores t1 , t2 , . . . , tj 2 1 , with j indicating the number
of the factor to be extracted.1, 2 The X-factor tj can be defined in terms of either the original matrix X
or the deflated matrix Xj by
tj=X rj=Xj wj , j=1, 2, . . . , i (1)
Each factor has therefore two associated weight vectors (rj and wj respectively), which coincide for
j=1 because X1 = X. The results for the first i 2 1 factors are collected in the matrix T = [t1 , t2 , . . . ,
ti 2 1 ], with i > 2. The corresponding weights {rj } and {wj } ( j=1, 2, . . . , i 2 1) are collected
analogously in the matrices R and W. With this notation, for i > 2, T = XR and
P = XTT(TTT) 2 1 = XTXR(RTXTXR) 2 1 (2)
where the columns of P contain the X-loadings. With (2),
Xi=X 2 TPT = X 2 XRPT = X(Ip 2 RPT ) (3)
but also
Xi=X 2 TPT = X 2 T(TTT) 2 1TTX = (Ip 2 TT + )X (4)
i.e. Xi is the matrix of residuals from the regression of the X-variables on the preceding X-factors in
T. The Y-matrix is deflated analogously with respect to T, i.e. Yi=(Ip 2 TT + )Y.
It is now easy to formulate the objective function of PLS regression in terms of the deflated X- and
Y-matrices: for the ith pair of latent variables, PLS selects linear combinations of the deflated X-
variables and deflated Y-variables, t = Xiw and u=Yic respectively, that have maximum covariance:
tTu = wTXTi Yic subject to wTw = cTc = 1 (5)
The optimal X-weight vector wi is the first eigenvector of the eigenvalue problem
XTi YiYTi Xiw = liw (6)
T
and the optimal Y-weight vector ci is proportional to Yi Xiwi . Only one of either X or Y needs to be
deflated,10 because XTi Yi=XT(Ip 2 TT + )(Ip 2 TT + )Y = XT(Ip 2 TT + )Y, which is equal to both XTi Y and
XTYi . From (6) and on deflating X only, wi also maximizes
wTXTi Y YTXi w subject to wTw = 1 (7)
The weight vectors {wj } derived in this way are mutually orthogonal2 and so are the X-factors {tj }
( j=1, 2, . . . , i). From (1) and (3),

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
44 C. J. F. TER BRAAK AND S. DE JONG

ti=X ri=Xi wi=X (Ip 2 RPT )wi (8)


so that we can define
ri=(Ip 2 RPT )wi (9)
This formula makes it easy to calculate ri from wi and all preceding i 2 1 weights and loadings in R
and P.3, 10, 11 To express wi in terms of ri , first note from (9) that
ri=wi+Rdi (10)
where di= 2 P wi . With R the Moore–Penrose inverse of R,
T +

di=R + (ri 2 wi )=R + rj (11)


T
because R wi=0i 2 1 , as can be verified by noting that, from (9), W and R have the same column space
and, from the orthogonality of {wj } ( j=1, 2, . . . , i), WTwi=0i 2 1 . On inserting (11) in (10) and
rearranging terms, we obtain
wi=(Ip 2 RR + )ri (12)
On expanding (6) using (7), (3) and (12), we obtain the generalized eigenvalue problem
(Ip 2 PRT ) XTY YTX(Ip 2 RPT) (Ip 2 RR + ) r = li(Ip 2 RR + )r (13)
which is expressed explicitly in terms of the weight vector r. Not only is RR + a projection matrix, but
PRT and RPT are also projection matrices (Appendix II). By using (39) and (36) of Appendix II, (13)
can be simplified to
(Ip 2 PRT ) XTY YTX r = li(Ip 2 RR + )r (14)
This expression is helpful in determining the objective function of PLS in terms of the original X and
Y, as we show in the next section.

3. GENERAL OBJECTIVE FUNCTION FRAMEWORK


The frameworks of Burnham et al.1 specify the objective functions that are being optimized by each
pair of latent variables of a latent variable method, except Framework 1 which is suited for the first
pair of latent variables only. Framework 3 is developed specially to cater for PLS and Framework 2
is a special case of Framework 4. We start from a framework, Framework A, that is the same as
Framework 4 of Burnham et al. except that the defining matrices are allowed to be singular. This
framework is rewritten in a form that has been analyzed by Rao12 under the name of the restricted
eigenvalue problem and is then specialized for PLS. We cannot directly use Rao’s solution of the
problem, because his solution did not cater for singular matrices.
Consider the following objective function for given, possibly singular, matrices M1 , M2 and M3:

max aTi XTY bi subject to aTi M1 ai=1, bTi M2bi=1 and aTj M3 ai=0, 1 < j<i (15)
ai,bi

On defining Ci=[a1 , a2 , . . . , ai 2 1 ]TM3 , Framework A can be rewritten in the form of Rao’s12


restricted eigenvalue problem, which is

max aTi XTY M22 YTX ai subject to aTi M1 ai=1 and Ci ai=0i 2 1 (16)
ai

where the last constraint is dropped when i=1. This can be seen by applying the Lagrange multiplier

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 45

method9 to (15). Zhu and Barnes13 and Hinkle and Rayens14 applied the Lagrange method to the
objective function of SIMPLS, reproducing the solution found by de Jong.3 The Lagrange method
provides the conditions for stationary points of (15). These conditions are necessary for the maximizer
of (15). The conditions obtained by the Lagrange method are
XTY bi 2 l*1 M1 ai 2 CTi m* = 0p and l*2 M2bi 2 YTX ai=0m (17)
where l*1, l*2 and m* are the Lagrange multipliers, one for each constraint. On solving for bi from the
second equation in (17) and inserting the solution in the first, we obtain, with li;l*1l*2 and
m;l*2m*,
XTY M22 YTX ai 2 li M1 ai 2 CTi m = 0p (18)
This equation is also obtained by applying the Lagrange method directly to framework (16). By
premultiplying (18) by a projector Ip 2 P that satisfies the equations
(Ip 2 P)CTi =Op 3 (i 2 1) and (Ip 2 P)M1 ai=M1 ai (19)
the equation can be written as the generalized eigenvalue problem
(Ip 2 P)XTY M22 YTX ai=liM1 ai (20)
The eigenvectors of (20) are stationary points of (16) with the eigenvalues as the corresponding values
of the maximand. The first eigenvector is therefore the global maximizer of (16). With bi solved from
(17) and given the proper length, i.e.
bi=M22 YTX ai /(aTi XTY M22 YTX ai )1/2 (21)
the global maximizers of (15) are obtained. By comparing (20) with (14) and considering the
requirements in (19), we derive that PLS fits in framework (16), with ai=r, by defining
M1 = (Ip 2 RR + ), M2 = Im , P = PRT and Ci=PT (22)
With these definitions the requirements of (19) read
(Ip 2 PRT )P = Op 3 (i 2 1) and (Ip 2 PRT ) (Ip 2 RR + ) ai=(Ip 2 RR + ) ai (23)
The requirements of (23) hold true, as follows from (35) and (44).
With Ci=PT the last constraint in (16), Ciai=0i 2 1 , is equivalent to the last set of constraints of
Framework A, aTj M3ai=0 (1 < j<i), by defining M3 = XTX. See the definition of P in (2). With the other
definitions of (22) this demonstrates that PLS fits in Framework A. Because M3 = XTX, PLS fits in both
Frameworks 2 and 4 of Burnham et al. when M1 is allowed to be singular. Table 1 copies Table 3 from
Burnham et al. with a column for PLS added. For clarity, Table 1 also specifies the matrices Ci and
P used in each method.
Before closing this section, we have two more observations on the framework and its solution. The
first observation is on the number of eigenvalue problems that need to be solved. Equation (20)
specifies that each latent variable is derived as the first eigenvector of a new eigenvalue problem. The
eigenvalue problems solved for different latent variables differ, because the matrix P depends, via P
and R, on the latent variables preceding any particular latent variable. This is an essential feature of
PLS and SIMPLS. However, if M1 = M3 , as in the other methods of Table 1, we need to solve a single
eigenvalue problem only. The second, third and higher latent variables are then the second, third and
higher eigenvectors of the eigenvalue problem solved for the first latent variable. This happens
because these higher eigenvectors automatically satisfy the orthogonality constraints aTj M3ai=0 (i≠j)
if M3 = M1 . In this case the ith X-factor ti=Xai is orthogonal to the jth Y-factor uj=Ybj and bTi M2bj=0
(i≠j). The second observation is that if XTX = Ip , as in some designed experiments, all methods of

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
46 C. J. F. TER BRAAK AND S. DE JONG

Table 1. Parameters M1 , M2 and M3 of Framework A


that define the ith pair of latent variables, and
parameter Ci of (16). R = [a1 , a2 , . . . , ai 2 1 ] and P
are p 3 (i 2 1) matrices of X-weights and X-loadings
of the preceding i 2 1 latent variables. The optimal X-
weight vector ai is obtained from the generalized
eigenvalue problem (20) using the projector matrix P
that is specified in the table. The optimal Y-weight
vector bi is obtained from (21)

CCR RRR PLS SIMPLS UPLS

M1 XTX XTX Ip 2 RR + Ip Ip
M2 YTY Im Im Im Im
M3 XTX XTX XTX XTX Ip
CiT P P P P R
P PRT PRT PRT PP + RR +

Table 1, except CCR, coincide. This is immediate from the definitions in Table 1, except for PLS. The
equivalence of PLS and SIMPLS in this particular case follows from the observation that if XTX = Ip ,
then PTai=RTai=0i 2 1 and aTi (Ip 2 RR + )ai=aTi ai=1.

4. PLS AND SIMPLS COMPARED


In this section, PLS and SIMPLS are compared in the light of the objective function framework. We
constructed two small example data sets (Table 2) with no other purpose than to illustrate that (i) the
length of the X-weight vector is equal to one in SIMPLS but may be greater than one in PLS, as
observed earlier by Burnham et al.,1 and (ii) it depends on the data whether PLS or SIMPLS explains
more of the Y-variance. The example data are analysed without autoscaling or any other form of
prescaling. The two example data sets use the same matrix X (Table 2) that is orthogonal but not
orthonormal (which would make PLS and SIMPLS equivalent). The example Y-data Y1 and Y2 (Table
2) differ only in scaling of the first column. Since PLS is invariant under orthogonal transformation
of the X-variables and/or the Y-variables,15,16 as is SIMPLS, one may think of the example X-data as
the principal component scores of the original (possibly autoscaled) data, whereby the Y-data are
rotated also. PLS and SIMPLS results, if non-identical, are distinguished by the superscripts N and S
respectively. PLS is not indicated by the letter P to avoid confusion with the loading matrix. The
superscript N acknowledges that this is the normal version of PLS as obtained by the standard NIPALS
algorithm. We change the notation for the X- and Y-weights in terms of the undeflated matrices from
a and b to the more customary letters r and q.
For the first factor pair both PLS and SIMPLS search for linear combinations Xr and Yq that
maximize the covariance:

Table 2. Artificial example data sets (X, Y1) and (X, Y2)

X Y1 Y2

24 22 1 4 0 2 0
24 2 21 24 21 22 21
4 22 21 22 1 21 1
4 2 1 2 0 1 0

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 47

rTXTYq subject to rTr = qTq = 1 (24)


Therefore r1 ; rN1 = rS1, q1;qN1 = qS1, t1 ;tN1 = tS1 and p1;pN1 = pS1, the associated X-loading vector. The
second X-factor t = Xr is required to be orthogonal to t1 . This requirement translates into the constraint
pT1r = 0 for the second factor of both PLS and SIMPLS. For the second and later Y-factors of the form
Yq, PLS and SIMPLS both use the constraint qTq = 1. Note that the second Y-factor in normal
NIPALS-PLS uses the deflated Y-matrix Y2 and cannot in general be rewritten as a linear combination
of the columns of the undeflated Y (see Appendix I). SIMPLS continues to apply the constraint rTr = 1,
but PLS uses a different constraint (Table 1). For the ith PLS factor (i>1) the constraint is
rT[Ip 2 R(RTR) 2 1RT]r = wTw = 1 (25)
where, for the second factor, R = [r1 ] and in general R = [r1 , r2 , . . . , ri 2 1 ]. Equivalently, with
r = w + Rd as in (10) and d defined by (11),
rTr = wTw + dTRTR d = 1 + dTRTR d > 1 (26)
In contrast with SIMPLS, PLS allows the length of the weight vector to be greater than one (Table 3),
as was observed in Appendix II of Burnham et al. If the second factor differs, the third and later factors
are expected to differ more, because these must now satisfy different orthogonality requirements
(PN r = 0i 2 1 for PLS and PS r = 0i 2 1 for SIMPLS, with PN ≠ PS ), and the difference in the length of r
T T

may increase, because R contains more columns and thus allows more freedom when more factors
have been extracted.
The interpretation of (26) is that PLS does not penalize the length of Rd, i.e. the part of the weight
vector r that can be expressed as a linear combination of the preceding weights. Does this make sense?
In the predictive model for Y based on i latent variables, the i weight vectors are freely combined in
different linear combinations to form the final regression coefficient matrix B of the model Ŷ = XB.
This can be seen as follows. From the regression of Y on T, giving Ŷ = TQ̃T, with Q̃ = YTT(TTT) 2 1 the
matrix of Y-loadings with respect to T, we obtain Ŷ = TQ̃T = XRQ̃T = XB, so that B = RQ̃T. No
constraint is applied while estimating Q̃, because Q̃ is obtained by unrestricted least squares regression
of Y on T. This holds true for both PLS and SIMPLS. Because the predictive model using i 2 1 latent
variables uses unconstrained combinations of the first i 2 1 weight vectors in R, we argue that the ith

Table 3. Main results (X-weights and X-scores) for PLS and SIMPLS regression
of (X, Y1) data from Table 2

PLS SIMPLS

1 2 3 1 2 3

r 2 0·1252 0·6880 2 0·1094 2 0·1252 0·5232 2 0·0646


0·5929 0·1773 2 0·5332 0·5929 0·1612 2 0·3660
2 0·7955 2 1·2037 2 1·3142 2 0·7955 2 0·8368 2 0·9284
iri 1·0000 1·3977 1·4225 1·0000 1·0000 1·0000
r/iri 2 0·1252 0·4922 2 0·0769 2 0·1252 0·5232 2 0·0646
0·5929 0·1269 2 0·3748 0·5929 0·1612 2 0·3660
2 0·7955 2 0·8612 2 0·9239 2 0·7955 2 0·8368 2 0·9284
t/iti 2 0·4892 2 0·7126 0·0542 2 0·4892 2 0·7142 0·0256
0·8201 2 0·1973 0·1960 0·8201 2 0·2050 0·1880
2 0·2944 0·5953 0·5558 2 0·2944 0·5726 0·5792
2 0·0365 0·3146 2 0·8060 2 0·0365 0·3466 2 0·7928

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
48 C. J. F. TER BRAAK AND S. DE JONG

weight vector may also use the preceding weight vectors in an unconstrained fashion. From this point
of view there is no reason for penalizing the length of the part of the weight vector that can be
expressed as a linear combination of the preceding weights.
Table 1 shows that PLS uses the same matrix for P as CCR and RRR, namely PRT, the projector
on P in the metric (XTX) 2 , whereas SIMPLS uses the orthogonal projector on P. In the dual space of
objects, Rn, PLS uses the orthogonal projector on T, whereas SIMPLS uses an oblique projector,
namely the projector on XXTT in the metric (XXT) 2 . Least squares regression problems use the
orthogonal projector in Rn. It is unclear to us how to value these geometrical differences between PLS
and SIMPLS.
Because PLS constrains the X-weight vector less than SIMPLS, one might conjecture that PLS
explains more of the Y-variance than SIMPLS. We first approach this conjecture mathematically and
then prove the conjecture to be false by giving a counter-example. The additional Y-variance explained
by an orthogonal factor t = Xr is

rTXTYYTXr/(tTt) (27)

where the numerator is equal to the objective function (16) used by PLS and SIMPLS (M2 = Im ). The
maximum of (16) attained is the eigenvalue l. One can prove that lN2 > lS2, but this relation does not
necessarily imply that PLS explains more Y-variance than SIMPLS with two factors, because of the
denominator of (27). We now present an example and a counter-example of the conjecture. For the
simple data examples given in Table 2, we find that in one case (Y1) two-factor PLS, compared with
SIMPLS, explains more variance in the Y-data and less in the X-data, whereas in the other case (Y2)
these roles are reversed (Table 4). These examples show that the conjecture is false; it depends on the
data which method explains the larger amount of variance in X or Y.
With real data we do not readily observe differences of a few per cent. With artificial data one may
generate even much larger differences than those of Table 4. These occur when at some stage the
deflated XTY has almost coinciding singular values. The associated pair of factors may then enter the
model in reversed order for one method compared with the other. Hence for a given dimensionality
the two models may differ greatly, a difference that may disappear largely after introducing the next
factor.
Because the objective function framework applies irrespective of the dimension of Y, the objective
functions of PLS and SIMPLS also differ when Y is univarite. Nevertheless, PLS and SIMPLS are
known to be equivalent when Y is univariate.3 We resolve this apparent paradox in Appendix III and
also explain why PLS and SIMPLS become equivalent when Y changes from multivariate to
univariate.

Table 4. Variance explained (%) by second fac-


tor in two regressions (X,Y1) and (X,Y2) of data
of Table 2

Y1 regressed Y2 regressed
on X on X

Method X Y1 X Y2

PLS 85·45 73·90 91·56 59·16


SIMPLS 86·75 72·22 90·21 65·53
Difference 2 1·30 1·68 1·35 2 6·37

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 49

5. CONCLUSIONS
Burnam et al.1 were the first to develop an objective function for all latent variables of multivariate
PLS regression that did not involve deflated matrices. This allowed PLS to be placed in an objective
function framework among its rival methods such as SIMPLS, undeflated PLS, reduced rank
regression and canonical correlation-based regression. That framework was a rather complicated
adaptation of a basic framework so as to accommodate PLS alongside its rivals. In the basic
framework, latent variables are simply chosen by constrained maximization of the covariance between
the linear combinations of the original variables. Using the Lagrange multiplier method, we show in
this paper that PLS can also be placed in the basic framework. This makes it easier to compare and
contrast PLS with the other multivariate methods. Table 1 lists the defining constraints of each
method.
We use the framework to highlight the similarities and dissimilarities between PLS and SIMPLS.
We show that PLS and SIMPLS differ in the constraint put on the length of the X-weight vector. PLS
does not penalize the length of the part of the weight vector that can be expressed as a linear
combination of the preceding weights, whereas SIMPLS does. In the predictive model for Y based on
A latent variables, the A weight vectors are freely combined in different linear combinations for
different Y-variables to form the matrix of final regression coefficients B of the model Ŷ = XB. This
holds true for both PLS and SIMPLS. Because the predictive model using A latent variables already
uses unconstrained combinations of the first A weight vectors, there is little basis to constrain their
usage in forming the (A+1)th vector. These results add to the understanding of the deflation process
in PLS.
Because PLS constrains the X-weight vector less than SIMPLS, one might conjecture that PLS
explains more of the Y-variance than SIMPLS. However, this conjecture does not hold true, as we
prove by giving a counter-example. With two artificial example data sets we demonstrate that it
depends on the data which of the two methods explains the larger amount of variance in X or Y. Like
Burnham et al.,1 we do not readily observe large differences between PLS and SIMPLS with real data.
In our experience the subtle theoretical differences between PLS and SIMPLS do not have important
consequences in practical applications.
We hope that the objective function framework and the associated geometrical results in Appendix
II contribute to a better understanding of the nature of PLS and SIMPLS and their relation to other
latent variable methods.

APPENDIX I: A PLS ALGORITHM WITH IMPLICIT DEFLATION


Table 5 contains an algorithm, as MATLAB code, for extracting PLS factors using implicit deflation.
It is the conceptual basis of algorithms in Fortran or C that are efficient if the X- and Y-matrices are
sparse. For obtaining efficiency, sparse matrix multiplication and an accelerator must be used.17, 18 In
Steps 1 and 4, weight vectors are applied to the original data matrices X and Y. The resulting X-factor
t and Y-factor u are orthogonalized with respect to the preceding X-factors in T in Steps 2 and 5. The
algorithm wrongly suggests that it solves an eigenproblem phrased in terms of the original data
matrices and that, on convergence, the Y-factor Yc is orthogonal to the preceding X-factors in T.
Although true in similar algorithms for CCR and RRR,19, 20 the suggestions do not hold true in PLS.
The algorithm is actually mathematically equivalent to algorithms using explicit deflation, as can be
seen by combining Steps 1 and 2 into

t = (In 2 TT + )Xw = Xi w (28)

and Steps 4 and 5 into

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
50 C. J. F. TER BRAAK AND S. DE JONG

u = (In 2 TT + )Yc = Yic (29)

On inserting Step 5 in Step 6 and then, sequentially, Steps 4, 3, 2 and 1 in the result, and on accounting
for the rescaling of w in Step 7 by l = norm(w 2 w0) = iw 2 w0i, we obtain

lw = XTu = XT(In 2 TT + )Yc = · · · = XT(In 2 TT + )YYT(In 2 TTT )Xw (30)

Upon convergence, l is the eigenvalue. The weight vectors obtained from the algorithm are thus in
terms of the deflated matrices and therefore denoted by w and c as in the main text. Beyond the first
dimension the Y-factor obtained from the algorithm is u = Yic, which is orthogonal to the preceding X-
factors in T. The Y-factor defined in Framework A in (15), u* = Ybi with bi defined by (21), is in
general not orthogonal to T. In contrast with u*, u is not in the column space of Y.1 The Y-factors are
related by u ~ (In 2 TT + )u*.

Table 5. Basic concept of a PLS algorithm with implicit deflation

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 51

APPENDIX II: RELATIONS BETWEEN PROJECTORS DEFINED BY THE WEIGHT MATRIX


R AND THE ASSOCIATED LOADING MATRIX P
After recalling some elementary properties of projectors, this appendix summarizes the relationships
between the weight matrix R and the associated loading matrix P and between the projectors defined
by R and P. The relationships are in no way special for PLS. It is not assumed that the latent variables
in T are orthogonal. For relation (36) to hold, t must be orthogonal to T. For a discussion of the
geometry of PLS and the paramount role of oblique projections see Reference 21.
We start with some elementary properties of projectors.12 The projector on A in the metric B (with
B = BT ) is idempotent, i.e. PA/BPA/B = PA/B, and self-adjoint so that BPA/B = PA/B T
B, and has the
algebraic form
PA/B = A(ATBA) 2 ATB (31)
The (oblique) projector is also said to be on A along (BA)'. If C is also a metric of a projector on A,
we obtain by using (31) that
PA/BPA/C = PA/C (32)
The projector on (BA)' along A is I 2 PA/B and
(I 2 PA/B )(I 2 PA/C ) = (I 2 PA/B ) (33)
by algebraic expansion of the product and use of (32).
We now establish general relations between weight matrices and their associated loading matrices.
Let T = XR, with R an arbitrary p 3 (i 2 1) weight matrix (i >1). The order of R is chosen for
consistency with the main text. The loading matrix P is then defined by
P = XTT(TTT) 2 1 = XTXR(RTXTXR) 2 1 (34)
T
By premultiplying both sides of this equation by R , we obtain
RTP = Ii 2 1 (35)
so that the matrices R and P are generalized inverses of one another. It also follows from (35) that the
weight vector ri for the ith factor is orthogonal to the loading vectors {pj } of other factors ( j≠i). For
any t = Xr that is orthogonal to T, i.e. TTt = 0i 2 1 , we have the relation3
PTr = (TTT) 2 1 TTXr = (TTT) 2 1TTt = 0i 2 1 (36)
This relation does not immediately follow from (35). Equation (35) also holds true for non-orthogonal
T. However, if TTt = 0i 2 1 , the loading matrix of X with respect to [T, t] is simply the concatenation
[P, p], where p = XTt/(tTt). From (35) with R* = [R, r] and P* = [P, p], R*TP* = Ii , hence
[R, r]T[P, p] = Ii , so that PTr = 0i 2 1 as was required.
It is important to note that the matrix products RPT and PRT are projectors, namely
RPT = R(RTXTXR) 2 1RXTX = PR/XTX (37)
the projector on R in the XTX metric (31), and
PRT = XTXR (RTXTXR ) 2 1 RT = P[PT(XTX) 2 P] 2 1PT(XTX) 2 = PP/(XTX) 2 (38)
the projector on P in the (XTX) 2 metric. The second equality sign in (38) can be verified by
straightforward algebra after expanding P on the right-hand side using (34). Also the (orthogonal)
projectors in the Ip-metric on R and P play a role, RR + = R(RTR) 2 1RT and PP + = P(PTP) 2 1PT
respectively.

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
52 C. J. F. TER BRAAK AND S. DE JONG

From the equality (33) or, alternatively, from (35) and the definition of P + and R + it is now easy
to derive that
(Ip 2 RPT)(Ip 2 RR + ) = (Ip 2 RPT ) (39)
(Ip 2 RR + )(Ip 2 RPT ) = (Ip 2 RR + ) (40)
(Ip 2 PR )(Ip 2 PP ) = (Ip 2 PR )
T + T
(41)
(Ip 2 PP + )(Ip 2 PRT ) = (Ip 2 PP + ) (42)
On transposing both sides of (39)–(42), we obtain
(Ip 2 RR + )(Ip 2 PRT ) = (Ip 2 PRT ) (43)
(Ip 2 PR )(Ip 2 RR ) = (Ip 2 RR )
T + +
(44)
(Ip 2 PP + )(Ip 2 RPT ) = (Ip 2 RPT ) (45)
(Ip 2 RP )(Ip 2 PP ) = (Ip 2 PP )
T + +
(46)
respectively, because RR + and PP + are symmetric. From (43) and (41)
(Ip 2 PRT ) = (Ip 2 RR + )(Ip 2 PRT ) = (Ip 2 RR + )(Ip 2 PRT )(Ip 2 PP + ) (47)
and from (44) and (40)
(Ip 2 RR + ) = (Ip 2 PRT )(Ip 2 RR + ) = (Ip 2 PRT )(Ip 2 RR + )(Ip 2 RPT ) (48)

APPENDIX III: EQUIVALENCE OF PLS AND SIMPLS FOR UNIVARIATE Y


3
De Jong showed that PLS and SIMPLS produce the same sequence of factors and thus, with any
number of factors, the same predictive model for Y when Y is univariate. In this appendix we present
a simple proof that specializes the objective function (16) for univariate Y.
For the first factor, rN1 = rS1 = r1 = s/isi, with s ; XTy. For the second factor we may write the
optimization problems for the two methods as follows:

PLS: max rTXTyyTXr/[rT(Ip 2 RR + )r] subject to PTr = 0i 2 1 (49)


r

SIMPLS: max rTXTyyTXr/(rTr) subject to PTr = 0i 2 1 (50)


r

Dividing both maximands by iXTyi2 does not affect the solution of the maximization problem, hence
we may replace XTy by r1 = XTy/iXTyi. We are also free to choose unit-length r, since neither criterion
depends on the length of r. Notice that this choice differs from the treatment in Section 3, where we
added the constraint rTM1r = 1 rather than absorbing it in the maximand. Finally, we prove below that
rTRR + r = kirTr1rT1 r for some constant ki>0 that is independent of r. This result is immediate for the
second factor with k2=1, because r1r1+ = r1(rT1 r1) 2 1rT1 = r1rT1. Applying these substitutions and
simplifying gives

PLS: max rTr1rT1r/(1 2 kirTr1rT1r) subject to PTr = 0i 2 1 and rTr = 1 (51)


r

SIMPLS: max rTr1rT1r subject to PTr = 0i 2 1 and rTr = 1 (52)


r

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
OBJECTIVE FUNCTION OF PLS 53

We observe that the maximizing solutions for PLS and SIMPLS coincide, because the minimum of the
denominator in the PLS criterion coincides with the maximum of the numerator as both ki and the
denominator are positive. Thus the normalized weight vectors defining the ith factor of PLS and of
SIMPLS are identical.
It remains to prove that rTRR + r = kirTr1rT1r. Define R* = [r1 , p1 , p2 , . . . , pi 2 2 ]. From the Krylov
series properties of the PLS weight vectors and the loading vectors, especially the fact that the jth term
of the latter series, (XTX)jXTy, corresponds to the ( j+1)st term of the former series,3 it follows that R*
and R have the same column space. Because PTr = 0i 2 1 , RT*r = (rT1r, 0, . . . , 0)T. We obtain
rTRR + r = rTR*(RT*R* ) 2 1RT*r = kirTr1rT1r (53)
T 21
where ki is the (1,1)th element of (R R* ) , which is indeed positive and independent of r. This
*
concludes the proof.
For multivariate Y we may write the optimization problem for the second factor for the two methods
as

PLS: max rTSSTr/(1 2 rTr1rT1r) subject to rTp1 = 0 and rTr = 1 (54)


r

SIMPLS: max rTSSTr subject to rTp1 = 0 and rTr = 1 (55)


r

where S ; XTY. Let the SIMPLS criterion (55) be maximum for r = rS2 . This corresponds also to the
maximum of the numerator in the PLS criterion (54). There is no reason, however, that it coincides
with the minimum of the denominator. The value of the PLS criterion can be increased, departing from
r = rS2 , when the denominator is further reduced at the expense of a smaller (relative) reduction of the
numerator. Thus, for multivariate Y, PLS does usually not coincide with SIMPLS beyond the first
factor. An exception is when S is of rank one, i.e. when SST can be written as ssT for some vector s.
In that special case one may use the same reasoning as for univariate Y, showing PLS and SIMPLS
to be equivalent.

REFERENCES
1. A. J. Burnham, R. Viveros and J. F. MacGregor, J. Chemometrics, 10, 31–45 (1996).
2. A. Höskuldsson, J. Chemometrics, 2, 211–228 (1988).
3. S. de Jong, Chemometrics Intell. Lab. Syst. 18, 251–263 (1993).
4. M. Stone and R. J. Brooks, J. R. Stat. Soc. B, 52, 237–269 (1990).
5. I. E. Frank and J. H. Friedman, Technometrics, 35, 109–135 (1993).
6. C. J. F. ter Braak, S. Juggins, H. J. B. Birks and H. van der Voet, in Multivariate Environmental Statistics,
ed. by G. P. Patil and C. R. Rao, pp. 525–560, North-Holland, Amsterdam (1993).
7. R. Brooks and M. Stone, J. Am. Stat. Assoc. 89, 1374–1377 (1994).
8. C. J. F. ter Braak and P. F. M. Verdonschot, Aquatic Sci. 57, 255–289 (1995).
9. J. R. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and
Econometrics, pp. 32–33, 131–144, Wiley, New York (1988).
10. B. S. Dayal and J. F. MacGregor, J. Chemometrics, 11, 73–85 (1997).
11. A. Höskuldsson, Chemometrics Intell. Lab. Syst. 14, 139–153 (1992).
12. C. R. Rao, Linear Statistical Inference and Its Application, 2nd edn, p. 50, Wiley, New York (1973).
13. E. Zhu and R. M. Barnes, J. Chemometrics, 9, 363–372 (1995).
14. J. Hinkle and W. Rayens, Chemometrics Intell. Lab. Syst. 30, 159–172 (1995).
15. M. C. Denham, Ph.D. Thesis, University of Liverpool (1991).
16. S. de Jong and C. J. F. ter Braak, J. Chemometrics, 8, 169–174 (1994).
17. A. R. Gourlay and G. A. Watson, Computational Methods for Matrix Eigenproblems, Wiley, New York
(1973).
18. M. O. Hill, DECORANA—A FORTRAN Program for Detrended Correspondence Analysis and Reciprocal

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)
54 C. J. F. TER BRAAK AND S. DE JONG

Averaging. Ecology and Systematics, Cornell University, Ithaca, NY (1979).


19. C. J. F. ter Braak, in Data Analysis in Community and Landscape Ecology, ed. by R. H. G. Jongman, C. J.
F. ter Braak and O. F. R. van Tongeren, pp. 91–173, Pudoc, Wageningen (1987) (reprinted 1995, Cambridge
University Press).
20. C. J. F. ter Braak, CANOCO—A FORTRAN Program for Canonical Community Ordination by [Partial]
[Detrended] [Canonical] Correspondence Analysis, Principal Components Analysis and Redundancy
Analysis (Version 2.1). Report LWA-88-02, Agricultural Mathematics Group, Wageningen (1988).
21. A. Phatak and S. de Jong, J. Chemometrics, 11, 311–338 (1997).

© 1998 John Wiley & Sons, Ltd. J. Chemometrics, Vol. 12, 41–54 (1998)

You might also like