0% found this document useful (0 votes)
18 views19 pages

Tree Traversal

This paper presents fast and accurate algorithms for phylogeny reconstruction based on the Minimum Evolution (ME) principle, demonstrating that a greedy approach can significantly reduce computation time while maintaining topological accuracy. The Greedy Minimum Evolution (GME) algorithm and a balanced minimum evolution scheme are introduced, showing improvements over traditional methods like Neighbor Joining (NJ) in both speed and accuracy, especially for large trees. The authors conclude that their methods are statistically consistent and advantageous for phylogenetic inference using evolutionary distance data.

Uploaded by

henri.dehaybe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views19 pages

Tree Traversal

This paper presents fast and accurate algorithms for phylogeny reconstruction based on the Minimum Evolution (ME) principle, demonstrating that a greedy approach can significantly reduce computation time while maintaining topological accuracy. The Greedy Minimum Evolution (GME) algorithm and a balanced minimum evolution scheme are introduced, showing improvements over traditional methods like Neighbor Joining (NJ) in both speed and accuracy, especially for large trees. The authors conclude that their methods are statistically consistent and advantageous for phylogenetic inference using evolutionary distance data.

Uploaded by

henri.dehaybe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 9, Number 5, 2002


© Mary Ann Liebert, Inc.
Pp. 687–705

Fast and Accurate Phylogeny


Reconstruction Algorithms Based on the
Minimum-Evolution Principle

RICHARD DESPER 1 and OLIVIER GASCUEL2

ABSTRACT

The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be
statistically consistent when it is used in conjunction with ordinary least-squares (OLS)
Ž tting of a metric to a tree structure. The traditional approach to using ME has been to
start with the Neighbor Joining (NJ) topology for a given matrix and then do a topological
search from that starting point. The Ž rst stage requires O(n3 ) time, where n is the number of
taxa, while the current implementations of the second are in O(p n3 ) or more, where p is the
number of swaps performed by the program. In this paper, we examine a greedy approach
to minimum evolution which produces a starting topology in O(n2 ) time. Moreover, we
provide an algorithm that searches for the best topology using nearest neighbor interchanges
(NNIs), where the cost of doing p NNIs is O(n2 C p n), i.e., O(n2 ) in practice because p
is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when
used in combination with NNIs, produces trees which are fairly close to NJ trees in terms
of topological accuracy. We also examine ME under a balanced weighting scheme, where
sibling subtrees have equal weight, as opposed to the standard “unweighted” OLS, where
all taxa have the same weight so that the weight of a subtree is equal to the number of its
taxa. The balanced minimum evolution scheme (BME) runs slower than the OLS version,
requiring O(n2 £ diam(T )) operations to build the starting tree and O(p n £ diam(T)) to
perform the NNIs, where diam(T) is the topological diameter of the output tree. In the
usual Yule-Harding distribution on phylogenetic trees, the diameter expectation is in log(n),
so our algorithms are in practice faster that NJ. Moreover, this BME scheme yields a very
signiŽ cant improvement over NJ and other distance-based algorithms, especially with large
trees, in terms of topological accuracy.

Key words: phylogenetic inference, distance methods, minimum evolution, topological accuracy,
computational speed.

1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Cen-
ter Drive, Bethesda, MD, 20892.
2 Département Informatique Fondamentale et Applications, LIRMM, 161 rue Ada, 34392 Montpellier, France.

687
688 DESPER AND GASCUEL

INTRODUCTION

M inimum evolution was proposed by several authors (Kidd and Sgaramella-Zonta, 1971; Saitou
and Nei, 1987; Rzhetsky and Nei, 1993; Swofford et al., 1996) as a basic principle for phyloge-
netic inference. Given the matrix of pairwise evolutionary distances between the taxa being studied, this
principle involves Ž rst estimating the length of any given topology and then selecting the topology with
shortest length. Minimum evolution is thus conceptually close to character-based parsimony and complies
with Occam’s principle of scientiŽ c inference, which essentially maintains that simpler explanations are
preferable to more complicated ones and that ad hoc explanations should be avoided.
Numerous variants of the minimum evolution principle exist, depending on how the branch lengths
are estimated and how the tree length is calculated from these branch lengths. Several deŽ nitions of tree
length have been proposed, differing from one another by the treatment of negative branch lengths. The
most common solution (Saitou and Nei, 1987; Rzhetsky and Nei, 1993) simply deŽ nes the tree length
as the sum of all branch lengths, regardless of whether they are positive or negative. Branch lengths
are usually estimated within the least-squares framework. If all distance estimates can be assumed to be
independent and to have the same variance, we use the ordinary least-squares (OLS) framework. The
weighted least-squares framework corresponds to the case were distance estimates are independent but
(possibly) with different variances, while the generalized least-squares approach does not impose any
restriction and is able to beneŽ t from the covariances of the distance estimates. It is well known that
distance estimates obtained from sequences do not have the same variance, because the largest distances
are much more variable than the shortest ones (Fitch and Margoliash, 1967) and are mutually depen-
dent when they share a common history (or path) in the true phylogeny (Nei and Jin, 1989). Therefore,
to estimate branch lengths from evolutionary distances, using generalized least-squares is theoretically
superior to using weighted least-squares, which is in turn more appropriate than ordinary least-squares
(Bulmer, 1991).
The minimum evolution principle has been shown to be statistically consistent when combined with
ordinary least-squares (Rzhetsky and Nei, 1993; Denis and Gascuel, 2002). This important property implies
that the more accurate the distance estimates, as induced by the use of long sequences when a correct
sequence evolution model is chosen, the higher the probability of recovering the true phylogeny. However,
ordinary least-squares poorly Ž ts the features of evolutionary distance data, as explained above. Thus,
it is tempting to combine the minimum-evolution principle with a more reliable estimation of branch
lengths, using weighted least-squares or generalized least-squares. However, we recently demonstrated
that such a combination is not always statistically consistent and, therefore, could represent a dead end
towards obtaining better phylogenetic inference methods, especially in the case of generalized least-squares
(Gascuel et al., 2001).
This paper further investigates the minimum evolution principle, but with a more optimistic perspective.
First, we demonstrate that its usage in combination with ordinary least-squares, even when not fully optimal
in terms of topological accuracy, has the great advantage of leading to very fast algorithms, much faster
than the NJ algorithm (Saitou and Nei, 1987) and fast enough as to be able to build very large trees as
envisaged in biodiversity studies. Second, we show that a new version of this principle, Ž rst introduced
by Pauplin (2000) to simplify tree length computation, is more appropriate than the OLS version. In this
new version, sibling subtrees have equal weight, as opposed to the standard unweighted OLS, where all
taxa have the same weight and thus the weight of a subtree is equal to the number of its taxa. This
new version can be seen as weighted, just as WPGMA is the weighted version of UPGMA (Sneath and
Sokal, 1973), but we will prefer the term “balanced” to avoid confusion with weighted least-squares.
In addition to the aforementioned fast OLS minimum evolution algorithms, we also present algorithms
to deal with this new balanced version that are also faster than NJ, though not as fast as their OLS
counterparts. Furthermore, the balanced algorithms produced output trees with better topological accuracy
than those from NJ, BIONJ (Gascuel, 1997a) and WEIGHBOR (Bruno et al., 2000). The rest of this paper
is organized as follows: we Ž rst provide the notation and deŽ nitions, then describe the algorithms for the
OLS version of the minimum evolution principle, explain how these algorithms are modiŽ ed to deal with the
balanced version, provide simulation results to illustrate the gain in topological accuracy and run times, and
conclude by a brief discussion. The appendix provides the details of the algorithms and some mathematical
proofs.
ALGORITHMS FOR MINIMUM EVOLUTION 689

1. NOTATION, DEFINITIONS, AND FORMULAE

A tree is made of nodes (or vertices) and of edges (or branches). Among the nodes, we distinguish the
internal (or ancestral) nodes and the leaves (or taxa). The leaves are denoted as i; j; or k and the internal
nodes as u; v; or w, while an edge e is deŽ ned by a pair of nodes and a length l.e/. We shall be considering
various length assignments of the same underlying shape. In this case, we shall use the word “topology,”
while “tree” will be reserved for an instance of a topology with given edge lengths associated. We use the
letter T to refer to a topology and T to refer to a tree. A tree is also made of subtrees (or clades), typically
denoted as A; B; C; or D. For the sake of simplicity, we shall use the same notation for the subtrees and
for the sets of taxa they contain. Accordingly, T also represents the set of taxa being studied, and n is the
number of these taxa. Moreover, we shall use lowercase letters, e.g., a; b; c; or d, to represent the subtree
roots. If A and B are two disjoint subtrees, with roots a and b respectively, we’ll say that A and B are
distant-k subtrees if there are k edges in the path from a to b.
The matrix 1 is the matrix of pairwise evolutionary distance estimates, with 1ij being the distance
between taxa i and j . Let A and B be two nonintersecting subtrees from a tree T . We deŽ ne the average
distance between A and B as
1 X
1AjB D 1ij I (1)
jAjjBj
i2A;j 2B

1AjB may also be deŽ ned recursively as:

² If A and B are singleton sets, i.e., A D fag and B D fbg, then 1AjB D 1ab ,
² Else, without loss of generality let B D B1 [ B2 as shown in Fig. 1, we then have

jB1 j jB2 j
1AjB D 1AjB1 C 1AjB2 : (2)
jB j jBj

It is easily seen that Equations (1) and (2) are equivalent. Equation (2) follows the notion that the weight
of a subtree is proportional to the number of its taxa. So every taxon has the same weight, and the same
holds for the distances as shown by Equation (1). Thus, this average is said to be unweighted (Sneath and
Sokal, 1973). It must be noted that the unweighted average distance between subtrees does not depend on
their topologies, but only on the taxa they contain.

FIG. 1. Subtree B is composed of the two sibling trees B1 and B2 .


690 DESPER AND GASCUEL

The distance 1T is the distance induced by the tree T ; i.e., 1Tij is equal to the length of the path
connecting i to j in T , for every taxon pair .i; j / . Given a topology T and a distance matrix 1, the OLS
branch length estimation produces the tree T with topology T minimizing the sum of squares:
X
.1Tij ¡ 1ij /2 :
i;j 2T

Vach (1989), Rzhetsky and Nei (1993), and others showed analytical formulae for the proper OLS edge
length estimation, as functions of the average distances. Suppose e is an internal edge of T , with the four
subtrees A, B , C, and D deŽ ned as depicted in Fig. 2(a). Then, the OLS length estimate of e is equal to

1£ ¤
l.e/ D ¸.1AjC C 1BjD / C .1 ¡ ¸/.1AjD C 1BjC / ¡ .1AjB C 1CjD / ; (3)
2
where
jAjjDj C jBjjCj
¸D :
.jAj C jBj/.jCj C jDj/

Suppose e is an external branch, with i, A, and B as represented in Fig. 2(b). Then we have

1
l.e/ D .1Aji C 1Bji ¡ 1AjB /: (4)
2
Equations (3) and (4) demonstrate an important property of OLS edge length estimation: the length
estimate of any given edge does not depend on the topology of the “corner” subtrees, i.e., A; B; C; and
D in Equation (3) and A and B in Equation (4), but only on the taxa contained in these subtrees.
Following Saitou and Nei (1987) and Rzhetsky and Nei (1993), we deŽ ne the tree length l.T / of T to
be the sum of the edge lengths of T . The OLS minimum evolution tree is then that tree with topology T
minimizing l.T /, where T has the OLS edge length estimates for T , and T ranges over all possible tree
topologies for the taxa being studied.
Now, suppose that we are interested in the length of the tree T shown in Figure 2(a), depending on the
conŽ guration of the corner subtrees. We then have (proof in Appendix 1)

1£ ¤
l.T / D ¸.1AjC C 1BjD / C .1 ¡ ¸/.1AjD C 1BjC / C 1AjB C 1CjD / (5)
2
C l.A/ C l.B/ C l.C/ C l.D/ ¡ 1ajA ¡ 1bjB ¡ 1cjC ¡ 1djD

where ¸ is deŽ ned as in Equation (3). The advantage of Equation (5) is that the lengths l.A/; l.B/; l.C/,
and l.D/ of the corner subtrees, as well as the average root/leaf distances, 1ajA ; 1bjB ; 1cjC , and 1djD ,
do not depend of the conŽ guration of A; B; C; and D around e. Exchanging B and C or B and D might
change the length of the Ž ve edges shown in Fig. 2(a) and then the length of T , but not the lengths of
A, B, C, and D. This simply comes from the fact that the edge e is within the corner subtrees associated

FIG. 2. Corner subtrees used to estimate the length of e: (a) for e an internal edge, (b) for e an external edge.
ALGORITHMS FOR MINIMUM EVOLUTION 691

to any of the edges of A; B; C; and D. As we shall see in the next section, this property is of great help
in designing fast OLS tree-swapping algorithms.
Let us now turn our attention toward the balanced version of minimum evolution, as deŽ ned by Pauplin
(2000). The tree length deŽ nition is the same. Formulae for edge length estimates are identical to Equations
(3) and (4), with ¸ replaced by 1=2 and using a different deŽ nition of the average distance between subtrees
that depends on the topology under consideration. Letting T be this topology, the balanced average distance
between two nonintersecting subtrees A and B is then recursively deŽ ned by the following:

² If A and B are singleton sets, i.e., A D fag and B D fbg, then 1TAjB D 1ab ;
² Else, without loss of generality let B D B1 [ B2 as shown in Fig. 1, we then have

1 T
1TAjB D .1 C 1TAjB2 /: (6)
2 AjB1

The change from Equation (2) is that the sibling subtrees B1 and B2 now have equal weight, regardless
of the number of taxa they contain. Thus, taxa do not have the same in uence depending on whether they
belong to a large clade or are isolated, which can be seen as consistent in the phylogenetic inference context
(Sneath and Sokal, 1973). Moreover, a comparison of variants of the NJ algorithm showed by computer
simulation (Gascuel, 2000) that this “balanced” approach is more appropriate than the unweighted one
for reconstructing phylogenies with evolutionary distances estimated from sequences. Finally, the balanced
minimum evolution principle can be shown to be statistically consistent using a proof (to be published
elsewhere) along the lines of Rzhetsky and Nei’s (1993). Therefore, it was tempting to test the performance
of this balanced version of the minimum evolution principle.
Unfortunately, this new version does not have all of the good properties of the OLS version: the edge
length estimates given by Equations (3) and (4) now depend on the topology of the corner subtrees, simply
because the balanced average distances between these subtrees depend on their topologies. As we shall
see, this makes the algorithms more complicated and more expensive in computing time than with OLS.
However, the same tree length formula, Equation (5), holds with 1 being replaced by 1T and ¸ by 1=2,
and, fortunately, we still have the good property that tree lengths l.A/; l.B/; l.C/, and l.D/, as well as
average root/leaf distances 1TajA ; 1TbjB ; 1TcjC , and 1TdjD , remain unchanged when B and C or B and D
are swapped. Edge lengths within the corner subtrees may change when performing a swap, but their
(balanced) sums remain identical (proof in Appendix 2).

2. ALGORITHMS FOR THE OLS VERSION OF MINIMUM EVOLUTION

This section presents two algorithms for phylogenetic inference. The Ž rst constructs an initial tree by
the stepwise addition of taxa to a growing tree, while the second improves this tree by performing local
rearrangements (or swapping) of subtrees. Both follow a greedy approach and tend, at each step, to min-
imize the OLS version of the minimum evolution criterion. This approach does not guarantee that the
global optimum will be reached, but only a local optimum. However, this kind of approach has proven to
be effective in many optimization problems (Cormen et al., 2000, 329–356), and we shall see that further
optimizing the minimum evolution criterion would not yield signiŽ cant improvement in terms of topolog-
ical accuracy. Moreover, such a combination of heuristic and optimization algorithms is used in numer-
ous phylogenetic reconstruction methods, for example those from the PHYLIP package (Felsenstein, 1989).

2.1. The GME greedy addition algorithm


Given an ordering on the taxa, denoted as .1; 2; 3; : : : ; n/, for k D 4 to n, we create a tree Tk on
the taxa set .1; 2; 3; : : : ; k/. We do this by testing each edge of Tk¡1 as a possible insertion point for k,
and the different insertion points are compared by the minimum evolution criterion. Inserting k on any
edge of Tk¡1 removes that edge, changes the length of every other already existing edge, and requires
the computation of the length of the three newly created edges. Computing the new tree length for every
possible insertion point would seem to be computationally expensive. However, a much simpler approach
exists to determine the best insertion point.
692 DESPER AND GASCUEL

Consider the tree T of Fig. 3, where k is inserted between subtrees C and A [ B, and assume that we
have the length l.T / D L of this new tree. Consider now the tree T 0 of Fig. 3, which is obtained from T
by exchanging k and A. Using Equation (5) and our above remarks we have

1£ ¤
l.T 0 / D L C .¸ ¡ ¸0 /.1kjA C 1BjC / C .¸0 ¡ 1/.1AjB C 1kjC / C .1 ¡ ¸/.1AjC C 1kjB / (7)
2
where
jAj C jBjjCj
¸D ;
.jAj C jBj/.jCj C 1/

and
jAj C jBjjCj
¸0 D :
.jAj C jCj/.jBj C 1/

In other words, the length of T 0 can be computed from the length of T . For this computation to be done
in O.1/ (i.e., constant) time, it is sufŽ cient to have previously computed

1. all average distances 1kjS between k and any subtree S from Tk¡1 ;
2. all average distances between subtrees of Tk¡1 separated by two edges; for example, A and B in Fig. 3;
3. the number of leaves of every subtree.

Suppose we now consider the tree T 00 formed by moving the insertion of k to the edge e, where e is a sibling
edge to the insertion point of T 0 . The length of T 00 is computed by Equation (7) as l.T 00 / D L C f .e/,
where f .e/ depends on the computations for both T 0 and T 00 . We continue, searching every edge e of
Tk¡1 by recursively moving from one edge to its neighboring edges, and we obtain the cost c.e/ that
corresponds to the length of the tree Tk¡1 plus k inserted on e. Moreover, c.e/ can be written as L C f .e/.
Because we only seek to determine the best insertion point, we need not calculate the actual value of L,
as it is sufŽ cient to minimize f .e/ with f D 0 for the Ž rst insertion edge considered.
The algorithm can be summarized as follows:

² For k D 3, initialize the matrix of average distances between distant-2 subtrees and the array counting
the number of taxa per subtree. Form T3 with leaf set f1; 2; 3g.
² For k D 4 to n,
1. compute all 1kjS average distances;
2. starting from an initial edge e0 of Tk¡1 , set f .e0 / D 0 and recursively search each edge e to obtain
f .e/ from Equation (7);
3. select the best edge by minimizing f , insert k on that edge to form Tk , and update the average
distance between every pair of distant-2 subtrees as well as the number of taxa per subtree;
² return Tn .

FIG. 3. T 0 is obtained from T by swapping A and k.


ALGORITHMS FOR MINIMUM EVOLUTION 693

To achieve Step 1, we recursively apply Equation (2), which requires O.k/ computing time (see Ap-
pendix 3). Step 2 is also done in O.k/ time, as explained above. Finally, to update the average distance
between any pair A; B of distant-2 subtrees, if k is inserted in the subtree A,
1 jAj
1fkg[AjB D 1kjB C 1AjB : (8)
1 C jAj 1 C jAj
Step 3 is also done in O.k/ time, because there are O.k/ pairs of distant-2 subtrees, and all the quantities
in the right-hand side of Equation (8) have already been computed. So we build Tk from Tk¡1 in O.k/
computational time, and thus the entire computational cost of the construction of T , as we sum over k,
is O.n2 / . This is much faster than NJ-like algorithms which require O.n3 / operations, and the FITCH
(Felsenstein, 1997) program which requires O.n4 / operations. As we shall see, this allows trees with
thousands of taxa to be constructed in few minutes. This algorithm is called GME (Greedy Minimum
Evolution) and additional details are described in Appendix 3.

2.2. The FASTNNI tree swapping algorithm


This algorithm iteratively exchanges subtrees of an initial tree in order to minimize its OLS length
estimate. There are many possible deŽ nitions of “subtree exchange.” Since the number of combinations is
high, we usually only consider exchanging neighboring subtrees, and, at least initially, we restrict ourselves
to the exchange of subtrees separated by three edges, for example, B and C in Figure 2(a). Such a procedure
is called a “nearest neighbor interchange,” since exchanging subtrees separated by one or two edges does
not yield any modiŽ cation of the initial tree. We adopted this approach because it allows a fast algorithm,
and it is sufŽ cient to reach a good topological accuracy.
Consider Fig. 2(a), and assume that the swap between B and C is considered. Let T be the initial tree
and T 0 the swapped tree. According to Equation (5) and our remarks, we have
1£ ¤
L.T / ¡ L.T 0 / D .¸ ¡ 1/.1AjC C 1BjD / ¡ .¸0 ¡ 1/.1AjB C 1CjD / ¡ .¸ ¡ ¸0 /.1AjD C 1BjC / ; (9)
2
where ¸ is as deŽ ned in Section 1, and
jAjjDj C jBjjCj
¸0 D :
.jAj C jCj/.jB j C jDj/

The swap has to be performed when L.T / ¡ L.T 0 / > 0, and the best among all possible swaps (two
per internal edge) corresponds to the largest difference between L.T / and L.T 0 /. Moreover, assuming
that the average distances between the corner subtrees have already been computed, L.T / ¡ L.T 0 / can be
obtained in O.1/ time via Equation (9). Instead of computing the average distances between the corner
subtrees (which change when swaps are realized), we compute the average distances between every pair
of nonintersecting subtrees. This takes place before evaluating the swaps and requires O.n2 / time, using
an algorithm that is described in Appendix 4. The whole algorithm can be summarized as follows:

1. precompute the average distances between nonintersecting subtrees;


2. run over all internal edges and select the best swap using Equation (9);
3. if the best swap does not improve the length of the tree; i.e., if .L.T / ¡ L.T 0 / · 0/, stop and return the
current tree, else perform the swap, compute the average distances between the newly created subtrees
(A [ C and B [ D in our example above) and the other nonintersecting subtrees using Equation (2),
and go to Step 2.

Step 1 requires O.n2 / time, Step 2 requires O.n/ time and Step 3 also requires O.n/ time because the
total number of subtrees is O.n/. Thus, the total complexity of the algorithm is O.n2 C pn/, where p is
the number of swaps performed. In practice, p is much smaller than n, as we shall see in Section 4, so this
algorithm has a practical time complexity of O.n2 /. It is very fast, able to improve trees with thousands of
taxa, and we call it FASTNNI (fast nearest neighbor interchanges). More details are given in Appendix 4.
Rzhetsky and Nei (1993) describe a procedure that requires O.n2 / to compute every branch length. In
one NNI, Ž ve branch lengths are changed, so evaluating a single swap is in O.n2 /, searching for the best
694 DESPER AND GASCUEL

swap in O.n3 /, and their whole procedure in O.pn3 /. This can be improved using Bryant and Waddell’s
(1998) results, but the implementation in the PAUP environment of these ideas is still in progress (David
Bryant, personal communication). In any case, our O.n2 / complexity is optimal since it is identical to the
data size.
Neither FASTNNI nor GME needs to explicitly compute the length of the whole tree until the Ž nal
topology is reached. To obtain the length of the Ž nal tree, we use Equations (3) and (4), which requires
O.n/ time as the average distances between corner subtrees have already been computed during the
execution of FASTNNI.

3. ALGORITHMS FOR THE BALANCED VERSION OF MINIMUM EVOLUTION

The balanced averaging scheme lends itself both to an insertion-based approach and to tree swapping
from an initial topology, and the algorithms are essentially the same as with OLS. The main difference is
that updating can no longer be achieved using a fast method as expressed by Equation (8), because the
balanced average distance between A [ fkg and B now depends of the position of k within A.

3.1. The BME addition algorithm


This algorithm is similar to that for OLS. Equation (7) simpliŽ es to

1h T i
L.T 0 / D L C .1AjC C 1TkjB / ¡ .1TAjB C 1TkjC / : (10)
4

Step 1 is identical and provides all 1TkjS distances by recursively applying Equation (6). The main difference
is the updating performed in Step 3. Equation (10) requires only the balanced average distances between
distant-2 subtrees, but to iteratively update these distances, we use (and update) a data structure that
contains all distances between every pair of nonintersecting subtrees (as for FASTNNI).
When k is inserted into Tk¡1 , we must calculate the average 1TXjY k
[fkg for any subtree Y of Tk¡1 such
that Y [ fkg is a subtree of Tk , and any subtree X disjoint from Y [ fkg. We can enumerate all such pairs by
considering their respective roots. Let x be the root of X and y the root of Y . Regardless of the position
of k, any node of Tk could serve as the root x of X. Then, considering a Ž xed x, any node in the path
from x to k could serve as the root y. Thus there are O.k £ [Link] // such pairs, where diam is the tree
diameter, i.e., the maximum number of edges between two leaves.
T k
Given such a pair, X; Y , let us consider how we may quickly calculate 1XjY [k from known quantities.
Consider the situation as depicted in Fig. 4. Suppose k is inserted by creating a new node w which pushes
the subtree Y1 farther away from B. Suppose there are .l ¡ 1/ edges in the path from w to y and the
subtrees branching off this path are Y2 ; Y3 ; : : : ; Yl , in order from w to y. Then
l
X
T T
1TXjY
k ¡l T k
[fkg D 2 .1kjX C 1XjY1 / C
k¡1
2¡.lC1¡i/ 1XjY
k¡1
i
:
iD2

FIG. 4. Calculating balanced average 1 TXjY when k is inserted into Y .


ALGORITHMS FOR MINIMUM EVOLUTION 695

However, we already know the value of


l
X
T ¡.l¡1/ T T
1XjY D 2
k¡1
1 k¡1
XjY1 C 2¡.lC1¡i/ 1XjY
k¡1
i
:
iD2

Thus
T T
1TXjY
k k¡1 ¡l T k
[fkg D 1XjY C 2 .1kjX ¡ 1XjY1 /:
k¡1
(11)

The upper bound on the number of pairs is worst when Tk is a chain, with k inserted at one end. In
this case, the diameter is proportional to k and both the number of distances to update and the bound
are proportional to k 2 . However, the diameter is usually much lower. Assuming, as usual, a Yule-Harding
speciation process (Yule, 1925; Harding, 1971), the expected diameter is [Link].k// (Erdös et al., 1999),
which implies an average complexity of the updating step in O.k log.k//. Other (e.g., uniform) distributions
on phylogenetic treespare discussed by Aldous (2001), and by McKenzie and Steel (2000), with expected
diameter at most O. k/.
Therefore, the time p complexity of the whole insertion algorithm is O.n3 / in the worst case and
2 2
O.n log.n// (or O.n n/) in practice. This is still less than NJ and allows trees with thousands of
taxa to be constructed within a reasonable amount of time. This algorithm is called BME (Balanced
Minimum Evolution) and additional details are given in Appendix 5.

3.2. The BNNI tree swapping algorithm


This algorithm is again very similar to that for OLS. Equation (9) simpliŽ es to
1
L.T / ¡ L.T 0 / D ..1TAjB C 1TCjD / ¡ .1TAjC C 1TBjD //: (12)
4
Step 1 is identical, but Equation (2) is replaced by Equation (6). Of course, the preliminary step of
the tree swapping algorithm is unnecessary when the initial tree is constructed using the BME algorithm,
because this latter computes (among other things) the average balanced distances between every pair of
nonintersecting subtrees. Finally, the main difference is within Step 3, where average distances between
subtrees are updated. The computations are almost identical to those of Section 3.1 and require O.n log.n//
computations on average. Thus, the total time complexity is O.n2 C pn log.n// where p is the number of
performed swaps, or [Link] log.n// if Step 1 is unnecessary. As with the OLS algorithms, this allows very
large trees to be improved. This algorithm is called BNNI (Balanced Nearest Neighbor Interchanges), and
details are given in Appendix 5.
Finally, as with the OLS algorithms, neither BME nor BNNI explicitly computes the branch lengths.
This is done in O.n/ time by using the edge length formulae from Appendix 2

1
l.e/ D 1TA[BjC[D ¡ .1TAjB C 1TCjD /
2
and
1 T
l.e/ D .1 C 1TijB ¡ 1TAjB /;
2 ijA
for the external and internal branches, respectively (see Fig. 2 for the notation).

4. RESULTS

4.1. Protocol
We used simulations based on random trees with parameter values chosen so as to cover the features of
most real data sets, as revealed by the compilation of the phylogenies published in the journal Systematic
Biology during the last few years. This approach induces much smaller contrasts between the tested
696 DESPER AND GASCUEL

methods than those based on model trees (e.g., Gascuel, 1997a; Bruno, Socci, and Halpern, 2000). Indeed,
model trees are generally used to emphasize a given property of the studied methods, for example, their
performance when the molecular clock is strongly violated. Thus, model trees are often extreme and their
use tends to produce strong and possibly misleading differences between the tested methods. On the other
hand, random trees allow comparisons with a large variety of tree shapes and evolutionary rates and provide
a synthetic and more realistic view of the average performances.
We used 24- and 96-taxon trees, and 2,000 trees per size. For each of these trees, a true phylogeny, de-
noted as T , was Ž rst generated using the stochastic speciation process described by Kuhner and Felsenstein
(1994), which corresponds to the usual Yule-Harding distribution on trees (Yule, 1925; Harding, 1971).
Using this generating process makes T ultrametric (or molecular clock–like). This hypothesis does not
hold in most biological data sets, so we created a deviation from the molecular clock, using a method
similar to that of Guindon and Gascuel (2002). Every branch length of T was multiplied by 1:0 C ¹X,
where X followed the standard exponential distribution .P .X > ´/ D e ¡´ / and ¹ was a tuning factor to
adjust the deviation from molecular clock; ¹ was set to 0.8 with 24 taxa and to 0.6 with 96 taxa. The
average ratio between the mutation rate in the fastest evolving lineage and the rate in the slowest evolving
lineage was then equal to about 2.0 with both tree sizes. With 24 taxa, the smallest value (among 2,000)
of this ratio was equal to about 1.2 and the largest to 5.0 (1.0 corresponds to the strict molecular clock),
while the standard deviation was approximately 0.5. With 96 taxa, the extreme values became 1.3 and 3.5,
while the standard deviation was 0.33.
These (2 £ 2,000) trees were then rescaled to obtain “slow,” “moderate” and “fast” evolutionary rates.
With 24 taxa, the branch length expectation was set to 0.03, 0.06, and 0.15 mutations per site, for the slow,
moderate, and fast conditions, respectively. With 96 taxa, we had 0.02, 0.04, and 0.10 mutations per site,
respectively. For both tree sizes, the average maximum pairwise divergence was of about 0.2, 0.4, and 1.0
substitutions per site, with a standard deviation of about 0.07, 0.14, and 0.35 for 24 taxa, and of about 0.04,
0.08, and 0.20 for 96 taxa. These values are in good accordance with real data sets. The maximum pairwise
divergence is rarely above 1.0 due to the fact that multiple alignment from highly divergent sequences is
simply impossible. Moreover, with such a distance value, any correction formula, e.g., Kimura’s (1980),
becomes very doubtful, due to our ignorance of the real substitution process and to the fact that the larger
the distance the higher the gap between estimates obtained from different formulae. The medium condition
(»0.4) corresponds to the most favorable practical setting, while in the slow condition (from »0.1 to »0.3)
the phylogenetic signal is only slightly perturbed by multiple substitutions, but it can be too low, with
some short branches being not supported by any substitution.
SeqGen (Rambaut and Grassly, 1997) was used to generate the sequences. For each tree T (among 3 £
2 £ 2,000), these sequences were obtained by simulating an evolving process along T according to the
Kimura (1980) two-parameter model with a transition/transversion ratio of 2.0. The sequence length was
set to 500 sites. Finally, DNADIST from the PHYLIP package (Felsenstein, 1989) was used to compute
the pairwise distance matrices, assuming the Kimura model with known transition/transversion ratio. The
data Ž les are available on our web page ([Link]
Every inferred tree, denoted as Tb, was compared to the true phylogeny T (i.e., that used to generate the
sequences and then the distance matrix) with a topological distance equivalent to Robinson and Foulds’
(1981). This distance is deŽ ned by the proportion of internal branches (or bipartitions) which are found
in one tree and not in the other one. This distance varies between 0.0 (both topologies are identical) and
1.0 (they do not share any internal branch). The results were averaged over the 2,000 test sets for each
tree size and evolutionary rate. Finally, to compare the various methods to NJ, we measured the relative
error reductions .PM ¡ PNJ /=PNJ , where M is any tested method different from NJ and PX is the average
topological distance between T b and T when using method X.

4.2. Phylogeny estimation algorithm comparison


We used a variety of different algorithms to try to reconstruct the original tree, given the matrix of esti-
mated distances. We used the NJ program from the PAUP package (Swofford, 1996), with and without the
BIONJ (Gascuel, 1997a) option; WEIGHBOR version 1.2, available at [Link]/billb/weighbor/;
the FITCH program from the PHYLIP package (Felsenstein, 1989); the Harmonic Greedy Triplet (HGT/FP)
program, provided by Miklos CsÍurös (CsÍurös, 2002); GME and BME. Also, we used output topologies from
ALGORITHMS FOR MINIMUM EVOLUTION 697

other programs as input for FASTNNI and BNNI. All of GME, BME, FASTNNI, and BNNI are available
at our web page ([Link] and via ftp at [Link]
We also measured how far from the true phylogeny one gets with NNIs. This served as a measure of the
limitation of each of the minimum evolution frameworks, as well as a performance index for evaluating
our algorithms. Ordinarily, the (OLS or balanced) minimum evolution criterion will not, in fact, observe a
minimum value at the true phylogeny. So, starting from the true phylogeny and running FASTNNI or BNNI,
we end up with a tree with a signiŽ cant proportion of false branches. When this proportion is high, the
corresponding criterion can be seen as poor regarding topological accuracy. Thus this proportion represents
the best possible topological accuracy that can be achieved by optimizing the considered criterion, since
we would not expect any algorithm optimizing this criterion in the whole tree space to Ž nd a tree closer
to the true phylogeny than the tree that is obtained by “optimizing” the true phylogeny itself.
Results are displayed in Table 1 and Table 2.
The performances of the basic algorithms are strongly correlated with the number of computations that
they perform. Both O.n2 / algorithms are clearly worse than NJ. BME (in O.n2 log.n//) is still worse than
NJ, but becomes very close with 96 taxa, which indicates the strength of the balanced minimum evolution
framework. BIONJ, in O.n3 / like NJ and having identical computing times, is slightly better than NJ in
all conditions, while WEIGHBOR, also in O.n3 / but requiring complex numerical calculations, is better
than BIONJ. Finally, FITCH, which is in O.n4 /, is the best with 24 taxa, but was simply impossible to
evaluate with 96 taxa (see the computing times below).
After FASTNNI, we observe that the output topology does not depend much on the input topology.
Even the poor HGT topology becomes close to NJ, which indicates the strength of NNIs. However,

Table 1. Topological Accuracy for 24-Taxon Trees at Various Rates of Evolutiona

w/o NNIs C FASTNNI C BNNI

Slow rate
True Tree .109 ¡1.6% .104 ¡6.2%
FITCH .109 ¡1.9% .113 2.0% .107 ¡3.4%
WEIGHBOR .109 ¡1.8% .112 1.7% .107 ¡3.0%
BIONJ .111 ¡0.3% .113 2.0% .107 ¡3.6%
NJ .111 0% .113 2.0% .107 ¡3.5%
BME .118 7.1% .113 1.9% .107 ¡2.8%
GME .122 10% .113 2.1% .107 ¡3.4%
HGT/FP .334 202% .112 1.1% .107 ¡2.9%
Moderate rate
True Tree .092 3.7% .083 ¡5.8%
FITCH .085 ¡4.9% .094 6.0% .085 ¡4.0%
WEIGHBOR .085 ¡4.3% .094 6.2% .085 ¡4.0%
BIONJ .087 ¡2.0% .094 6.5% .085 ¡4.2%
NJ .088 0% .094 6.6% .085 ¡4.0%
BME .100 13% .094 6.3% .084 ¡4.9%
GME .107 21% .095 7.1% .084 ¡4.8%
HGT/FP .326 268% .095 7.5% .088 ¡0.2%
Fast rate
True Tree .088 6.5% .076 ¡8.3%
FITCH .076 ¡7.8% .090 8.8% .077 ¡7.0%
WEIGHBOR .077 ¡6.8% .089 8.2% .077 ¡6.9%
BIONJ .079 ¡3.6% .090 9.0% .077 ¡6.6%
NJ .082 0% .090 9.1% .077 ¡6.9%
BME .098 19% .090 9.1% .076 ¡7.1%
GME .105 28% .090 9.8% .076 ¡7.6%
HGT/FP .329 300% .090 9.8% .083 0.8%
a The Ž rst number indicates the average topological distance between the inferred tree and the true phylogeny.
The second number (percentage) provides the relative difference in topological distance between the method
considered and NJ; the more negative this value, the better the method was relative to NJ.
698 DESPER AND GASCUEL

Table 2. Topological Accuracy for 96-Taxon Trees at Various Rates of Evolutiona

Without NNIs C FASTNNI C BNNI

Slow rate
True Tree .172 ¡5.6% .167 ¡8.8%
WEIGHBOR .178 ¡2.5% .181 ¡0.7% .173 ¡5.2%
BIONJ .180 ¡0.9% .182 ¡0.3% .173 ¡5.1%
NJ .183 0% .182 ¡0.2% .173 ¡5.2%
BME .186 1.9% .181 ¡0.6% .173 ¡5.3%
GME .199 8.8% .183 0.3% .173 ¡5.3%
HGT/FP .512 185% .185 1.5% .175 ¡4.3%
Moderate rate
True Tree .132 ¡3.0% .115 ¡15.4%
WEIGHBOR .129 ¡5.4% .137 0.5% .118 ¡13.0%
BIONJ .134 ¡1.9% .138 1.3% .118 ¡13.0%
NJ .136 0% .139 1.8% .119 ¡12.9%
BME .137 1.0% .138 1.1% .118 ¡13.2%
GME .158 16% .140 2.7% .118 ¡13.2%
HGT/FP .480 253% .143 5.2% .123 ¡9.3%
Fast rate
True Tree .115 0.6% .088 ¡23.4%
WEIGHBOR .103 ¡10% .119 3.8% .091 ¡21.0%
BIONJ .112 ¡2.5% .121 5.1% .090 ¡21.7%
NJ .115 0% .121 5.5% .090 ¡21.3%
BME .117 1.8% .120 4.4% .090 ¡21.4%
GME .144 25% .122 6.3% .091 ¡21.1%
HGT/FP .465 306% .126 9.4% .098 ¡14.7%
a See note to Table 1.

except for the true phylogeny in some cases, this output is worse than NJ. This conŽ rms our previous
results (Gascuel, 2000), which indicated that the OLS version of minimum evolution is reasonable but
not excellent for phylogenetic inference. But this phenomenon is much more visible with 24 taxa than
with 96, so we expect the very fast GME C FASTNNI O.n2 / combination to be equivalent to NJ with
large n.
After BNNI, all initial topologies become better than NJ. Moreover, they converge to each other inde-
pendently of the starting point (results not shown), which demonstrates, among other things, that it would
be useless to jumble the taxa ordering in GME or BME, as is done optionally in PHYLIP programs to
improve the Ž tness. With 24 taxa, the performance is equivalent to that of FITCH, while with 96 taxa the
results are far better than those of WEIGHBOR, especially in the Fast condition where BNNI improves
NJ by 21%, against 10% for WEIGHBOR. These results are somewhat unexpected, since BNNI has a
low O.n2 C pn log.n// average time complexity. Moreover, as explained above, we do not expect very
high contrast between any inference method and NJ, due to the fact that our data sets represent a large
variety of realistic trees and conditions, but do not contain extreme trees, notably concerning the maximum
pairwise divergence. Preliminary experiments indicate that maximum parsimony is close to BNNI in the
Slow and Moderate conditions, but worse in the Fast condition, the contrast between these methods being
in the same range as those reported in Tables 1 and 2. The combination of BNNI with BME or GME can
thus be seen as remarkably efŽ cient and accurate. Moreover, regarding the true phylogeny after BNNI, it
appears that little gain could be expected by further optimizing this criterion, since our simple optimization
approach is already close to these optimal values.

4.3. Computing times


To compare the actual running speeds of these algorithms, we tested them on a Sun Enterprise E4500/
E5500, with ten 400-MHz processors and 7 GB of memory, running the Solaris 8 operating system. Table 3
ALGORITHMS FOR MINIMUM EVOLUTION 699

Table 3. Average Computational Time (MM:SS) for Reconstruction Algorithms

Algorithm 24 taxa 96 taxa 1000 taxa 4000 taxa

GME .02522 .07088 8.366 3:11.01


GME C FASTNNI .02590 .08268 9.855 3:58.79
GME C BNNI .02625 .08416 11.339 6:02.10
BME .02534 .08475 19.231 12:14.99
BME C BNNI .02557 .08809 19.784 12:47.45
HGT/FP .02518 .13491 13.808 3:33.14
NJ ¡ BIONJ .06304 .16278 21.250 20:55.89
WEIGHBOR .42439 26.88176 ***** *****
FITCH 4.37446 ***** ***** *****

summarizes the average computational times (in hours:minutes:seconds) required by the various programs
to build phylogenetic trees. The leftmost two columns were averaged over two thousand 24- and 96-taxon
trees, the third over ten 1,000-taxon trees, and the Ž nal column over four 4,000-taxon trees. Stars indicate
entries where the algorithm was deemed to be too slow to bother with that test.
FITCH takes approximately 25 minutes to make a 96-taxon tree, which made the simulations impossible
(12,000 trees were considered) and would make impractical its application to such taxa number, because
real studies often include bootstrapping which requires the construction of a large number of trees (usually
1,000). In practice, if one wishes to Ž nd a tree minimizing the weighted least-squares criterion, it is faster
(»1 minute for 96 taxa) to use PAUP (Swofford, 1996) to perform a weighted least-squares NNI search
from a tree created via some other fast method, i.e., to combine weighted least-squares with a tree building
strategy analogous to ours.
For similar reasons, we did not test the running time of WEIGHBOR for 1000- or 4000-taxon trees.
WEIGHBOR’s running time increased more than 60-fold when moving from 24-taxon trees to 96-taxon
trees; thus, we judged it infeasible to run WEIGHBOR on even one 1,000-taxon tree.
The fastest programs in Table 3 were the ME, ME C FASTNNI, and HGT/FP algorithms. The HGT/FP
algorithm was the only one which was able to maintain the fast speed of the GME C FASTNNI com-
bination, but it did so at a serious cost in terms of topological accuracy. The BME and BME C BNNI
combinations lagged behind their OLS-based counterparts, but were still signiŽ cantly faster than NJ and
BIONJ. Of particular interest is the GME C BNNI combination, which was not only markedly faster
than NJ, but also produced superior topologies. Unfortunately, computational constraints made a thorough
testing of algorithm performance at the 1,000- and 4,000-taxon levels difŽ cult to achieve, so we cannot
claim statistical signiŽ cance for the relative speeds for the larger data sets. However, we suspect that
implementation reŽ nements such as those used in PAUP’s NJ could be used to make our algorithms still
much faster.
Table 4 contains the number of NNIs performed by each of the three combinations which appear in
Table 3. Not surprisingly, the largest number of NNIs was consistently required when the initial topology
was made to minimize the OLS ME criterion, but NNIs were chosen to minimize the balanced ME criterion
(i.e., GME C BNNI). This table shows the overall superiority of the BME tree over the GME tree, when
combined with BNNI. In all of the cases considered, the average number of NNIs considered for each
value of n was considerably less than n itself.

Table 4. Average Number of NNIs Performed

Algorithm 24 taxa 96 taxa 1000 taxa 4000 taxa

GME C FASTNNI 1.244 8.446 44.9 336.50


GME C BNNI 1.446 11.177 59.1 343.75
BME C BNNI 1.070 6.933 29.1 116.25
700 DESPER AND GASCUEL

5. DISCUSSION

We have presented a new greedy implementation of minimum evolution tree topology searching that
is considerably faster than most distance algorithms currently in use. The currently most popular fast
algorithm is Neighbor-Joining, an O.n3 / algorithm. Our greedy ordinary least-squares minimum evolution
tree construction algorithm (GME) runs at O.n2 /, the size of the input matrix. Although the GME tree is
not quite as accurate as the NJ tree, it is a good starting point for nearest neighbor interchanges (NNIs).
The combination of GME and FASTNNI, which achieves NNIs according to the ordinary least-squares
criterion, also in O.n2 / time, has a topological accuracy very close to that of NJ, especially with large
numbers of taxa.
However, the balanced minimum evolution framework appears much more appropriate for phylogenetic
inference than the ordinary least-squares version. This is likely due to the fact that it gives less weight to the
topologically long distances, i.e., those containing numerous edges, while the ordinary least-squares method
puts the same conŽ dence on each distance, regardless of its length. Even when the usual and topological
lengths are different, they are strongly correlated. The balanced minimum evolution framework is thus
conceptually close to weighted least-squares (Fitch and Margoliash, 1967), which is more appropriate than
ordinary least-squares for evolutionary distances estimated from sequences. A preliminary look at PAUP’s
weighted least-squares topology-searching ability leads us to hypothesize that minimizing weighted least-
squares criterion or minimizing the balanced minimum evolution criterion would lead to output trees of
roughly the same quality. Studying the formal relationship between weighted least-squares and balanced
minimum evolution is an important direction for further research.
The balanced NNI algorithm (BNNI) achieves outstanding performance, superior to those of NJ, BIONJ,
and WEIGHBOR. BNNI is an O.n2 Cnp diam.T // algorithm, where diam.T / is the diameter of the inferred
tree, and p the number of swaps performed. With p diam.T / D O.n/ for most data sets, the combination
of GME and BNNI effectively gives us an O.n2 / algorithm with high topological accuracy.

APPENDIX

A1. Derivation of tree length formula


In this section, we’ll derive Equation (5), which we use throughout our NNI analysis. First, we need the
following lemma: (Vach, 1989; Gascuel, 1997b)

Lemma 1.1. If T is the OLS tree for its topology and for the matrix 1, and u is a node in T which
separates the three subtrees X; Y; and Z, we then have 1XjY D 1TXjY . (And by symmetry this identity also
holds for XjZ, and Y jZ.)

Now, let’s reconsider Equation (5) and Figure 2(a),


1£ ¤
l.T / D ¸.1AjC C 1BjD / C .1 ¡ ¸/.1AjD C 1BjC / C 1AjB C 1CjD
2
C l.A/ C l.B/ C l.C/ C l.D/ ¡ 1ajA ¡ 1bjB ¡ 1cjC ¡ 1d jD ;

where
jBjjCj C jAjjDj
¸D :
.jAj C jBj/.jCj C jDj/
It is clear that

l.T / D l.A/ C l.B/ C l.C/ C l.D/ (13)


C l.v; a/ C l.v; b/ C l.w; c/ C l.w; d/ C l.v; w/

and from Lemma 1.1,


1AjB D 1TAjB D 1ajA C l.v; a/ C l.v; b/ C 1bjB ;
ALGORITHMS FOR MINIMUM EVOLUTION 701

and similarly
1CjD D 1TCjD D 1cjC C l.w; c/ C l.w; d/ C 1djD :

Thus,

l.v; a/ C l.v; b/ C l.w; c/ C l.w; d/ D 1AjB C 1CjD ¡ .1ajA C 1bjB C 1cjC C 1d jD /: (14)

We substitute the right-hand side of Equation (14) and the OLS value for l.v; w/ from Equation (3) into
Equation (13) to achieve the desired Equation (5).

A2. Constant subtree lengths under tree swapping in the balanced scheme
First, we consider Equation (3) for internal edge length estimation when using balanced weights. Since
¸ D 1=2, the equation simpliŽ es to:
µ ¶
1 1 T
l T .e/ D .1AjC C 1TBjD C 1TAjD C 1TBjC / ¡ .1TAjB C 1TCjD /
2 2
1 T
D 1TA[BjC[D ¡ .1 C 1TCjD /: (15)
2 AjB
The balanced edge length formula for external edges is essentially the same: Equation (4) simpliŽ es to

1 T
l T .e/ D 1TijA[B ¡ 1 : (16)
2 AjB
Now let’s consider a topology T with three subtrees A; B; and C, which meet at the vertex v. Let a; b;
and c be the roots of these three subtrees, with b1 and b2 the children of b and c1 , c2 the children of c,
determining leaf subsets B1 ; B2 ; C1 , and C2 , respectively. By Equation (15),

1
l.v; b/ D 1TA[CjB ¡ .1TB1 jB2 C 1TAjC /
2
1 T
D .1AjB C 1TBjC ¡ 1TB1 jB2 ¡ 1TAjC /:
2
Similarly,
1 T
l.v; c/ D .1 C 1TBjC ¡ 1TC1 jC2 ¡ 1TAjB /;
2 AjC
and thus
1 T
l.v; b/ C l.v; c/ D 1TBjC ¡ .1 C 1TC1 jC2 /: (17)
2 B1 jB2
In particular, the right-hand side of Equation (17) is completely independent of the internal structure of A.
Thus, if we perform a tree swap internal to A, the sum l.v; b/ C l.v; c/ will remain constant. Analogous
arguments will show the same result if either b or c is a leaf. This indicates that neither the length of a
subtree (here B [ C) nor its average root/leaves distance (here, from v to any leaves of B [ C) is changed
when a swap is performed within a disjoint subtree (here, A).

A3. Details of the GME algorithm


In this section, we provide the details of the Greedy ME algorithm. Recall that we iteratively form the
tree Tk with leaf set [k] D f1; : : : ; kg from Tk¡1 by selecting an insertion point. Step 1 of the insertion
process is to calculate 1kjS for every subtree S of Tk¡1 . We’ll use the following notation:

² We root Tk¡1 at any taxon r and let d be its unique direct descendant.
² Let DFS-POST (Tarjan, 1983, 14–19) be the depth-Ž rst post-order of the vertices of Tk¡1 .
702 DESPER AND GASCUEL

² Let DFS-PRE (Tarjan, 1983, 14–19) be the depth-Ž rst preorder of the vertices of Tk¡1 .
² For any nonleaf node v (or w), let v1 and v2 (or w1 and w2 , respectively) denote its two children.
² For any node v of Tk¡1 , if v is a leaf, let L.v/ D fvg, otherwise let L.v/ be the set of all nodes of Tk¡1
which are descendants of v, including v itself. Let U .v/ the complement to L.v/ among the nodes of
Tk¡1 .

The details of Step 1 are:

1. To calculate 1kjS for all S which are subtrees of Tk¡1 :


(a) Loop for w from DFS-POST ¡ frg. IF w 2 [k ¡ 1], 1kjL.w/ D 1kw , ELSE

jL.w1 /j1kjL.w 1 / C jL.w2 /j1kjL.w2 /


1kjL.w/ D :
jL.w/j

(b) Set 1kjU .d/ D 1kr : Loop over w in DFS-PRE ¡ fr; dg. Let s be the sibling of w and p be the
parent of s and w. Compute

jU .p/j1kjU .p/ C jL.s/j1kjL.s/


1kjU .w/ D :
jU .w/j

This achieves the computation of all 1kjS average distances. All other steps of the algorithm are described
in Section 2.1.

A4. Details of FASTNNI algorithm


In this appendix, we Ž ll in the details for the OLS tree swapping algorithm. Recall that the Ž rst step
is to compute the average distances between nonintersecting subtrees. We use the same notation as in the
previous section.

1. Precalculating average distances:


(a) We Ž rst calculate all the average distances of the form 1L.v/jL.w/ . Loop over v in DFS-POST ¡
fr; dg; loop over w in DFS-POST including all nodes not equal to or below v in this order. For
any w which is not an ancestor of v:
i. IF v; w 2 [n], then 1L.v/jL.w/ D 1vw .
ii. ELSE IF v 62 [n], set

jL.v1 /j1L.v1 /jL.w/ C jL.v2 /j1L.v2 /jL.w/


1L.v/jL.w/ D :
jL.v/j

iii. ELSE
jL.w1 /j1L.v/jL.w1 / C jL.w2 /j1L.v/jL.w2 /
1L.v/jL.w/ D :
jL.w/j

(b) To calculate all 1L.v/jU.d/ distances, loop v over DFS-POST ¡ fr; dg. IF v 2 [n], then 1L.v/jU .d / D
1vr , ELSE
jL.v1 /j1L.v1 /jU .d/ C jL.v2 /j1L.v2 /jU .d/
1L.v/jU .d/ D :
jL.v/j

(c) We now calculate all distances of the form 1L.v/jU.w/ , where w is an ancestor of v. Loop w over
DFS-PRE ¡ fr; dg. Let s be the sibling of w and p be the parent of s and w. Loop over v from
L.w/ via any manner. For each v, set

jL.s/j1L.v/jL.s/ C jU .p/j1L.v/jU .p/


1L.v/jU .w/ D :
jU .w/j
ALGORITHMS FOR MINIMUM EVOLUTION 703

It is easily seen that every formula above uses already computed terms, and thus requires O.1/ time,
due to the computing orchestration based on the DFS-POST and DFS-PRE orders. Each pair of vertices
v; w is the subject of exactly one 1L.v/jL.w/ ; 1L.v/jU.w/ or 1L.w/jU .v/ calculation, thus the computational
time is O.n2 /, and we can store all of these average distances unambiguously in a matrix.

Now to Steps 2 and 3.

2. Create the heap of possible swaps. Loop e over the internal edges of T via any method.
(a) Using Equation (9) determine the change in total lengths s1 .e/; s2 .e/ for each of the two possible
tree swaps across e. Let s.e/ D max.0; s1 .e/; s2 .e//.
(b) Form a heap containing all the values of s.e/ which are positive.
3. Achieve the best swap and update the data.
(a) Assuming the heap is nonempty, let e D .v; w/ be the best edge on the heap. Let A; B; C; and D
denote the four subtrees which meet at e, with roots a; b; c; and d, as in Figure 2(a), such that a
and b are incident to v, while c and d are incident to w. Suppose B $ C is the indicated swap.
Remove edges .v; b/ and .w; c/ from the topology and add edges .v; c/ and .w; b/.
(b) Loop over the subtrees S of A [ C via any manner. Compute 1SjB[D by averaging 1SjB and 1SjD
using Equation (2). Achieve the same for B [ D.
(c) Set s.e/ D 0 and remove it from the heap. Let f range over the four edges incident to e, recalculate
s.f / by testing the two possible tree swaps across f , and, if s.f / > 0, insert s.f / into the heap.
(d) If the heap is nonempty, return to Step 3(a). Otherwise, use the matrix of average distances and
Equations (3) and (4) to assign branch lengths to all of the edges of the Ž nal tree.

A5. Balanced minimum evolution algorithms


The structures of the BME and BNNI algorithms are identical to their OLS counterparts, except in the
step updating the matrix of average distances. Also, the BME algorithm requires that the full matrix of
average distances be kept throughout tree building, as opposed to the OLS version, which requires only
the distant-2 subtree average distances.
Consider the insertion of the node k, in tree Tk¡1 . Let X and Y be a pair of disjoint subtrees of Tk¡1 ,
T k
such that k is inserted in Y . We then use Equation (11) to calculate 1XjY [fkg . The updating step of BME
is as follows:

3. Let .s; u/ be the edge where k is inserted, with S and U the subtrees having roots s and u, respectively,
and Tk¡1 D S [ U and S \ U D ;.
(a) Loop over the subtrees Z of S.
i. Let Y be the complement of Z in Tk¡1 .
Tk
ii. Loop over X µ Z and use Equation (11) to calculate 1XjY [fkg .
(b) Repeat (a) with U in the place of S.

A similar adjustment is done for BNNI. Suppose we start with the topology T , as shown in Figure 2(a),
and swap subtrees B and C to form the topology T 0 . Suppose x and y are nodes in A [ fvg, with y on the
path from x to v, and l edges between y and v. Let X and Y be the nonintersecting subtrees with roots x
and y, respectively. We allow for the possibility that l D 0, in which case we choose X µ A. (See Fig. 5.)
Our updating equation is:
0
1TXjY D 1TXjY ¡ 2¡.lC2/ 1TXjB C 2¡.lC2/ 1TXjC (18)

The modiŽ ed step in BNNI is:

3. Update the matrix of average distances


(a) Loop over pairs of nodes x; y where x is any node in the subtree A and y is any node in the path
from x to v. Let X and Y denote the nonintersecting subtrees (see Fig. 5) with roots x and y,
0
respectively. Use Equation (18) to calculate 1TXjY . Repeat the same steps for all analogous pairs
x; y in each of B, C, and D.
704 DESPER AND GASCUEL

FIG. 5. Recalculating balanced average 1TXjY after B $ C tree swap.

(b) For any subtree X ½ A [ C, compute

0 1 T
1TXjB[D D .1 C 1TXjD /:
2 XjB
0
Perform analogous computations to compute 1TY jA[C for all subtrees Y ½ B [ D.
(c) Calculate
0 1 T
1TA[CjB[D D .1 C 1TAjD C 1TCjB C 1TCjD /
4 AjB

ACKNOWLEDGMENTS

Special thanks go to Stéphane Guindon, who generated the data sets used in Section 4, and to Mike
Steel for his help and advice.

REFERENCES

Aldous, D.J. 2001. Stochastic models and descriptive statistics for phylogenetic trees from Yule to today. Statist. Sci.
16, 23–34.
Bruno, W.J., Socci, N.D., and Halpern, A.L. 2000. Weighted neighbor joining: A likelihood-basedapproach to distance-
based phylogeny reconstruction. Mol. Biol. Evol. 17, 189–197.
Bryant, D., and Waddell, P. 1998. Rapid evaluation of least-squares and minimum-evolution criteria on phylogenetic
trees. Mol. Biol. Evol. 15, 1346–1359.
Bulmer, M. 1991. Use of the method of generalized least squares in reconstructing phylogenies from sequence data.
Mol. Biol. Evol. 8, 868–883.
Cormen, T.H., Leiserson, C.E., and Rivest, R.L. 2000. Introduction to Algorithms, MIT Press, Cambridge, MA.
Cs Íurös, M. 2002. Fast recovery of evolutionary trees with thousands of nodes. J. Comp. Biol. 9, 277–297.
Denis, F., and Gascuel, O. 2002. On the consistency of the minimum evolution principle of phylogenetic inference.
Discr. Appl. Math. In press.
Erdös, P.L., Steel, M., Székély, L., and Warnow, T. 1999. A few logs sufŽ ce to build (almost) all trees: Part II. Theo.
Comp. Sci. 221, 77–118.
Felsenstein, J. 1989. PHYLIP—Phylogeny Inference Package (Version 3.2). Cladistics 5, 164–166.
Felsenstein, J. 1997. An alternating least-squares approach to inferring phylogenies from pairwise distances. Syst. Biol.
46, 101–111.
Fitch, W.M., and Margoliash, E. 1967. Construction of phylogenetic trees. Science 155, 279–284.
Gascuel, O. 1997a. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data.
Mol. Biol. Evol. 14, 685–695.
Gascuel, O. 1997b. Concerning the NJ algorithm and its unweighted version, UNJ. In Mirkin, B., McMorris, F., Roberts,
F., and Rzetsky, A., eds. Mathematical Hierarchies and Biology, American Mathematical Society, Providence, RI.
Gascuel, O. 2000. On the optimization principle in phylogenetic analysis and the minimum-evolution criterion. Mol.
Biol. Evol. 17, 401–405.
ALGORITHMS FOR MINIMUM EVOLUTION 705

Gascuel, O., Bryant, D., and Denis, F. 2001. Strengths and limitations of the minimum evolution principle. Syst. Biol.
50, 621–627.
Guindon, S., and Gascuel, O. 2002. EfŽ cient biased estimation of evolutionary distances when substitution rates vary
across sites. Mol. Biol. Evol. 19, 534–543.
Harding, E. 1971. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab. 3,
44–77.
Kidd, K., and Sgaramella-Zonta, L. 1971. Phylogenetic analysis: Concepts and methods. Am. J. Human Genet. 23,
235–252.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies
of nucleotide sequences. J. Mol. Evol. 16, 111,120.
Kuhner, M.K., and Felsenstein, J. 1994. A simulation comparison of phylogeny algorithms under equal and unequal
rates. Mol. Biol. Evol. 11, 459–468.
McKenzie, A., and Steel, M. 2000. Distributions of cherries for two models of trees. Math. Biosci. 164, 81–92.
Nei, M., and Jin, L. 1989. Variances of the average numbers of nucleotide substitutions within and between populations.
Mol. Biol. Evol. 6, 290–300.
Pauplin, Y. 2000. Direct calculation of a tree length using a distance matrix. J. Mol. Evol. 51, 41–47.
Rambaut, A., and Grassly, N.C. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence
evolution along phylogenetic trees. Comput. Appl. Biosci. 13, 235–238.
Robinson, D., and Foulds, L. 1981. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147.
Rzhetsky, A., and Nei, M. 1993. Theoretical foundation of the minimum-evolution method of phylogenetic inference.
Mol. Biol. Evol. 10, 1073–1095.
Saitou, N., and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees.
Mol. Biol. Evol. 4, 406–425.
Sneath, P.H.A., and Sokal, R.R. 1973. Numerical Taxonomy, 230–234, W.K. Freeman, San Francisco.
Swofford, D. 1996. PAUP—Phylogenetic Analysis Using Parsimony (and other methods), Version 4.0.
Swofford, D.L., Olsen, G.J., Waddell, P.J., and Hillis, D.M. 1996. Phylogenetic inference. In Hillis, D., Moritz, C.,
and Mable, B., eds., Molecular Systematics, 470–514, Sinauer, Sunderland, MA.
Tarjan, R.E. 1983. Data Structures and Network Algorithms, SIAM, Philadelphia.
Vach, W. 1989. Least squares approximation of addititve trees. In Opitz, O., ed. Conceptual and Numerical Analysis
of Data, Springer-Verlag, Berlin.
Yule, G. 1925. A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis. Philos. Trans. Roy.
Soc. London Ser. B, Biological Sciences 213, 21–87.

Address correspondence to:


Olivier Gascuel
Département Informatique Fondamentale et Applications, LIRMM
161 rue Ada
34392, Montpellier, France

E-mail: gascuel@[Link]

You might also like