Classification and Regression Tree Construction
Classification and Regression Tree Construction
Thesis Proposal
Alin Dobra
Department of Computer Science
Cornell University, Ithaca NY
[email protected]
November 25, 2002
1 Introduction
Decision trees, either classication or regression trees, are especially attractive type of models for
three main reasons. First, they have an intuitive representation, the resulting model is easy to un-
derstand and assimilate by humans [BFOS84]. Second, the decision trees are nonparametric models,
no intervention being required from the user, and thus they are very suited for exploratory knowl-
edge discovery. Third, scalable algorithms, in the sense that the performance degrades gracefully
with the increase of the size of training data, exist for decision tree construction models [GRG98].
Last, accuracy of decision trees is comparable or superior to other models [Mur95, LLS97].
Three subtopics of decision tree construction received our attention:
Bias and Bias Correction in Split Selection Methods: Preferences towards split vari-
ables with particular characteristics (i.e. bigger domains) can harm the interpretability and
accuracy of decision trees. Methods to identify and correct such preferences are of signicant
interest.
Scalable Linear Regression Trees: Linear regression trees are substantially more accurate
than regression trees with constant regressors in the leaves but no scalable construction algo-
rithms for such regressors have been proposed. There is great practical interest in a scalable
algorithm to build such regression trees.
Probabilistic Decision Trees: Decision trees are essentialy deterministic functions from
the domain of predictor attributes to the domain of predicted attribute (class labels or real
values). Some applications require probabilistic models or require the prediction of the model
to be continous and dierentiable. Probabilstic decision trees generalize decision trees and
have these properties. Due to the decision tree heritage, they can be constructed eciently.
In Section 2 we give some background information on classication and regression trees and
the EM algorithm for Gaussian mixtures. In Section 3 we present our ongoing and future work on
classication and regression tree construction. Section 3.1 contains details on our work on the bias
correction in decision tree construction, followed by scalable linear regression tree construction in
Section 3.2 and probabilistic classication and regression trees in Section 3.3. Section ?? contains
some details on other ongoing work we are conducting on query processing over streaming data.
We conclude in Section 4 with a summary of past and proposed future work that constitutes
our thesis proposal.
1
2 Preliminaries
In this section we give a short introduction to classication and regression trees and EM algorithm
for Gaussian mixtures.
2.1 Classication Trees
Let X
1
, . . . , X
m
, C be random variables where X
i
has domain Dom(X
i
). The random variable C has
domain Dom(C) = 1, . . . , J. We call X
1
. . . X
m
predictor attributes (m is the number of predictor
attributes) and C the class label.
A classier ( is a function ( : Dom(X
1
) Dom(X
m
) Dom(C). Let = Dom(X
1
)
Dom(X
m
) Dom(C) (the set of events).
For a given classier ( and a given probability measure P over we can introduce a functional
R
P
(() = P[((X
1
, . . . , X
n
),=C] called the misclassication rate of the classier (.
Classier Construction Problem: Given a training dataset D of N independent identically dis-
tributed samples from , sampled according to probability distribution P, nd a function ( that
minimizes the functional R
P
((), where P is the probability distribution used to generate D.
A special type of classier is a decision tree. A decision tree is a directed, acyclic graph T in
the form of a tree. The root of the tree (denoted by Root(T )) does not have any incoming edges.
Every other node has exactly one incoming edge and may have 0, 2 or more outgoing edges. We
call a node T without outgoing edges a leaf node, otherwise T is called an internal node. Each leaf
node is labeled with one class label; each internal node T is labeled with one predictor attribute
X
T
, X
T
X
1
, . . . , X
m
called the split attribute. We denote the label of node T by Label(T).
Each edge (T, T
has a predicate q
(T,T
)
associated with it where q
(T,T
)
involves only the splitting attribute X
T
of node n. The set of
predicates Q
Y
on the outgoing edges of an internal node T must be non-overlapping and exhaustive.
A set of predicates Q is non-overlapping if the conjunction of any two predicates in Q evaluates
to false. A set of predicates Q is exhaustive if the disjunction of all predicates in Q evaluates to
true. We will call the set of predicates Q
T
on the outgoing edges of an internal node T the splitting
predicates of T; the combined information of splitting attribute and splitting predicates is called
the splitting criteria of T and is denoted by crit(T).
Given a decision tree T , we can dene the associated classier in the following recursive manner:
((x
1
, . . . , x
m
, T) =
i=1
P[i[T], GG(T, X, Q)
def
= gini(T)
n
j=1
P[q
j
(X)[T] gini(T
j
) (3)
Information Gain. This popular split selection method is widely used in the machine learning
literature [Mit97].
entropy(T)
def
=
J
i=1
log P[i[T], IG(T, X, Q)
def
= entropy(T)
n
j=1
P[q
j
(X)[T] entropy(T
j
) (4)
3
Age
Car Type
Car Type
<= 30
NO
YES NO
YES
>30
0
sedan
# Childr.
NO
sedan sports, truck sports, truck
sports, truck sedan
>0
YES
Car Type
Figure 2: Example of decision tree
Car type Age Chd Sub
sedan 23 0 yes
sports 31 1 no
sedan 36 1 no
truck 25 2 no
sports 30 0 no
sedan 36 0 no
sedan 25 0 yes
truck 36 1 no
sedan 30 2 yes
sedan 31 1 yes
sports 25 0 no
sedan 45 1 yes
sports 23 2 no
truck 45 0 yes
Figure 3: Example Training
Database
Gain Ratio. Quinlan introduced this adjusted version of the information gain to counteract the
bias introduced by categorical attributes with large domains [Qui86].
GR(T, X, Q)
def
=
IG(T, X, Q)
|Dom(X)|
j=1
log P[X = x
j
[T]
(5)
There are two other popular split selection methods for n-ary splits for categorical attributes
that are not based on an impurity function.
The
2
Statistic.
2
(T)
def
=
J
i=1
|Dom(X)|
j=1
(P[X = x
i
[T] P[C = j[T] P[X = x
i
, C = j[T])
2
P[X = x
i
[T] P[C = j[T]
. (6)
The G Statistic.
G(T)
def
= 2 N IG(T) log
e
2, (7)
where N is the number of records at node T.
2.2 Regression Trees
We start with the formal denition of the regression problem and we present regression trees, a
particular type of regressor.
We have the random variables X
1
, . . . , X
m
as in the previous section to which we add the
random variable Y with real line as the domain that we call the predicted attribute or output.
A regressor is a function : Dom(X
1
) Dom(X
m
) Dom(Y ). Now if we let Omega =
Dom(X
1
) Dom(X
m
) Dom(Y ) we can dene probability measures P over . Using such a
4
probability measure and some loss function L (i.e. square loss function L(a, x) = |a x|
2
) we can
dene the regressor error as R
P
() = E
P
[L(Y, (X
1
, . . . , X
m
)] where E
P
is the expectation with
respect to probability measure P.
Regressor Construction Problem: Given a training dataset D of N iid samples from sampled
according to probability distribution P, nd a function that minimizes the functional R
P
().
Regression Trees, the particular type of regressors we are interested in, are the natural gen-
eralization of decision trees for regression (continuous valued prediction) problems. Instead of
associating a class label to every node, a real value or a functional dependency of some of the
inputs is used.
Regression trees were introduced by Breiman et al.(CART) [BFOS84]. Regression trees in
CART have a constant numerical value in the leaves and use the variance as a measure of impurity.
Thus the split selection measure is:
Err(T)
def
=
1
N
T
N
T
i=1
(y
i
y
i
)
2
, Err(T) = Err(T)
n
j=1
P[q
j
(X)[T] Err(T
j
) (8)
The reason for using variance as the impurity measure is justied by the fact that the best
constant predictor in a node is the average of the value of the predicted variable on the test
examples that correspond to the node; the variance is thus the mean square error of the average
used as a predictor.
As in the case of classication trees, prediction is made by navigating the tree folowing branches
with true predicates until a leaf is reached. The numerical value associated with the leaf is the
prediction of the model.
Usually the top-down induction schema algorithm like the one in Figure 1 is used to build reger-
ession tress. Pruning is used to improve the accuracy on unseen examples like in the classication
tree case. Pruning methods for classication trees can be adapted for regression trees [Tor98].
2.3 EM Algorithm for Gaussian Mixtures
In this section we give a short introduction to the problem of approximating some unknown dis-
tribution from which a sample is available with a mixture of Gaussian distributions. An iterative
algorithm to nd parameters of such Gaussian distributions was introduced by Wolfe [Wol70]) and
was was latter generalized by Dempster et al. [DR77]. Our introduction follows mostly the excellent
tutorial [Bil] where complete proofs can be found.
The Gaussian mixture density estimation problem is the following: nd the most likely values
of the parameter set = (
1
, . . . ,
M
,
1
, . . . ,
M
,
1
, . . . ,
M
) of the probabilistic model:
p(x, ) =
M
i=1
i
p
i
(x[
i
,
i
) (9)
p
i
(x[
i
,
i
) =
1
(2)
d/2
[
i
[
1/2
e
1
2
(x
i
)
T
1
i
(x
i
)
(10)
given sample X = (x
1
, . . . , x
N
) (training data). In the above formulas p
i
is the density of the
Gaussian distribution with mean
i
and covariance matrix
i
.
i
is the weight of the component
5
i of the mixture with the normalization property
M
i=1
i
= 1. M is the number of mixture
components or clusters and is xed and given and d is the dimensionality of the space.
The EM algorithm for estimating the parameters of the Gaussian components proceeds by
repeatingly applying the following two steps until values of the estimates do not change signicantly:
Expectation (E step):
h
ji
=
i
p
i
(x
j
[
i
,
i
)
M
k=1
k
p
k
(x
j
[
k
,
k
)
(11)
Maximization (M step):
i
=
1
N
N
j=1
h
ij
(12)
i
=
N
j=1
h
ij
x
j
N
j=1
h
ij
(13)
i
=
N
j=1
h
ij
(x
j
i
)(x
j
i
)
T
N
j=1
h
ij
(14)
It follows immediately from the above equations that
i
is a positive denite matrix if vectors
x
j
are linear independent.
3 Classication and Regression Tree Construction
As mentioned in the introductory part most of our work adresses issues in the decision and regression
tree construction. Even though this space has been thoroughly explored for the last two decades,
it still provides interesting research oportunities.
3.1 Bias Correction in Decision Tree Construction
Split variable selection is one of the main components of classication tree construction. The quality
of the split selection criterion has a major impact on the quality (generalization, interpretability
and accuracy) of the resulting tree. Many popular split criteria suer from bias towards predictor
variables with large domains [WL94, Kon95].
Consider two predictor variables X
1
and X
2
whose association with the class label is equally
strong (or weak). Intuitively, a split selection criterion is unbiased if on a random instance the
criterion chooses both X
1
and X
2
with probability 1/2 as split variables. Unfortunately, this is
usually not the case.
There are two previous meanings associated with the notion of bias in decision tree construc-
tion: an anomaly observed by [Qui86], and the dierence in distribution of the split criteria applied
to dierent predictor variables [WL94]. The rst is very specic and doesnt extent to the intu-
itive denition given above, the second one is too strict in the sense that it requires equality of
distributions, which is hard to check.
6
0
200
400
600
800
1000
N
-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
log10(p1)
-0.5
0
0.5
1
1.5
2
2.5
Bias
Figure 4: The bias of the information gain.
0
200
400
600
800
1000
N
-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
log10(p1)
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
Bias
Figure 5: The bias of the p-value of the
2
-
test (using a
2
-distribution).
In our work we dened bias of a split criterion s as the logarithmic odds of choosing X
1
, a split
attribute, over X
2
, some other split attribute, when neither X
1
nor X
2
is correlated with the class
label [DG01]. Formally:
Bias(X
1
, X
2
) = log
10
P
s
(X
1
, X
2
)
1 P
s
(X
1
, X
2
)
(15)
where
P
s
(X
1
, X
2
)
def
= P[s(T, X
1
) > s(T, X
2
)] (16)
is the probability that split criterion s chooses X
1
over X
2
. T is a dataset distributed according to
the hypothesis that the input attributes are not corelated with the class label (Null Hypothesis).
When the split criterion is unbiased Bias(X
1
, X
2
)=log
10
(0.5/(10.5))=0. The bias is positive
if s prefers X
1
over X
2
and negative, otherwise. Instead of specifying when a split criterion is
biased or not, this denition gives a measure of the bias. This is important since in practice we are
interested in almost unbiased split criteria, not necessarily strictly unbiased.
Using this denition of the bias we showed that a number of popular split criteria have a
pronounced bias towards split attributes with big domains [DG01]. Figure 4 depicts the bias of the
information gain split criterion as a function of the number of training examples N and probability
to see the rst class label p
1
. As can be easily noticed, for values of p
1
between 10
2
and 1/2 there
is a very strong bias towards predictor attibute X
2
that has 10 possible values (predictor attribute
X
2
has 2 possible values). Similar results were observed for gini gain and gain ratio (but with
smaller bias). Figure 5 is the result of the same experiment for the
2
test. For this criterion the
bias is almost nonexistent. The same qualitative result can be observed for G
2
statistic but with
more noticeable bias.
These experimental results suggest that there is something inherent in the nature of the statis-
tical split criteria that makes them have a much smaller biase than the non-statistical criteria. We
were able to prove that the p-value of any split selection criterion (the probability, under the Null
Hypothesis, that a value of the criterion as good or better than the one observed is obtained) is
virtually unbiased [DG01].
Thus, to get an unbiased criterion, the p-value of any split criterion can be used. The only
diculty in this method is the computation of the p-value under the Null Hypothesis of a given
criterion. We can distinguish four ways in which this can be accomplished:
7
Exact computation. Use the exact distribution of the split criterion. The main drawback is
that this is almost always very expensive; it is reasonably ecient only for n = 2 and k = 2
[Mar97].
Bootstrapping. Use Monte Carlo simulations with random instances generated according to
the Null Hypothesis. This method was used in by [FW98]; its main drawback is the high cost
of the the Monte Carlo simulations.
Asymptotic approximations. Use an asymptotic approximation of the distribution of the split
criterion (e.g., use the
2
-distribution to approximate the
2
-test [Kas80] and the G
2
-statistic
[Min87]). Approximations often work well in practice, but they can be inaccurate for border
conditions (e.g., small entries in the contingency table).
Tight approximations. Use a tight approximation of the distribution of the criterion with
a nice distribution. While conceptually superior to the previous three methods, such tight
approximations might be hard to nd.
Out of the four possible ways to compute the p-value of a split criterion the last one, tight
approximation, is the most appealing since is both eective and computationally ecient. We
were able to deduce such a tight approximation for the gini index for decision trees that have an
outgoing branch for intermediate nodes for every possible value of the split attribute. The key
ideas of the approximation are: (1) the experimental observation that the distribution of the gini
index under the Null Hypothesis is essentially a gamma distribution and (2) the expected value
and variance of this split criterion can be computed exactly. Thus, the distribution of the gini
index is approximated by a gamma distribution with the same expected value and variance (two
parameters completely determine a gamma distribution). Details and actual formulas can be found
in [DG01]. Experiments with this criterion revealed a behavior similar with the one in Figure 5.
Summary: Our prior work on the problem of bias in split selection methods has the following three
contributions:
1. We dened Bias is split selection as the log odds of choosing a split variable over another.
This denes a useful measure that captures the intuitive notion of bias. A mathematical
denition of the bias is crucial for any theoretical developments.
2. We proved that the p-value of any split selection criterion is essentially unbiased, thus giving
a very general method to obtain unbiased criteria. This result also explains why the statistical
criteria previously used are unbiased.
3. We constructed a tight approximation of the distribution of gini gain split criterion. We
used this distribution to build an unbiased criterion following our general prescription. The
distribution of gini gain is interesting in itself since it can be used for pre or post pruning
much in the way the
2
test is used [Fra00].
Future Work and Research Directions: The work we just presented on the bias in split selection is
only an initial step. A thorough theoretical and experimental study of the proposed correction of
gini index is in order. We have done some progress in the direction of estimating the distribution of
the gini index for general situations, not only under Null Hypothesis. Determining the distribution
is crucial for a theoretical characterization of the correction.
8
3.2 Scalable Linear Regression Trees
We introduced the regression trees in Section 2.2. We mention here only some further developments
that are related to our work.
Even though they were introduced early in the development of decision trees (CART, Breiman
et al. [BFOS84]), they received far less attention from the research community. Quinlan generalized
the regression trees in CART by using a linear model in the leaves to improve the accuracy of the
prediction [Qui92]. The impurity measure used to choose the split variable and the split point is
the standard deviation of the predictor of the training examples at the node. Karalilc argues that
the mean square error of the linear model in a node is a more appropriate impurity measure for
the linear regression trees since it is possible to have data with big variance but well predicted by
a linear model [Kar92]. This is a crucial observation since evaluating the variance is much easier
than the error of a linear model (that requires solving a linear system). Even more, if discrete
attributes are present among the predictor attributes and CART type of trees are built (at most
two children for each node), the problem of nding the best split attribute becomes intractable for
linear regression trees since the theorem that justies a linear algorithm for nding the best split
(Theorem 9.4 [BFOS84]) doesnt seem to apply.
Torgo proposed the usage of even more sophisticated functional models in the leaves (i.e. kernel
regressors) [Tor97]. For such regression trees both construction and usage of the model is expensive
but the accuracy is superior to the linear regression trees.
There are a number of contributions coming from the Statistics community. Chaudhuri et.
al proposed the use of statistical tests for split variable selection instead of error of t methods
[CHLY94]. The main idea is to t a model (constant, linear or higher order polynomial) in every
node in the tree and to partition the data at this node into two classes: datapoints with positive
residuals
1
and datapoints with negative residuals. In this manner the regression problem is locally
reduced to a classication problem so it becomes much simpler. Statistical tests used in decision
tree construction can be used from this point on. In this work Students t-test was used. Unfor-
tunately, it is not very clear why dierences in the distributions of the signs of the residuals are
good criteria on which decision about attributes to split on and split points are made. A further
enhancement was proposed recently by Loh [Loh02]. It consists mostly in the use of
2
-test instead
of the t-test in order to accommodate discrete attributes, the detection of interactions of pairs of
predictor attributes and a sophisticated calibration mechanism to ensure the unbiasedness of the
split attribute selection criterion.
Motivation: We are interested mostly in designing a scalable regression tree algorithm with linear
regressors in the leaves. The choice of linear regression trees as model for scalable regression problem
is motivated by the following properties of linear regression trees:
Due to the hierarchical structure, regression trees are easy to understand and interpret.
Divide-et-impera-like methods can be used to build regression trees, thus the computational
eort can be made linear in the size of the training data and logarithmic in the size of the
model. This is the key for scalable regression methods.
Regression trees, especially with linear models in the leaves, are capable of achieving great
1
Residuals are the dierence between the true value and the value predicted by regression model.
9
accuracy and the models are non-parametric (there is no apriori limit on the number of
parameters).
Predictions using a regression tree model can be made very eciently (proportional to the
logarithm of the size of the model).
For constant regression trees, algorithms for scalable classication trees can be straightforwardly
adapted [GRG98]. The main obstacle in doing the same thing for linear regression trees is the
observation previously made that the problem of partitioning the domain of a discrete variable in
two parts is intractable. Also the amount of sucient statistics that has to be maintained goes from
two real numbers for constant regressors (mean and mean of square) to quadratic in the number of
regression attributes (to maintain the matrix A
T
A that denes the linear system). This can be a
problem also.
We make the distinction in [Loh02] between predictor attributes:
Discrete attributes used only in the split predicates in intermediate nodes in the regression
tree.
Split continuous attributes: continuous attributes used only for splitting.
Regression attributes: continuous attributes used in the linear combination that species the
linear regressors in the leaves as well as for specifying split predicates.
Proposed Solution: The main idea is to locally transform the regression problem into a classi-
cation problem. This can be done by rst identifying two general Gaussian distributions in the
regressor attributes output space using EM algorithm for Gaussian mixtures and then classifying
the datapoints based on the probability to belong to these two distributions. Classication tree
techniques are then used to select the split attribute and the split point.
Y
X
r
Figure 6: Projection on X
r
, Y space of tran-
ing data.
Y
X
r
X
d
No
Yes
Figure 7: Projection on X
d
, X
r
, Y space of
same training data as in Figure 6
The role of EM is to identity two natural classes in the data that have approximatively a linear
organization. The role of classication step is to identify predictor attributes that can make the
10
dierence between these two classes in the input space. Too see this more clearly suppose we are
in the process of building a linear regression tree. Suppose that we have a hypothetical set of
training examples with three components: a regressor attribute X
r
, a discrete attribute X
d
and
the predicted attribute Y . The projection of the training data on the X
r
, Y space might look
like Figure 6. The data is approximatively organized in two clusters with Gaussian distributions
that are marked as ellipsoids. Dierentiating between the two clusters is crucial for prediction but
information in the regression attribute is not sucient to make this distinction. The information
in the discrete attribute X
d
can make this distinction, as can be observed from Figure 7 where the
projection is made on the space X
d
, X
r
, Y . If more split attributes would have been present, a split
on X
d
would have been preferred since the classes after split are pure.
Observe that use of EM algorithm for Gaussian mixture is very limited since we have only two
mixtures and thus the likelihood function has a simpler form which means less local maxima. Since
EM is sensitive to distances, before running the algorithm, training data has to be normalized by
performing a linear transformation that makes the data look as close as possible to a unitary sphere
with the center in the origin. Experimentally we observed that with this transformation and in
this restricted scenario EM algorithm with clusters initialized randomly works well.
Benets of Proposed Solution:
The regression algorithm is very intuitive and simple since it consists mainly of a straightfor-
ward combination of known and well studied techniques, namely: split selection in classica-
tion trees and EM algorithm for Gaussian mixtures.
By employing scalable versions of variable selection in classication tree construction algo-
rithms [GRG98] and EM algorithm for Gaussian mixtures [BFR98] a scalable linear regression
tree construction algorithms can be built.
Good oblique splits
2
can be found for regression trees build with our algorithm. The projec-
tions of the datapoints corresponding to the two classes, induced by the two Gaussian clusters,
on the continuous predictor attributes are approximated with two Gaussian distributions (one
for each class). The hyperplane that best separates this distributions determines the optimal
oblique split. The construction of this oblique splits incurs minimum computation overhead
and can substantially improve the interpretability and accuracy of the regression trees.
Ecient Implementation of EM Algorithm The following two ideas are used to implement eciently
the EM algorithm:
Steps E and M are performed simultaneously. This means that quantities h
ij
do not have to
be stored explicitly.
All the operations are expresed in terms of Cholesky decomposition G
i
of covariance matrix
i
= G
i
G
T
i
. G
i
has the useful property that is lower diagonal, so solving linear systems takes
quadratic eort in the number of dimensions and computing the determinant is linear in the
number of dimensions.
2
Oblique splits are inequalities involving linear combinations of split and regression continuous attributes.
11
Using the Cholesky decomposition we immediately have
1
i
= G
1T
i
G
1
i
and [
i
[ = [G
i
[
2
.
Substituting in Equation 10 we get:
p
i
(x[
i
, G
i
) =
1
(2)
d/2
[G
i
[
e
1
2
G
1
i
(x
i
)
2
(17)
Quantity x
= G
1
i
(x
i
) can be computed by solving the linear system G
i
x
= x
i
and
takes quadratic eort. For this reason the inverse of G
i
needs not be precomputed since solving
the linear system takes as much time as vector matrix multiplication.
The following quantities have to be maintained incrementally for each cluster:
s
i
=
N
j=1
h
ij
(18)
s
x,i
=
N
j=1
h
ij
x
j
(19)
S
i
=
N
j=1
h
ij
x
T
j
x
j
(20)
and for every training example x
j
quantities h
ij
are computed with the formula in Equation 11.
and are discarded after updating s
i
, s
x,i
, S
i
for every cluster i.
After all the training examples have been seen, the new parameters of the M distributions are
computed with the formulas:
i
=
s
i
N
(21)
i
=
s
x,i
s
i
(22)
i
=
S
i
s
i
i
T
i
(23)
G
i
= Chol(
i
) (24)
Moreover, if the datapoints are coming from a Gaussian distribution with mean
i
and co-
variance matrix G
i
G
T
i
the transformation x
= G
1
i
(x
i
) results in datapoints with Gaussian
distribution with mean 0 and identity covariance matrix. This means that this transformation can
be used for data normalization in the tree growing phase.
The Gaussian mixture problem is solved in the regressoroutput space and for each mixture we
have to nd the least square estimate (LSE) of the output as a function of regressors. It can be
shown that the LSE regressor is:
y = c
i
(x
I,i
) +
O,i
(25)
where c
i
is the solution of linear equation c
T
i
G
II,i
= G
OI,i
. Subscript I was used to denote the
rst d 1 components (the ones referring to regressors) and O to refer to the last component
(corresponds to output). Thus, for example G
II,i
refers to the (d 1) (d 1) upper part of G
i
.
12
Prior Work: We have fully implemented the algorithm described above for construction of linear
regression trees. The only pruning method implemented is Quinlans resubstitution error pruning.
We conducted some preliminary experiments in which we learned smooth nonlinear three di-
mensional functions with and without normal noise added. On these experiments we observed
improvements of up to 4 in accuracy on resulted trees with up to 20% less nodes when compared
with GUIDE regression tree learner [Loh02].
Future Work:
Implement more pruning methods, for example Complexity Pruning.
Conduct extensive experiments on synthetic and real life datasets of the proposed regression
method, comparing it with other well established regression methods in terms of accuracy,
size of the resulting model and computational eort.
3.3 Probabilistic Decision and Regression Trees
Decision and regression trees prove to be an excellent class of models for practical applications.
Nevertheless they have a number of shortcomings for some applications:
Only one path in the tree is used to make predictions. This is the case since any path in the
tree denes a partition of the input space and a particular value of the input can fall only in
one such partition. This has as consequence a lack of precision of the prediction when the
input is close to the border of the partition. For regression trees it also means that the model
seen as a function has discontinuities so it does not have a derivative everywhere.
In the learning process the training examples inuence the construction of subparts of the
tree. After a split strategy is chosen, only training examples that satisfy the split predicate
are used to construct a child node, thus information contained in a datapoint is used only in
one child. This leads to a severe reduction in the number of datapoints on which decisions are
made as the tree construction progresses and has as consequence a reduction in the precisions
of the estimates.
Mistakes in the choice of variables to split on in an intermediate node result not only in
increasing the size of the model, but have as consequence a decrease of precision of decisions
due to data fragmentation.
For some applications it is desirable for decision trees to give a probabilistic answer instead
of the deterministic answer. Estimating such probabilities only from the information at the
leaves incurs big errors due to data fragmentation.
In order to address these fallacies we are proposing a generalization of the decision and regres-
sion trees that makes extensive use of probability distributions. To simplify the exposition, we
rst present probabilistic decision trees (PDT) and just indicate afterwards the way probabilistic
regression trees dier.
First we introduce some notation. Let k be the number of class labels (size of the domain of
the predicted variable). Let x be the current input (vector with values for all the input variables).
Like decision trees, PDTs consist of a set of nodes that are linked in a tree structure. We have
two types of nodes:
13
Intermediate Nodes: are nodes with exactly k children. For each child a function f
T,i
dened on the space of all the inputs is specied. This function is interpreted as the density of
a probability distribution and its application to input x gives the probability that x inuences
branch i of the subtree starting at the current node. Functions f
T,i
satisfy the normalization
condition
k
i=1
f
T,i
(x) = 1 for any input x. The vertex between nodes T and T
i
is labeled by
the function f
T,i
.
Leaf Nodes: are nodes without children. The only information attached to them is the
predicted class label.
1
1 2
2
T
1
T
2
B
i
n
o
m
i
a
l
(
X
2
) B
i
n
o
m
i
a
l
(
X
2
)
N
(
m
1
,s
1
)
(
X
1
)
N
(
m
u
2
,s
2
)
(
X
1
)
Figure 8: Example of probabilistic decision tree.
An example of PDT is depicted in Figure 8. There are two class labels 1 and 2 so each of
the intermediate nodes T
1
and T
2
has two children. For the leaves we represented only the class
label. For node T
1
the two functions that induce the probability to get to each of the children are
densities of normal distributions on predictor variable X
1
(which is supposed continuous). For T
2
the two functions are densities of binomial distributions on the predictor variable X
2
supposed to
be discrete.
Since function f
T,i
(x) is interpreted as the probability that input x goes on branch i at node
T, the product of the applications to input x of the functions that label vertices on a path from
the root node T
R
to a leaf gives the probability to reach this particular leaf. By summing up the
probabilities to reach labels with class label c we obtain the probability that the output has class
label c. This can be done for all the class labels and thus a probability distribution of the prediction
is obtained. If a deterministic prediction is required, the class label with the highest probability
can be returned.
If functions f
T,i
for any node T are chosen so that they have value 1 if x is in the partition P
i
and 0 otherwise and partitions P
i
are simple in the sense that they refer to only one component of
x, the PDTs degenerate into normal decision trees.
Learning PDTs: The main idea in constructing the functions f
T,i
for a node T during the learning
of PDTs is to use all the training examples available but to weight them by the product of the
applications of the functions on the path from the root node to the current node to the input part
14
of these datapoints. Thus the already constructed part of the PDT denes a magnifying glass that
changes the importance of each of the training examples in the construction of the current node.
Since training examples are labeled for each child i a distribution that approximates well the
distribution of the weighted datapoints with label i is determined. Only particular classes of
parametric distributions are considered. The choice of the parameters of a distribution when
the class is known is equivalent to the split point selection problem in decision tree construction.
The class choice is equivalent with the split attribute selection problem. Examples of classes of
distributions are: multinomial distributions on discrete predictor attributes, normal distributions
on continuous predictor attributes; one class for each predictor attribute.
Denition of impurity measures like gini index and information gain use prior and conditional
probabilities only, thus they can also be used to select a class of distributions to use in a node of
PDTs.
Probabilistic Regression Trees (PRT): They dier from PDTs only in the fact that regression models
(i.e. constant, linear) label the leaves instead of class labels. The prediction of PRTs is the weighted
sum of the predictions of regressors in the leaves, the weight of a leaf being the probability to reach
the leaf for the particular input from the root node. We see no reason to have more than two
children for a node in a PRTs. The technique described in Section 3.2 can be used to locally reduce
the regression problem to a classication problem.
Future work: Some of the ideas mentioned above require further renements. Also pruning of
PDTs and PRTs has to be considered. Once the learning algorithms are perfected we plan to do
an extensive empirical evaluation with synthetic and real life data sets.
Related work: In terms of arhitecture our proposal is closed to hierarchical mixture of experts
[JJ93, WR94] and bayesian network classiers [FG96]. One particular technique proposed to acheive
continuity in the resulting decision tree models is smoothing by averaging [CHLY94, Bun93]. By
tting a naive bayesian classier in the leaves of a decision tree, probabilistic predictions can be
made [BFOS84]. This types of models are usually called class probability trees. Buntine combined
smoothing by averaging with class probability trees [Bun93].
4 Summary of Thesis Proposal
We now summarize our present and future work on classication and regression tree construction:
Bias in Split Selection: We dened bias in split selection as the log adds of choosing
a split attribute over another. Using this denition we showed that popular split selection
criteria have signicant bias. We then proved that the p-value of any split selection criteria is
essentially an unbiased split criteria and we used this result to correct the bias of gini index.
To build such a correction we found a good approximation of the distribution under Null
Hypothesis of gini index. We will continue this work with a theoretical and experimental
study of the proposed correction on gini index.
Scalable Linear Regression Trees: We proposed a novel method to learn regression trees
with linear regressors in the leaves. The method consists in determining at the level of every
node during tree construction, using EM algorithm for Gaussian mixtures, two Gaussian
15
clusters with general covariance matrices that t well the training data. The clusters are
then used to classify the training data in two classes based on closeness to this clusters. In
this way a classication problem results for each node and classication tree construction
techniques can be used to make good decisions. A scalable version of the proposed algorithm
can be obtained by employing scalable EM and classication tree construction algorithms.
We implemented the proposed method with resubstitution error pruning and obtained some
good initial results. We plan to implement more pruning methods and to conduct a detailed
experimental study.
Probabilistic Decision Trees: We proposed a generalization of classication and regression
trees that consists in using probability distributions instead of split predicates for intermediate
nodes. Some of the benets of such a model are: probabilistic predictions are made in a
natural way, unnecessary fragmentation of the training data is avoided, resulted models have
continuity properties, more insights can be gained from a probabilistic modes. We proposed
inference and learning methods for such trees. We are planing to adapt decision tree pruning
methods to these types of models and scalable versions of the construction algorithms.
References
[BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression
Trees. Wadsworth, Belmont, 1984.
[BFR98] P. S. Bradley, Usama M. Fayyad, and Cory Reina. Scaling clustering algorithms to large
databases. In Knowledge Discovery and Data Mining, pages 915, 1998.
[Bil] Je Bilmes. A gentle tutorial of the em algorithm and its application to parameter
estimation for gaussian mixture and hidden markov models.
[Bun93] W. Buntine. Learning classication trees. In D. J. Hand, editor, Articial Intelligence
frontiers in statistics, pages 182201. Chapman & Hall,London, 1993.
[CHLY94] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression
trees. Statistica Sinica, 4:143167, 1994.
[DG01] Alin Dobra and Johannes Gehrke. Bias correction in classication tree construction.
In Proceedings of the Eighteen International Cornerence on Machine Learning, pages
9097. Morgan Kaufmann, 2001.
[DR77] N. M. Dempster,A.P. Laird and D. B. Rubin. Maximum likelihood from incomplete
data via the EM algorithm. J. R. Statist. Soc. B, 39:185197, 1977.
[FG96] Nir Friedman and Moises Goldszmidt. Building classiers using bayesian networks. In
AAAI/IAAI, Vol. 2, pages 12771284, 1996.
[Fra00] Eibe Frank. Pruning decision trees and lists. PhD thesis, Department of Computer
Science, University of Waikato, Hamilton, New Zealand, 2000.
[FW98] Eibe Frank and Ian H. Witten. Using a permutation test for attribute selection in
decision trees. In International Conference on Machine Learning, 1998.
16
[GRG98] Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest a frame-
work for fast decision tree construction of large datasets. In Proceedings of the 24th
International Conference on Very Large Databases, pages 416427. Morgan Kaufmann,
August 1998.
[JJ93] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM
algorithm. Technical Report AIM-1440, 1993.
[Kar92] A. Karalic. Linear regression in regression tree leaves, 1992.
[Kas80] G. Kass. An exploratory technique for investigating large quantities of categorical data.
Applied Statistics, (29):119127, 1980.
[Kon95] I. Kononenko. On biases in estimating multivalued attributes, 1995.
[LLS97] Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. An empirical comparison of decision
trees and other classication methods. Technical Report 979, Department of Statistics,
University of Wisconsin, Madison, June 1997.
[Loh02] Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection.
Statistica Sinica, 2002. to appear.
[MAR96] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classier for
data mining. In Proc. of the Fifth Intl Conference on Extending Database Technology
(EDBT), Avignon, France, March 1996.
[Mar97] J. Kent Martin. An exact probability metric for decision tree splitting. Machine Learn-
ing, 28(2,3):257291, 1997.
[Min87] John Mingers. Expert systems rule induction with statistical data. J. Opl. Res. Soc.,
38(1):3947, 1987.
[Mit97] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[Mur95] Sreerama K. Murthy. On growing better decision trees from data. PhD thesis, Depart-
ment of Computer Science, Johns Hopkins University, Baltimore, Maryland, 1995.
[Mur97] Sreerama K. Murthy. Automatic construction of decision trees from data: A multi-
disciplinary survey. Data Mining and Knowledge Discovery, 1997.
[Qui86] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1:81106, 1986.
[Qui92] J. R. Quinlan. Learning with Continuous Classes. In 5th Australian Joint Conference
on Articial Intelligence, pages 343348, 1992.
[RS98] Rajeev Rastogi and Kyuseok Shim. PUBLIC: A decision tree classier that integrates
building and pruning. In Proceedings of the 24th International Conference on Very Large
Databases, pages 404415, New York City, New York, August 24-27 1998.
[SAM96] John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classier
for data mining. In Proc. of the 22nd Intl Conference on Very Large Databases, Bombay,
India, September 1996.
17
[Tor97] Lus Torgo. Functional models for regression tree leaves. In Proc. 14th International
Conference on Machine Learning, pages 385393. Morgan Kaufmann, 1997.
[Tor98] L. Torgo. A comparative study of reliable error estimators for pruning regression trees,
1998.
[WL94] Allan P. White and Whei Zong Liu. Bias in information-based measures in decision tree
induction. Machine Learning, 15:321329, 1994.
[Wol70] John H. Wolfe. Pattern clustering by multivariate mixture analysis. multivariate behav-
ioral research, 5:329350, July 1970.
[WR94] S. R. Waterhouse and A. J. Robinson. Classication using hierarchical mixtures of
experts. In Proceedings of the 1994 IEEE Workshop on Neural Networks for Signal
Processing IV, pages 177186, Long Beach, CA, 1994. IEEE Press.
18