Session 17-Decision Tree
Session 17-Decision Tree
Note: Content used in this PPT has copied from various source.
Why decision trees?
• Decision trees require relatively little effort from users for data
preparation. Missing values does not prevent splitting the data to
build the trees. They are also not sensitive to the presence of outliers.
Application areas of decision trees
where, formula argument defines a symbolic description of the model to be fit using “~”
symbol; data argument defines data frame that contains the variables in the selected
model; controls argument is an optional argument that contains an object of class
TreeControl. It is obtained using ctree_control; the dots “…”define other optional
arguments.
Decision Tree Representation in R
ID3 algorithm is one of the most used basic decision tree algorithm. In 1983, Ross
Quinlan developed this algorithm. The basic concept of ID3 is to construct a tree by
following the top-down and a greedy search methodology. It constructs the tree that
starts from the root of the tree and moves downside of the tree. In addition, for
performing the testing of each attribute at every node, greedy method is used. The ID3
algorithm does not require any backtracking for creating the tree.
R language provides a package “data.tree” for implementing the ID3 algorithm. The
package “data.tree” creates a tree from the hierarchical data. It provides many methods
for traversing the tree in different orders. After converting the tree data into a data
frame, any operation like print, aggregation can be applied on it. Due to this, many
applications like machine learning, financial data analysis use this package.
Measuring Features
Entropy—Measures Homogeneity
Entropy measures the impurity of collected samples that contain positive and negative
labels. A dataset is pure if it contains only a single class; otherwise, the dataset is impure.
Entropy calculates the information gain for an attribute
of the tree. In simple words, entropy measures the homogeneity of the dataset. ID3
algorithm uses entropy to calculate the homogeneity of a sample. The entropy is zero if
the sample is completely homogeneous and if the sample is equally divided (50% - 50%)
it has entropy of one.
Information Gain—Measures the Expected Reduction in Entropy
The expected reduction of the entropy that is related to the specified attribute during
the splitting of decision tree node is called the information gain. Let the Gain(S, A) be
the information gain of an attribute A. Then the information gain is defined by the
following formula:
Gain(S, A) = Entropy(S) -
Measuring Features
Entropy—Measures Homogeneity
Entropy measures the impurity of collected samples that contain positive and negative
labels. A dataset is pure if it contains only a single class; otherwise, the dataset is impure.
Entropy calculates the information gain for an attribute of the tree. In simple words,
entropy measures the homogeneity of the dataset. ID3 algorithm uses entropy to
calculate the homogeneity of a sample. The entropy is zero if the sample is completely
homogeneous and if the sample is equally divided (50% - 50%) it has entropy of one.
Information Gain—Measures the Expected Reduction in Entropy
The expected reduction of the entropy that is related to the specified attribute during
the splitting of decision tree node is called the information gain. Let the Gain(S, A) be
the information gain of an attribute A. Then the information gain is defined by the
following formula:
Gain(S, A) = Entropy(S) -
Inductive Bias In Decision Tree Learning
• The inductive bias of the ID3 decision tree learning is the shortest tree.
Hence, when ID3 or any other decision tree learning classifies the tree,
then the shortest tree is preferred over larger trees for the induction
bias. Also, the trees that place high information gain attributes that are
close to the root are also preferred over those that are not close and
they are used as inductive bias.
Issues in Decision Tree Learning
• To avoid overfitting, stop growing the tree earlier. If the tree stops growing, then the
problem automatically resolves since the obtained training set is already small in size
and easily fits into the model.
• Another method uses a separate set of examples that do not include any training
data. For this, training and validation set method can be used. This method works
even if the training set is misled due to random errors. The validation set exhibits the
same random fluctuations by 2/3 training set and 1/3 validation set.
• The next method for avoiding overfitting is to use a statistical test. It estimates
whether to expand a node of tree or not. In addition, the test that expands a node
improves performance beyond the training set.
Issues in Decision Tree Learning
Reduced Error Pruning: Pruning or reduced error pruning is another method for resolving
overfitting problems. The simple concept of pruning is to remove subtrees from a tree. The
reduced error pruning algorithm goes through the entire tree and removes the nodes including
the subtree of that node that have no negative effect on the accuracy of the decision tree. It turns
the subtree into a leaf node with the most common label.
Rule Post-Pruning
• Rule post-pruning is the best method for resolving the overfitting problem that gives high
accuracy hypotheses. This method prunes the tree and reduces the overfitting problem. The
steps of the rule post-pruning method are as follows:
• Infer the decision tree from the training set and grow the tree until the training data is fitted as
well as possible. It allows overfitting to happen.
• Now convert the learned tree into an equivalent set of rules by creating one rule for each path
from the root node to a leaf node.
• Prune each rule by removing any precondition that results in improving its estimates accuracy.
• At last, sort the pruned rules by their estimates accuracy and consider them in this sequence
when classifying subsequent instances.
Decision Tree in R (1 of 2)
library(party) Output
ct<-ctree(speed~dist, data=cars)
plot(ct)
Decision Tree in R (1 of 2)
Classify a Case
library(rpart)
nativeSpeaker_find<-data.frame("age" = 11, "shoeSize" = 30.63692,
"score" = 55.721149)