0% found this document useful (0 votes)

335 views25 pages

Decision Trees for Data Scientists

This document summarizes how decision trees work, including: 1. Decision trees split data into progressively more homogeneous subgroups based on predictor variables to classify a target variable. 2. Nodes in decision trees represent conditions on variables, with decision nodes splitting the data and terminal nodes making predictions. 3. The document uses a toy dataset to demonstrate how decision trees select variables to split on at each node based on splitting criteria like entropy and the Gini index. 4. It also briefly discusses how decision trees can overfit and methods like pruning to address overfitting.

Uploaded by

Akash Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

335 views25 pages

Decision Trees for Data Scientists

Uploaded by

Akash Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards

| Towards Data Science

Published in Towards Data Science

Shailey Dash Follow

Nov 2, 2022 · 22 min read · Listen

Save

Decision
Open in app
Trees Explained — Entropy, Sign up Sign In

Information Gain, Gini Index, CCP Pruning

Though Decision Trees look simple and intuitive, there is nothing very simple

Though Decision Trees look simple and intuitive, there is nothing very simple about
how the algorithm goes about the process deciding on splits and how tree pruning
occurs. In this post I take you through a simple example to understand the inner
workings of Decision Trees.

Iris Decision Tree from Scikit Learn ( Image source: sklearn)

Decision Trees are a popular and surprisingly effective technique, particularly for
classification problems. But, the seemingly intuitive interface hides complexities.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 1/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

The criterion for selecting variables and hierarchy can be tricky to get, not to
mention Gini index, Entropy ( wait, isn’t that physics?) and information gain (isn’t
that information theory?). As you can see there are lots of tricky problems on which
you can get stuck on. The best way to understand Decision Trees is to work through
a small example which has sufficient complexity to be able to demonstrate some of
the common points one suddenly goes, ‘ not sure what happens here…?’.

This post is therefore more like a tutorial or a demo where I will work through a toy
dataset that I have created to understand the following:

1. What is a decision tree: root node, sub nodes, terminal/leaf nodes

2. Splitting criteria: Entropy, Information Gain vs Gini Index

120 4
3. How do sub nodes split

4. Why do trees overfit and how to stop this

5. How to predict using a decision tree

So, let’s get demonstrating…

1. What does a Decision Tree do?

Let’s begin at the real beginning with core problem. For example, we are trying to
classify whether a patient is diabetic or not based on various predictor variables
such as fasting blood sugar, BMI, BP, etc. This is obviously a prediction problem for
a new patient. We also have 1000 patient records to help us develop an
understanding of which features are most useful in predicting. Unlike other
classification algorithms such as Logistic Regression, Decision Trees have a
somewhat different way of functioning and identifying which variables are
important.

The first thing to understand in Decision Trees is that they split the predictor space,
i.e., the target variable into different sub groups which are relatively more
homogenous from the perspective of the target variable. For example, if the target
variable is binary, with categories 1 and 0 ( shown by green and red dots in the
image below, then the decision tree works to split the target variable space into sub
groups that are more homogenous in terms of having either 1’s or 0’s.

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 2/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Target Variable Splitting process (Image source: author)

That is the overall concept. Let us begin with understanding the various elements of
a decision tree.

Understanding components of a Decision Tree

A decision tree is a branching flow diagram or tree chart. It comprises of the

following components:

. A target variable such as diabetic or not and its initial distribution.

A root node: this is the node that begins the splitting process by finding the
variable that best splits the target variable

Node purity: Decision nodes are typically impure, or a mixture of both classes of
the target variable (0,1 or green and red dots in the image). Pure nodes are those
that have one class — hence the term pure. They either have green or red dots
only in the image.

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 3/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Decision nodes: these are subsequent or intermediate nodes, where the target
variable is again split further by other variables

Leaf nodes or terminal nodes are pure nodes, hence are used for making a
prediction of a numerical or class is made.

Let’s see this visually..

Structure of a Decision Tree (image source: my collection)

In general a decision tree takes a statement or hypothesis or condition and then

makes a decision on whether the condition holds or does not. The conditions are
shown along the branches and the outcome of the condition, as applied to the target
variable, is shown on the node.

Arrows leading away from a node indicate a condition which is being applied to the
node. Arrows pointing to a node indicate a condition that is being satisfied.

This is the first level of the Decision Tree — understanding the flow of splitting the
decision space into smaller spaces which ultimately become more and more
homogenous in the target variable. This ultimately leads to a prediction.

Decision Trees offer tremendous flexibility in that we can use both numeric and
categorical variables for splitting the target data. Categoric data is split along the
different classes in the variable. Numeric is a little more tricky as we have to split
into thresholds for the condition being tested such as <18 and ≥18, for example. A
numeric variable can appear multiple times in the data with different cut offs or
thresholds. Also final classifications can be repeated.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 4/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

The important things from data science perspective are:

1. Flow of information through the Decision Tree

2. How does Decision Trees select which variable to split on at decision nodes?

3. How does it decide that the tree has enough branches and that it should stop
splitting?

Now let us look at a simplified toy example to understand the above process more
concretely.

First the problem:

We have data for 15 data points of student data on pass or fail an online ML exam.
To understand the basic process we begin with a dataset which comprises a target
variable that is binary ( Pass/Fail) and various binary or categorical predictor
variables such as:

whether enrolled in other online courses

whether student is from a maths, computer science or other background

Whether working or not working

The dataset is given below:

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 5/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Toy Dataset for Online ML Exam( source: author)

Notice that only one variable, ‘Student Background’ has more than 2 levels or
categories — Maths, CS, Others. It is one for the advantages of Decision Trees
compared to other classification models such as Logistic Regression or SVM, that
we do not need to carry out one hot encoding to make these into dummy variables.

Let us first look at the flow of how a decision tree works and then we will dive into
the complexities of how the decisions are actually made…

Flow of a Decision Tree

A decision tree begins with the target variable. This is usually called the parent
node. The Decision Tree then makes a sequence of splits based in hierarchical order

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 6/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

of impact on this target variable. From the analysis perspective the first node is the
root node, which is the first variable that splits the target variable.

To identify the root node we would evaluate the impact of all the variables that we
have currently on the target variable to identify the variable that splits the exam
Pass/Fail classes into the most homogenous groups. Our candidates for splitting this
are: Background, Working Status and Other Online Courses.

What do we hope to achieve with this split? Suppose we begin with Working Status
as the root node. This splits into 2 sub nodes, one each for Working and Not
working. Thus the Pass/Fail status is updated in each sub node respectively.

Sample Decision Tree Flow( Image source: author created

So, this is the basic flow of the Decision Tree. As long as there is a a mixture of Pass
and Fail in a sub node, there is scope to split further to try and get it to be only one
category. This is termed the purity of the node. For example, Not Working has 5 Pass
and 1 Fail, hence it is purer than the Working node which has 5P and 4F. A leaf node
would be one which contains either Pass or Fail class only.

A node which is impure can be branched further for improving purity. However,
most of the time we do not necessarily go down to the point where each leaf is
‘pure’. It is also important to understand that each node is standalone and hence the
attribute that best splits the ‘Working’ node may not be the one that best splits the
‘Not Working’ node.

Now, let us move on to learning the core part of Decision Trees - key questions:

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 7/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

How does the Tree decide which variable to branch out at each level?

Greedy top down approach

Decision trees follow a a top-down, greedy approach that is known as recursive

binary splitting. The recursive binary splitting approach is top-down because it
begins at the top of the tree and then it successively splits the predictor space. At
each split the predictor space gets divided into 2 and is shown via two new branches
pointing downwards. The algorithm is termed is greedy because at each step of the
process, the best split is made for that step. It does not project forwards and try and
pick a split that might be more optimal for the overall tree.

The algorithm therefore evaluates all variables on some statistical criteria and then
chooses the variable that performs best on the criteria.

Variable selection criterion

Here is where the true complexity and sophistication of decision lies. Variables are
selected on a complex statistical criterion which is applied at each decision node.
Now, variable selection criterion in Decision Trees can be done via two approaches:

1. Entropy and Information Gain

2. Gini Index

Both criteria are broadly similar and seek to determine which variable would split
the data to lead to the underlying child nodes being most homogenous or pure. Both
are used in different Decision Tree algorithms. To add to the confusion, it is not
clear which one is the preferred approach. So, one has to have an understanding of
both.

Let us begin with Entropy and Information Gain criterion

What is Entropy?

Entropy is a term that comes from physics and means a measure of disorder. More
specifically, we can define it as:

Entropy is a scientific concept as well as a measurable physical property that is most

commonly associated with a state of disorder, randomness, or uncertainty. The term and

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 8/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

the concept are used in diverse fields, from classical thermodynamics, where it was first
recognized, to the microscopic description of nature in statistical physics, and to the
principles of information theory.

https://en.wikipedia.org/wiki/Entropy

In information theory, the entropy of a random variable is the average level of

“information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes.

https://en.wikipedia.org/wiki/Entropy_(information_theory)

In the context of Decision Trees, entropy is a measure of disorder or impurity in a

node. Thus, a node with more variable composition, such as 2Pass and 2 Fail would
be considered to have higher Entropy than a node which has only pass or only fail.
The maximum level of entropy or disorder is given by 1 and minimum entropy is
given by a value 0.

Leaf nodes which have all instances belonging to 1 class would have an entropy of 0.
Whereas, the entropy for a node where the classes are divided equally would be 1.

Entropy is measured by the formula:

Where the pi is the probability of randomly selecting an example in class i. Let us

understand this a bit better in the context of our example. So, the initial entropy of
at the parent node is given by the probability of getting a pass vs fail. In our dataset,
the target variable has 9 passes and 6 fails. Hence the probabilities for the entropy
formula are:

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 9/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Now essentially what a Decision Tree does to determine the root node is to calculate
the entropy for each variable and its potential splits. For this we have to calculate a
potential split from each variable, calculate the average entropy across both or all
the nodes and then the change in entropy vis a vis the parent node. This change in
entropy is termed Information Gain and represents how much information a
feature provides for the target variable.

Entropy_parent is the entropy of the parent node and Entropy_children represents

the average entropy of the child nodes that follow this variable. In the current case
since we have 3 variables for which this calculation must be done from the
perspective of the split.

1. Work Status

2. Online Course Status

3. Student Background

To calculate entropy, first let us put our formulas for Entropy and Information Gain
in terms of variables in our dataset:

1. Probability of pass and fail at each node, i.e, the Pi:

2. Entropy:

3. Average Entropy at child nodes:

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 10/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Note, average Entropy is the weighted average of all the sub nodes that a parent
node splits into. Thus, in our example this would be 2 sub nodes for Working Status
and 3 sub nodes for Student Background.

4. Information Gain:

Parent Node Calculations

First we will calculate parent node entropy using the formula above. Use any log2
calculator online to calculate the log values. In our case they work out to:

(Mathematical note: log to base 2 of anything less than 1 is a negative number,

hence we multiply by a minus sign to get a positive number)

So far, this is just the entropy of the parent node. Now we have to decide which
attribute or variable to use to split this to get the root node.

Calculating the Root Node

For this we have to calculate a potential split from each variable, calculate the
average entropy across both the nodes and then the change in entropy via a vis the
parent node.

Let us begin with Work Status variable and calculate the entropy of the split.

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 11/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

We then calculate the average entropy for the Working status split as a weighted
average with weights of the share of observations from the total number of
observations that fall in each sub node.

Information Gain = Entropy_Parent — Entropy_child =

0.9183–0.8119 = .1858

(Calculations are shown in the spreadsheet below)

In a similar fashion we can evaluate the entropy and information gain for Student
Background and Online Courses variables. The results are provided in the table
below:

We will drop this variable for the time being and move on to evaluate the other
variables. The spreadsheet below shows the Entropy calculations for all the
variables:

Since Information Gain

represents the reduction in
randomness hence choose
that variable which has max
IG

Root Node Entropy Calculations (source: author)

To find the root node attribute we look at the Information gain from Student
Background vis a vis initial parent entropy. This shows the maximum reduction of
.5370. Hence, this is the attribute that would be selected as the root node. The

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 12/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

other’s variables — ‘Working Status’ and ‘Online Courses’ show a much lower
decrease in entropy vis a vis the parental node.

So, on the basis of the above calculations, we have determined what the root node
would be. The tree would now look as follows:

Root Node of Decision Tree

Student Background splits the target variable into 3 groups. Everyone from CS
background clearly passes and hence this is a terminal or leaf node. Everyone from
Other backgrounds fails and this is also a terminal node. Maths background is split
into 3 Pass and 4 Fail and hence it is impure and there is some scope for further
splitting to attain greater purity.

Now to split the Maths background sub node, we need to calculate Entropy and
Information Gain for the remaining variables, i.e., Working Status and Online
Courses. We would then select the variable that shows the highest Information
Gain.

The Entropy and Information Gain Calculations for the Maths Background node can
be seen in the table below. Notice, we now have the Maths Background as the node
that is being split, hence average Entropy for the splits is calculated using it as a
base.

Note: putting in log(0) throws an error. However mathematically we can use the
limit. Normally we just don’t include the pj=0 case in the calculation. However, I
have included just to show the complete calculation.

By convention 0 log 0 = 0, since y log y → 0 as y → 0.

https://www.icg.isy.liu.se/courses/infotheory/lect1.pdf
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 13/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

The Entropy for each potential split is:

Splitting the Maths Subnode (Image source: author)

As we can see Information Gain is higher for the Working Status variable. Hence
this is the variable used to continue branching.

Maths Node Branching

We now see that the Maths node has split into 1 terminal node on the right and one
node which is still impure. Notice, now almost all our nodes are terminal nodes.
There is only one node which is not terminal. We can try splitting it further using
Other Online Courses. Anyway, you get the picture. In any case most Decision Trees
do not necessarily split to the point where every node is a terminal node. Most
algorithms have built in stops which we will discuss a little further down. Further, if
the Decision Tree continues to split we have another problem which is that of
overfitting. Again we shall discuss that below after we have briefly reviewed an
alternative approach to developing a Decision Tree using the Gini Index.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 14/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Gini Index

The other way of splitting a decision tree is via the Gini Index. The Entropy and
Information Gain method focuses on purity and impurity in a node. The Gini Index
or Impurity measures the probability for a random instance being misclassified
when chosen randomly. The lower the Gini Index, the better the lower the
likelihood of misclassification.

The formula for Gini Index

Where j represents the no. of classes in the target variable — Pass and Fail in our
example

P(i) represents the ratio of Pass/Total no. of observations in node.

So, Let’s take an example from the decision tree above. Let’s begin with the root
node and calculate the Gini Index for each of the splits. The Gini Index has a
minimum (highest level of purity) of 0. It has a maximum value of .5. If Gini Index is
.5, it indicates a random assignment of classes.

Now let us calculate the Gini index for the root node for Student Background
attribute. In this case we have 3 nodes. Gini formula requires us to calculate the Gini
Index for each sub node. Then do a weighted average to calculate the overall Gini
Index for the node.

Maths sub node: 4Pass, 3Fail

CS sub node: 4Pass, 0 Fail

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 15/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Others sub node: 0Pass, 4 Fail

As we can see the probability for misclassification in CS node is zero, since

everyone passes. Similarly no scope for misclassification on Others node as
everyone fails. Only the maths node has possibility of misclassification, and this is
quite high, given that the maximum Gini Index is .5.

The overall Gini Index for this split is calculated similarly to the entropy as weighted
average of the distribution across the 3 nodes.

Similarly, we can also compute the Gini Index for Working Status and Online
Courses. These are given below:

Working/Not working

Online Courses
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 16/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

The Gini Index is lowest for the Student Background variable. Hence, similar to the
Entropy and Information Gain criteria, we pick this variable for the root node. In a
similar fashion we would again proceed to move down the tree, carrying out splits
where node purity is less

Gini Index vs Information Gain

Depending on which impurity measurement is used, tree classification results can

vary. This can make small (or sometimes large) impact on your model. There seems
to be no one preferred approach by different Decision Tree algorithms. For
example, CART uses Gini; ID3 and C4.5 use Entropy.

The Gini index has a maximum impurity is 0.5 and maximum purity is 0, whereas
Entropy has a maximum impurity of 1 and maximum purity is 0.

How does a prediction get made in Decision Trees

Now that we have understood, hopefully in detail, how Decision Trees carry out
splitting and variable selection, we can move on to how they do prediction. Actually,
once a tree is trained and tested, prediction is easy. The tree basically provides a
flow chart based on various predictor variables. Suppose we have a new instance
entering the flow along with its values of different predictor variables. Unlike
training and test data , it will not have the class for the target attribute. We are trying
to predict this class by moving down the tree, testing its values of different predictor
variables at different branches. Ultimately, the new instance will move into a leaf
node and will be classified according to the class prevailing in the leaf node.

Suppose it looks like the below configuration:

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 17/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Based on our tree, we would first check the Math branch, then Working Yes branch.
That as we have seen is a leaf node and the new observation would be classified on
the basis of the majority vote in this node, i.e., since it is Pass, this new observation
would also be predicted to be Pass.

In practice, when the algorithm is evaluating a new example and reaches a leaf
node, the prediction is based on the modal value of categories in the leaf node. As
seen in the above case, the Working node is not fully pure. However, we go with the
prediction of the modal value, which is Pass. In general, most leaf nodes are not
pure and, hence for categorical prediction, we use the modal value for prediction. If
it is a numerical prediction (regression tree), we predict the mean value of the target
values at each leaf node.

Overfitting and Decision Trees

Overfitting can be a big challenge with Decision Trees. Even in our toy example, we
can see the algorithm continues to split till it reaches a leaf node. Often the leaf
node may just have one or two instances. This will clearly lead to a complex tree
structure which may not generalize well to a test scenario. This is because each leaf
will represent a very specific set of attribute combinations that are seen in the
training data, and the tree will not be able to classify attribute combinations not
seen in the training data. There are several ways we can prevent the decision tree
from becoming too unwieldy: 3 broad approaches to avoiding overfitting are
distinguished:

1. Pre pruning or Early stopping: Preventing the tree from growing too big or deep

2. Post Pruning: Allowing a tree to grow to its full depth and then getting rid of
various branches based on various criteria

3. Ensembling or using averages of multiple models such as Random Forest

We will only briefly overview at pre and post pruning techniques here. Ensemble
techniques such as Random Forest require more explanation and hence will be
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 18/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

tackled in a separate article.

Pruning

Pre pruning

The pre-pruning technique refers to the early stopping of the growth of the decision
tree. The pre-pruning technique involves tuning the hyperparameters of the
decision tree model prior to the training pipeline. The hyperparameters of the
DecisionTreeClassifier in SkLearn include max_depth , min_samples_leaf ,

min_samples_split which can be tuned to early stop the growth of the tree and
prevent the model from overfitting. The best way is to use the sklearn
implementation of the GridSearchCV technique to find the best set of
hyperparameters for a Decision Tree model.

A challenge with the early stopping approach is that it faces a ‘horizon’ problem,
where an early stopping may prevent some more fruitful splits down the line.

Post Pruning

In this technique, we allow the tree to grow to its maximum depth. Then we remove
parts of the tree to prevent overfitting. We effectively consider subtrees of the full
tree which are evaluated on a criteria and then removed. Hence we are effectively
going ‘up’ the tree and converting leaves to nodes and subtrees. The criteria whether
a particular consolidation goes through or not is usually MSE for regression trees
and classification error for classification trees.

A challenge with post pruning is that a decision tree can grow very deep and large
and hence evaluating every branch can be computationally expensive. An
important post pruning technique is Cost complexity pruning (ccp) which provides
a more efficient solution in this regard. CCP is a complex and advanced technique
which is parametrized by the parameter α in Scikit Learn DecisionTreesclassifier
module.

So, how does CCP work and what does it do?

The basic problem that CCP addresses is: How to determine the best way to prune a
tree? Intuitively, we would select a sub tree to prune such that its removal leads to a
lower test error rate. This can be done using cross validation or, if we have

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 19/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

sufficient sample, the validation set approach. However, given the number of sub
trees in a fully grown tree for even a small sample, this is likely to be a very
computationally and time intensive process.

Cost complexity pruning — also known as weakest link pruning — gives us a way to do
just this. Rather than considering every possible subtree, we consider a sequence of trees
indexed by a nonnegative tuning parameter α.…….

The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to
the training data. When α = 0, then the subtree T will simply equal T0, because then (8.4)
just measures the training error. However, as α increases, there is a price to pay for having
a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for
a smaller subtree

An Introduction to Statistical Learning, P 333

Essentially the parameter α is very similar to the penalty term in Lasso regression.
The basic equation for CCP is given below

CCP equation, Hastie, p332 (image source: James et al)

This is one complex equation. But let’s try and understand a bit further. Some
definitions:

Rm: Rm is the rectangle (i.e. the subset of predictor space) corresponding to the
mth terminal node

yˆRm is the predicted response associated with mth terminal leaf

(yi -yˆRm) — MSE for the sub tree referenced by terminal node m ( we are using
regression tree approach for this equation. For simplicity I am following the
equation and approach in James et al(2014).

|T|: is the no. of terminal nodes in the tree T

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 20/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Now let’s see what the equation is doing. We are essentially minimizing cost or loss
given by (yi -yˆRm) across all terminal nodes. Now α is a term that multiplies the
total number terminal nodes in the tree. If α=0, then we are minimizing the training
loss. The tree will be the same as the original tree. However, with α>0, we add a
penalty term which increases with the number of terminal nodes |T|. This means
the overall cost gets minimized for a smaller subtree.

Minimal cost complexity pruning recursively finds the node with the “weakest link”. The
weakest link is characterized by an effective alpha, where the nodes with the smallest
effective alpha are pruned first.

https://scikit-
learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

What does this mean? It means that the algorithm is hunting out nodes where the
training loss is already high and hence can only be minimized with a small α. On
the other hand, nodes where the training loss is smaller, can accommodate a larger
penalty term as part of minimization.

To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides

DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the
corresponding total leaf impurities at each step of the pruning process.

https://scikit-
learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

As α increases, more of the tree is pruned. We then have a tradeoff between bias
and variance. With the ccp_alpha

Effectively we increase the bias of the model, i.e. , we simplify it. However, on the
con side this means we have to tolerate increasing levels of impurity in the terminal
nodes. We see that as α increases both no. of nodes and tree depth reduces.

How to determine optimal α

Plotting ccp_alpha vs train and test accuracy we see that when α =0 and keeping the
other default parameters of DecisionTreeClassifier , the tree overfits, leading to a 100%
training accuracy and 88% testing accuracy. As alpha increases, more of the tree is
pruned, thus creating a decision tree that generalizes better. at some point,

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 21/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

however, further increases in α actually lead to a decrease in test accuracy as the

model becomes too simplified. In this example, setting ccp_alpha=0.015 maximizes
the testing accuracy. (refer to https://scikit-
learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html for
details).

Accuracy vs Alpha(Image source SkLearn)

Advantages and Disadvantages of Trees Decision trees

1. Trees give a visual schema of the relationship of variables used for classification
and hence are more explainable. The hierarchy of the tree provides insight into
variable importance.

2. At times they can actually mirror decision making processes.

3. White box model which is explainable and we can track back to each result of the
model. This is in contrast to black box models such as neural networks.

4. In general there is less need to prepare and clean data such as normalization and
one hot encoding of categorical variables and missing values.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 22/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

Note the Sklearn implementation currently does not support categorical variables,
so we do need to create dummy variables. Similarly it does not support missing
values. But both can be handled in theory.

5. Model can be validated statistically

Disadvantages

1. Prone to overfitting and hence lower predictive accuracy

2. Decision trees can be unstable because small variations in the data might result
in a completely different tree being generated. This problem for example can be
mitigated by using decision trees within an ensemble

3. Can be non-robust, i.e., a small change in the data can cause a large change in the
final estimated tree

4. Predictions are approximate, based on relevant terminal nodes. Hence it it may

not be the best method to extrapolate the results of the model to unseen cases.

5. Decision tree learners create biased trees if some classes dominate. It is required
to balance the dataset prior to fitting with the decision tree.

So, that’s it for Decision Trees form start to at least two thirds of the way. There are a
lot of complexities, hence I cannot say end. I hope you liked this blog on the inner
workings of Decision Trees. One thing is clear, this is far from a simple technique. I
have so far reviewed only the complexities of how variables hierarchy is chosen and
a tree structure is built up and how pruning is done. There are many types of
Decision Tree algorithms even in Scikit Learn. These include: ID3, C4.5, C5.0 and
CART. Exploration of these models is for another blog.

References

1. · G. James, D. Witten, T. Hastie, and R. Tibshirani (2013), An Introduction to

Statistical Learning: with Applications in R, Springer, (2013)

2. Satyam Kumar, ‘3 Techniques to Avoid Overfitting of Decision Trees’, Towards

Data Science, https://towardsdatascience.com/3-techniques-to-avoid-overfitting-
of-decision-trees-1e7d3d985a09

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 23/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

3. https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-
pruning

4. https://towardsdatascience.com/learn-how-decision-trees-are-grown-
22bc3d22fb51#:~:text=Training%20a%20Decision%20Tree&text=This%20approa
ch%20makes%20the%20decision,the%20features%20from%20the%20data.

5. https://www.sciencedirect.com/topics/computer-science/cost-
complexity#:~:text=Cost%2Dcomplexity%20pruning%20is%20based,leaf%20of
%20the%20subtree%20concerned.

6. https://en.wikipedia.org/wiki/Entropy_(information_theory)

Decision Tree Entropy Information Gain Gini Index

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

About Help Terms Privacy

Get the Medium app

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 24/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 25/25

Association Rules for Data Analysts
No ratings yet
Association Rules for Data Analysts
16 pages
Support Vector Machine
No ratings yet
Support Vector Machine
12 pages
Classification: Decision Tree Induction: Lecture #9
No ratings yet
Classification: Decision Tree Induction: Lecture #9
121 pages
SVM: Understanding the Optimal Hyperplane
100% (1)
SVM: Understanding the Optimal Hyperplane
37 pages
CSE 465 Exam: Decision Trees & SVMs
No ratings yet
CSE 465 Exam: Decision Trees & SVMs
2 pages
Understanding Decision Trees in Classification
100% (1)
Understanding Decision Trees in Classification
58 pages
GATE DA ML Cheat Sheet Dazzling Career
100% (2)
GATE DA ML Cheat Sheet Dazzling Career
45 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Ensemble Learning-Bagging-Boosting-Stacking
No ratings yet
Ensemble Learning-Bagging-Boosting-Stacking
12 pages
ML QB With Answer
No ratings yet
ML QB With Answer
20 pages
Understanding Random Forests in Machine Learning
100% (1)
Understanding Random Forests in Machine Learning
4 pages
FP-Growth Algorithm
No ratings yet
FP-Growth Algorithm
23 pages
Data Mining Notes Unit-1 and Unit-2 Partial
No ratings yet
Data Mining Notes Unit-1 and Unit-2 Partial
19 pages
ML Lab Record2
No ratings yet
ML Lab Record2
42 pages
Overfitting & Underfitting in Machine Learning
No ratings yet
Overfitting & Underfitting in Machine Learning
9 pages
1 - Performance Modelling Introduction
No ratings yet
1 - Performance Modelling Introduction
71 pages
ML UNIT-2 Notes
100% (1)
ML UNIT-2 Notes
15 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
72 pages
Parallel Coordinates and Data Visualization
No ratings yet
Parallel Coordinates and Data Visualization
9 pages
Big Data Processing Concepts Explained
No ratings yet
Big Data Processing Concepts Explained
57 pages
Deep Learning For Time Series Forecasting - Tutorial and Literature Survey
100% (1)
Deep Learning For Time Series Forecasting - Tutorial and Literature Survey
36 pages
DBSCAN Algorithm for Data Scientists
No ratings yet
DBSCAN Algorithm for Data Scientists
10 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
52 pages
Decision Trees & ID3 for Beginners
No ratings yet
Decision Trees & ID3 for Beginners
109 pages
K-Means Clustering Explained
100% (1)
K-Means Clustering Explained
25 pages
Unit-3 ML Material
No ratings yet
Unit-3 ML Material
46 pages
Data Mining: Classification & Prediction
No ratings yet
Data Mining: Classification & Prediction
16 pages
ML Material Unit1
No ratings yet
ML Material Unit1
32 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
What's Next?: Binary Classification and Related Tasks Classification
No ratings yet
What's Next?: Binary Classification and Related Tasks Classification
44 pages
Classification vs Clustering Explained
No ratings yet
Classification vs Clustering Explained
153 pages
Deep Learning - Wikipedia
No ratings yet
Deep Learning - Wikipedia
36 pages
Unit IV
No ratings yet
Unit IV
14 pages
Machine Learning Concepts and Algorithms
No ratings yet
Machine Learning Concepts and Algorithms
4 pages
Deep Learning Applications & CNNs
No ratings yet
Deep Learning Applications & CNNs
14 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Exploratory Data Analysis Overview
No ratings yet
Exploratory Data Analysis Overview
34 pages
Unsupervised Learning Notes
No ratings yet
Unsupervised Learning Notes
4 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
Key Concepts in Machine Learning
No ratings yet
Key Concepts in Machine Learning
4 pages
Greedy Algorithm: Definition & Examples
No ratings yet
Greedy Algorithm: Definition & Examples
20 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
EM Algorithm in Machine Learning Explained
No ratings yet
EM Algorithm in Machine Learning Explained
3 pages
Machine Learning Lab Viva Questions
100% (1)
Machine Learning Lab Viva Questions
4 pages
Data Discretization Techniques
No ratings yet
Data Discretization Techniques
21 pages
22PCOAM16 - Machine Learning - Session 3 Concept Learning Task
No ratings yet
22PCOAM16 - Machine Learning - Session 3 Concept Learning Task
17 pages
ML Decode TE IT
No ratings yet
ML Decode TE IT
71 pages
DAA Unit-3 PPT 19
No ratings yet
DAA Unit-3 PPT 19
49 pages
Unit 5
No ratings yet
Unit 5
36 pages
Deep Learning for Recommendation Systems
No ratings yet
Deep Learning for Recommendation Systems
47 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
11 pages
R23 ML Unit 3
No ratings yet
R23 ML Unit 3
6 pages
Machine Learning Notes
100% (1)
Machine Learning Notes
257 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
Understanding Decision Trees and Algorithms
No ratings yet
Understanding Decision Trees and Algorithms
34 pages
Unit 1 ML (NN& ML Techniques)
No ratings yet
Unit 1 ML (NN& ML Techniques)
40 pages
Unit 1 ML (DT)
No ratings yet
Unit 1 ML (DT)
24 pages
Decision Tree
No ratings yet
Decision Tree
5 pages
Breaking Down Decision Tree Algorithm
No ratings yet
Breaking Down Decision Tree Algorithm
10 pages
Question Bank UM19MB602: Introduction To Machine Learning Unit 4: Decision Tree
No ratings yet
Question Bank UM19MB602: Introduction To Machine Learning Unit 4: Decision Tree
4 pages
Understanding Penrose Diagrams in Spacetime
100% (1)
Understanding Penrose Diagrams in Spacetime
24 pages
Rishi Kumar
No ratings yet
Rishi Kumar
3 pages
C++ Big Integer Library Implementation
No ratings yet
C++ Big Integer Library Implementation
10 pages
SQPMath10StandardCh 08probability
No ratings yet
SQPMath10StandardCh 08probability
2 pages
Error Pada Abaqus
No ratings yet
Error Pada Abaqus
31 pages
Generative Semi-Supervised Learning For Multivariate Time Series Imputation
No ratings yet
Generative Semi-Supervised Learning For Multivariate Time Series Imputation
9 pages
AI & ML Model Exam for IT Students
No ratings yet
AI & ML Model Exam for IT Students
1 page
Dithered Noise Shapers in Audio Systems
No ratings yet
Dithered Noise Shapers in Audio Systems
18 pages
Econometrics II Chap 4 2 Multivariate Time Series PPT Autosaved
No ratings yet
Econometrics II Chap 4 2 Multivariate Time Series PPT Autosaved
45 pages
Optimal Control
No ratings yet
Optimal Control
29 pages
Time Value
No ratings yet
Time Value
19 pages
302 Defective Chessboard
No ratings yet
302 Defective Chessboard
35 pages
Ultimate Python For Data Science: 200 Essential Functions and Interview Questions
No ratings yet
Ultimate Python For Data Science: 200 Essential Functions and Interview Questions
12 pages
FALLSEM2025-26 VL BCSE334L 00100 TH 2025-09-29 KNN - (Lazy-Learner)
No ratings yet
FALLSEM2025-26 VL BCSE334L 00100 TH 2025-09-29 KNN - (Lazy-Learner)
22 pages
Understanding Derivatives and Gradients
No ratings yet
Understanding Derivatives and Gradients
4 pages
Sample QP For Mid-Semester Exam
No ratings yet
Sample QP For Mid-Semester Exam
5 pages
JNTUA B.Tech Control Systems Syllabus
No ratings yet
JNTUA B.Tech Control Systems Syllabus
1 page
Stochastic Models With Power Law Tails The Equation X AX + B ISBN 9783319296784, 3319296787 Entire PDF Ebook
100% (1)
Stochastic Models With Power Law Tails The Equation X AX + B ISBN 9783319296784, 3319296787 Entire PDF Ebook
14 pages
03 BDD
No ratings yet
03 BDD
23 pages
UNIT I Notes (AICS)
No ratings yet
UNIT I Notes (AICS)
33 pages
Spring 2025-Course Outline-Probability and Statistics
No ratings yet
Spring 2025-Course Outline-Probability and Statistics
3 pages
Tim Maudlin
No ratings yet
Tim Maudlin
20 pages
Eco Stats Data Analysis in Ecology From T Tests To Multivariate Abundances Readable PDF Download
100% (15)
Eco Stats Data Analysis in Ecology From T Tests To Multivariate Abundances Readable PDF Download
17 pages
WIS G7 Probability Handout 1
No ratings yet
WIS G7 Probability Handout 1
2 pages
Decision Science Answer Key
No ratings yet
Decision Science Answer Key
35 pages
Ee6403 DTSSP Notes Rejinpaul PDF
No ratings yet
Ee6403 DTSSP Notes Rejinpaul PDF
162 pages
SAP Integrated Business Planning Overview
No ratings yet
SAP Integrated Business Planning Overview
71 pages
Multi Linear Regression Analysis of Marks
No ratings yet
Multi Linear Regression Analysis of Marks
5 pages
Transformation of Functions (Solved With Notes)
No ratings yet
Transformation of Functions (Solved With Notes)
21 pages