Decision Trees for Data Scientists
Decision Trees for Data Scientists
Save
Decision
Open in app
Trees Explained — Entropy, Sign up Sign In
Though Decision Trees look simple and intuitive, there is nothing very simple about
how the algorithm goes about the process deciding on splits and how tree pruning
occurs. In this post I take you through a simple example to understand the inner
workings of Decision Trees.
Decision Trees are a popular and surprisingly effective technique, particularly for
classification problems. But, the seemingly intuitive interface hides complexities.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 1/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
The criterion for selecting variables and hierarchy can be tricky to get, not to
mention Gini index, Entropy ( wait, isn’t that physics?) and information gain (isn’t
that information theory?). As you can see there are lots of tricky problems on which
you can get stuck on. The best way to understand Decision Trees is to work through
a small example which has sufficient complexity to be able to demonstrate some of
the common points one suddenly goes, ‘ not sure what happens here…?’.
This post is therefore more like a tutorial or a demo where I will work through a toy
dataset that I have created to understand the following:
120 4
3. How do sub nodes split
Let’s begin at the real beginning with core problem. For example, we are trying to
classify whether a patient is diabetic or not based on various predictor variables
such as fasting blood sugar, BMI, BP, etc. This is obviously a prediction problem for
a new patient. We also have 1000 patient records to help us develop an
understanding of which features are most useful in predicting. Unlike other
classification algorithms such as Logistic Regression, Decision Trees have a
somewhat different way of functioning and identifying which variables are
important.
The first thing to understand in Decision Trees is that they split the predictor space,
i.e., the target variable into different sub groups which are relatively more
homogenous from the perspective of the target variable. For example, if the target
variable is binary, with categories 1 and 0 ( shown by green and red dots in the
image below, then the decision tree works to split the target variable space into sub
groups that are more homogenous in terms of having either 1’s or 0’s.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 2/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
That is the overall concept. Let us begin with understanding the various elements of
a decision tree.
A root node: this is the node that begins the splitting process by finding the
variable that best splits the target variable
Node purity: Decision nodes are typically impure, or a mixture of both classes of
the target variable (0,1 or green and red dots in the image). Pure nodes are those
that have one class — hence the term pure. They either have green or red dots
only in the image.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 3/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Decision nodes: these are subsequent or intermediate nodes, where the target
variable is again split further by other variables
Leaf nodes or terminal nodes are pure nodes, hence are used for making a
prediction of a numerical or class is made.
Arrows leading away from a node indicate a condition which is being applied to the
node. Arrows pointing to a node indicate a condition that is being satisfied.
This is the first level of the Decision Tree — understanding the flow of splitting the
decision space into smaller spaces which ultimately become more and more
homogenous in the target variable. This ultimately leads to a prediction.
Decision Trees offer tremendous flexibility in that we can use both numeric and
categorical variables for splitting the target data. Categoric data is split along the
different classes in the variable. Numeric is a little more tricky as we have to split
into thresholds for the condition being tested such as <18 and ≥18, for example. A
numeric variable can appear multiple times in the data with different cut offs or
thresholds. Also final classifications can be repeated.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 4/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
2. How does Decision Trees select which variable to split on at decision nodes?
3. How does it decide that the tree has enough branches and that it should stop
splitting?
Now let us look at a simplified toy example to understand the above process more
concretely.
We have data for 15 data points of student data on pass or fail an online ML exam.
To understand the basic process we begin with a dataset which comprises a target
variable that is binary ( Pass/Fail) and various binary or categorical predictor
variables such as:
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 5/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Notice that only one variable, ‘Student Background’ has more than 2 levels or
categories — Maths, CS, Others. It is one for the advantages of Decision Trees
compared to other classification models such as Logistic Regression or SVM, that
we do not need to carry out one hot encoding to make these into dummy variables.
Let us first look at the flow of how a decision tree works and then we will dive into
the complexities of how the decisions are actually made…
A decision tree begins with the target variable. This is usually called the parent
node. The Decision Tree then makes a sequence of splits based in hierarchical order
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 6/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
of impact on this target variable. From the analysis perspective the first node is the
root node, which is the first variable that splits the target variable.
To identify the root node we would evaluate the impact of all the variables that we
have currently on the target variable to identify the variable that splits the exam
Pass/Fail classes into the most homogenous groups. Our candidates for splitting this
are: Background, Working Status and Other Online Courses.
What do we hope to achieve with this split? Suppose we begin with Working Status
as the root node. This splits into 2 sub nodes, one each for Working and Not
working. Thus the Pass/Fail status is updated in each sub node respectively.
So, this is the basic flow of the Decision Tree. As long as there is a a mixture of Pass
and Fail in a sub node, there is scope to split further to try and get it to be only one
category. This is termed the purity of the node. For example, Not Working has 5 Pass
and 1 Fail, hence it is purer than the Working node which has 5P and 4F. A leaf node
would be one which contains either Pass or Fail class only.
A node which is impure can be branched further for improving purity. However,
most of the time we do not necessarily go down to the point where each leaf is
‘pure’. It is also important to understand that each node is standalone and hence the
attribute that best splits the ‘Working’ node may not be the one that best splits the
‘Not Working’ node.
Now, let us move on to learning the core part of Decision Trees - key questions:
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 7/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
How does the Tree decide which variable to branch out at each level?
The algorithm therefore evaluates all variables on some statistical criteria and then
chooses the variable that performs best on the criteria.
Here is where the true complexity and sophistication of decision lies. Variables are
selected on a complex statistical criterion which is applied at each decision node.
Now, variable selection criterion in Decision Trees can be done via two approaches:
2. Gini Index
Both criteria are broadly similar and seek to determine which variable would split
the data to lead to the underlying child nodes being most homogenous or pure. Both
are used in different Decision Tree algorithms. To add to the confusion, it is not
clear which one is the preferred approach. So, one has to have an understanding of
both.
What is Entropy?
Entropy is a term that comes from physics and means a measure of disorder. More
specifically, we can define it as:
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 8/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
the concept are used in diverse fields, from classical thermodynamics, where it was first
recognized, to the microscopic description of nature in statistical physics, and to the
principles of information theory.
https://en.wikipedia.org/wiki/Entropy
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Leaf nodes which have all instances belonging to 1 class would have an entropy of 0.
Whereas, the entropy for a node where the classes are divided equally would be 1.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 9/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Now essentially what a Decision Tree does to determine the root node is to calculate
the entropy for each variable and its potential splits. For this we have to calculate a
potential split from each variable, calculate the average entropy across both or all
the nodes and then the change in entropy vis a vis the parent node. This change in
entropy is termed Information Gain and represents how much information a
feature provides for the target variable.
1. Work Status
3. Student Background
To calculate entropy, first let us put our formulas for Entropy and Information Gain
in terms of variables in our dataset:
2. Entropy:
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 10/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Note, average Entropy is the weighted average of all the sub nodes that a parent
node splits into. Thus, in our example this would be 2 sub nodes for Working Status
and 3 sub nodes for Student Background.
4. Information Gain:
First we will calculate parent node entropy using the formula above. Use any log2
calculator online to calculate the log values. In our case they work out to:
So far, this is just the entropy of the parent node. Now we have to decide which
attribute or variable to use to split this to get the root node.
For this we have to calculate a potential split from each variable, calculate the
average entropy across both the nodes and then the change in entropy via a vis the
parent node.
Let us begin with Work Status variable and calculate the entropy of the split.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 11/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
We then calculate the average entropy for the Working status split as a weighted
average with weights of the share of observations from the total number of
observations that fall in each sub node.
0.9183–0.8119 = .1858
In a similar fashion we can evaluate the entropy and information gain for Student
Background and Online Courses variables. The results are provided in the table
below:
We will drop this variable for the time being and move on to evaluate the other
variables. The spreadsheet below shows the Entropy calculations for all the
variables:
To find the root node attribute we look at the Information gain from Student
Background vis a vis initial parent entropy. This shows the maximum reduction of
.5370. Hence, this is the attribute that would be selected as the root node. The
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 12/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
other’s variables — ‘Working Status’ and ‘Online Courses’ show a much lower
decrease in entropy vis a vis the parental node.
So, on the basis of the above calculations, we have determined what the root node
would be. The tree would now look as follows:
Student Background splits the target variable into 3 groups. Everyone from CS
background clearly passes and hence this is a terminal or leaf node. Everyone from
Other backgrounds fails and this is also a terminal node. Maths background is split
into 3 Pass and 4 Fail and hence it is impure and there is some scope for further
splitting to attain greater purity.
Now to split the Maths background sub node, we need to calculate Entropy and
Information Gain for the remaining variables, i.e., Working Status and Online
Courses. We would then select the variable that shows the highest Information
Gain.
The Entropy and Information Gain Calculations for the Maths Background node can
be seen in the table below. Notice, we now have the Maths Background as the node
that is being split, hence average Entropy for the splits is calculated using it as a
base.
Note: putting in log(0) throws an error. However mathematically we can use the
limit. Normally we just don’t include the pj=0 case in the calculation. However, I
have included just to show the complete calculation.
https://www.icg.isy.liu.se/courses/infotheory/lect1.pdf
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 13/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
As we can see Information Gain is higher for the Working Status variable. Hence
this is the variable used to continue branching.
We now see that the Maths node has split into 1 terminal node on the right and one
node which is still impure. Notice, now almost all our nodes are terminal nodes.
There is only one node which is not terminal. We can try splitting it further using
Other Online Courses. Anyway, you get the picture. In any case most Decision Trees
do not necessarily split to the point where every node is a terminal node. Most
algorithms have built in stops which we will discuss a little further down. Further, if
the Decision Tree continues to split we have another problem which is that of
overfitting. Again we shall discuss that below after we have briefly reviewed an
alternative approach to developing a Decision Tree using the Gini Index.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 14/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Gini Index
The other way of splitting a decision tree is via the Gini Index. The Entropy and
Information Gain method focuses on purity and impurity in a node. The Gini Index
or Impurity measures the probability for a random instance being misclassified
when chosen randomly. The lower the Gini Index, the better the lower the
likelihood of misclassification.
Where j represents the no. of classes in the target variable — Pass and Fail in our
example
So, Let’s take an example from the decision tree above. Let’s begin with the root
node and calculate the Gini Index for each of the splits. The Gini Index has a
minimum (highest level of purity) of 0. It has a maximum value of .5. If Gini Index is
.5, it indicates a random assignment of classes.
Now let us calculate the Gini index for the root node for Student Background
attribute. In this case we have 3 nodes. Gini formula requires us to calculate the Gini
Index for each sub node. Then do a weighted average to calculate the overall Gini
Index for the node.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 15/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
The overall Gini Index for this split is calculated similarly to the entropy as weighted
average of the distribution across the 3 nodes.
Similarly, we can also compute the Gini Index for Working Status and Online
Courses. These are given below:
Working/Not working
Online Courses
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 16/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
The Gini Index is lowest for the Student Background variable. Hence, similar to the
Entropy and Information Gain criteria, we pick this variable for the root node. In a
similar fashion we would again proceed to move down the tree, carrying out splits
where node purity is less
The Gini index has a maximum impurity is 0.5 and maximum purity is 0, whereas
Entropy has a maximum impurity of 1 and maximum purity is 0.
Now that we have understood, hopefully in detail, how Decision Trees carry out
splitting and variable selection, we can move on to how they do prediction. Actually,
once a tree is trained and tested, prediction is easy. The tree basically provides a
flow chart based on various predictor variables. Suppose we have a new instance
entering the flow along with its values of different predictor variables. Unlike
training and test data , it will not have the class for the target attribute. We are trying
to predict this class by moving down the tree, testing its values of different predictor
variables at different branches. Ultimately, the new instance will move into a leaf
node and will be classified according to the class prevailing in the leaf node.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 17/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Based on our tree, we would first check the Math branch, then Working Yes branch.
That as we have seen is a leaf node and the new observation would be classified on
the basis of the majority vote in this node, i.e., since it is Pass, this new observation
would also be predicted to be Pass.
In practice, when the algorithm is evaluating a new example and reaches a leaf
node, the prediction is based on the modal value of categories in the leaf node. As
seen in the above case, the Working node is not fully pure. However, we go with the
prediction of the modal value, which is Pass. In general, most leaf nodes are not
pure and, hence for categorical prediction, we use the modal value for prediction. If
it is a numerical prediction (regression tree), we predict the mean value of the target
values at each leaf node.
Overfitting can be a big challenge with Decision Trees. Even in our toy example, we
can see the algorithm continues to split till it reaches a leaf node. Often the leaf
node may just have one or two instances. This will clearly lead to a complex tree
structure which may not generalize well to a test scenario. This is because each leaf
will represent a very specific set of attribute combinations that are seen in the
training data, and the tree will not be able to classify attribute combinations not
seen in the training data. There are several ways we can prevent the decision tree
from becoming too unwieldy: 3 broad approaches to avoiding overfitting are
distinguished:
1. Pre pruning or Early stopping: Preventing the tree from growing too big or deep
2. Post Pruning: Allowing a tree to grow to its full depth and then getting rid of
various branches based on various criteria
We will only briefly overview at pre and post pruning techniques here. Ensemble
techniques such as Random Forest require more explanation and hence will be
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 18/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Pruning
Pre pruning
The pre-pruning technique refers to the early stopping of the growth of the decision
tree. The pre-pruning technique involves tuning the hyperparameters of the
decision tree model prior to the training pipeline. The hyperparameters of the
DecisionTreeClassifier in SkLearn include max_depth , min_samples_leaf ,
min_samples_split which can be tuned to early stop the growth of the tree and
prevent the model from overfitting. The best way is to use the sklearn
implementation of the GridSearchCV technique to find the best set of
hyperparameters for a Decision Tree model.
A challenge with the early stopping approach is that it faces a ‘horizon’ problem,
where an early stopping may prevent some more fruitful splits down the line.
Post Pruning
In this technique, we allow the tree to grow to its maximum depth. Then we remove
parts of the tree to prevent overfitting. We effectively consider subtrees of the full
tree which are evaluated on a criteria and then removed. Hence we are effectively
going ‘up’ the tree and converting leaves to nodes and subtrees. The criteria whether
a particular consolidation goes through or not is usually MSE for regression trees
and classification error for classification trees.
A challenge with post pruning is that a decision tree can grow very deep and large
and hence evaluating every branch can be computationally expensive. An
important post pruning technique is Cost complexity pruning (ccp) which provides
a more efficient solution in this regard. CCP is a complex and advanced technique
which is parametrized by the parameter α in Scikit Learn DecisionTreesclassifier
module.
The basic problem that CCP addresses is: How to determine the best way to prune a
tree? Intuitively, we would select a sub tree to prune such that its removal leads to a
lower test error rate. This can be done using cross validation or, if we have
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 19/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
sufficient sample, the validation set approach. However, given the number of sub
trees in a fully grown tree for even a small sample, this is likely to be a very
computationally and time intensive process.
Cost complexity pruning — also known as weakest link pruning — gives us a way to do
just this. Rather than considering every possible subtree, we consider a sequence of trees
indexed by a nonnegative tuning parameter α.…….
The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to
the training data. When α = 0, then the subtree T will simply equal T0, because then (8.4)
just measures the training error. However, as α increases, there is a price to pay for having
a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for
a smaller subtree
Essentially the parameter α is very similar to the penalty term in Lasso regression.
The basic equation for CCP is given below
This is one complex equation. But let’s try and understand a bit further. Some
definitions:
Rm: Rm is the rectangle (i.e. the subset of predictor space) corresponding to the
mth terminal node
(yi -yˆRm) — MSE for the sub tree referenced by terminal node m ( we are using
regression tree approach for this equation. For simplicity I am following the
equation and approach in James et al(2014).
Now let’s see what the equation is doing. We are essentially minimizing cost or loss
given by (yi -yˆRm) across all terminal nodes. Now α is a term that multiplies the
total number terminal nodes in the tree. If α=0, then we are minimizing the training
loss. The tree will be the same as the original tree. However, with α>0, we add a
penalty term which increases with the number of terminal nodes |T|. This means
the overall cost gets minimized for a smaller subtree.
Minimal cost complexity pruning recursively finds the node with the “weakest link”. The
weakest link is characterized by an effective alpha, where the nodes with the smallest
effective alpha are pruned first.
https://scikit-
learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
What does this mean? It means that the algorithm is hunting out nodes where the
training loss is already high and hence can only be minimized with a small α. On
the other hand, nodes where the training loss is smaller, can accommodate a larger
penalty term as part of minimization.
https://scikit-
learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
As α increases, more of the tree is pruned. We then have a tradeoff between bias
and variance. With the ccp_alpha
Effectively we increase the bias of the model, i.e. , we simplify it. However, on the
con side this means we have to tolerate increasing levels of impurity in the terminal
nodes. We see that as α increases both no. of nodes and tree depth reduces.
Plotting ccp_alpha vs train and test accuracy we see that when α =0 and keeping the
other default parameters of DecisionTreeClassifier , the tree overfits, leading to a 100%
training accuracy and 88% testing accuracy. As alpha increases, more of the tree is
pruned, thus creating a decision tree that generalizes better. at some point,
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 21/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
1. Trees give a visual schema of the relationship of variables used for classification
and hence are more explainable. The hierarchy of the tree provides insight into
variable importance.
3. White box model which is explainable and we can track back to each result of the
model. This is in contrast to black box models such as neural networks.
4. In general there is less need to prepare and clean data such as normalization and
one hot encoding of categorical variables and missing values.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 22/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
Note the Sklearn implementation currently does not support categorical variables,
so we do need to create dummy variables. Similarly it does not support missing
values. But both can be handled in theory.
Disadvantages
2. Decision trees can be unstable because small variations in the data might result
in a completely different tree being generated. This problem for example can be
mitigated by using decision trees within an ensemble
3. Can be non-robust, i.e., a small change in the data can cause a large change in the
final estimated tree
5. Decision tree learners create biased trees if some classes dominate. It is required
to balance the dataset prior to fitting with the decision tree.
So, that’s it for Decision Trees form start to at least two thirds of the way. There are a
lot of complexities, hence I cannot say end. I hope you liked this blog on the inner
workings of Decision Trees. One thing is clear, this is far from a simple technique. I
have so far reviewed only the complexities of how variables hierarchy is chosen and
a tree structure is built up and how pruning is done. There are many types of
Decision Tree algorithms even in Scikit Learn. These include: ID3, C4.5, C5.0 and
CART. Exploration of these models is for another blog.
References
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 23/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
3. https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-
pruning
4. https://towardsdatascience.com/learn-how-decision-trees-are-grown-
22bc3d22fb51#:~:text=Training%20a%20Decision%20Tree&text=This%20approa
ch%20makes%20the%20decision,the%20features%20from%20the%20data.
5. https://www.sciencedirect.com/topics/computer-science/cost-
complexity#:~:text=Cost%2Dcomplexity%20pruning%20is%20based,leaf%20of
%20the%20subtree%20concerned.
6. https://en.wikipedia.org/wiki/Entropy_(information_theory)
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 24/25
27/02/2023, 15:16 Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning | by Shailey Dash | Towards Data Science
https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c 25/25