Unit-3
Decision Tree Learning
Prepared By: Deepti Singh
Decision Tree
• It is a popular supervised learning algorithm used for
classification and regression tasks.
• It is used to model and predict outcomes based on input
data.
• It is a tree- like structure where
- each internal node tests on attributes,
- each branch corresponds to attribute value
- And each leaf node represent final decision or prediction.
Cont…
• Working:
1. Splitting: The dataset splits into subsets based on
feature values.
2. Decision Rule: At each node, a decision rule is
applied to split the data.
3. Predictions: The Process continues until leaf nodes
are reached and provide final predictions.
* Decision tree are easy to interpret & visualize, making them a
valuable tools for various applications.
Inductive Bias
• Inductive bias refers to the set of assumptions which
a learning algorithm makes to generalize from
specific training data to unseen data.
• It guides the learning process and helps the
algorithm to make predictions.
Decision Tree Learning Algorithm
• It is a variation of the top-down GREEDY
SEARCH algorithm.
• Two basic algorithms are :
- Iterative Dichotomiser 3 (ID3) algorithm
- C4.5 algorithm
Cont…
1. Attribute Selection Measures:
• It is a heuristic measure for selecting the splitting criterion that best
separates a given data of class labelled training tuples into inductive
individual classes.
• It is also known as splitting rules.
- because it determines how the tuples at a given node are to be split.
• The attribute having the best score for the measure is chosen as the
splitting attributes for the given tuples.
• The three popular attribute selection measures are:-
- Information Gain
- Gain Ratio
- Gini Index
Attribute Selection Measures
1.1 Information Gain
• In decision tree learning, the main selection measure that is used is
information gain.
• This algorithm will always try to maximize information gain, which measures
how well a given attribute separates the training examples according to their
target classification.
• When the DT is constructed, the attribute with highest information gain will be
tested first.
• This become the root node and the split is made based on the values of the
root node.
• Entropy is an entity that controls the split in data.
• It computes the homogeneity of examples.
Cont…
• The formula is:
• p= probability of various instances under consideration.
• The entropy is 0 if all members of S belongs to the same class.
• And 1 if there are equal no. of positive and negative examples.
• Entropy ranges between “0 and 1”, if there is unequal no. of positive and
negative examples.
• Information gain quantifies the reduction in enttropy after splitting the
dataset on a feature:
Attribute Selection Measures
1.2 Gain Ratio
• It is biased towards tests with many outcomes.
• It prefers to select attribute having a large no. values.
• Ex: Consider an attribute that acts as a unique identifier such as a
product id.
- A split on product id would results in a large no. of partitions, each one
just containing one tuple.
- The information gain is obtained by partitioning this attribute is
maximum. Clearly partitioning is useless for classification.
• C4.5 algorithm, a successor of ID3 uses an extension to information gain
known as gain ratio which attempts to overcome this bias.
Cont…
• It applies a kind of normalization to information gain using split
information:
• Gain Ratio:
• The attribute with maximum gain ratio is selected as the splitting
attribute.
Attribute Selection Measures
1.3 Gini Index
• The Gini Index is used in Classification and Rgeression task (CART).This
index measures the impurity of set of training tuples. Mathematically,
Where pi =the probability that tuple belongs to Class Ci
• The Gini index considers a binary split for each attribute. Consider the case
where A is a discrete-valued attribute having v distinct values,
{a1,a2,…an}occuring in a training set.
• To determine the best binary split on A , we examine all the possible
subsets that can be formed using known values of A.
Cont…
• Ex: if income has three possible values namely: {low, medium , high}, then
the possible sets are:
- {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low},
{medium}, {high}, {}.
• We exclude the empty set and power set, because they do not represent a
split. Therefore, there are (2v – 2) possible ways to form two partitions of
the data based on binary split on A.
• We considering a binary split, we compute a weighted sum of the
impurity of each resulting partition. Ex: a binary split on A partitions D into
D1 and D2, the Gini Index of D is :
•
Cont…
Cont…
ID3 Algorithm
Steps to Create a Decision Tree using the ID3
Algorithm:
• Step 1: Data Preprocessing:
Clean and preprocess the data. Handle missing values and convert categorical variables into
numerical representations if needed.
• Step 2: Selecting the Root Node:
Calculate the entropy of the target variable (class labels) based on the dataset. The formula
for entropy is:
• Step 3: Calculating Information Gain:
For each attribute in the dataset, calculate the information gain when the dataset is split on
that attribute. The formula for information gain is:
Cont…
• Step 4: Selecting the Best Attribute:
Choose the attribute with the highest information gain as the decision node for the tree.
• Step 5: Splitting the Dataset:
Split the dataset based on the values of the selected attribute.
• Step 6: Repeat the Process:
Recursively repeat steps 2 to 5 for each subset until a stopping criterion is met (e.g., the tree
depth reaches a maximum limit or all instances in a subset belong to the same class).
Solved Example
Step 1: Data Preprocessing: The dataset
does not require any preprocessing, as it
is already in a suitable format.
Step2: Calculating Entropy:
To calculate entropy, we first determine
the proportion of positive and negative
instances in the dataset:
Step 3: Calculating Information Gain:
We calculate the information gain for
each attribute (Weather, Temperature,
Humidity, Windy) and choose the
attribute with the highest information
gain as the root node.
Cont…
• Step 4: Selecting the Best Attribute:
The “Weather” attribute has the highest
information gain, so we select it as the
root node for our decision tree.
• Step 5: Splitting the Dataset:
We split the dataset based on the values
of the “Weather” attribute into three
subsets (Sunny, Overcast, Rainy).
• Step 6: Repeat the Process:
Since the “Weather” attribute has no
repeating values in any subset, we stop
splitting and label each leaf node with
the majority class in that subset.
Radial Basis Function (RBFs)
• One approach to function approximation that is closely related to distance-weighted
regression and also to artificial neural networks is learning with radial basis functions.
• In this approach, the learned hypothesis is a function of the form:
• where each xu is an instance from X and where the kernel function Ku(d(xu, x)) is defined so
that it decreases as the distance d(xu, x) increases. Here k is a user- provided constant that
specifies the number of kernel functions to be included.
• Even though is a global approximation to f(x), the contribution from each of the
Ku(d(xu, x)) terms is localized to a region nearby the point xu. It is common to choose each
function Ku(d(xu, x)) to be a Gaussian function centered at the point xu with some variance
…. eqn. 1
Cont…
• We will restrict our discussion here to this common Gaussian kernel function. As shown by
Hartman et al. (1990), the functional form of Equation (1) can approximate any function with
arbitrarily small error, provided a sufficiently large number k of such Gaussian kernels and
provided the width of each kernel can be separately specified.
• The function given by Equation (1) can be viewed as describing a two- layer network where
the first layer of units computes the values of the various Ku(d(xu, x)) and where the second
layer computes a linear combination of these first-layer unit values.
* Each hidden unit produces an activation
determined by a Gaussian function centered at
some instance xu. Therefore, its activation will be
close to zero unless the input x is near xu. The
output unit produces a linear combination of the
hidden unit activations. Although the network
shown here has just one output, multiple output
units can also be included.
Cont.
Summary:
• Radial basis function networks provide a global approximation to the target function,
represented by a linear combination of many local kernel functions. The value for any given
kernel function is non-negligible only when the input x falls into the region defined by its
particular center and width. Thus, the network can be viewed as a smooth linear
combination of many local approximations to the target function.
• One key advantage to RBF networks is that they can be trained much more efficiently than
feed-forward networks trained with BACKPROPAGATION. This follows from the fact that the
input layer and the output layer of an RBF are trained separately.
Case Based Reasoning
• Instance-based methods such as k-NEAREST NEIGHBOR and locally weighted regression share
three key properties.
• First, they are lazy learning methods in that they defer the decision of how to generalize
beyond the training data until a new query instance is observed.
• Second, they classify new query instances by analyzing similar instances while ignoring
instances that are very different from the query.
• Third, they represent instances as real-valued points in an n-dimensional Euclidean space.
• In CBR, instances are typicaly represented using more rich symbolic descriptions, and the
methods used to retrieve similar instances are correspondingly more elaborate.
Cont.
• CBR has been applied to problems such as
- conceptual design of mechanical devices based on a stored library of
previous designs,
- reasoning about new legal cases based on previous rulings, and solving
planning and scheduling problems by reusing and combining portions of
previous solutions to similar problems.
• Case base reasoning consist of cycle as shown below:
1. Retrieve: Given a new case, retrieve similar cases from the case base.
2. Reuse: Adapt the retrieved cases to fit to the new case.
3. Revise: Evaluate the solution and revise it based on how well it works.
4. Retain: Decide whether to retain this new case in the case base.