0% found this document useful (0 votes)
4 views21 pages

Random Forest

Uploaded by

patelnirmal917
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

Random Forest

Uploaded by

patelnirmal917
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Random Forest

What is Decision Tree?


A Decision Tree algorithm is one of the
most popular machine learning
algorithms. It uses a tree like structure
and their possible combinations to solve
a particular problem.

It belongs to the class of supervised


learning algorithms where it can be used
for both classification and regression
purposes.
What is Decision Tree?
A decision tree is a structure that includes a root node, branches, and leaf nodes.

Each internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root
node.
Terminology
• Root Node: It represents the entire population or
sample. This further gets divided into two or more
homogeneous sets.

• Splitting: It is a process of dividing a node into two or


more sub-nodes.

• Decision / Internal Node: When a sub-node splits into


further sub-nodes, then it is called a decision node.

• Leaf/Terminal Node: Nodes that do not split are called


Leaf or Terminal nodes.
Terminology
• Pruning: When we remove sub-nodes of a
decision node, this process is called pruning.
It is the opposite process of splitting.

• Branch/Sub-Tree: A sub-section of an entire


tree is called a branch or sub-tree.

• Parent and Child Node: A node, which is


divided into sub-nodes is called the parent
node of sub-nodes where sub-nodes are the
children of a parent node.
Random Forest
A Random Forest Algorithm is a supervised
machine learning algorithm that is extremely
popular and is used for Classification and
Regression problems in Machine Learning.

As a forest comprises numerous trees, and


the more trees more it will be robust. Similarly,
the greater the number of trees in a Random
Forest Algorithm, the higher its accuracy and
problem-solving ability.
Random Forest
Random Forest is a classifier that contains
several decision trees on various subsets of
the given dataset and takes the average to
improve the predictive accuracy of that
dataset. It is based on the concept of
ensemble learning which is a process of
combining multiple classifiers to solve a
complex problem and improve the
performance of the model.
Why Random Forest ?
•Reduces overfitting compared to individual decision trees.
•Handles both classification and regression tasks.
•Robust to noise in the data.
•High performance with minimal parameter tuning.
How does Random Forest work
1. Random Forest - Different Trees with Different Splits

• Random Forest is an ensemble of multiple decision trees, where each tree is built using a random
subset of the features and random sampling of data points (with bootstrapping).

• While all trees use the same dataset, the randomness involved in their construction can lead to
different splits at each level. For example:
• One tree might split first on Variable 1 because it provides the best impurity reduction for that
specific subset of the data.
• Another tree might split first on Variable 2 because it provides the best impurity reduction for
that subset of the data.

• Even though both features may have high importance, the decision to split on one or the other
depends on the data samples and the random feature subset chosen during tree construction.
How does Random Forest work
2. Why Different First Splits?

• Random Selection of Features: When building each tree, Random Forest randomly selects
a subset of features for each split. This means that Variable 1 might be more useful in one
tree, while Variable 2 could be the best split in another.

• Impurity Reduction at Each Node: The first split is chosen based on the feature that
maximizes impurity reduction (e.g., Gini index or entropy). Even if Variable 2 has the highest
importance across the entire forest, Variable 1 might be the best split for that specific
tree’s data, leading to the observed difference.
How does Random Forest work
Step 1: Create Multiple Decision Trees
• For each tree, a random subset of the training data is chosen (with replacement).
• At each node in the tree, only a random subset of features is considered for splitting.

Step 2: Train Each Tree


• Each tree is independently trained using the subset of data and random features.
How does Random Forest work
Step 3: Make Predictions
• For Classification: The majority vote from all trees determines the final class label.
• For Regression: The average of predictions from all trees gives the final predicted value.

Step 4: Model Evaluation


• Out-of-Bag (OOB) scoring is used to assess model performance during training, without
the need for a separate validation set.
How does
Random
Forest work
Understanding the code of Random Forest
RandomForestClassifier( RandomForestRegressor(
random_state=42, random_state=42,
n_jobs=-1, n_jobs=-1,
max_depth=5, max_depth=5,
n_estimators=100, n_estimators=100,
oob_score=True) oob_score=True)

For Classification For Regression


Understanding the code of Random Forest
RandomForestClassifier( random_state is used to set the seed for the random
random_state=42, number generator, ensuring that the results are reproducible.
n_jobs=-1, When you set random_state=42, it ensures that every time
max_depth=5, you run the model with this seed, you will get the same
n_estimators=100, results.

oob_score=True) •42 is just a random number; any integer can be used here.
The important thing is that using the same number
guarantees the same split of the data when training.
Understanding the code of Random Forest
RandomForestClassifier( n_jobs specifies the number of CPU cores to use for
random_state=42, parallel processing.
n_jobs=-1, •If n_jobs=-1, it means use all available cores in the
max_depth=5, system to speed up training and prediction. This is useful
n_estimators=100, for large datasets where training multiple trees in

oob_score=True) parallel can save time.


•If you set n_jobs=1, it will use only one core, and for
n_jobs=2, it will use two cores, and so on.
Understanding the code of Random Forest
RandomForestClassifier( max_depth limits the maximum depth of the individual
random_state=42, trees in the forest.
n_jobs=-1, •A depth of 5 means that each tree will have a maximum
max_depth=5, of 5 levels from the root to the leaf nodes. Limiting the
n_estimators=100, depth helps in preventing overfitting, making the model

oob_score=True) more generalizable.


•If max_depth=None, the trees will grow until they
completely fit the data, which can lead to overfitting.
Understanding the code of Random Forest
RandomForestClassifier( n_estimators specifies the number of decision trees in
random_state=42, the random forest. In this case, the random forest will
n_jobs=-1, consist of 100 individual decision trees.
max_depth=5, • Increasing the number of trees can improve the
n_estimators=100, model's performance, but it also increases
oob_score=True) computational time and resources. Usually, a larger
number of trees helps the model to be more accurate
by reducing variance, but after a certain point, the
improvement may plateau.
Understanding the code of Random Forest
RandomForestClassifier( oob_score stands for Out-Of-Bag score. It is used to
random_state=42, estimate the accuracy of the random forest model without
using a separate validation set.
n_jobs=-1,
•When building a random forest, each tree is trained on a
max_depth=5, different bootstrapped sample of the data, meaning some
n_estimators=100, data points are left out (out-of-bag) for each tree.
•By setting oob_score=True, the RandomForestClassifier
oob_score=True)
will calculate the accuracy based on these out-of-bag
samples. This can serve as an internal cross-validation and
give an estimate of the model's performance without the
need for a separate test set.
Advantages of Random Forest

• Can handle both classification and regression tasks.


• Less prone to overfitting than decision trees.
• Can handle missing data and outliers well.
• Feature importance estimation.
Disadvantages of Random Forest

• Slower to predict compared to a single decision tree (because it requires multiple


trees to be evaluated).
• Less interpretable than a single decision tree.
• Memory-intensive (especially with a large number of trees).

You might also like