Random Forest
Random Forest
Each internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label. The topmost node in the tree is the root
node.
Terminology
• Root Node: It represents the entire population or
sample. This further gets divided into two or more
homogeneous sets.
• Random Forest is an ensemble of multiple decision trees, where each tree is built using a random
subset of the features and random sampling of data points (with bootstrapping).
• While all trees use the same dataset, the randomness involved in their construction can lead to
different splits at each level. For example:
• One tree might split first on Variable 1 because it provides the best impurity reduction for that
specific subset of the data.
• Another tree might split first on Variable 2 because it provides the best impurity reduction for
that subset of the data.
• Even though both features may have high importance, the decision to split on one or the other
depends on the data samples and the random feature subset chosen during tree construction.
How does Random Forest work
2. Why Different First Splits?
• Random Selection of Features: When building each tree, Random Forest randomly selects
a subset of features for each split. This means that Variable 1 might be more useful in one
tree, while Variable 2 could be the best split in another.
• Impurity Reduction at Each Node: The first split is chosen based on the feature that
maximizes impurity reduction (e.g., Gini index or entropy). Even if Variable 2 has the highest
importance across the entire forest, Variable 1 might be the best split for that specific
tree’s data, leading to the observed difference.
How does Random Forest work
Step 1: Create Multiple Decision Trees
• For each tree, a random subset of the training data is chosen (with replacement).
• At each node in the tree, only a random subset of features is considered for splitting.
oob_score=True) •42 is just a random number; any integer can be used here.
The important thing is that using the same number
guarantees the same split of the data when training.
Understanding the code of Random Forest
RandomForestClassifier( n_jobs specifies the number of CPU cores to use for
random_state=42, parallel processing.
n_jobs=-1, •If n_jobs=-1, it means use all available cores in the
max_depth=5, system to speed up training and prediction. This is useful
n_estimators=100, for large datasets where training multiple trees in