Random forest r_packages_slides_02-02-2018_eng

Introduction to Random Forest
& R Packages for RF
Shuma Ishigami
2/2/2018 Shuma Ishigami 1

Agenda
• Random Forest Algorithm
– Decision Tree
– Bootstrapping
– Random Forest
• R packages for RF
– Sample codes
– Comparison

What is the Random Forest Algorithm ?
Random Forest =
[Something “Randomized”]+[“Forest” consist of trees] =
[Randomly chosen samples +
Randomly selected feature vectors] +
[Many Decision Trees]

Random Forest
To put it simply, a random forest is
A set of many Decision Trees, where each tree use
Randomly Drawn Bootstrap Sample and
Randomly selected predictors

Decision Tree
• A supervised* learning algorithm for both
classification and regression
• Has “Nodes” and “Branches”, (like a tree)
• DT is a set of simple rules that spilt data into
subgroup
Notes: * When we say “supervised” learning, we give a computer with a
training data with feature variables and “answer” so as the computer can
learn/find a rule from these training questions.

Simple rule ?
This rule only
involves one feature
variable, 𝑋1
Rule: Is 𝑋1 > a ?
𝑋2
𝑋1
𝑋2
𝑋1
a
This rule involves
two feature variable,
𝑋1 and 𝑋2
Rule: Is 𝑋1 > b AND
𝑋2> c ?
b
c

𝑋2
𝑋1
Example: Two categories( and ) and
two feature variables(X1 and X2)

𝑋2
𝑋1
a
b
Try to divide data into two categories
by simple rules

𝑋1 > 𝑎 ?
𝑋2 > 𝑏 ?
Yes
Yes No
No
The set of rules in the previous slide can
be represented in a tree from

𝑋2
𝑋1
a
b
Now our trained DT categorizes any
new input falling in the left-below
area as class

New input
𝐶𝑙𝑎𝑠𝑠( ) ⇒
𝑋2
𝑋1
a
b
Let’s try with a new input whose
true class is
The DT could classify the new input
as . Correct, good job!

Issue in Decision Tree
• Sensitive to Noise
• A few noise in training data would greatly
reduce prediction capability of the tree

𝑋2
𝑋1
Two noises in training data

𝑋2
𝑋1
Adding only two noises results in this
very complex tree

𝑋2
𝑋1
New input
Due to the few noises, a prediction of a
new input goes wrong.

Bootstrap sampling
• Bootstrap sampling method
– Randomly draw samples with replacement from original data
– Sample with replacement means every time we draw a sample from
the data, we replace the data. We may draw the same sample more
than once.
– Ex. Assume we have an original data {A,B,C,D,E} and draw sample from
it 3 times. First, picked up B from {A,B,C,D,E} . Second time, D from
{A,B,C,D,E} . Third, B again from {A,B,C,D,E}. Contrast to sample
without replacement, sample with replacement may be duplicated.

Bagging(Bootstrap AGGregatING)
• Train many decision trees by using many
bootstrapping samples
• Then let each tree independently classify a
new input data and decide the predicted class
of the new data by majority rule
• Essential idea is that because noises are rarely
drawn as bootstrap samples so the effect of
such noises is negligible

𝑋2
𝑋1
Randomly sample from the original
dataset. This bootstrap sample
luckily do not have any noise.

𝑋2
𝑋1
Let’s make a decision tree with this
bootstrap sample.

𝑋2
𝑋1
Another bootstrap sample and DT.
This time a noise comes in.

𝑋2
𝑋1
Third sample and DT

𝑋2
𝑋1
Fourth sample and DT

𝑋1
𝑋2
4 DTs overlaped

𝑋1
𝑋2
New input
Determine the class of the new input
Our Forest(4 trees) now could
categorize the new input correctly

Why Random Forest?
• [Bootstraping sample] + [Many Decision trees]
– Every tree use the same set of feature variables. This leads to high
correlations among trees, resulting in limited improvement in
prediction capability
• Random Forest
– To reduce correlation among a set of decision trees, RF
assign randomly chosen feature variables with each tree
and each tree categorize training data, relying only on the
subset of feature variables
– So in RF, decision trees use different bootstrapping
samples and different options of feature variables

𝑋1
The algorithm assigns this tree with
X1 as a feature variable, so the tree
tries to categorize the given sample
by X1

𝑋1
2nd tree

𝑋2
This time, X2 chosen as a feature to
consider

𝑋2
4th tree

Classification in Random Forest
Let each tree in the forest to predict the class of
the new input, and then take a vote
New input
𝐶𝑙𝑎𝑠𝑠 ⇒ ？
VS

𝑋1
New input

𝑋2
New input

New input
𝐶𝑙𝑎𝑠𝑠 ⇒
1 3
VS
Taking a vote on the class of the new input, majority is in favor of

Out-Of-Bag(OOB) Error
• Cross validation in RF
• We call a sample as Out-Of-Bag sample in a tree,
if the sample is NOT chosen as a bootstrapping sample that
grows the decision tree.
• For a sample, we can collect a set of trees where the sample
become OOB in these trees and predict the class of the
sample using these trees.
OOB error is defined as the average of ( # of errors in
prediction)/( # of trees that do not have the sample) for all
sample

１
𝑋1𝑋1
Dark-color samples
= Bootstrapping
samples
Light-color samples
= Out-Of-Bag samples
This OOB sample is predicted as in
this tree

R Packages for Random Forest
• randomForest
• party
• partykit
• randomForestSRC
• ranger
• Rborist
• grf

“randomForest”: Sample Codes
randomForest(x = X, y = Y,
na.action = na.fail,
ntree = 100)
X and Y are dataframe

“party”: Sample Codes
cforest(formula = Y ~ X,
data = Data,
controls = cforest_unbiased(ntree = 10))
Data is a dataframe , consisting of Y and X

“partykit”: Sample Codes
cforest(formula = Y ~ X,
data = Data,
ntree = 100)

“randomForestSRC”: Sample Codes
rfsrc(formula = Y ~ X,
data = as.data.frame(Data),
na.action = "na.impute",
ntree = 100)

“ranger”: Sample Codes
ranger(formula = Y ~ X,
data = as.data.frame(Data),
num.trees = 100)

“Rborist”: Sample Codes
Rborist(x = X,
y = Y,
nTree = 100)
X and Y are dataframe

“grf”: Sample Codes
custom_forest(X = X, Y = Y,
num.trees = 100)
X and Y are dataframe or matrix

Attributes
a. Use factor variables as feature variables ?
b. Use numerical variables as feature variables ?
c. Can a package handle with missing values in feature variables ?
d. Use a factor variable as target variable ?
e. Use a numerical variable as target variable ?
f. Computation time
g. Parallel Processing

Comparison Table
random
Forest
party partykit random
ForestSRC
ranger Rborist grf
a Yes.
But # of
levels
should be
less than 53.
Error with
NA.
Yes. Yes.
# of Levels
< 31.
Yes. Yes.
Error with
NA.
Yes.
Error with
NA.
No.
Can not
handle with
factor type
feature
variables.
b Yes.
Error with
NA.
Yes. Yes. Yes. Yes.
Error with
NA.
Yes.
Error with
NA.
Yes.

random
Forest
party partykit random
ForestSRC
ranger Rborist grf
c Has a
function for
imputing
NA.
No. Has a
option for
how to
handle with
NA
Has a
function for
imputing
NA.
No. No. No.
d Yes Yes Yes Yes Yes Yes Yes
e Yes Yes Yes Yes Yes Yes No
f 3.96 sec 331.79 sec Not end in
sufficient
time
8.44 sec 5.07 sec 2.79 sec NA
g With
external
packages.
No. Mclapply OpenMP Can set # of
threads to
use
Use all
cores as
default
Can set # of
threads to
use
Notes: For time comparison, I generated a data with a factor type binary variable as target and 10 numerical feature
variables. The sample size is 100,000 and I measured time to grow a forest with 10 trees. I show the average of three
tries with different seeds of random number.

Random forest r_packages_slides_02-02-2018_eng

More Related Content

What's hot

Similar to Random forest r_packages_slides_02-02-2018_eng

Recently uploaded

Random forest r_packages_slides_02-02-2018_eng

Editor's Notes