Ensemble
Introduction
• We are almost at the end of the semester/final
competition.
• https://inclass.kaggle.com/c/ml2016-cyber-security-
attack-defender/leaderboard
• https://www.kaggle.com/c/outbrain-click-prediction/
leaderboard
• https://www.kaggle.com/c/transfer-learning-on-stack-
exchange-tags/leaderboard
• You already developed some algorithms and codes.
Lazy to modify them.
• Ensemble: improving your machine with little
modification
Framework of Ensemble
• Get a set of classifiers
• , , ……
坦 補 DD They should be diverse.
• Aggregate the classifiers (properly)
• 在打王時每個人都有該站的位置
Ensemble: Bagging
Review: Bias v.s. Variance
Error from bias
Error from variance
Error observed
Underfitting Overfitting
Large Bias Small Bias
Small Variance Large Variance
Universe 1 Universe 2 Universe 3
A complex model will If we average all the , is
have large variance. it close to
We can average
𝐸 𝑓 ]= ^
[ ∗
𝑓
complex models to
reduce variance.
Sampling N’
N training
Bagging examples
examples with
replacement
(usually N=N’)
Set 1 Set 2 Set 3 Set 4
Function Function Function Function
1 2 3 4
This approach would be helpful when
Baggingyour model is complex, easy to overfit.
e.g. decision tree
Testing data x
Function Function Function Function
1 2 3 4
y1 y2 y3 y4
Average/voting
Assume each object x is
Decision Tree represented by a 2-dim vector
𝑥2
x1 < 0.5
𝑥 2=0.7
yes no 𝑥 2=0.3
x2 < 0.3 x2 < 0.7 𝑥1
𝑥1 =0.5
yes no yes no The questions in training
Class 1 Class 2 Class 2 Class 1 …..
number of
branches,
Can have more complex questions Branching
criteria,
Experiment: Function of
Miku
http://speech.ee.ntu.edu.tw/~tlkagk/
courses/MLDS_2015_2/theano/miku
(1st column: x, 2nd column: y, 3rd column: output (1 or 0) )
Experiment:
Function of Miku
Single Depth = 5 Depth = 10
Decision
Tree
Depth = 15 Depth = 20
train f1 f2 f3 f4
x1 O X O X
Random Forest x2 O X X O
x3 X O O X
• Decision tree: x4 X O X O
• Easy to achieve 0% error rate on training data
• If each training example has its own leaf ……
• Random forest: Bagging of decision tree
• Resampling training data is not sufficient
• Randomly restrict the features/questions used in each
split
• Out-of-bag validation for bagging
• Using RF = f2+f4 to test x1
Out-of-bag (OOB) error
• Using RF = f2+f3 to test x2
• Using RF = f1+f4 to test x3 Good error estimation
of testing set
• Using RF = f1+f3 to test x 4
Experiment:
Function of Miku
Random
Forest Depth = 5 Depth = 10
(100 trees)
Depth = 15 Depth = 20
Ensemble: Boosting
Improving Weak Classifiers
Training data:
Boosting (binary classification)
• Guarantee:
• If your ML algorithm can produce classifier with error
rate smaller than 50% on training data
• You can obtain 0% error rate classifier after boosting.
• Framework of boosting
• Obtain the first classifier
• Find another function to help
• However, if is similar to , it will not help a lot.
• We want to be complementary with (How?)
• Obtain the second classifier
• …… Finally, combining all the classifiers
• The classifiers are learned sequentially.
How to obtain different
classifiers?
• Training on different training data sets
• How to have different training data sets
• Re-sampling your training data to form a new set
• Re-weighting your training data to form a new set
• In real implementation, you only have to change the
cost/objective function
(𝑥 , ^
1
𝑦 ,𝑢 ) 𝑢 =1
1 1 1
0.4 𝐿 ( 𝑓 ) =∑ 𝑙 ( 𝑓 ( 𝑥 ) , 𝑦 )
𝑛
^ 𝑛
𝑛
2
( 𝑥2 , ^
𝑦 2 ,𝑢 2 ) 𝑢 =1 2.1
𝐿 𝑓 =∑ 𝑢 𝑙 ( 𝑓 𝑥 , 𝑦 )
( )
𝑛 𝑛 𝑛
( ) ^
( 𝑥3 , ^
𝑦 3 , 𝑢3 ) 𝑢3 =1 0.7 𝑛
Idea of Adaboost
• Idea: training on the new training set that fails
• How to find a new training set that fails ?
: the error rate of on its training data
∑ 𝑢𝑛1 𝛿 ( 𝑓 1 ( 𝑥 𝑛 ) ≠ ^
𝑦 )
𝑍 1 =∑ 𝑢
𝑛
𝑛
𝜀1 =
𝑛
1 𝜀1 < 0.5
𝑍1
𝑛
Changing the example weights from to such that
∑ 𝑢𝑛2 𝛿 ( 𝑓 1 ( 𝑥 𝑛 ) ≠ ^
𝑦𝑛 ) The performance of for new
𝑛
= 0.5 weights would be random.
𝑍2
Training based on the new weights
Re-weighting Training Data
• Idea: training on the new training set that fails
• How to find a new training set that fails ?
𝑢 =1/ √ 3
1
( 𝑥1, ^
𝑦 1 , 𝑢1 ) 𝑢 = 1
1
𝑢 =√ 3
2 2
𝑦 2 ,𝑢 2 ) 𝑢 =1
(𝑥 , ^
2
𝑢 =1/ √ 3
3 3
3
𝑦 , 𝑢 ) 𝑢 =1
(𝑥 , ^3 3
𝑢 =1 / √ 3
4
4
(𝑥 , ^
4
𝑦 ,𝑢 ) 𝑢 =1
4 4
0.5
0.25 0.5
𝑓 1 (𝑥 ) 𝑓 2 (𝑥 )
Re-weighting Training Data
• Idea: training on the new training set that fails
• How to find a new training set that fails ?
If misclassified by ()
multiplying increase
If correctly classified by ()
devided by decrease
will be learned based on example weights
What is the value of ?
Re-weighting Training Data
𝑍 1 =∑ 𝑢
∑ 𝑢𝑛1 𝛿 ( 𝑓 1 ( 𝑥 𝑛 ) ≠ ^
𝑦 )
𝑛 𝑛
𝜀1 =
𝑛 1
𝑍1 𝑛
∑ 𝑢 𝛿( 𝑓 1( 𝑥 ) ≠ ^
𝑛
2 𝑦 )
𝑛 𝑛
( 𝑛
𝑓1 𝑥 ≠ 𝑦 ) ^ 𝑛
multiplying
𝑛
= 0.5
𝑍2 𝑓 1 ( 𝑥𝑛 )= ^
𝑦𝑛 devided by
¿ ∑ 𝑢 𝑑1
𝑛
1 ¿ ∑ 𝑛
𝑢 +
2 ∑ 𝑢
𝑛
2
𝑓 1 ( 𝑥 ) ≠ ^𝑦
𝑛 𝑛
𝑓 1(𝑥 )≠ ^ 𝑓 1 ( 𝑥 )= ^
𝑛 𝑛 𝑛 𝑛
𝑦 𝑦
¿∑ 𝑢
𝑛
𝑛
2 ¿ ∑ 𝑛
𝑢 𝑑 1+
1 ∑ 𝑛
𝑢 /𝑑 1
1
𝑓 1 ( 𝑥 ) ≠ ^𝑦 𝑓 1 ( 𝑥 ) = ^𝑦
𝑛 𝑛 𝑛 𝑛
∑ 𝑢1 𝑑1
𝑛
𝑓 ( 𝑥𝑛 ) ≠ 𝑦
^𝑛
2
= 0.5
1
∑ 𝑛
𝑢 𝑑1 +
1 ∑ 𝑛
𝑢 / 𝑑1
1
𝑓 1
(𝑥 )≠ ^
𝑛
𝑦 𝑛
𝑓 1
( 𝑥 )= ^
𝑛
𝑦 𝑛
Re-weighting Training Data
𝑍 1 =∑ 𝑢
∑ 𝑢𝑛1 𝛿 ( 𝑓 1 ( 𝑥 𝑛 ) ≠ ^
𝑦 )
𝑛 𝑛
𝜀1 =
𝑛 1
𝑍1 𝑛
∑ 𝑢 𝛿( 𝑓 1( 𝑥 ) ≠ ^
𝑛
2 𝑦 )
𝑛
𝑛
( 𝑛
𝑓1 𝑥 ≠ 𝑦 ) ^ 𝑛
multiplying
𝑛
= 0.5
𝑍2 𝑓 1 ( 𝑥𝑛 )= ^
𝑦𝑛 devided by
∑ 𝑛
𝑢 1 / 𝑑1
𝑓 ( 𝑥 𝑛) = ^
𝑦
𝑛
1
=1
∑ 𝑛
𝑢1 𝑑 1
𝑓 1
( 𝑥 𝑛) ≠ 𝑦
^𝑛
∑ 𝑛
𝑢 /𝑑 1=
1 ∑ 𝑢 𝑑1 1
𝑛
1
𝑑1 𝑓
∑ 𝑛
𝑢1 =𝑑 1 ∑ 𝑛
𝑢1
𝑓 1 ( 𝑥 )= ^𝑦 𝑓 1 ( 𝑥 ) ≠ ^𝑦
𝑛 𝑛 𝑛 𝑛
1
( 𝑥 𝑛 )= ^𝑦 𝑛 𝑓 1(𝑥 )≠ ^
𝑛
𝑦
𝑛
∑ 𝑢𝑛 𝑍 1 ( 1 − 𝜀1 ) 𝑍 1 𝜀1
1
𝑓 1(𝑥 )≠ ^
𝑛 𝑛
𝑦
𝜀1 =
𝑍 1 ( 1 − 𝜀 1 ) / 𝑑 1= 𝑍 1 𝜀 1 𝑑1
∑
𝑍1 𝑛
𝑢 =𝑍 1 𝜀1
1
𝑑1= √ ( 1− 𝜀 1 ) / 𝜀 1 ¿ 1
𝑓 1 ( 𝑥 ) ≠ ^𝑦
𝑛 𝑛
Algorithm for AdaBoost
• Giving training data
• (Binary classification), (equal weights)
• Training weak classifier with weights
• For t = 1, …, T:
• is the error rate of with weights
• For n = 1, …, N:
• If is misclassified by :
𝑦 𝑛≠ 𝑓 𝑡 ( 𝑥𝑛)
^
𝑛
• Else: ¿ 𝑢𝑡 × 𝑒 xp ( 𝛼 𝑡 ) 𝑑𝑡 = √ ( 1− 𝜀 𝑡 ) / 𝜀 𝑡
𝑛 𝛼 𝑡=𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀 𝑡
¿ 𝑢 × 𝑒 xp ( − 𝛼𝑡 )
𝑡
𝑢𝑡 +1 ←𝑢 𝑡 × 𝑒𝑥𝑝 ( − ^
𝑦 𝑓 𝑡 ( 𝑥 ) 𝛼𝑡)
𝑛 𝑛 𝑛 𝑛
Algorithm for AdaBoost
• We obtain a set of functions: ,
• How to aggregate them?
• Uniform weight:
• Non-uniform weight: Smaller error ,
larger weight for
final voting
𝛼 𝑡=𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀 𝑡 𝜀𝑡 = 0.1 𝜀𝑡 = 0.4
𝑢𝑡 +1 =𝑢𝑡 ×𝑒𝑥𝑝 ( − ^
𝑦 𝑓 𝑡 ( 𝑥 ) 𝛼𝑡 )
𝑛 𝑛 𝑛 𝑛
1.10 0.20
Toy ExampleT=3, weak classifier = decision stump
• t=1
1.0 +
1.0 +
1.0 - 1.53 +
+
1.53
0.65 -
1.0 + 1.53 +
1.0 + - - 0.65 + - -
1.0 1.0 0.65 0.65
𝜀1 =0.30
1.0 + 1.0 - 0.65 + 0.65 -
1.0 - 𝑑1=1.53 0.65 -
𝛼 1=0.42
𝑓 1 (𝑥 )
Toy ExampleT=3, weak classifier = decision stump
:
• t=2
𝛼 1=0.42
1.53 +
+
1.53
0.65 -
0.78 +
0.78 + 0.33 -
1.53 + 0.78 +
1.26- 1.26
-
- 0.33 +
0.65 + -
0.65 0.65
𝜀 2=0.21
0.33 + 0.33 -
0.65 + 0.65 -
0.65 - 𝑑2 =1.94 1.26 -
𝛼 2=0.66
𝑓 2 (𝑥 )
Toy ExampleT=3, weak classifier = decision stump
: :
• t=3
𝛼 1=0.42 𝛼 2=0.66
0.78 + 0.33 - :
0.78 +
𝑓 3 ( 𝑥) 0.78 + 𝛼 3=0.95
1.26- 1.26
-
0.33 +
𝜀 3=0.13
0.33 + 0.33 - 𝑑3 =2.59
1.26 -
𝛼 3=0.95
Toy Example
• Final Classifier:
𝑠𝑖𝑔𝑛¿0.42 + 0.66 + 0.95 ¿
+
+ -
+
+ - -
+ -
-
Warning of Math
( )
𝑇
𝐻 ( 𝑥 ) =𝑠𝑖𝑔𝑛 ∑ 𝛼𝑡 𝑓 𝑡 ( 𝑥 ) 𝛼 𝑡=𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀 𝑡
𝑡 =1
As we have more and more (T increases), achieves smaller
and smaller error rate on training data.
Error Rate of Final
Classifier
• Final classifier:
𝑔(𝑥)
Training Data Error Rate
1
¿ ∑ 𝛿 ( 𝐻 ( 𝑥 ) ≠ ^𝑦 )
𝑛 𝑛
𝑁 𝑛
1
¿ ∑ 𝛿 ( ^𝑦 𝑔 ( 𝑥 ) <0 )
𝑛 𝑛
𝑁 𝑛
1
≤ ∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔 𝑥 )
( )
𝑛 𝑛
^
𝑁 𝑛
𝑦 𝑔 𝑥 )
^ 𝑛
( 𝑛
𝑇
Training Data Error Rate 𝑔 ( 𝑥 ) = ∑ 𝛼𝑡 𝑓 𝑡 ( 𝑥 )
1 𝑛 ¿ 1 𝑍 𝑡=1
≤ ∑ 𝑒𝑥𝑝 ( − ^𝑦 𝑔 ( 𝑥 ) ) 𝑁 𝑇 +1
𝑛
𝑁 𝑛 𝛼 𝑡=𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀 𝑡
: the summation of the weights of training data for training
𝑍 𝑇 +1=∑ 𝑢
𝑛
What is
𝑇 +1
𝑛
𝑛 𝑇
𝑢 =1
1
𝑢 𝑛
𝑇+1 =∏ 𝑒𝑥𝑝 ( − ^𝑦 𝑓 𝑡 ( 𝑥 ) 𝛼𝑡 )
𝑛 𝑛
𝑢𝑡 +1 =𝑢𝑡 ×𝑒𝑥𝑝 ( − ^
𝑦 𝑓 𝑡 ( 𝑥 ) 𝛼𝑡 )
𝑛 𝑛 𝑛 𝑛
𝑡=1
𝑇
𝑍 𝑇 +1=∑ ∏ 𝑒𝑥𝑝 ( − ^𝑦 𝑓 𝑡 ( 𝑥 ) 𝛼𝑡 )
𝑛 𝑛
𝑔(𝑥)
( )
𝑛 𝑡 =1 𝑇
¿ ∑ 𝑒𝑥𝑝 − ^𝑦 𝑛
∑ 𝑓 𝑡(𝑥 𝑛
) 𝛼𝑡
𝑛 𝑡 =1
𝑇
Training Data Error Rate 𝑔 ( 𝑥 ) = ∑ 𝛼𝑡 𝑓 𝑡 ( 𝑥 )
1 1 𝑡=1
≤ ∑ 𝑒𝑥𝑝 ( − ^𝑦 𝑔 ( 𝑥 ) ) ¿ 𝑁 𝑍 𝑇 +1
𝑛 𝑛
𝑁 𝑛 𝛼 𝑡=𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀 𝑡
𝑍 1 =𝑁 (equal weights)
𝑍 𝑡 = 𝑍 𝑡 − 1 𝜀 𝑡 𝑒𝑥𝑝 ( 𝛼 𝑡 ) + 𝑍 𝑡 − 1 ( 1 − 𝜀 𝑡 ) 𝑒𝑥𝑝 ( −𝛼𝑡 )
Misclassified portion in Correctly classified portion in
¿ 𝑍 𝑡 −1 𝜀 𝑡 √ ( 1− 𝜀 𝑡 ) / 𝜀𝑡 + 𝑍 𝑡 −1 ( 1 − 𝜀𝑡 ) √ 𝜀 𝑡 / ( 1 − 𝜀𝑡 )
𝑇
¿ 𝑍 𝑡 −1 ×2 √ 𝜀 𝑡 ( 1− 𝜀 𝑡 )
𝑍 𝑇 +1=𝑁 ∏ 2 √ 𝜀𝑡 ( 1− 𝜀𝑡 )
𝑇
𝑡=1
Training Data Error Rate ≤ ∏ 2 √𝜖 𝑡 ( 1− 𝜖𝑡 ) Smaller and
𝑡 =1 <1 smaller
End of Warning
Even though the training error is 0,
the testing error still decreases?
𝐻 ( 𝑥)
𝑔(𝑥)
Margin =
𝐻 ( 𝑥)
Large Margin?
𝑔(𝑥)
Training Data Error Rate =
1
¿ ∑ 𝛿 ( 𝐻 ( 𝑥 ) ≠ ^𝑦 ) Adaboost
𝑛 𝑛
𝑁 𝑛
1
≤ ∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔 𝑥 )
( )
𝑛 𝑛
^
𝑁 𝑛 Logistic
regression
𝑇
¿ ∏ 2 √ 𝜖𝑡 ( 1− 𝜖 𝑡 )
𝑡=1 SVM
Getting smaller and
smaller as T increase 𝑦 𝑛 𝑔 ( 𝑥𝑛 )
^
Experiment:
Function of Miku
Adaboost T = 10 T = 20
+Decision Tree
(depth = 5)
T = 50 T = 100
To learn more …
• Introduction of Adaboost:
• Freund; Schapire (1999). "A Short Introduction to Boosting“
• Multiclass/Regression
• Y. Freund, R. Schapire, “A Decision-Theoretic Generalization of on-Line
Learning and an Application to Boosting”, 1995.
• Robert E. Schapire and Yoram Singer. Improved boosting algorithms using
confidence-rated predictions. In Proceedings of the Eleventh Annual
Conference on Computational Learning Theory, pages 80–91, 1998.
• Gentle Boost
• Schapire, Robert; Singer, Yoram (1999). "Improved Boosting Algorithms
Using Confidence-rated Predictions".
General Formulation of
Boosting
• Initial function
• For t = 1 to T:
• Find a function and to improve
• Output:
What is the learning target of ?
Minimize 𝐿 𝑔 =∑ 𝑙 ( 𝑦 , 𝑔 𝑥 )¿ ∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔 𝑥 )
( ) ( )
𝑛 𝑛 𝑛 𝑛
( ) ^ ^
𝑛 𝑛
Gradient Boosting
• Find , minimize
• If we already have , how to update ?
Gradient Descent:
𝜕 𝐿(𝑔 )
𝑔 𝑡 ( 𝑥 ) =𝑔 𝑡 −1 ( 𝑥 ) −𝜂
𝜕 𝑔 (𝑥 ) g ( 𝑥 ) =𝑔 𝑡 −1 ( 𝑥 )
− ∑ 𝑒𝑥𝑝 (− ^𝑦 𝑔 𝑡−1 ( 𝑥 ) ) ( − ^𝑦 )
𝑛 𝑛 𝑛
Same direction
𝑛
𝑔 𝑡 ( 𝑥 ) =𝑔 𝑡 −1 ( 𝑥 ) +𝛼 𝑡 𝑓 𝑡 ( 𝑥 )
Gradient Boosting
𝑓 𝑡(𝑥)
Same direction 𝑛
∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔𝑡 ( 𝑥 ) ) ( 𝑦 )
^ 𝑛 𝑛
^ 𝑛
We want to find maximizing
Minimize Error
∑ 𝑒𝑥𝑝 ( − ^𝑦 𝑔𝑡 −1 ( 𝑥 ) ) ( ^𝑦 ) 𝑓 ( 𝑥 )
𝑛 𝑛 𝑛
𝑡
𝑛
𝑛 example weight Same sign
( )
𝑡− 1
𝑔𝑡 − 1 ( 𝑥 )¿) 𝑒𝑥𝑝 − 𝑦 ∑ 𝛼𝑖 𝑓 𝑖 ( 𝑥 )
𝑛 𝑛
𝑢 =𝑒𝑥𝑝 ( − ^
𝑛
𝑡 𝑦
𝑛 𝑛
^
𝑖=1
𝑡 −1
¿ ∏ 𝑒𝑥𝑝 (− ^𝑦 𝛼 𝑖 𝑓 𝑖 ( 𝑥 ) )
𝑛 𝑛 Exactly the weights we obtain
in Adaboost
𝑖=1
Gradient Boosting
• Find , minimize
is something like
𝑔 𝑡 ( 𝑥 ) =𝑔 𝑡 −1 ( 𝑥 ) +𝛼 𝑡 𝑓 𝑡 ( 𝑥 )
learning rate
Find minimzing
𝐿 ( 𝑔 )=∑ 𝑒𝑥𝑝 (− 𝑦 ( 𝑔𝑡 −1 ( 𝑥 ) +𝛼 𝑡 𝑓 𝑡 ( 𝑥 ) ) )
^ 𝑛 Find such
that
𝑛
¿ ∑ 𝑒𝑥𝑝 ( − ^𝑦 𝑔𝑡 −1 ( 𝑥 ) ) 𝑒𝑥𝑝 (− ^𝑦 𝛼 𝑡 𝑓 𝑡 ( 𝑥 ) )
𝑛 𝑛 𝜕 𝐿(𝑔 )
=0
𝑛
𝜕 𝛼𝑡
¿ ∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔𝑡 −1 ( 𝑥 ) ) 𝑒𝑥𝑝 ( 𝛼 𝑡 )
^ 𝑛 𝑛
𝛼 𝑡=¿
𝑙𝑛 √ ( 1 − 𝜀𝑡 ) / 𝜀𝑡
𝑦^ 𝑛 ≠ 𝑓 𝑡 ( 𝑥 )
+ ∑ 𝑒𝑥𝑝 ( − 𝑦 𝑔𝑡− 1 𝑥 ) ) 𝑒𝑥𝑝 ( − 𝛼𝑡 )
^ 𝑛
( 𝑛
Adaboost
𝑦^ 𝑛= 𝑓 𝑡 ( 𝑥 )
!
Cool Demo
• http://arogozhnikov.github.io/2016/07/05/
gradient_boosting_playground.html
Ensemble: Stacking
Voting
小明’ s system y
老王’ s system y
Majority
x
Vote
老李’ s system y
小毛’ s system y
Training Training Val Testing
Stacking Data Data Data Data
小明’ s system y
老王’ s system y
x Final
Classifier
老李’ s system y
小毛’ s system y
as new feature
2017
新年快樂
Happy New Year