Foundations of Machine Learning
DSA 5102 • Lecture 3
Li Qianxiao
Department of Mathematics
Last time
So far, our hypothesis space consists of smooth functions, or a
simple Sign functions composed with a smooth function
Today, we are going to look at another class of supervised
learning hypothesis spaces consisting of piece-wise constant
functions. We also discuss how to combine them to form strong
classifiers and regressors.
Decision Trees
Should I go to DSA5102?
Is it a Tuesday
evening?
No Yes
Do DSA5102 Go out with friends
homework instead?
Have friends No friends
Do they wanna learn
Go to DSA5102
ML?
Yes No
Reconsider
Go to DSA5102
friendship options
Decision Tree Basics
Root Node
Branches
Leaf Node Internal Node
Depth = 3
Branches
Internal Node Leaf Node
Branches
Leaf Node Leaf Node
Decision Trees
Decision trees are very simple and useful ways to build models.
Key ideas
• Stratify the input space into distinct, non-overlapping
regions
• Assign a chosen, constant prediction to each region
A One-Dimensional Example
Suppose we want to approximate some oracle function
A depth-1 decision tree is the piecewise constant function
This corresponds to the following decision tree
𝑥= 𝜃 0
𝑎
𝑥
𝑏 𝑥≤ 𝜃0 𝑥>𝜃 0
𝑎 𝑏
We can also further split the input space to form deeper trees
Classification and Regression Trees
Suppose that the input space is . A partition of is a collection of
subsets such that
The general decision tree hypothesis space is
In theory, we can consider very general partitions, but in practice
it is convenient to restrict to high-dimensional rectangles
Learning Decision Trees
A decision tree model
depends on both and .
Given , are easy to fix:
• Regression: we take the average label values
• Classification: we take the modal label values
Suppose we are dealing with regression, then we can fix as
before and solve the following empirical risk minimization:
Even restricting to rectangular partitions, this is very hard to solve!
Figure source: https://en.wikipedia.org/wiki/Bell_number
Recursive Binary Splitting
Instead, we can resort to the following greedy algorithm, which
essentially repeats the following two steps:
1. Pick a dimension of the input space (randomly or …)
2. Find the best value to split this input dimension into
two parts and assign new constant values to these new
regions
This grows the tree by adding two leaf nodes at a time, and hence
the name recursive binary splitting
Does this find the optimal solution?
Greedy vs Optimal Solution
Decision Trees for Classification
The greedy algorithm can be carried out analogously, except that
we need to define a proper loss function:
Where proportion of samples in belonging to class
Advantages and Disadvantages
Advantages:
• Can readily visualize and understand predictions
• Implicit feature selection via analyzing contribution of splits to
reduction of error/impurity
• Robust to data types, supervised learning tasks and nonlinear
relationships
Disadvantages:
• Greedy algorithms may find sub-optimal solutions
• Sensitive to data variation and balancing
• Prone to overfitting
Overfitting
The biggest draw back of decision trees is overfitting
Model Ensembling
Ensemble Methods
An effective way to reduce overfitting and increase approximation
power is to combine models. This is called model ensembling.
We will now introduce two classes of such methods
1. Bagging
2. Boosting
Bootstrap Aggregating (Bagging)
The first method for combining model is also the simplest: we
simply train models on random subsamples of the training data.
We can then combine them in the obvious way to make
predictions:
1. Regression
2. Classification
Example
Dataset:
Subsample and train:
1.
2.
3.
Aggregate:
What does bagging do?
Consider a simple model where
Assume the noise satisfies
• and
• Uncorrelated: =0 for
Form aggregate model
Define the errors
We can show that
A significant reduction! But…
• What is the most unrealistic assumption?
• What happens when there is bias so that ?
Bias and Variance
Variance Bias / Approximation Error
∗
Bagging
𝑓
^𝑓 ?
Boosting
Unlike bagging whose purpose is to reduce variance, boosting
aims to reduce bias.
It answers an important question in the positive:
Can weak learners be combined to form a strong learner?
We will introduce the simplest setting of the Adaptive Boosting or
AdaBoost algorithm
Key Ideas of AdaBoost
1. Initialize with uniform weight across all training samples
2. Train a classifier/regressor
3. Identify the samples that got wrong (classification) or has
large errors (regression)
4. Weight these samples more heavily and train on this
reweighted dataset
5. Repeat steps 3-5
Bagging reduces variance
Boosting reduces bias
Demo: Decision Trees with
Ensembling
Cross Validation
Recall: test data is the ultimate test of our model performance
But, should we always rely on test data for model evaluation and
selection?
1. It is not always available
2. No average error estimate
3. We might “overfit” on test data
A more robust idea: cross validation
Given a training set , we can further split it into training and
validation datasets. In fact, we can do this times
K-Fold Cross-Validation
Validation Training Training Training Score
Training Validation Training Training Score
Average
Score
Training Training Validation Training Score
Training Training Training Validation Score
Summary
Decision trees
• Piece-wise constant predictors
• Learn by greedy algorithms
Model Ensembling
• Bagging: reduce variance
• Boosting: reduce bias
Evaluate models using cross validation
Homework and Project
Homework 2 is online (Due in 2 weeks)
Project
• Incorporate cross-validation as a more robust model
evaluation technique
• Ensembling methods
• Tune hyperparameters using cross validation
Test
• Week 7