0% found this document useful (0 votes)
168 views34 pages

DSA5102 Lecture3

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views34 pages

DSA5102 Lecture3

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Foundations of Machine Learning

DSA 5102 • Lecture 3

Li Qianxiao
Department of Mathematics
Last time
So far, our hypothesis space consists of smooth functions, or a
simple Sign functions composed with a smooth function

Today, we are going to look at another class of supervised


learning hypothesis spaces consisting of piece-wise constant
functions. We also discuss how to combine them to form strong
classifiers and regressors.
Decision Trees
Should I go to DSA5102?

Is it a Tuesday
evening?
No Yes

Do DSA5102 Go out with friends


homework instead?
Have friends No friends

Do they wanna learn


Go to DSA5102
ML?

Yes No

Reconsider
Go to DSA5102
friendship options
Decision Tree Basics

Root Node

Branches

Leaf Node Internal Node

Depth = 3
Branches

Internal Node Leaf Node

Branches
Leaf Node Leaf Node
Decision Trees
Decision trees are very simple and useful ways to build models.

Key ideas
• Stratify the input space into distinct, non-overlapping
regions
• Assign a chosen, constant prediction to each region
A One-Dimensional Example
Suppose we want to approximate some oracle function

A depth-1 decision tree is the piecewise constant function


This corresponds to the following decision tree

𝑥= 𝜃 0
𝑎
𝑥
𝑏 𝑥≤ 𝜃0 𝑥>𝜃 0

𝑎 𝑏
We can also further split the input space to form deeper trees
Classification and Regression Trees
Suppose that the input space is . A partition of is a collection of
subsets such that

The general decision tree hypothesis space is


In theory, we can consider very general partitions, but in practice
it is convenient to restrict to high-dimensional rectangles
Learning Decision Trees
A decision tree model

depends on both and .


Given , are easy to fix:
• Regression: we take the average label values

• Classification: we take the modal label values


Suppose we are dealing with regression, then we can fix as
before and solve the following empirical risk minimization:

Even restricting to rectangular partitions, this is very hard to solve!

Figure source: https://en.wikipedia.org/wiki/Bell_number


Recursive Binary Splitting
Instead, we can resort to the following greedy algorithm, which
essentially repeats the following two steps:

1. Pick a dimension of the input space (randomly or …)


2. Find the best value to split this input dimension into
two parts and assign new constant values to these new
regions
This grows the tree by adding two leaf nodes at a time, and hence
the name recursive binary splitting

Does this find the optimal solution?


Greedy vs Optimal Solution
Decision Trees for Classification
The greedy algorithm can be carried out analogously, except that
we need to define a proper loss function:

Where proportion of samples in belonging to class


Advantages and Disadvantages
Advantages:
• Can readily visualize and understand predictions
• Implicit feature selection via analyzing contribution of splits to
reduction of error/impurity
• Robust to data types, supervised learning tasks and nonlinear
relationships
Disadvantages:
• Greedy algorithms may find sub-optimal solutions
• Sensitive to data variation and balancing
• Prone to overfitting
Overfitting
The biggest draw back of decision trees is overfitting
Model Ensembling
Ensemble Methods
An effective way to reduce overfitting and increase approximation
power is to combine models. This is called model ensembling.

We will now introduce two classes of such methods


1. Bagging
2. Boosting
Bootstrap Aggregating (Bagging)
The first method for combining model is also the simplest: we
simply train models on random subsamples of the training data.
We can then combine them in the obvious way to make
predictions:
1. Regression

2. Classification
Example
Dataset:

Subsample and train:


1.
2.
3.

Aggregate:
What does bagging do?
Consider a simple model where

Assume the noise satisfies


• and
• Uncorrelated: =0 for

Form aggregate model


Define the errors

We can show that

A significant reduction! But…


• What is the most unrealistic assumption?
• What happens when there is bias so that ?
Bias and Variance
Variance Bias / Approximation Error


Bagging
𝑓
^𝑓 ?
Boosting
Unlike bagging whose purpose is to reduce variance, boosting
aims to reduce bias.

It answers an important question in the positive:


Can weak learners be combined to form a strong learner?

We will introduce the simplest setting of the Adaptive Boosting or


AdaBoost algorithm
Key Ideas of AdaBoost
1. Initialize with uniform weight across all training samples
2. Train a classifier/regressor
3. Identify the samples that got wrong (classification) or has
large errors (regression)
4. Weight these samples more heavily and train on this
reweighted dataset
5. Repeat steps 3-5
Bagging reduces variance
Boosting reduces bias
Demo: Decision Trees with
Ensembling
Cross Validation
Recall: test data is the ultimate test of our model performance

But, should we always rely on test data for model evaluation and
selection?
1. It is not always available
2. No average error estimate
3. We might “overfit” on test data

A more robust idea: cross validation


Given a training set , we can further split it into training and
validation datasets. In fact, we can do this times

K-Fold Cross-Validation

Validation Training Training Training Score

Training Validation Training Training Score


Average
Score
Training Training Validation Training Score

Training Training Training Validation Score


Summary
Decision trees
• Piece-wise constant predictors
• Learn by greedy algorithms

Model Ensembling
• Bagging: reduce variance
• Boosting: reduce bias

Evaluate models using cross validation


Homework and Project
Homework 2 is online (Due in 2 weeks)

Project
• Incorporate cross-validation as a more robust model
evaluation technique
• Ensembling methods
• Tune hyperparameters using cross validation

Test
• Week 7

You might also like