Introduction to ML and Decision Tree

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Introduction to ML and Decision Tree
Suman Debnath
Principal Developer Advocate | AWS

“AI is the new ’Electricity’..”

Before We Begin...
• Deep Learning (Subset of ML) - Uses Deep Neural Networks (a shallow network has one hidden
layer, a deep network has more than one) to learn features of the data in a hierarchical manner (e.g.
pixels from one layer recombine to form a line in the next layer)
– computer vision
– speech recognition
– natural language processing
• Artificial Intelligence – Basically a computer program doing something “smart”
– A bunch of if-then statements
–Machine Learning
• Machine Learning (Subset of AI) – A broad umbrella term for the technology that finds patterns in your
existing data, and uses them to make predictions on new data points
– Fraud Detection
– Deep Learning

AI | ML | DL – Maybe a picture is better?
Great Resource:
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
by Pedro Domingos

Timeline Of Machine Learning
1950 1952 1957 1979 1986 1997 2011 2012 2014 2016
The Learning
Machine (Alan Turing)
Machine Playing Checker
(Author Samuel)
Perceptron
(Frank Rosenblatt)
Stanford Cart
Backpropagation
(D. Rumelhart, G. Hinton, R. Williams)
Deep Blue Beats
Kasparov
Watson Wins Jeopardy
DeepMind Wins GoGoogle NN recognizing
cat in Youtube
Facebook DeepFace, Amazon
Echo, Turing Test Passed

Explosion in AI and ML Use Cases
Image recognition and tagging for photo organization
Object detection, tracking and navigation for Autonomous Vehicles
Speech recognition & synthesis in Intelligent Voice Assistants
Algorithmic trading strategy performance improvement
Sentiment analysis for targeted advertisements

43,252,003,274,489,856,000
43 QUINTILLION UNIQUE COMBINATIONS

F2 U' R' L F2 R L' U'
Learning
function

F2 U' R' L F2 R L' U'
Learning
function
1%
accuracy
R U r U R U2 r U2%
accuracy

Learning
function
20%
accuracy
40%
accuracy
60%
accuracy
80%
accuracy
95%
accuracy
2%
accuracy

Learning
function
95%
accuracy
?
F2 R F R′ B′ D F D′ B D F

Don’t code the patterns
Let the system
Learn Through Data

We Call This Approach Machine Learning

Supervised Learning (depends on labeled datasets)
Baby
No, it’s a
Labrador.

Supervised Learning – How Machines Learn
Human intervention and validation required
e.g. Photo classification and tagging
Input
Label
Machine
Learning
Algorithm
Labrador
Prediction
Cat
Training Data
?
Label
Labrador
Adjust Model

Unsupervised Learning (learning without labels)
No human intervention required
(e.g. Customer segmentation)
Input
Machine
Learning
Algorithm
Prediction

Machine Learning Use Cases
Supervised Learning
Ø Classification
• Spam detection
• Customer churn prediction
Ø Regression
• House price prediction
• Demand forecasting
Unsupervised Learning
Ø Clustering
• Customer segmentation
There are other types as well
(Reinforcement Learning, for example)
but these two are the primary areas today

There are Lots of Machine Learning Algorithms
machinelearningmastery.com

Color Size Fruit
Red Big Apple
Red Small Apple
Yellow Small Lemon
Red Big Apple
Green Big Apple
Yellow Big Lemon
Green Small Lemon
Red Big Apple
Yellow Big Lemon
Green Big Apple
Input Feature Target Label
Some Dataset

Decision Tree might look like …
Size of the fruit ?Apple
Color of the fruit ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting

But the question is…given a dataset

But the question is…given a dataset, how can
we build a tree like this ?
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting

General DT structure
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf

General DT structure
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Apple Lemon
Lemon
Red
Green
Yellow
Big Small
Root
Branches
Leaf
Splitting

Training flow of a Decision Tree
• Prepare the labelled data set
• Try to pick the best feature as the root node
• Grow the tree until we get a stopping criteria
• Pass through the prediction data query through the tree
until we arrive at some le
• Once we get the leaf node, we have the prediction!! :)

Feature 1 Feature 2 Feature 3 Feature 4 Target Label
Training data, everything is known

Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf

???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
Prediction Data, only Feature 1 to 4 is known
UNKNOWN

???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
UNKNOWN
Send the Query/Inference

???
Root
Interior
Interior
Leaf Leaf
Leaf
Leaf Interior
Leaf Leaf
UNKNOWN
Send the Query/Inference
Get the prediction

Math behind Decision Tree
• Entropy
• Information Gain(IG)

Entropy
It is the notion of the impurity of the data, now what is this new term impurity of
the data?

Entropy
the data?
pure

Entropy
the data?
pure less pure

Entropy
the data?
impurepure less pure

Entropy
H(x) = - ∑ P(k) * log2(P(k))
k = ranges from 1 through n
H(x) = Entropy of x
P(k) = Probability of random variable x when x=k

Outlook Temperature Humidity Windy Play ball
Rainy Hot High FALSE No
Rainy Hot High TRUE No
Overcast Hot High FALSE Yes
Sunny Mild High FALSE Yes
Sunny Cool Normal FALSE Yes
Sunny Cool Normal TRUE No
Overcast Cool Normal TRUE Yes
Rainy Mild High FALSE No
Rainy Cool Normal FALSE Yes
Sunny Mild Normal FALSE Yes
Rainy Mild Normal TRUE Yes
Overcast Mild High TRUE Yes
Overcast Hot Normal FALSE Yes
Sunny Mild High TRUE No
Dataset – D

Dataset – D
X = “Play Ball”

Dataset – D
P(k=Yes) => 9/14 = 0.64
X = “Play Ball”

Dataset – D
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
X = “Play Ball”

Dataset – D
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
X = “Play Ball”

X = “Play Ball”
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47

X = “Play Ball”
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
H(x) = - ∑ P(k) * log2(P(k))

X = “Play Ball”
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
H(x) = - ∑ P(k) * log2(P(k))
H(x) = - [P(k=Yes) * log2(P(k=Yes)) + P(k=No) * log2(P(k=No))]

X = “Play Ball”
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
H(x) = - ∑ P(k) * log2(P(k))
H(x) = - [(0.64 * log2 (0.64) + 0.36 * log2(0.36))]

X = “Play Ball”
P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36
log2 (0.64) = -0.64
log2 (0.36) = -1.47
H(x) = - ∑ P(k) * log2(P(k))
H(x) = - [(0.64 * log2 (0.64) + 0.36 * log2(0.36))]
H(x) = 0.94

Information Gain(IG)
Dataset – D

Outlook
Sub-Dataset – D1
Sub-Dataset – D2
Sub-Dataset – D3
Dataset – D
HD1(”Play Ball”) = 0.69
HD2(”Play Ball”) = 0
HD3(”Play Ball”) = 0.97
Weighted Entropy
0.69
5/14tim
es
5/14times
4/14 times
IGOutlook = Entropy(D) -Weighted Entropy
= 0.97 - 0.69
= 0.25

IGOutlook = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Outlook
= 0.94 – 0.69
= 0.25

= 0.94 – 0.69
= 0.25
IGTemperature = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Temperature
= 0.94 - 0.91
= 0.03
IGHumidity = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Humidity
= 0.94 - 0.79
= 0.15
IGWindy = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Windy
= 0.94 - 0.90
= 0.04

Maximum IG ? - Outlook
= 0.94 – 0.69
= 0.25
IGTemperature = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Temperature
= 0.94 - 0.91
= 0.03
IGHumidity = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Humidity
= 0.94 - 0.79
= 0.15
IGWindy = HD(“Play Ball”) - Weighted Entropy after breaking the dataset with Windy
= 0.94 - 0.90
= 0.04

Here is the algorithmic steps
1. First the entropy of the total dataset is calculated for the target
label/class.

label/class.
2. The dataset is then split on different features.
a) The entropy for each branch is calculated. Then it is added
proportionally, to get total weighted entropy for the split.
b) The resulting entropy is subtracted from the entropy before the split.
c) The result is the Information Gain.

label/class.
3. The feature that yields the largest IG is chosen for the decision node.

label/class.
3. The feature that yields the largest IG is chosen for the decision node.
4. Repeat step #2 and #3, for each subset of the data(for each internal
node) until:
a) All the dependent features are exhausted
b) The stopping criteria are met.

Thankfully, we do not have to do all this(like calculating
Entropy, IG, etc.), we have lots of libraries/packages
available in Python which we can use to solve a problem
with decision tree.

Can you please show the CODE pls…

Amazon
Rekognition
Amazon
Personalize
Amazon
Textract
Amazon
Comprehend
Amazon
Translate
Amazon
Polly
Amazon
Transcribe
+ Medical
Amazon
Lex
V I S I O N T E X T C H A T B O T SS P E E C H P E R S O N A L I Z A T I O N
Ground Truth
data labelling
ML
Marketplace
SageMaker Studio IDE
SageMaker
Notebooks
SageMaker
Experiments
SageMaker
Debugger
SageMaker
Autopilot
SageMaker
Model Monitor
Model
training
Model
tuning
Model
hosting
Built-in
algorithms
SageMaker
Neo
N E W !
N E W !
N E W ! N E W !N E W !
Deep Learning
AMIs & Containers
GPUs and
CPUs
Inferentia
Elastic
Inference
FPGA
N E W !
N E W ! N E W ! N E W !
A M A Z O N
S A G E M A K E R
M L F R A M E W O R K S
& I N F R A S T R U C T U R E
A I S E R V I C E S
Amazon
Forecast
F O R E C A S T I N G
AWS ML Stack
Broadest and most complete set of Machine Learning capabilities

Amazon
Rekognition
Amazon
Personalize
Amazon
Textract
Amazon
Comprehend
Amazon
Translate
Amazon
Polly
Amazon
Transcribe
+ Medical
Amazon
Lex
V I S I O N T E X T C H A T B O T SS P E E C H P E R S O N A L I Z A T I O N
Ground Truth
data labelling
ML
Marketplace
SageMaker Studio IDE
SageMaker
Notebooks
SageMaker
Experiments
SageMaker
Debugger
SageMaker
Autopilot
SageMaker
Model Monitor
Model
training
Model
tuning
Model
hosting
Built-in
algorithms
SageMaker
Neo
N E W !
N E W !
N E W ! N E W !N E W !
Deep Learning
AMIs & Containers
GPUs and
CPUs
Inferentia
Elastic
Inference
FPGA
N E W !
N E W ! N E W ! N E W !
A M A Z O N
S A G E M A K E R
M L F R A M E W O R K S
& I N F R A S T R U C T U R E
A I S E R V I C E S
Amazon
Forecast
F O R E C A S T I N G
AWS ML Stack
Broadest and most complete set of Machine Learning capabilities
Amazon SageMaker |
• Jupyter notebooks
• Support Jupyter Lab
• Multiple built-in kernels
• Bring your own kernels
• Integrate with Git
• Sample notebooks

Reference
Blog: An Introduction to Decision Tree and Ensemble Methods – Part 1
Code: Repository

Stay Connected
/suman-d /_sumand

Suman Debnath
Principal Developer Advocate
ml.aws

Amazon SageMaker |
VPC
Notebook
VPC
Private subnetPublic subnet
SageMaker
Internet
Gateway
Internet
Gateway
NAT ENI
SageMaker
Endpoint
VPN
Gateway
ENI
Customer VPC
On Premises
Amazon VPC
InternetInternet
Accessing Internet and other resources from your SageMaker Notebook

Amazon SageMaker |
SageMaker
Amazon
EC2
Amazon Elastic
Container Registry
Amazon
CloudWatch
AWS
CodeCommit
Amazon Simple
Storage Service
Amazon
SageMaker
AWS
RoboMaker
AWS
Lambda
Passes Execution Role
AWS Key
Management
AWS Identity
Management
Role
User
CreateNotebookInstance
CreateHyperParameterTuningJob
CreateTrainingJob
CreateModel
Permissions Model

Model Training – Split training data
All Labeled Dataset
Training Data
70% 30%

Model Training – Training w/ training data
All Labeled Dataset
Training Data
70% 30%
Training
Trial
Model

Model Training – Split the test data
All Labeled Dataset
Training Data
70% 30%
Training
Trial
Model
Test
Data

Model Training – Model evaluation
All Labeled Dataset
Training Data
70% 30%
Training
Test
Data
Evaluation
Result
Trial
Model

Model Training - Performance Measurement
All Labeled Dataset
Training Data
70% 30%
Training
Test
Data
Evaluation
Result
Trial
Model Accuracy

Introduction to ML and Decision Tree

More Related Content

What's hot

Similar to Introduction to ML and Decision Tree

More from Suman Debnath

Recently uploaded

Introduction to ML and Decision Tree