19_ML_intro
19_ML_intro
Artificial Intelligence
Learning
from Examples
AIMA Chapter 19
Slides by Michael Hahsler
Based on slides by Dan Klein, Pieter Abbeel, Sergey
Levine and A. Farhadi (http://ai.berkeley.edu)
with figures from the AIMA textbook.
Types of
Supervised Training &
ML & Agents Data supervised Use in AI
Learning Testing
ML Models
Learning from Examples: Machine Learning
Up until now in this course:
• Hand-craft algorithms to make rational/optimal or at least good decisions.
Examples: Search strategies, heuristics.
Issues
• Designer cannot anticipate all possible future situations.
• Designer may have examples but does not know how to program a solution.
Machine Learning
• Learning = Improve performance after making observations about the world.
That is, learn what works and what doesn’t.
• We learn a model that decides on the actions to take. This is called the
“performance element.”
• The goal is to get closer to optimal decisions. I.e., it is an optimization problem.
From Chapter 2: Agents that Learn
The learning element modifies the performance element to improve its
performance.
• Supervised Learning: Uses a data set with correct answers. Learn a function (model) to We focus
map an input (e.g., state) to an output (e.g., action or utility).
Examples: here on
Use a naïve Bayesian classifier to distinguish between spam/no spam supervised
Learn a playout policy to simulate games (current board -> good move) learning
• Reinforcement Learning: Learn from rewards/punishment (e.g., winning a game) obtained
via interaction with the environment over time.
1+1=2
Supervised
Learning
Supervised Learning
• Examples
• We assume there exists a target function 𝑦𝑦 = 𝑓𝑓(𝑥𝑥) that produces iid (independent
and identically distributed) examples possibly with noise and errors.
• Examples are observed input-output pairs E = 𝑥𝑥1 , 𝑦𝑦1 , … , 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , … , 𝑥𝑥𝑁𝑁 , 𝑦𝑦𝑁𝑁 ,
where 𝑥𝑥 is a vectors called the feature vector.
𝑓𝑓
• Learning problem
• Given a hypothesis space H of representable models.
• Find a hypothesis ℎ ∈ 𝐻𝐻 such that 𝑦𝑦�𝑖𝑖 = ℎ 𝑥𝑥𝑖𝑖 ≈ 𝑦𝑦𝑖𝑖 ∀𝑖𝑖
• That is, we want to approximate 𝑓𝑓 by ℎ using E.
Set of all
functions
• Supervised learning includes
• Classification (outputs = class labels). E.g., 𝑥𝑥 is an email and 𝑓𝑓(𝑥𝑥) is spam / ham.
• Regression (outputs = real numbers). E.g., x is a house and 𝑓𝑓(𝑥𝑥) is its selling price.
Consistency vs. Simplicity
Example: Univariate curve fitting (regression, function approximation)
y Examples y Learned Models x … 𝑓𝑓 𝑥𝑥
lines … ℎ(𝑥𝑥)
Very simple,
but not very
consistent
with the
data!
𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃(𝑦𝑦)
h∗ x = argmax 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥) = argmax = argmax 𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃(𝑦𝑦)
𝑦𝑦 𝑦𝑦 𝑃𝑃(𝑥𝑥) 𝑦𝑦
Optimality: The Bayes classifier is optimal for 0/1 loss. It is the most consistent classifier possible with the lowest
possible error called the Bayes error rate. No better classifier is possible!
Issue: The classifier requires to learn 𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃 𝑦𝑦 = 𝑃𝑃(𝑥𝑥, 𝑦𝑦) from the examples.
• It needs the complete joint probability which requires in the general case a probability table with one entry for
each possible value for the feature vector 𝑥𝑥.
• This is impractical (unless a simple Bayes network exists) and most classifiers try to approximate the Bayes
classifier using a simpler model with fewer parameters.
Simplicity
Ease of use
• Simpler hypotheses have fewer model parameters to estimate and store.
Points: Two
samples from the
same function 𝑓𝑓
to show variance.
Examples
(Instances,
Observation)
Find a hypothesis (called “model”) to predict the class given the features.
Feature Engineering
• Add information sources as new variables to the model.
• Add derived features that help the classifier (e.g., 𝑥𝑥1 𝑥𝑥2 , 𝑥𝑥12 ).
• Embedding: E.g., convert words to vectors where vector
similarity between vectors reflects semantic similarity.
• Testing loss: Calculate the empirical loss for predictions on a testing data set 𝑇𝑇 that is
different from the data used for training.
1
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝑇𝑇 (ℎ) = � 𝐿𝐿(𝑦𝑦, ℎ 𝑥𝑥 )
|𝑇𝑇|
𝑥𝑥,𝑦𝑦 ∈𝑇𝑇
• For classification we often use the accuracy measure, the proportion of correctly
classified test examples.
1
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ, 𝑇𝑇 = � [ℎ 𝑥𝑥 = 𝑦𝑦] = 1 − 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿0/1,𝑇𝑇 (ℎ)
𝑇𝑇
(𝑥𝑥,𝑦𝑦)∈𝑇𝑇
• Notes:
• The validation set was not used for training, so we get generalization Test
accuracy for the different hyperparameter settings. Data
• If no model selection is necessary, then no validation set is used.
Testing a Model
Training
Data
• After the model is selected, the final model is evaluated against the
test set to estimate the final model accuracy.
Test
• Very important: never “peek” at the test set during training! Data
How to Split the Dataset
• Random splits: Split the data randomly in, e.g.,
60% training, 20% validation, and 20% testing.
• Stratified splits: Like random splits, but balance classes and other Training
properties of the examples. Data
Training
Data
• k-fold cross validation: Use training & validation data better
• Split the training & validation data randomly into k folds.
• For k rounds hold one fold back for testing and use the remaining 𝑘𝑘 − 1 folds
for training. Validation
• Use the average error/accuracy as a better estimate. Data
• Some algorithms/tools do this internally.
Analytical solution:
𝒘𝒘∗ = 𝑿𝑿𝑇𝑇 𝑿𝑿 −1
𝑿𝑿𝑇𝑇 𝒚𝒚
Pseudo inverse
Naïve Bayes Classifier
• Approximates a Bayes classifier with the naïve independence assumption that all 𝑛𝑛
features are conditional independent given the class.
𝑛𝑛
The parameters for the normal distribution 𝑁𝑁 𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦 are estimated from data.
Decision Trees
• Class is predicted by looking at the majority in the set of the k nearest neighbors. 𝑘𝑘 is a
hyperparameter. Larger 𝑘𝑘 smooth the decision boundary.
• Neighbors are found using a distance measure (e.g., Euclidean distance between points).
• Approximates a Bayesian classifier by
ℎ(𝑥𝑥) = argmax 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 neighborhood(𝑥𝑥))
𝑦𝑦
Support Vector Machine (SVM)
Margin
Decision
boundary
• Linear classifier that finds the maximum margin separator using only the points
that are “support vectors” and quadratic optimization.
• The kernel trick can be used to learn non-linear decision boundaries.
Artificial Neural Networks/Deep Learning
Computational graph
Hidden Layer For classification
typically a softmax • Represent 𝑦𝑦� = ℎ 𝑥𝑥 as a network
activation function of weighted sums with non-linear
returning 𝑷𝑷(𝑦𝑦|𝑥𝑥) activation functions g (e.g.,
logistic, ReLU).
• Learn weights 𝐰𝐰 from examples
using backpropagation of
prediction errors L(𝑦𝑦,
� 𝑦𝑦) (gradient
descend).
• ANNs are universal
approximators. Large networks
can approximate any function (no
bias). Regularization is typically
used to avoid overfitting.
• Deep learning adds more hidden
layers and layer types (e.g.,
convolution layers) for better
Perceptron learning.
Bias term Non-linear activation function
Many other models exist
• Directly learn the best • Learn evaluation • Natural language • Neural networks can be
action from examples. functions for states. processing: Use deep used as a compact
learning / word representation of tables
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ℎ(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = ℎ(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) embeddings / language that do not fit in
models to understand memory. E.g.,
concepts, translate • Joint probability table
• This model can also be • Can learn a heuristic for between languages, or • State utility table
used as a playout policy minimax search from generate text.
for Monte Carlo tree examples.
search with data from • Speech recognition:
• The tables can be
self-play. Identify the most likely
learned form data.
sequence of words.
• Vision: Object
recognition in
images/videos.
Generate images/video.