0% found this document useful (0 votes)
7 views

19_ML_intro

The document discusses machine learning, particularly supervised learning, which involves improving decision-making through learning from examples. It covers concepts such as the performance element, types of learning (supervised, unsupervised, reinforcement), and the importance of model simplicity and generalization. Additionally, it outlines the training and testing processes, including model evaluation, hyperparameter tuning, and the significance of dataset splitting.

Uploaded by

Surya Basnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

19_ML_intro

The document discusses machine learning, particularly supervised learning, which involves improving decision-making through learning from examples. It covers concepts such as the performance element, types of learning (supervised, unsupervised, reinforcement), and the importance of model simplicity and generalization. Additionally, it outlines the training and testing processes, including model evaluation, hyperparameter tuning, and the significance of dataset splitting.

Uploaded by

Surya Basnet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS 5/7320

Artificial Intelligence

Learning
from Examples
AIMA Chapter 19
Slides by Michael Hahsler
Based on slides by Dan Klein, Pieter Abbeel, Sergey
Levine and A. Farhadi (http://ai.berkeley.edu)
with figures from the AIMA textbook.

This work is licensed under a Creative Commons


Attribution-ShareAlike 4.0 International License.
Topics

Types of
Supervised Training &
ML & Agents Data supervised Use in AI
Learning Testing
ML Models
Learning from Examples: Machine Learning
Up until now in this course:
• Hand-craft algorithms to make rational/optimal or at least good decisions.
Examples: Search strategies, heuristics.

Issues
• Designer cannot anticipate all possible future situations.
• Designer may have examples but does not know how to program a solution.

Machine Learning
• Learning = Improve performance after making observations about the world.
That is, learn what works and what doesn’t.
• We learn a model that decides on the actions to take. This is called the
“performance element.”
• The goal is to get closer to optimal decisions. I.e., it is an optimization problem.
From Chapter 2: Agents that Learn
The learning element modifies the performance element to improve its
performance.

How is the agent


currently performing?

Update the performance


element and changes how
it selects actions.
E.g., adding rules,
changing weights

Exploration: select actions


that lead to better
information
Types of Using Machine Learning
• What component of the performance element is learned?
E.g., how to select action, estimate the utility of a state, …

• What representation (model) is used in the component?


Linear regression, rules, trees, neural nets,…

• What feedback is available for learning?


• Unsupervised Learning: No feedback, just organize data (e.g., clustering, embedding)

• Supervised Learning: Uses a data set with correct answers. Learn a function (model) to We focus
map an input (e.g., state) to an output (e.g., action or utility).
Examples: here on
 Use a naïve Bayesian classifier to distinguish between spam/no spam supervised
 Learn a playout policy to simulate games (current board -> good move) learning
• Reinforcement Learning: Learn from rewards/punishment (e.g., winning a game) obtained
via interaction with the environment over time.
1+1=2
Supervised
Learning
Supervised Learning
• Examples
• We assume there exists a target function 𝑦𝑦 = 𝑓𝑓(𝑥𝑥) that produces iid (independent
and identically distributed) examples possibly with noise and errors.
• Examples are observed input-output pairs E = 𝑥𝑥1 , 𝑦𝑦1 , … , 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , … , 𝑥𝑥𝑁𝑁 , 𝑦𝑦𝑁𝑁 ,
where 𝑥𝑥 is a vectors called the feature vector.
𝑓𝑓
• Learning problem
• Given a hypothesis space H of representable models.
• Find a hypothesis ℎ ∈ 𝐻𝐻 such that 𝑦𝑦�𝑖𝑖 = ℎ 𝑥𝑥𝑖𝑖 ≈ 𝑦𝑦𝑖𝑖 ∀𝑖𝑖
• That is, we want to approximate 𝑓𝑓 by ℎ using E.
Set of all
functions
• Supervised learning includes
• Classification (outputs = class labels). E.g., 𝑥𝑥 is an email and 𝑓𝑓(𝑥𝑥) is spam / ham.
• Regression (outputs = real numbers). E.g., x is a house and 𝑓𝑓(𝑥𝑥) is its selling price.
Consistency vs. Simplicity
Example: Univariate curve fitting (regression, function approximation)
y Examples y Learned Models x … 𝑓𝑓 𝑥𝑥
lines … ℎ(𝑥𝑥)

Very simple,
but not very
consistent
with the
data!

• Consistency: ℎ 𝑥𝑥𝑖𝑖 ≈ 𝑦𝑦𝑖𝑖


• Simplicity: small number of model parameters
Measuring Consistency using Loss
Goal of learning: Find a hypothesis that makes predictions that are consistent with
the examples E = 𝑥𝑥1 , 𝑦𝑦1 , … , 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , … , 𝑥𝑥𝑁𝑁 , 𝑦𝑦𝑁𝑁 .
That is, 𝑦𝑦� = ℎ 𝑥𝑥 ≈ 𝑦𝑦.

• Measure mistakes: Loss function 𝐿𝐿(𝑦𝑦, 𝑦𝑦)



• Absolute-value loss 𝐿𝐿1 𝑦𝑦, 𝑦𝑦� = |𝑦𝑦 − 𝑦𝑦|
� For Regression
• Squared-error loss 𝐿𝐿2 𝑦𝑦, 𝑦𝑦� = 𝑦𝑦 − 𝑦𝑦� 2
• 0/1 loss 𝐿𝐿0/1 𝑦𝑦, 𝑦𝑦� = 0 if 𝑦𝑦 = 𝑦𝑦,
� else 1 For Classification
• Log loss, cross-entropy loss and many others… Loss
𝑓𝑓
• Empirical loss: average loss over the N examples in the dataset
1 ℎ∗
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝐸𝐸 (ℎ) = � 𝐿𝐿(𝑦𝑦, ℎ 𝑥𝑥 )
|𝐸𝐸|
𝑥𝑥,𝑦𝑦 ∈𝐸𝐸
Learning Consistent ℎ by Minimizing the Loss
• Empirical loss 1
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝐸𝐸 (ℎ) = � 𝐿𝐿(𝑦𝑦, ℎ 𝑥𝑥 )
|𝐸𝐸|
𝑥𝑥,𝑦𝑦 ∈𝐸𝐸

• Find the best hypothesis that minimizes the loss


ℎ∗ = argmin 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝐸𝐸 (ℎ)
ℎ∈ 𝐻𝐻 Loss
𝑓𝑓
• Reasons for ℎ∗ ≠ 𝑓𝑓
a) Realizability: 𝑓𝑓 ∉ 𝐻𝐻
b) 𝑓𝑓 is nondeterministic or examples are noisy.
ℎ∗
c) It is computationally intractable to search all 𝐻𝐻,
so we use a non-optimal heuristic.
The Bayes Classifier
For 0/1 loss, the empirical loss is minimized by the model that predicts for each 𝑥𝑥 the most likely class 𝑦𝑦 using
MAP (Maximum a posteriori) estimates. This is called the Bayes classifier.

𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃(𝑦𝑦)
h∗ x = argmax 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥) = argmax = argmax 𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃(𝑦𝑦)
𝑦𝑦 𝑦𝑦 𝑃𝑃(𝑥𝑥) 𝑦𝑦

Optimality: The Bayes classifier is optimal for 0/1 loss. It is the most consistent classifier possible with the lowest
possible error called the Bayes error rate. No better classifier is possible!

Issue: The classifier requires to learn 𝑃𝑃 𝑥𝑥 𝑦𝑦) 𝑃𝑃 𝑦𝑦 = 𝑃𝑃(𝑥𝑥, 𝑦𝑦) from the examples.
• It needs the complete joint probability which requires in the general case a probability table with one entry for
each possible value for the feature vector 𝑥𝑥.
• This is impractical (unless a simple Bayes network exists) and most classifiers try to approximate the Bayes
classifier using a simpler model with fewer parameters.
Simplicity
Ease of use
• Simpler hypotheses have fewer model parameters to estimate and store.

Generalization: How well does the hypothesis perform on new data?


• We do not want the model to be too specific to the training examples (an issue called
overfitting).
• Simpler models typically generalize better to new examples.

How to achieve simplicity?


a) Model bias: Restrict 𝐻𝐻 to simpler models (e.g., assumptions like independence,
only consider linear models).
b) Feature selection: use fewer variables from the feature vector 𝑥𝑥
c) Regularization: penalize model for its complexity (e.g., number of parameters)
ℎ∗ = argmin 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝐸𝐸 (ℎ) + 𝜆𝜆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(ℎ)
ℎ∈ 𝐻𝐻
Penalty term
Overfitting

Model Selection: Bias vs. Variance


Simpler More consistent

Points: Two
samples from the
same function 𝑓𝑓
to show variance.

Lines: the learned


function ℎ.

High Bias: restrictions by the model class Low


This is a tradeoff
Low Variance: difference in the model due to slightly different data. high
Data
Feature vector 𝑥𝑥 Class
The Dataset (Features, Variables, Attributes) Label 𝑦𝑦

Examples
(Instances,
Observation)

Find a hypothesis (called “model”) to predict the class given the features.
Feature Engineering
• Add information sources as new variables to the model.
• Add derived features that help the classifier (e.g., 𝑥𝑥1 𝑥𝑥2 , 𝑥𝑥12 ).
• Embedding: E.g., convert words to vectors where vector
similarity between vectors reflects semantic similarity.

• Example for Spam detection: In addition to words


• Have you emailed the sender before?
• Have 1000+ other people just gotten the same email?
• Is the header information consistent?
• Is the email in ALL CAPS?
• Do inline URLs point where they say they point?
• Does the email address you by (your) name?

• Feature Selection: Which features should be used in the


model is a model selection problem (choose between
models with different features).
Training
and
Testing
Model Evaluation (Testing)
The model was trained on the training examples 𝐸𝐸. We want to test how well the model
will perform on new examples 𝑇𝑇 (i.e., how well it generalizes to new data).

• Testing loss: Calculate the empirical loss for predictions on a testing data set 𝑇𝑇 that is
different from the data used for training.
1
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿,𝑇𝑇 (ℎ) = � 𝐿𝐿(𝑦𝑦, ℎ 𝑥𝑥 )
|𝑇𝑇|
𝑥𝑥,𝑦𝑦 ∈𝑇𝑇

• For classification we often use the accuracy measure, the proportion of correctly
classified test examples.
1
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ, 𝑇𝑇 = � [ℎ 𝑥𝑥 = 𝑦𝑦] = 1 − 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑠𝑠𝐿𝐿0/1,𝑇𝑇 (ℎ)
𝑇𝑇
(𝑥𝑥,𝑦𝑦)∈𝑇𝑇

𝑐𝑐 is an indicator function returning 1 if 𝑐𝑐 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 and otherwise 0


Training a Model
• Models are “trained” (learned) on the training data. This
involved estimating:

1. Model parameters (the model): E.g., probabilities, weights, Training


factors. Data
2. Hyperparameters: Many learning algorithms have choices for
learning rate, regularization 𝜆𝜆, maximal decision tree depth,
selected features,... The algorithm tries to optimizes the model
parameters given user-specified hyperparameters.

• We need to tune the hyperparameters! Test


Data
Hyperparameter Tuning/Model Selection
1. Hold a validation data set back from the training data.
2. Learn models using the training set with different
hyperparameters. Often a grid of possible hyperparameter
combinations or some greedy search is used.
Training
3. Evaluate the models using the validation data and choose Data
the model with the best accuracy. Selecting the right type of Training
model, hyperparameters and features is called model Data
selection.
4. Learn the final model with the chosen hyperparameters Validation
using all training (including validation data). Data

• Notes:
• The validation set was not used for training, so we get generalization Test
accuracy for the different hyperparameter settings. Data
• If no model selection is necessary, then no validation set is used.
Testing a Model

Training
Data

• After the model is selected, the final model is evaluated against the
test set to estimate the final model accuracy.
Test
• Very important: never “peek” at the test set during training! Data
How to Split the Dataset
• Random splits: Split the data randomly in, e.g.,
60% training, 20% validation, and 20% testing.

• Stratified splits: Like random splits, but balance classes and other Training
properties of the examples. Data
Training
Data
• k-fold cross validation: Use training & validation data better
• Split the training & validation data randomly into k folds.
• For k rounds hold one fold back for testing and use the remaining 𝑘𝑘 − 1 folds
for training. Validation
• Use the average error/accuracy as a better estimate. Data
• Some algorithms/tools do this internally.

• LOOCV (leave-one-out cross validation): 𝑘𝑘 = 𝑛𝑛 used if very little Test


data is available. Data
Learning Curve:
The Effect the Training Data Size
Accuracy of a classifier
when the amount of
available training data
increases.
Accuracy

More data is better!

At some point the


learning curve flattens
out and more data does
not contribute much!
Comparing to a Baselines
• First step: get a baseline
• Baselines are very simple straw man model.
• Helps to determine how hard the task is.
• Helps to find out what a good accuracy is.

• Weak baseline: The most frequent label classifier


• Gives all test instances whatever label was most common in the training set.
• Example: For spam filtering, give every message the label “ham.”
• Accuracy might be very high if the problem is skewed (called class imbalance).
• Example: If calling everything “ham” gets already 66% right, so a classifier that gets 70% isn’t very good…

• Strong baseline: For research, we typically compare to previous published state-


of-the-art as a baseline.
Types of
Models
Regression: Predict a number
Classification: Predict a label
Regression: Linear Regression
Model: ℎ𝒘𝒘 𝒙𝒙𝑗𝑗 = 𝑤𝑤𝑜𝑜 + 𝑤𝑤1 𝑥𝑥𝑗𝑗,1 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑗𝑗,𝑛𝑛 = ∑𝑖𝑖 𝑤𝑤𝑖𝑖 𝑥𝑥𝑗𝑗,𝑖𝑖 = 𝒘𝒘𝑇𝑇 𝒙𝒙𝑗𝑗
Squared error loss over the whole data matrix 𝑿𝑿
Empirical Loss: 𝐿𝐿 𝒘𝒘 = 𝑿𝑿𝑿𝑿 − 𝒚𝒚 𝟐𝟐
The gradient is a vector of partial derivatives
𝑇𝑇
Gradient: ∇𝐿𝐿 𝒘𝒘 = 2𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 − 𝒚𝒚 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿
∇𝐿𝐿 𝒘𝒘 = (𝒘𝒘), (𝒘𝒘), … , (𝒘𝒘)
𝜕𝜕𝑤𝑤1 𝜕𝜕𝑤𝑤2 𝜕𝜕𝑤𝑤𝑛𝑛
Find: ∇𝐿𝐿 𝒘𝒘 = 0

Gradient descend: ∇𝐿𝐿 𝒘𝒘


𝒘𝒘
𝒘𝒘 = 𝒘𝒘 − 𝛼𝛼∇𝐿𝐿 𝒘𝒘

Analytical solution:
𝒘𝒘∗ = 𝑿𝑿𝑇𝑇 𝑿𝑿 −1
𝑿𝑿𝑇𝑇 𝒚𝒚
Pseudo inverse
Naïve Bayes Classifier
• Approximates a Bayes classifier with the naïve independence assumption that all 𝑛𝑛
features are conditional independent given the class.
𝑛𝑛

ℎ 𝑥𝑥 = argmax 𝑃𝑃 𝑦𝑦 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦)


𝑦𝑦
𝑖𝑖=1
The 𝑃𝑃 𝑦𝑦 s and the 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦)s are estimated from the data by counting.

• Gaussian Naïve Bayes Classifiers extend the approach to continuous features by


assuming:

𝑃𝑃 𝑥𝑥𝑖𝑖 𝑦𝑦) ~ 𝑁𝑁 𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦

The parameters for the normal distribution 𝑁𝑁 𝜇𝜇𝑦𝑦 , 𝜎𝜎𝑦𝑦 are estimated from data.
Decision Trees

• A sequence of decisions represented as a tree.


• Many implementations that differ by
• How to select features to split?
• When to stop splitting?
• Is the tree pruned?

• Approximates a Bayesian classifier by


ℎ(𝑥𝑥) = argmax 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 leafNodeMatching(𝑥𝑥))
𝑦𝑦
K-Nearest Neighbors Classifier

• Class is predicted by looking at the majority in the set of the k nearest neighbors. 𝑘𝑘 is a
hyperparameter. Larger 𝑘𝑘 smooth the decision boundary.
• Neighbors are found using a distance measure (e.g., Euclidean distance between points).
• Approximates a Bayesian classifier by
ℎ(𝑥𝑥) = argmax 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 neighborhood(𝑥𝑥))
𝑦𝑦
Support Vector Machine (SVM)

Margin

Decision
boundary

• Linear classifier that finds the maximum margin separator using only the points
that are “support vectors” and quadratic optimization.
• The kernel trick can be used to learn non-linear decision boundaries.
Artificial Neural Networks/Deep Learning
Computational graph
Hidden Layer For classification
typically a softmax • Represent 𝑦𝑦� = ℎ 𝑥𝑥 as a network
activation function of weighted sums with non-linear
returning 𝑷𝑷(𝑦𝑦|𝑥𝑥) activation functions g (e.g.,
logistic, ReLU).
• Learn weights 𝐰𝐰 from examples
using backpropagation of
prediction errors L(𝑦𝑦,
� 𝑦𝑦) (gradient
descend).
• ANNs are universal
approximators. Large networks
can approximate any function (no
bias). Regularization is typically
used to avoid overfitting.
• Deep learning adds more hidden
layers and layer types (e.g.,
convolution layers) for better
Perceptron learning.
Bias term Non-linear activation function
Many other models exist

• Generalized linear model (GLM): This important

Other model family includes linear regression and the


classification method logistic regression.

Popular Often used methods

• Regularization: enforce simplicity by using a penalty

Models and for complexity.


• Kernel trick: Let a linear classifier learn non-linear
decision boundaries ( = a linear boundary in a high

Methods dimensional space).


• Ensemble Learning: Use many models and combine
the results (e.g., random forest, boosting).
• Embedding and Dimensionality Reduction: Learn
how to represent data in a simpler way.
Some Use Cases of ML for Intelligent Agents
Learn Actions Learn Heuristics Perception Compressing Tables

• Directly learn the best • Learn evaluation • Natural language • Neural networks can be
action from examples. functions for states. processing: Use deep used as a compact
learning / word representation of tables
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ℎ(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = ℎ(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) embeddings / language that do not fit in
models to understand memory. E.g.,
concepts, translate • Joint probability table
• This model can also be • Can learn a heuristic for between languages, or • State utility table
used as a playout policy minimax search from generate text.
for Monte Carlo tree examples.
search with data from • Speech recognition:
• The tables can be
self-play. Identify the most likely
learned form data.
sequence of words.
• Vision: Object
recognition in
images/videos.
Generate images/video.

Bottom line: Learning a function is often more effective than hard-coding it


However, we do not always know how it performs in very rare cases!

You might also like