0% found this document useful (0 votes)
2 views61 pages

Week 09 Lesson 1 Intro Machine Learning 1 to 32 (4)

The document provides an overview of machine learning (ML) concepts, including terminology, data types, and the data science process. It distinguishes between supervised and unsupervised learning, outlining various methods and applications of ML, such as spam filtering and credit card fraud detection. Additionally, it discusses model training, testing, overfitting, and the importance of regularization and validation in building effective ML models.

Uploaded by

jappashehzar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views61 pages

Week 09 Lesson 1 Intro Machine Learning 1 to 32 (4)

The document provides an overview of machine learning (ML) concepts, including terminology, data types, and the data science process. It distinguishes between supervised and unsupervised learning, outlining various methods and applications of ML, such as spam filtering and credit card fraud detection. Additionally, it discusses model training, testing, overfitting, and the importance of regularization and validation in building effective ML models.

Uploaded by

jappashehzar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Machine Learning

Basic Concepts !"#$%&"'(

Feature'2

*"+,-,./'0.%/1#
&2'
!"#$%&"')
Feature'1

Terminology
Machine Learning, Data Science, Data Mining, Data Analysis, Sta tistical
Learning, Knowledge Discovery in Databases, Pattern Dis covery.
Data everywhere!
1. Google: processes 24 peta bytes of data per day. 2.

Facebook: 10 million photos uploaded every hour. 3. Youtube:


1 hour of video uploaded every second. 4. Twitter: 400 million

tweets per day.

5. Astronomy: Satellite data is in hundreds of PB. 6. . . .

7. “By 2020 the digital universe will reach 44 zettabytes...”

The Digital Universe of Opportunities: Rich Data and the


Increasing Value of the Internet of Things, April 2014.
That’s 44 trillion gigabytes!

Data types
Data comes in different sizes and also flavors (types): Texts

Numbers

Clickstreams
Graphs

Tables

Images

Transactions

Videos

Some or all of the above!

Smile, we are ’DATAFIED’ !


• Wherever we go, we are “datafied”.

• Smartphones are tracking our locations.

• We leave a data trail in our web browsing.

• Interaction in social networks.


• Privacy is an important issue in Data Science.

The Data Science process 1


3

DATA COLLECTION Data!cleaning!

2 DATA PREPARATION Descriptive


EDA A!and!B! !C

DB statistics, Clustering
Research questions?
Time

Domain expertise Feature/variable!


Static engineering!
Static

Data.
Data.
DB 54

Visualization MACHINE LEARNING


Classification, models,
+
Application - -
deployment scoring, predictive
- ++ +
+ +
-
density +- -
clustering, +

estimation, etc. Dashboard


Data-driven Predicted%class/risk%
Model%(f)%
decisions Yes!/! 90%!

Applications of ML • We all
use it on a daily basis. Examples:

Machine Learning
• Spam filtering
• Credit card fraud detection
• Digit recognition on checks, zip codes •
Detecting faces in images
• MRI image analysis
• Recommendation system
• Search engines
• Handwriting recognition
• Scene classification
• etc...

Interdisciplinary field Statistics

Biology

Engineering
Economics

M L Visualization

Databases Signal processing

ML versus Statistics
GLM
Statistics: • PCA
Machine Learning:
• Hypothesis testing •
Experimental design • • Decision trees
Anova • Rule induction
• Linear regression • • Neural Networks •
Logistic regression • SVMs
• Clustering method • Visualization
Association rules • • Graphical models •
Feature selection • Genetic algorithm

http://statweb.stanford.edu/
~jhf/ftp/dm-stat.pdf

Machine Learning definition

“How do we create computer programs that improve with experi ence?”


Tom Mitchell
http://videolectures.net/mlas06_mitchell_itm/

Machine Learning definition


“How do we create computer programs that improve with experi ence?”
Tom Mitchell
http://videolectures.net/mlas06_mitchell_itm/

“A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E. ”
Tom Mitchell. Machine Learning 1997.

Supervised vs. Unsupervised


d
Given: Training data: (x1, y1), . . . , (xn, yn) / xi ∈ R and yiis the label.

example x1 → x11 x12 . . . x1d y1 ← label


......
............

xi1 xi2 . . . xid


............

example xi → yi ← label
......
example xn → xn1 xn2 . . . xnd yn ← label

Supervised vs. Unsupervised


d
Given: Training data: (x1, y1), . . . , (xn, yn) / xi ∈ R and yiis the label.

example x1 → x11 x12 . . . x1d y1 ← label


......
............

xi1 xi2 . . . xid


............

example xi → yi ← label
......
example xn → xn1 xn2 . . . xnd yn ← label
Supervised vs. Unsupervised

Unsupervised learning:
Learning a model from unlabeled data.

Supervised learning:
Learning a model from labeled data.
Unsupervised Learning
Training data:“examples” x.

n
x1, . . . , xn, xi ∈ X ⊂ R

• Clustering/segmentation:

d
f : R −→ {C1, . . . Ck} (set of clusters).

Example: Find clusters in the population, fruits, species.


Unsupervised learning

Feature'2

Feature'1
Unsupervised learning

Feature'2

Feature'1
Unsupervised learning
Feature'2

Feature'1
Methods: K-means, gaussian mixtures, hierarchical clustering, spectral
clustering, etc.

Supervised learning
Training data:“examples” x with “labels” y.
d
(x1, y1), . . . , (xn, yn) / xi ∈ R

• Classification: y is discrete. To simplify, y ∈ {−1, +1}

d
f : R −→ {−1, +1} f is called a binary classifier. Example: Approve

credit yes/no, spam/ham, banana/orange.

Supervised learning

!"#$%&"'(
!"#$%&"')
Supervised learning

!"#$%&"'(
&2'
!"#$%&"')

*"+,-,./'0.%/1#
Supervised learning

!"#$%&"'(
&2'
!"#$%&"')

*"+,-,./'0.%/1#

Methods: Support Vector Machines, neural networks, decision trees,


K-nearest neighbors, naive Bayes, etc.

Supervised learning
Classification:

!"#$%&"'(

!"#$%&"'('
!"#$%&"') !"#$%&"'('
!"#$%&"') !"#$%&"'(

!"#$%&"'(

!"#$%&"')
!"#$%&"')

!"#$%&"')

Supervised learning
Non linear classification
Supervised learning Training
data:“examples” x with “labels” y. (x1, y1), . . . , (xn,

d
yn) / xi ∈ R
• Regression: y is a real value, y ∈ R

d
f : R −→ R f is called a regressor. Example: amount of
credit, weight of fruit.

Supervised learning
Regression:

!
#$%&'($")
Example: Income in function of age, weight of the fruit in function of its
length.

Supervised learning
Regression:

!
#$%&'($")
Supervised learning
Regression:

!
#$%&'($")
Supervised learning
Regression:

!
#$%&'($")
Training and Testing

!"#$%$%&'()*'

+,'-.&/"$*01'
+/2).'345'
Training and &)%2)"8''

Testing !"#$%$%&'()*'

+,'-.&/"$*01'

6%7/1)8''
=")2$*'#1/:%*'>'
#&)8'' 8' ;$<7/2) =")2$*'9)(?%/'
4#1$.9'(*#*:( +/2).'345'
K-nearest neighbors
• Not every ML method builds a model!

• Our first ML method: KNN.

• Main idea: Uses the similarity between examples. • Assumption: Two

similar examples should have same labels.

• Assumes all examples (instances) are points in the d dimen sional


d
space R .

K-nearest neighbors
• KNN uses the standard Euclidian distance to define nearest neighbors.
Given two examples xi and xj:
vuuu d
tX
k=1
2
d(xi, xj) = (xik − xjk)

K-nearest neighbors
Training algorithm:
d
Add each training example (x, y) to the dataset D. x ∈ R ,
y ∈ {+1, −1}.

K-nearest neighbors
Training algorithm:
d
Add each training example (x, y) to the dataset D. x ∈ R ,
y ∈ {+1, −1}.

Classification algorithm:
Given an example xq to be classified. Suppose Nk(xq) is the set of the

K-nearest neighbors of xq.

yˆq = sign(X yi)


xi∈Nk(xq)

K-nearest neighbors
3-NN. Credit: Introduction to Statistical Learning.

K-nearest neighbors
3-NN. Credit: Introduction to Statistical Learning.

Question: Draw an approximate decision boundary for K = 3?

K-nearest neighbors
Credit: Introduction to Statistical Learning.

K-nearest neighbors Question: What are the


pros and cons of K-NN?

K-nearest neighbors
Question: What are the pros and cons of K-NN? Pros:
+ Simple to implement.
+ Works well in practice.
+ Does not require to build a model, make assumptions, tune parameters.
+ Can be extended easily with news examples.

K-nearest neighbors
Question: What are the pros and cons of K-NN? Pros:
+ Simple to implement.
+ Works well in practice.
+ Does not require to build a model, make assumptions, tune parameters.
+ Can be extended easily with news examples.

Cons:
- Requires large space to store the entire training dataset. - Slow! Given
n examples and d features. The method takes O(n × d) to run.
- Suffers from the curse of dimensionality.

Applications of K-NN
1. Information retrieval.

2. Handwritten character classification using nearest neighbor in large


databases.

3. Recommender systems (user like you may like similar movies). 4.

Breast cancer diagnosis.


5. Medical data mining (similar patient symptoms). 6.

Pattern recognition in general.

Training and

Testing !"#$%$%&'()*'

+,'-.&/"$*01'

6%7/1)8'' =")2$*'#1/:%*'>'
&)%2)"8''
#&)8'' 8' ;$<7/2) =")2$*'9)(?%/'
4#1$.9'(*#*:( +/2).'345'

Question: How can we be confident about f?

Training and Testing


train
• We calculate E the in-sample error (training error or em pirical
error/risk).

train Xn `oss(yi, f(xi))


E (f) = i=1

Training and Testing


train
• We calculate E the in-sample error (training error or em pirical
error/risk).

train Xn `oss(yi, f(xi))


E (f) =

i=1
• Examples of loss

functions: – Classification `oss(yi, f(xi)) =


(
1 if sign(yi) 6= sign(f(xi)) 0
error: otherwise

Training and Testing


train
• We calculate E the in-sample error (training error or em pirical
error/risk).
train Xn `oss(yi, f(xi))
E (f) =

i=1
• Examples of loss

functions: – Classification `oss(yi, f(xi)) =


(
1 if sign(yi) 6= sign(f(xi)) 0
error: otherwise

– Least square loss:


2
`oss(yi, f(xi)) = (yi − f(xi))

Training and Testing


train
• We calculate E the in-sample error (training error or em pirical
error/risk).

train Xn `oss(yi, f(xi))


E (f) = i=1

train train
• We aim to have E (f) small, i.e., minimize E (f)

Training and Testing


train
• We calculate E the in-sample error (training error or em pirical
error/risk).

train Xn `oss(yi, f(xi))


E (f) = i=1
train train
• We aim to have E (f) small, i.e., minimize E (f)

test
• We hope that E (f), the out-sample error (test/true error), will be small
too.
Overfitting/underfitting An intuitive example

Structural Risk Minimization


High*Bias***** **********Low*Bias**
Low*Variance* *******High*Variance

Predic'on*Error

____Test*error****
____Training*error*

UnderfiAng****************Good*models** * *OverfiAng *********

Low*******************************************Complexity*of*the*model***************************
**********High*

Training and Testing


()&
!"#$%&'

()&

!"#$%&'
()&
!"#$%&'

Training and Testing


High bias (underfitting)

!"#$%&' !"#$%&'

!"#$%&'
()&
()&

()&

Training and Testing


High bias (underfitting)

!"#$%&' !"#$%&'

!"#$%&'
()&
()&

()&

High variance (overfitting)

Training and Testing


()&
!"#$%&'
!"#$%&'
()&

High bias (underfitting) Just right! !"#$%&'

()&

High variance (overfitting)

Avoid overfitting
In general, use simple models!

• Reduce the number of features manually or do feature selec tion.

• Do a model selection (ML course).


• Use regularization (keep the features but reduce their impor tance by
setting small parameter values) (ML course).

• Do a cross-validation to estimate the test error.

Regularization: Intuition We want to


minimize:

Classification term + C × Regularization term


`oss(yi, f(xi)) +
n C × R(f)
X

i=1

Regularization: Intuition
!"#$%&'

()&
!"#$%&'

()&
()&

!"#$%&'
f(x) = λ0 + λ1x ... (1)

2
f(x) = λ0 + λ1x + λ2x ... (2)
2 3 4
f(x) = λ0 + λ1x + λ2x + λ3x + λ4x ... (3) Hint: Avoid

high-degree polynomials.

Train, Validation and Test


TRAIN VALIDATION TEST

Example: Split the data randomly into 60% for training, 20% for validation
and 20% for testing.

Train, Validation and Test


TRAIN VALIDATION TEST

1. Training set is a set of examples used for learning a model (e.g., a


classification model).

Train, Validation and Test


TRAIN VALIDATION TEST
1. Training set is a set of examples used for learning a model (e.g., a
classification model).

2. Validation set is a set of examples that cannot be used for learning the
model but can help tune model parameters (e.g., selecting K in K-NN).
Validation helps control overfitting.

Train, Validation and Test


TRAIN VALIDATION TEST

1. Training set is a set of examples used for learning a model (e.g., a


classification model).

2. Validation set is a set of examples that cannot be used for learning the
model but can help tune model parameters (e.g., selecting K in K-NN).
Validation helps control overfitting.
3. Test set is used to assess the performance of the final model and
provide an estimation of the test error.

Train, Validation and Test


TRAIN VALIDATION TEST

1. Training set is a set of examples used for learning a model (e.g., a


classification model).

2. Validation set is a set of examples that cannot be used for learning the
model but can help tune model parameters (e.g., selecting K in K-NN).
Validation helps control overfitting.

3. Test set is used to assess the performance of the final model and
provide an estimation of the test error.
Note: Never use the test set in any way to further tune the parameters
or revise the model.

K-fold Cross Validation


A method for estimating test error using training data.

Algorithm:

Given a learning algorithm A and a dataset D

Step 1: Randomly partition D into k equal-size subsets D1, . . . , Dk


Step 2:
For j = 1 to k
Train A on all Di, i ∈ 1, . . . k and i 6= j, and get fj. Apply fj
D
to Dj and compute E j
Step 3: Average error over all folds.
D
k (E j)
X

j=1

Confusion matrix

34*#/0%;/<$0%
!"#$%$&' (')*%$&'
&"$4)()'6 ,!-0-+,!-.-1!/
&"$=)4*$=%;/<$0
!"#$%$&' !"#$%&'()*)+$%,!&-
./0($%&'()*)+$%,.&- (')*%$&' 7$6()*)+)*5%,8$4/00- ,!-0-+,!-.-1(/
./0($%1$2/*)+$%,.1- !"#$%1$2/*)+$%,!1-
79$4):)4)*5 ,(-0-+,(-.-1!/

344#"/45 +,!-.-,(/-0-+,!-.-,(-.-1!-.-1(/
,2'-3'45'6%*)'-"7-34'8$5%$"6#-%2*%-*4'-5"44'
5%
'4'- 34'8$5%'8-*#-3"#$%$&'
,2'-3'45'6%*)'-"7-3"#$%$&'-34'8$5%$"6#-%2*
,2'-3'45'6%*)'-"7-6')*%$&'-5*#'#-%2*%-9'4
%-*4'- 5"44'5%
'- 34'8$5%'8-*#-6')*%$&'
,2'-3'45'6%*)'-"7-3"#$%$&'-5*#'#-%2*%-9

Evaluation metrics

34*#/0%;/<$0%
&"$4)()'6 ,!-0-+,!-.-1!/

&"$=)4*$=%;/<$0
!"#$%$&' (')*%$&' 7$6()*)+)*5%,8$4/00- ,!-0-+,!-.-1(/
!"#$%$&' !"#$%&'()*)+$%,!&-
./0($%&'()*)+$%,.&- (')*%$&' 79$4):)4)*5 ,(-0-+,(-.-1!/
./0($%1$2/*)+$%,.1- !"#$%1$2/*)+$%,!1-
,2'-3'45'6%*)'-"7-34'8$5%$"6#-%2*%-*4'-5"44'
5%
344#"/45 +,!-.-,(/-0-+,!-.-,(-.-1!-.-1(/
,2'-3'45'6%*)'-"7-3"#$%$&'-34'8$5%$"6#-%2*
%-*4'- 5"44'5%
,2'-3'45'6%*)'-"7-3"#$%$&'-5*#'#-%2*%-9 '- 34'8$5%'8-*#-6')*%$&'
'4'- 34'8$5%'8-*#-3"#$%$&'
,2'-3'45'6%*)'-"7-6')*%$&'-5*#'#-%2*%-9'4

Terminology review
Review the concepts and terminology:

Instance, example, feature, label, supervised learning, unsu pervised


learning, classification, regression, clustering, pre diction, training set,
validation set, test set, K-fold cross val idation, classification error, loss
function, overfitting, under fitting, regularization.

Machine Learning Books


1. Tom Mitchell, Machine Learning.

2. Abu-Mostafa, Yaser S. and Magdon-Ismail, Malik and Lin, Hsuan-Tien,


Learning From Data, AMLBook.

3. The elements of statistical learning. Data mining, inference, and


prediction T. Hastie, R. Tibshirani, J. Friedman.

4. Christopher Bishop. Pattern Recognition and Machine Learn ing.

5. Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification.


Wiley.

Machine Learning Resources


• Major journals/conferences: ICML, NIPS, UAI, ECML/PKDD, JMLR,
MLJ, etc.

• Machine learning video lectures:


http://videolectures.net/Top/Computer_Science/Machine_Learning/

• Machine Learning (Theory):


http://hunch.net/
• LinkedIn ML groups: “Big Data” Scientist, etc.
• Women in Machine Learning:
https://groups.google.com/forum/#!forum/women-in-machine-learning • KDD

nuggets http://www.kdnuggets.com/

Credit
• The elements of statistical learning. Data mining, inference, and
prediction. 10th Edition 2009. T. Hastie, R. Tibshirani, J. Friedman.
• Machine Learning 1997. Tom Mitchell.

You might also like