0% found this document useful (0 votes)
61 views139 pages

NLP Course Lecture02 Huawei Noahs Ark Lab

ods.ai NLP course lecture 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views139 pages

NLP Course Lecture02 Huawei Noahs Ark Lab

ods.ai NLP course lecture 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing

Lecture 02 Machine Learning Basics;


Text Classification

Qun Liu, Valentin Malykh


Huawei Noah’s Ark Lab

Spring 2020
A course delivered at MIPT, Moscow

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 1 / 135
Content

1 Machine Learning basics

2 Classification and logistic regression

3 Text Classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 2 / 135
Machine Learning basics

Content

1 Machine Learning basics

2 Classification and logistic regression

3 Text Classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 3 / 135
Machine Learning basics What is machine learning?

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 4 / 135
Machine Learning basics What is machine learning?

What is machine learning?


— Wikipedia definition

Machine learning (ML) is the scientific study of algorithms and


statistical models that computer systems use to perform a specific
task without using explicit instructions, relying on patterns and
inference instead.
It is seen as a subset of artificial intelligence.
Machine learning algorithms build a mathematical model based
on sample data, known as "training data", in order to make
predictions or decisions without being explicitly programmed to
perform the task.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 5 / 135
Machine Learning basics What is machine learning?

Machine learning algorithms are used in a wide variety of


applications, such as email filtering and computer vision, where it
is difficult or infeasible to develop a conventional algorithm for
effectively performing the task.
Machine learning is closely related to computational statistics,
which focuses on making predictions using computers.
The study of mathematical optimization delivers methods, theory
and application domains to the field of machine learning.
Data mining is a field of study within machine learning, and
focuses on exploratory data analysis through unsupervised
learning.
In its application across business problems, machine learning is
also referred to as predictive analytics.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 6 / 135
Machine Learning basics What is machine learning?
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Supervised machine learning

(Supervised) Machine Learning techniques automatically


learn a model of the relationship between a set of
descriptive features and a target feature from a set of
historical examples.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 7 / 135
Machine Learning basics What is machine learning?

Supervised machine learning

Figure: Using machine learning to induce a prediction model from a


training dataset.
John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 8 / 135
Machine Learning basics What is machine learning?

Supervised machine learning

Figure: Using the model to make predictions for new query instances.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 9 / 135
Machine Learning basics What is machine learning?

L OAN -S ALARY
ID O CCUPATION AGE R ATIO O UTCOME
1 industrial 34 2.96 repaid
2 professional 41 4.64 default
3 professional 36 3.22 default
4 professional 41 3.11 default
5 industrial 48 3.80 default
6 industrial 61 2.52 repaid
7 professional 37 1.50 repaid
8 professional 40 1.93 repaid
9 industrial 33 5.25 default
10 industrial 32 4.15 default

What is the relationship between the descriptive features


(O CCUPATION, AGE, L OAN -S ALARY R ATIO) and the target
feature (O UTCOME)?
John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 10 / 135
Machine Learning basics Machine learning – an example

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 11 / 135
Machine Learning basics Machine learning – an example

L OAN -S ALARY
ID O CCUPATION AGE R ATIO O UTCOME
1 industrial 34 2.96 repaid
2 professional 41 4.64 default
3 professional 36 3.22 default
4 professional 41 3.11 default
5 industrial 48 3.80 default
6 industrial 61 2.52 repaid
7 professional 37 1.50 repaid
8 professional 40 1.93 repaid
9 industrial 33 5.25 default
10 industrial 32 4.15 default

What is the relationship between the descriptive features


(O CCUPATION, AGE, L OAN -S ALARY R ATIO) and the target
feature (O UTCOME)?
John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 12 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO > 3 then


O UTCOME=’default’
else
O UTCOME=’repay’
end if

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 13 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO > 3 then


O UTCOME=’default’
else
O UTCOME=’repay’
end if

This is an example of a prediction model

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 14 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO > 3 then


O UTCOME=’default’
else
O UTCOME=’repay’
end if

This is an example of a prediction model


This is also an example of a consistent prediction model

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 15 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO > 3 then


O UTCOME=’default’
else
O UTCOME=’repay’
end if

This is an example of a prediction model


This is also an example of a consistent prediction model
Notice that this model does not use all the features and the
feature that it uses is a derived feature (in this case a
ratio): feature design and feature selection are two
important topics that we will return to again and again.
John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 16 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

What is the relationship between the descriptive features


and the target feature (O UTCOME) in the following
dataset?

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 17 / 135
Machine Learning basics Machine learning – an example

Loan-
Salary
ID Amount Salary Ratio Age Occupation House Type Outcome
1 245,100 66,400 3.69 44 industrial farm stb repaid
2 90,600 75,300 1.2 41 industrial farm stb repaid
3 195,600 52,100 3.75 37 industrial farm ftb default
4 157,800 67,600 2.33 44 industrial apartment ftb repaid
5 150,800 35,800 4.21 39 professional apartment stb default
6 133,000 45,300 2.94 29 industrial farm ftb default
7 193,100 73,200 2.64 38 professional house ftb repaid
8 215,000 77,600 2.77 17 professional farm ftb repaid
9 83,000 62,500 1.33 30 professional house ftb repaid
10 186,100 49,200 3.78 30 industrial house ftb default
11 161,500 53,300 3.03 28 professional apartment stb repaid
12 157,400 63,900 2.46 30 professional farm stb repaid
13 210,000 54,200 3.87 43 professional apartment ftb repaid
14 209,700 53,000 3.96 39 industrial farm ftb default
15 143,200 65,300 2.19 32 industrial apartment ftb default
16 203,000 64,400 3.15 44 industrial farm ftb repaid
17 247,800 63,800 3.88 46 industrial house stb repaid
18 162,700 77,400 2.1 37 professional house ftb repaid
19 213,300 61,100 3.49 21 industrial apartment ftb default
20 284,100 32,300 8.8 51 industrial farm ftb default
21 154,000 48,900 3.15 49 professional house stb repaid
22 112,800 79,700 1.42 41 professional house ftb repaid
23 252,000 59,700 4.22 27 professional house stb default
24 175,200 39,900 4.39 37 professional apartment stb default
25 149,700 58,600 2.55 35 industrial farm stb default

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 18 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO < 1.5 then


O UTCOME=’repay’
else if L OAN -S ALARY R ATIO > 4 then
O UTCOME=’default’
else if AGE < 40 and O CCUPATION =’industrial’ then
O UTCOME=’default’
else
O UTCOME=’repay’
end if

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 19 / 135
Machine Learning basics Machine learning – an example
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Machine learning – an example

if L OAN -S ALARY R ATIO < 1.5 then


O UTCOME=’repay’
else if L OAN -S ALARY R ATIO > 4 then
O UTCOME=’default’
else if AGE < 40 and O CCUPATION =’industrial’ then
O UTCOME=’default’
else
O UTCOME=’repay’
end if

The real value of machine learning becomes apparent in


situations like this when we want to build prediction models
from
John Kelleher large
and Brian datasets
Mac Namee and Aoifewith
D’Arcy, multiple features.
Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 20 / 135
Machine Learning basics Model spaces and inductive bias

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 21 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Machine learning algorithms work by searching through a


set of possible prediction models for the model that best
captures the relationship between the descriptive features
and the target feature.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 22 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Machine learning algorithms work by searching through a


set of possible prediction models for the model that best
captures the relationship between the descriptive features
and the target feature.
An obvious search criteria to drive this search is to look for
models that are consistent with the data.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 23 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Machine learning algorithms work by searching through a


set of possible prediction models for the model that best
captures the relationship between the descriptive features
and the target feature.
An obvious search criteria to drive this search is to look for
models that are consistent with the data.
However, because a training dataset is only a sample ML
is an ill-posed problem.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 24 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Table: A simple retail dataset


ID B BY A LC O RG G RP
1 no no no couple
2 yes no yes family
3 yes yes no family
4 no no yes couple
5 no yes yes single

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 25 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Table: A full set of potential prediction models before any training


data becomes available.

B BY A LC O RG G RP M1 M2 M3 M4 M5 ... M6 561
no no no ? couple couple single couple couple couple
no no yes ? single couple single couple couple single
no yes no ? family family single single single family
no yes yes ? single single single single single couple
...
yes no no ? couple couple family family family family
yes no yes ? couple family family family family couple
yes yes no ? single family family family family single
yes yes yes ? single single family family couple family

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 26 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Table: A sample of the models that are consistent with the training
data

B BY A LC O RG G RP M1 M2 M3 M4 M5 ... M6 561
no no no couple couple couple single couple couple couple
no no yes couple single couple single couple couple single
no yes no ? family family single single single family
no yes yes single single single single single single couple
...
yes no no ? couple couple family family family family
yes no yes family couple family family family family couple
yes yes no family single family family family family single
yes yes yes ? single single family family couple family

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 27 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Table: A sample of the models that are consistent with the training
data

B BY A LC O RG G RP M1 M2 M3 M4 M5 ... M6 561
no no no couple couple couple single couple couple couple
no no yes couple single couple single couple couple single
no yes no ? family family single single single family
no yes yes single single single single single single couple
...
yes no no ? couple couple family family family family
yes no yes family couple family family family family couple
yes yes no family single family family family family single
yes yes yes ? single single family family couple family

Notice that there is more than one candidate model left! It


is because a single consistent model cannot be found
based on a sample training dataset that ML is ill-posed.
John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 28 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Consistency ≈ memorizing the dataset.


Consistency with noise in the data isn’t desirable.
Goal: a model that generalises beyond the dataset and
that isn’t influenced by the noise in the dataset.
So what criteria should we use for choosing between
models?

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 29 / 135
Machine Learning basics Model spaces and inductive bias
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Model spaces and inductive bias

Inductive bias the set of assumptions that define the model


selection criteria of an ML algorithm.
There are two types of bias that we can use:
1 restriction bias
2 preference bias
Inductive bias is necessary for learning (beyond the
dataset).

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 30 / 135
Machine Learning basics Classification and regression

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 31 / 135
Machine Learning basics Classification and regression

Classification

Table: A simple retail dataset


ID B BY A LC O RG G RP
1 no no no couple
2 yes no yes family
3 yes yes no family
4 no no yes couple
5 no yes yes single

To predict a target feature with categorical values.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 32 / 135
Machine Learning basics Classification and regression

Regression

Table: The age-income dataset.


ID AGE I NCOME
1 21 24,000
2 32 48,000
3 62 83,000
4 72 61,000
5 84 52,000

To predict a target feature with numerical values.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 33 / 135
Machine Learning basics Overfitting and underfitting

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 34 / 135
Machine Learning basics Overfitting and underfitting
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Overfitting and underfitting

Table: The age-income dataset.


ID AGE I NCOME
1 21 24,000
2 32 48,000
3 62 83,000
4 72 61,000
5 84 52,000

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 35 / 135
Machine Learning basics Overfitting and underfitting

80000

60000

Income



40000
20000

0 20 40 60 80 100
Age

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 36 / 135
Machine Learning basics Overfitting and underfitting

80000

60000

Income



40000
20000

0 20 40 60 80 100
Age

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 37 / 135
Machine Learning basics Overfitting and underfitting

80000

60000

Income



40000
20000

0 20 40 60 80 100
Age

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 38 / 135
Machine Learning basics Overfitting and underfitting

80000

60000

Income



40000
20000

0 20 40 60 80 100
Age

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 39 / 135
Machine Learning basics Overfitting and underfitting
What is Predictive Data Analytics? What is ML? How Does ML Work? Underfitting/Overfitting Lifecycle Summary

Overfitting and underfitting


80000

80000

80000

80000
● ● ● ●
60000

60000

60000

60000
● ● ● ●
Income

Income

Income

Income
● ● ● ●
● ● ● ●
40000

40000

40000

40000
20000

20000

20000

20000
● ● ● ●

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


Age Age Age Age

(a) Dataset (b) Underfitting (c) Overfitting (d) Just right

Figure: Striking a balance between overfitting and underfitting when


trying to predict age from income.

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 40 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Content

1 Machine Learning basics


What is machine learning?
Machine learning – an example
Model spaces and inductive bias
Classification and regression
Overfitting and underfitting
Unsupervised learning and semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 41 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Unsupervised learning

Unsupervised learning is the machine learning task of inferring a


function to describe hidden structure from unlabeled data.
Since the examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
This distinguishes unsupervised learning from supervised
learning.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 42 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 43 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Unsupervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 44 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Semi-supervised learning

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 45 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Clustering

Cluster analysis or clustering is the task of grouping a set of


objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters).
Clustering is a typical unsupervised learning task.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 46 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Clustering – An example

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 47 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Clustering – An example

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 48 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Clustering – An example

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 49 / 135
Machine Learning basics Unsupervised learning and semi-supervised learning

Clustering – An example

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 50 / 135
Classification and logistic regression

Content

1 Machine Learning basics

2 Classification and logistic regression

3 Text Classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 51 / 135
Classification and logistic regression Classification - an example

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 52 / 135
Classification and logistic regression Classification - an example

Classification - an example

A power generator

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 53 / 135
Classification and logistic regression Classification - an example

Table: A dataset listing features for a number of generators.

ID RPM V IBRATION S TATUS ID RPM V IBRATION S TATUS


1 568 585 good 29 562 309 faulty
2 586 565 good 30 578 346 faulty
3 609 536 good 31 593 357 faulty
4 616 492 good 32 626 341 faulty
5 632 465 good 33 635 252 faulty
6 652 528 good 34 658 235 faulty
7 655 496 good 35 663 299 faulty
8 660 471 good 36 677 223 faulty
9 688 408 good 37 685 303 faulty
10 696 399 good 38 698 197 faulty
11 708 387 good 39 699 311 faulty
12 701 434 good 40 712 257 faulty
13 715 506 good 41 722 193 faulty
14 732 485 good 42 735 259 faulty
15 731 395 good 43 738 314 faulty
16 749 398 good 44 753 113 faulty
17 759 512 good 45 767 286 faulty
18 773 431 good 46 771 264 faulty
19 782 456 good 47 780 137 faulty
20 797 476 good 48 784 131 faulty
21 794 421 good 49 798 132 faulty
22 824 452 good 50 820 152 faulty
23 835 441 good 51 834 157 faulty
24 862 372 good 52 858 163 faulty
25 879 340 good 53 888 91 faulty
26 892 370 good 54 891 156 faulty
27 913 373 good 55 911 79 faulty
28 933 330 good 56 939 99 faulty

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 54 / 135
Classification and logistic regression Classification - an example

Classification - an example

600
500
400
Vibration
300
200
100

600 700 800 900


RPM

Figure: A scatter plot of the RPM and V IBRATION descriptive


John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics
features from the generators dataset shown in Table 4 [18] where
’good’ generators are shown as crosses and ’faulty’ generators are
shown as triangles.
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 55 / 135
Classification and logistic regression Decision boundary

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 56 / 135
Classification and logistic regression Decision boundary

Decision boundary

600
500
400
Vibration
300
200
100

600 700 800 900


RPM

Figure: A scatter plot of the RPM and V IBRATION descriptive


John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics
features from the generators dataset shown in Table 4 [18] . A decision
boundary separating ’good’ generators (crosses) from ’faulty’
generators (triangles) is also shown.
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 57 / 135
Classification and logistic regression Decision boundary

Decision boundary

830 − 0.667 × RPM − Vibration = 0


θ0 + θ1 x1 + x2 = 0

For all “good” generators, we have: θ0 + θ1 x1 + x2 >= 0


For all “faulty” generators, we have: θ0 + θ1 x1 + x2 < 0

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 58 / 135
Classification and logistic regression Decision boundary

Decision boundary

830 − 0.667 × RPM − Vibration = 0


θ0 + θ1 x1 + x2 = 0

For all “good” generators, we have: θ0 + θ1 x1 + x2 >= 0


For all “faulty” generators, we have: θ0 + θ1 x1 + x2 < 0

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 58 / 135
Classification and logistic regression Decision boundary

Decision boundary

830 − 0.667 × RPM − Vibration = 0


θ0 + θ1 x1 + x2 = 0

For all “good” generators, we have: θ0 + θ1 x1 + x2 >= 0


For all “faulty” generators, we have: θ0 + θ1 x1 + x2 < 0

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 58 / 135
Classification and logistic regression Decision boundary

Notation

1 x2
   

Let: θ = θ1  , x = x1 


θ0 1
Then we have the decision boundary:
dθ (x) = θT x = 0

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 59 / 135
Classification and logistic regression Model definition

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 60 / 135
Classification and logistic regression Model definition

Heaviside step function

(
1 if: x >= 0
Heaviside(x) =
0 if: x < 0

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 61 / 135
Classification and logistic regression Model definition

Heaviside step function

By Omegatron - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=801382

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 62 / 135
Classification and logistic regression Model definition

Model as a Heaviside step function

hθ (x) = Heaviside(dθ (x))


1 if: dθ (x) = θT x >= 0; for good generators
(
=
0 if: dθ (x) = θT x < 0; for faulty generators

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 63 / 135
Classification and logistic regression Model definition

Model as a Heaviside step function


Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

(a) (b)

Figure: (a) A surface showing the value of Equation (6)[21] for all
values of RPM dand θ (x) hθ (x) =boundary
V IBRATION. The decision Heaviside(d given θ (x))
in
Equation (6)[21] is highlighted. (b) The same surface linearly
thresholded
John Kelleher and Brian Macat zeroand
Namee to Aoife
operate
D’Arcy,as a predictor.
Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 64 / 135
Classification and logistic regression Model definition

Problem of the Heaviside step function

Heaviside step function is not derivable and hard to optimize.


An alternative is using a Logistic function (sigmoid function):

1
Logistic(x) =
1 + e−x

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 65 / 135
Classification and logistic regression Model definition

Logistic function
Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

1.00
0.75
logistic(x)
0.50
0.25
0.00

−10 −5 0 5 10
x

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 66 / 135
Classification and logistic regression Model definition

Model as a logistic function

1 1
hθ (x) = Logistic(dθ (x)) = =
1+e −dθ (x) 1 + e−θT x

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 67 / 135
Classification and logistic regression Model definition

Model as a logistic function


Interpreting Learning Rate Cat. Features Logistic Reg. Non-Linear Relationships Multinomial SVM

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics

The decision surface for the example logistic regression


model.
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 68 / 135
Classification and logistic regression Cost function

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 69 / 135
Classification and logistic regression Cost function

Cost function

The objective of a learning algorithm is to search a model to best


predict the target feature on the training data given the inductive
bias.
One way to achieve this goal is to minimize the errors of the
prediction over the training data.
A cost function (loss function) is defined as the sum of the errors
over the training samples.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 70 / 135
Classification and logistic regression Cost function

Cost function

A possible cost function:


n
1X1
J(θ) = (hθ (x {i} ) − y {i} )2
n 2
i=1

It is not a good cost function for Logistic regression because it is


non-convex for a Logistic function.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 71 / 135
Classification and logistic regression Cost function

Convex vs. non convex functions

figure source: https://www.fromthegenesis.com/artificial-neural-network-part-7/

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 72 / 135
Classification and logistic regression Cost function

Cost function for Logisitc regression

n
1X
J(θ) = Cost(hθ (x {i} ), y {i} )
n
i=1

where:
(
− log(hθ (x)) if y = 1
Cost(hθ (x), y ) =
− log(1 − hθ (x)) if y = 0
= − [y log(hθ (x)) + (1 − y ) log(1 − hθ (x))]
y {∗} ∈ {0, 1}

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 73 / 135
Classification and logistic regression Cost function

Cost function for Logisitc regression

Cost(hθ (x), y ) = − [y log(hθ (x)) + (1 − y ) log(1 − hθ (x))]

figure source: https://www.geeksforgeeks.org/ml-cost-function-in-logistic-regression/

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 74 / 135
Classification and logistic regression Stochastic gradient descend

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 75 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

Now we have a cost function which defines the errors of a model


over the training data.
Next we need to search all the possible models to find a model
with the minimal cost.
We use a gradient descend algorithm for this purpose.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 76 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

Have some function


Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 77 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 78 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 79 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 80 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 81 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 82 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 83 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 84 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 85 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 86 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend

J(0,1)

1
0

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 87 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend algorithm

Gradient descent algorithm

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 88 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend algorithm

Gradient descent algorithm

Correct: Simultaneous update

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 89 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend algorithm

Gradient descent algorithm

Correct: Simultaneous update Incorrect:

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 90 / 135
Classification and logistic regression Stochastic gradient descend

∂J(θ)
The derivation ∂θ gives the direction of the movement.
The learning rate α is used to adjust the size for each step.
figure source: https://machinelearningmedium.com/2017/08/15/gradient-descent/

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 91 / 135
Classification and logistic regression Stochastic gradient descend

The influence of the learning rate.


figure source: https://ithelp.ithome.com.tw/m/articles/10204032

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 92 / 135
Classification and logistic regression Stochastic gradient descend

A non-convex error surface may lead to local optima.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 93 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for Logistic regression

Model
1
hθ (x) =
1 + e−θT x
Parameters
θ0 , θ1
Cost Function
n
1 X h {i} i
J(θ) = − y log(hθ (x {i} )) + (1 − y {i} ) log(1 − hθ (x {i} ))
n
i=1

Goal
minimize J(θ)
θ0 ,θ1

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 94 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for Logistic regression

Gradient descent:
n
1 X h {i} i
J(θ) = − y log(hθ (x {i} )) + (1 − y {i} ) log(1 − hθ (x {i} ))
n
i=1

Want minθ J(θ):


Repeat {

θj = θj − α ∂θ j
J(θ);
(simultaneously update all θj )
}

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 95 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for Logistic regression

n
1 X h {i} i
J(θ) = − y log(hθ (x {i} )) + (1 − y {i} ) log(1 − hθ (x {i} ))
n
i=1

n
∂ 1 ∂ X h {i} i
J(θ) = − y log(hθ (x {i} )) + (1 − y {i} ) log(1 − hθ (x {i} ))
∂θj n ∂θj
i=1
= ······
n
1X {i}
= (hθ (x {i} ) − y {i} )xj
n
i=1

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 96 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for Logistic regression

Gradient descent:
n
1 X h {i} i
J(θ) = − y log(hθ (x {i} )) + (1 − y {i} ) log(1 − hθ (x {i} ))
n
i=1

Want minθ J(θ):


Repeat {
Pn
θj = θj − α n1
{i}
i=1 (hθ (x
{i} ) − y {i} )xj
(simultaneously update all θj )
}

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 97 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for logistic regression

Andrew Ng
Fortunately the error surface for logistic regression is convex.
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 98 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descend for logistic regression

1.0

1.0

1.0
0.5

0.5

0.5
Vibration

Vibration

Vibration
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
RPM RPM RPM
1.0

1.0

25
20
0.5

0.5

Sum of Squared Errors


Vibration

Vibration

15
0.0

0.0

10
−0.5

−0.5

5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 0 200 400 600 800
RPM RPM Training Iteration

John Kelleher and Brian Mac Namee and Aoife D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics
Figure: A selection of the logistic regression models developed
during the gradient descent process for the extended generators
dataset
Qun Liu in Table
& Valentin Malykh 35 [35] . The
(Huawei) bottom-right
Natural panel shows the sum
Language Processing of2020
Spring 99 / 135
Classification and logistic regression Stochastic gradient descend

Gradient descent variants


• There are three variants of gradient descent.
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent

• The difference of these algorithms is the amount of


data.
This term is different
Update equation
with each method
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃)

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 100 / 135
Classification and logistic regression Stochastic gradient descend

Batch gradient descent


This method computes the gradient of the cost function
with the entire training dataset.

Update equation We need to calculate the


gradients for the whole dataset
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃)
to perform just one update.

Code

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 101 / 135
Classification and logistic regression Stochastic gradient descend

Batch gradient descent


• Advantage
• It is guaranteed to converge to the global minimum for
convex error surfaces and to a local minimum for non-
convex surfaces.

• Disadvantages
• It can be very slow.
• It is intractable for datasets that do not fit in memory.
• It does not allow us to update our model online.

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 102 / 135
Classification and logistic regression Stochastic gradient descend

Stochastic gradient descent


This method performs a parameter update for each training
example 𝑥 𝑖 and label 𝑦 (𝑖) .

Update equation
We need to calculate the
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥 𝑖 ; 𝑦 (𝑖) ) gradients for the whole dataset
to perform just one update.

Code
Note : we shuffle the training data at every epoch

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 103 / 135
Classification and logistic regression Stochastic gradient descend

Stochastic gradient descent


• Advantage
• It is usually much faster than batch gradient descent.
• It can be used to learn online.

• Disadvantages
• It performs frequent updates with a high variance that
cause the objective function to fluctuate heavily.

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 104 / 135
Classification and logistic regression Stochastic gradient descend

The fluctuation : Batch vs SGD

・Batch gradient descent converges to


the minimum of the basin the
parameters are placed in and the
fluctuation is small.

・SGD’s fluctuation is large but it


enables to jump to new and
potentially better local minima.

However, this ultimately complicates


https://wikidocs.net/3413 convergence to the exact minimum,
as SGD will keep overshooting

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 105 / 135
Classification and logistic regression Stochastic gradient descend

Learning rate of SGD


• When we slowly decrease the learning rate, SGD
shows the same convergence behaviour as batch
gradient descent
• It almost certainly converging to a local or the global
minimum for non-convex and convex optimization
respectively.

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 106 / 135
Classification and logistic regression Stochastic gradient descend

Mini-batch gradient descent


This method takes the best of both batch and SGD, and
performs an update for every mini-batch of 𝑛.

Update equation
𝑖:𝑖+𝑛
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥 ; 𝑦 (𝑖:𝑖+𝑛) )

Code

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 107 / 135
Classification and logistic regression Stochastic gradient descend

Mini-batch gradient descent


• Advantage :
• It reduces the variance of the parameter updates.
• This can lead to more stable convergence.
• It can make use of highly optimized matrix optimizations
common to deep learning libraries that make computing
the gradient very efficiently.
• Disadvantage :
• We have to set mini-batch size.
• Common mini-batch sizes range between 50 and 256, but can
vary for different applications.

Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 108 / 135
Classification and logistic regression Stochastic gradient descend

Trade-off
• Depending on the amount of data, they make a
trade-off :
• The accuracy of the parameter update
• The time it takes to perform an update.

Memory Online
Method Accuracy Time
Usage Learning
Batch gradient
○ Slow High ×
descent
Stochastic gradient
△ High Low ○
descent
Mini-batch gradient
○ Midium Midium ○
descent
Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv:1609.04747 (2016).

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 109 / 135
Classification and logistic regression Multiclass classification

Content

2 Classification and logistic regression


Classification - an example
Decision boundary
Model definition
Cost function
Stochastic gradient descend
Multiclass classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 110 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

Email foldering/tagging: Work, Friends, Family, Hobby





1, Work
2,

Friends
y=


3, Family
4,

Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 111 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

Email foldering/tagging: Work, Friends, Family, Hobby





1, Work
2,

Friends
y=


3, Family
4,

Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 111 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

Binary classification:

x2

x1
Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 112 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

Binary classification: Multi-class classification:

x2 x2

x1 x1
Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 113 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

One-vs-all (one-vs-rest):

x2

x1
Class 1:
Class 2:
Class 3:

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 114 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

x2
One-vs-all (one-vs-rest):

x1
x2

x1
Class 1:
Class 2:
Class 3:

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 115 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1 x1

Class 1:
Class 2:
Class 3:

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 116 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 117 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1 x1
x2
Class 1:
Class 2:
Class 3:
x1
Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 118 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

x2
One-vs-all (one-vs-rest):

x1
x2 x2
PP
PP
P

x1 x1
x2 B
Class 1: B
Class 2: B
Class 3: BB
x1
Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 119 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 120 / 135
Classification and logistic regression Multiclass classification

Multiclass classification

One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes

Andrew Ng
Andrew Ng, Machine Learning, Coursera course

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 121 / 135
Text Classification

Content

1 Machine Learning basics

2 Classification and logistic regression

3 Text Classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 122 / 135
Text Classification

Text classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 123 / 135
Text Classification

Applications

Junk email filtering


News topic classification
Authorship attribution
Sentiment analysis
Genre classification
Offensive language identification
Language identification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 124 / 135
Text Classification

Huffpost News Category Dataset

This dataset contains around 200k news headlines from the year
2012 to 2018 obtained from HuffPost.

https://www.kaggle.com/rmisra/news-category-dataset

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 125 / 135
Text Classification

Huffpost News Category Dataset

Kavita Ganesan, Build Your First Text Classifier in Python with Logistic Regression
https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 126 / 135
Text Classification

Procedure

Text preprocessing
Feature extraction
Model training
Model Application
Evaluation

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 127 / 135
Text Classification

Text preprocessing

Text cleaning (removing HTML/XML tags, figures, formula, etc.)


Removing stop words
Tokenization
Stemming

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 128 / 135
Text Classification

Stop words

A stop word is a commonly used word (such as “the”, “a”, “an”,


“in”).
Stop words are not helpful for text classification because they
occur in almost all documents,
Stop words are normally removed before applying a text
classification algorithm.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 129 / 135
Text Classification

Feature extraction

Each document is represented as a vector in order to applying a


classification algorithm.
Each dimension of the input vector is called a feature.
In text classification, the most straightforward idea is to use words
as features.

doc_id book read music go


doc1 3 1 0 5
doc2 2 5 3 0
doc3 0 0 7 2

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 130 / 135
Text Classification

Weighting of words in document vectors

Term - a word or a collocation.


Document - a sequence of terms.
Corpus - a set of documents.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 131 / 135
Text Classification

Weighting of words in document vectors

(
1 if fik > 0
Boolean weighting: wik =
0 Otherwise
Word frequency weighting: wik = fik
TF-IDF weighting: wik = fik × log nNi
i: word index
k : document index
fik : word frequency in a document
N: number of documents in the corpus
ni : number of documents containing the word

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 132 / 135
Text Classification

Weighting of words in document vectors

TF-IDF - short for term frequency–inverse document frequency, is


a numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus.
The assumption behind the use of the inverse document
frequency is: the more documents where a word (term) occurs,
the less important the word is to that document.
TF-IDF is proposed in information retrieval but also used in other
areas including NLP.
Which word weighting method is the best for text classification: no
universal answer. It empirically depends on the data and the
classification algorithm you use.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 133 / 135
Text Classification

Algorithms

Logistic regression
Nearest neighbor
Decision trees
Support vector machines
Neural networks

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 134 / 135
Text Classification

Further topics

Feature selection
Dimension reduction
Document embeddings

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 135 / 135
Summary

Content

1 Machine Learning basics

2 Classification and logistic regression

3 Text Classification

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Spring 2020 136 / 135

You might also like