05/04/2024
MDAN 54233
Outline
• What is text classification?
• Applications
• Types of classification
• Steps of text classification
Text Classification
Supunmali Ahangama
MDAN 54233 2
1 2
Text Classification Applications
• Also known as document classification. • News articles categorization
– A document is represented as textual data such as sentences or • Spam filtering
paragraphs belonging to the English language
• Music or movie genre categorization
• Text classification is assigning text documents into one or
more classes or categories, assuming that there is a • Customer support request categorization
predefined set of classes. • Sentiment analysis
• Language detection
MDAN 54233 3 MDAN 54233 4
3 4
05/04/2024
News articles categorization Spam classification
MDAN 54233 5 https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 6
5 6
Categorize customer support requests Two major types
• Content based classification
– Analyzing the actual content of the information to determine its
category or class.
– E.g., Spam/Ham by analyzing the words and phrases in emails
• Request based classification
– Classifying based on the requests or queries made by users.
https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 7 MDAN 54233 8
7 8
05/04/2024
Automated Text Classification Supervised learning
• Learning Algorithms • Classification:
– Supervised machine learning – the outcomes to be predicted are distinct categories (outcome
– Unsupervised machine learning variable is a categorical variable).
– Semi-supervised learning • Regression
– Reinforcement learning – the outcome to be predicted is a continuous numeric variable.
MDAN 54233 9 MDAN 54233 10
9 10
I am happy as I
am doing the
Text Analytics Positive: 1
Module
MDAN 54233 11 MDAN 54233 12
11 12
05/04/2024
Types of Classification
Based on the number of classes that can be predicted on
any data point:
• Binary classification
• Multi-class classification (multinomial classification)
• Multi-label classification
MDAN 54233 13 https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 14
13 14
Text Classification - Workflow
1. Prepare train and test datasets
2. Text normalization
3. Feature extraction
4. Model training
5. Model prediction and evaluation
6. Model deployment
MDAN 54233 15 MDAN 54233 16
15 16
05/04/2024
Text Normalization Feature Extraction
Some of the commonly used steps • In ML terminology, features are unique, measurable
• Expanding contractions attributes (properties) for each data point (observation) in
a dataset.
• Text standardization through lemmatization
• Features are usually numeric in nature and can be absolute
• Removing special characters and symbols
numeric values or categorical features encoded as binary
• Removing stopwords features.
Refer the previous lesson “Pre-processing” for more details.
MDAN 54233 17 MDAN 54233 18
17 18
Feature Extraction Techniques Vocabulary
• Bag of Words model Documents: I am happy as I am following this course
[doc_1, doc_2, ….., doc_m] ….
• TF-IDF model ….
I hated the book
• Advanced word vectorization models
V = [I, am, happy, as, following, this, course, ….., hated, the, book]
Refer the previous lesson “Text Vectorization” for more
details.
MDAN 54233 19 MDAN 54233 20
19 20
05/04/2024
Feature Extraction Feature Extraction
Binary code based Absolute term
I am happy as I am following this course on availability I am happy as I am following this course frequency.
[I, am, happy, as, following, this, course, ….., hated, the, book] [I, am, happy, as, following, this, course, ….., hated, the, book]
[1, 1, 1, 1, 1, 1, 1, ….., 0, 0 0] [2, 2, 1, 1, 1, 1, 1, ….., 0, 0 0]
• A lot of zeros. It is known as Sparse representation. • A lot of zeros. It is known as Sparse representation.
MDAN 54233 21 MDAN 54233 22
21 22
Positive Negative
I am happy as I am following this course I am sad as I am not good in this course
Feature Extraction with Frequencies - Example I am happy I am sad
Vocabulary PosFreq(1) NegFreq(0)
I 3 3
am 3 3
Positive Negative
happy 2 0
I am happy as I am following this I am sad, I am not good in coding
as 1 1
course I am sad
following 1 0
I am happy
this 1 1
course 1 1
sad 0 2
not 0 1
good 0 1 Get the frequency for the each
in 0 1 distinct word.
MDAN 54233 23 MDAN 54233 24
23 24
05/04/2024
Vocabulary PosFreq(1)
I am happy as I am following this course
I 3
am 3
happy 2
as 1
following 1 12
this 1
course 1
sad 0
not 0
good 0
in 0
MDAN 54233 25 MDAN 54233 26
25 26
Vocabulary NegFreq(0) I am sad as I am not good in this course Vocabulary PosFreq(1) NegFreq(0)
I am happy as I am following this course
I 3 I 3 3
am 3 am 3 3
happy 0 happy 2 0
as 1 as 1 1
following 0 following 1 0
this 1 13 this 1 1
12 9
course 1 course 1 1
sad 2 sad 0 2
not 1 not 0 1
good 1 good 0 1 feature vector is [1,12, 9]
in 1 in 0 1
MDAN 54233 27 MDAN 54233 28
27 28
05/04/2024
Classification Algorithms Processes in Classification
There are many algorithms. The algorithms that would be • Training
covered in this lesson are: • Evaluation (test)
• Multinomial Naïve Bayes – Evaluate performance with the Ground truth (actual class labels)
• Support vector machines • Tuning (hyperparameter tuning or optimization)
• Logistic regression
• Random forest
MDAN 54233 29 MDAN 54233 30
29 30
Evaluating Classification Models Evaluation
• To check how well these models are performing
• E.g.
– Accuracy
– Precision
– Recall
– F1 score
MDAN 54233 31 MDAN 54233 32
31 32
05/04/2024
Activity Summary
• Calculate the accuracy, Recall, Precision, F Score • Text classification, is a natural language processing task that involves
assigning predefined categories or labels to textual documents.
• In data preprocessing, the raw text data is cleaned, normalized, and
transformed into a numerical format suitable for machine learning
algorithms.
• Feature extraction involves selecting relevant features that can represent
the documents effectively.
• Model selection is the process of choosing the best algorithm or approach
to classify the text data.
• Finally, evaluation is done to measure the performance of the classification
model using various metrics such as accuracy, precision, recall, and F1-
score.
MDAN 54233 33 MDAN 54233 34
33 34