0% found this document useful (0 votes)
26 views9 pages

Text Classification

Text classification is the process of assigning predefined categories to textual documents, with applications including spam filtering and sentiment analysis. The workflow involves data preparation, text normalization, feature extraction, model training, and evaluation using metrics like accuracy and F1 score. Various algorithms such as Multinomial Naïve Bayes and Support Vector Machines are used for classification.

Uploaded by

harrypoter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Text Classification

Text classification is the process of assigning predefined categories to textual documents, with applications including spam filtering and sentiment analysis. The workflow involves data preparation, text normalization, feature extraction, model training, and evaluation using metrics like accuracy and F1 score. Various algorithms such as Multinomial Naïve Bayes and Support Vector Machines are used for classification.

Uploaded by

harrypoter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

05/04/2024

MDAN 54233
Outline

• What is text classification?


• Applications
• Types of classification
• Steps of text classification

Text Classification
Supunmali Ahangama

MDAN 54233 2

1 2

Text Classification Applications

• Also known as document classification. • News articles categorization


– A document is represented as textual data such as sentences or • Spam filtering
paragraphs belonging to the English language
• Music or movie genre categorization
• Text classification is assigning text documents into one or
more classes or categories, assuming that there is a • Customer support request categorization
predefined set of classes. • Sentiment analysis
• Language detection

MDAN 54233 3 MDAN 54233 4

3 4
05/04/2024

News articles categorization Spam classification

MDAN 54233 5 https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 6

5 6

Categorize customer support requests Two major types

• Content based classification


– Analyzing the actual content of the information to determine its
category or class.
– E.g., Spam/Ham by analyzing the words and phrases in emails

• Request based classification


– Classifying based on the requests or queries made by users.

https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 7 MDAN 54233 8

7 8
05/04/2024

Automated Text Classification Supervised learning

• Learning Algorithms • Classification:


– Supervised machine learning – the outcomes to be predicted are distinct categories (outcome
– Unsupervised machine learning variable is a categorical variable).
– Semi-supervised learning • Regression
– Reinforcement learning – the outcome to be predicted is a continuous numeric variable.

MDAN 54233 9 MDAN 54233 10

9 10

I am happy as I
am doing the
Text Analytics Positive: 1
Module

MDAN 54233 11 MDAN 54233 12

11 12
05/04/2024

Types of Classification

Based on the number of classes that can be predicted on


any data point:
• Binary classification
• Multi-class classification (multinomial classification)
• Multi-label classification

MDAN 54233 13 https://www.datacamp.com/tutorial/text-classification-python MDAN 54233 14

13 14

Text Classification - Workflow

1. Prepare train and test datasets


2. Text normalization
3. Feature extraction
4. Model training
5. Model prediction and evaluation
6. Model deployment

MDAN 54233 15 MDAN 54233 16

15 16
05/04/2024

Text Normalization Feature Extraction

Some of the commonly used steps • In ML terminology, features are unique, measurable
• Expanding contractions attributes (properties) for each data point (observation) in
a dataset.
• Text standardization through lemmatization
• Features are usually numeric in nature and can be absolute
• Removing special characters and symbols
numeric values or categorical features encoded as binary
• Removing stopwords features.

Refer the previous lesson “Pre-processing” for more details.


MDAN 54233 17 MDAN 54233 18

17 18

Feature Extraction Techniques Vocabulary

• Bag of Words model Documents: I am happy as I am following this course


[doc_1, doc_2, ….., doc_m] ….
• TF-IDF model ….
I hated the book
• Advanced word vectorization models

V = [I, am, happy, as, following, this, course, ….., hated, the, book]
Refer the previous lesson “Text Vectorization” for more
details.

MDAN 54233 19 MDAN 54233 20

19 20
05/04/2024

Feature Extraction Feature Extraction


Binary code based Absolute term
I am happy as I am following this course on availability I am happy as I am following this course frequency.

[I, am, happy, as, following, this, course, ….., hated, the, book] [I, am, happy, as, following, this, course, ….., hated, the, book]

[1, 1, 1, 1, 1, 1, 1, ….., 0, 0 0] [2, 2, 1, 1, 1, 1, 1, ….., 0, 0 0]

• A lot of zeros. It is known as Sparse representation. • A lot of zeros. It is known as Sparse representation.

MDAN 54233 21 MDAN 54233 22

21 22

Positive Negative
I am happy as I am following this course I am sad as I am not good in this course
Feature Extraction with Frequencies - Example I am happy I am sad

Vocabulary PosFreq(1) NegFreq(0)


I 3 3
am 3 3
Positive Negative
happy 2 0
I am happy as I am following this I am sad, I am not good in coding
as 1 1
course I am sad
following 1 0
I am happy
this 1 1
course 1 1
sad 0 2
not 0 1
good 0 1 Get the frequency for the each
in 0 1 distinct word.

MDAN 54233 23 MDAN 54233 24

23 24
05/04/2024

Vocabulary PosFreq(1)
I am happy as I am following this course
I 3
am 3
happy 2
as 1
following 1 12
this 1
course 1
sad 0
not 0
good 0
in 0
MDAN 54233 25 MDAN 54233 26

25 26

Vocabulary NegFreq(0) I am sad as I am not good in this course Vocabulary PosFreq(1) NegFreq(0)
I am happy as I am following this course
I 3 I 3 3
am 3 am 3 3
happy 0 happy 2 0
as 1 as 1 1
following 0 following 1 0
this 1 13 this 1 1
12 9
course 1 course 1 1

sad 2 sad 0 2

not 1 not 0 1

good 1 good 0 1 feature vector is [1,12, 9]


in 1 in 0 1

MDAN 54233 27 MDAN 54233 28

27 28
05/04/2024

Classification Algorithms Processes in Classification

There are many algorithms. The algorithms that would be • Training


covered in this lesson are: • Evaluation (test)
• Multinomial Naïve Bayes – Evaluate performance with the Ground truth (actual class labels)

• Support vector machines • Tuning (hyperparameter tuning or optimization)


• Logistic regression
• Random forest

MDAN 54233 29 MDAN 54233 30

29 30

Evaluating Classification Models Evaluation

• To check how well these models are performing


• E.g.
– Accuracy
– Precision
– Recall
– F1 score

MDAN 54233 31 MDAN 54233 32

31 32
05/04/2024

Activity Summary

• Calculate the accuracy, Recall, Precision, F Score • Text classification, is a natural language processing task that involves
assigning predefined categories or labels to textual documents.
• In data preprocessing, the raw text data is cleaned, normalized, and
transformed into a numerical format suitable for machine learning
algorithms.
• Feature extraction involves selecting relevant features that can represent
the documents effectively.
• Model selection is the process of choosing the best algorithm or approach
to classify the text data.
• Finally, evaluation is done to measure the performance of the classification
model using various metrics such as accuracy, precision, recall, and F1-
score.
MDAN 54233 33 MDAN 54233 34

33 34

You might also like