Text Classification using scikit-learn in NLP
Last Updated :
21 Jun, 2024
The purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit, to create a simple text categorization pipeline.
What is Text Classification?
Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to text documents. This process enables the automated sorting and organization of textual data, facilitating the extraction of valuable information and insights from large volumes of text. Text classification is widely used in various applications, including sentiment analysis, spam detection, topic labelling, and document categorization.
Why Use Scikit-learn for Text Classification?
- Ease of Use: User-friendly API and comprehensive documentation make it accessible for beginners and experts alike.
- Performance: Optimized for large datasets and efficient computation with robust model evaluation tools.
- Integration: Seamless integration with NumPy, SciPy, and pandas, plus support for creating streamlined workflows with pipelines.
- Community Support: Large, active community and frequent updates ensure continuous improvement and extensive resources for troubleshooting.
Implementation of Text Classification with Scikit-Learn
We'll categorize text using a straightforward example. Now let's look at a dataset of favorable and bad movie reviews.
Step 1: Import Necessary Libraries and Load Dataset
For this example, we'll use the 'sklearn.datasets.fetch_20newsgroups' dataset, which is a collection of newsgroup documents.
Python
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
# Load dataset
newsgroups = fetch_20newsgroups(subset='all', categories=['rec.sport.baseball', 'sci.space'], shuffle=True, random_state=42)
data = newsgroups.data
target = newsgroups.target
# Create a DataFrame for easy manipulation
df = pd.DataFrame({'text': data, 'label': target})
df.head()
Output:
text label
0 From: [email protected] (Mark Singer)\nSubject: R... 0
1 From: [email protected] (Cousin It)\nS... 0
2 From: [email protected]\nSubj... 0
3 From: [email protected] (Edward [Ted] Fis... 0
4 From: [email protected] (Sherri Nichols)\nSub... 0
Step 2: Preprocess the Data
Term frequency-inverse document frequency, or TF-IDF, will be used to translate text into numerical vectors.
Python
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
# Transform the text data to feature vectors
X = vectorizer.fit_transform(df['text'])
# Labels
y = df['label']
Step 3: Fit the model for classification
We'll use a Support Vector Machine (SVM) for classification.
Python
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the classifier
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
Output:
SVC
SVC(kernel='linear')
Step 4: Model Evaluation
Evaluate the model using accuracy score and classification report.
Python
from sklearn.metrics import accuracy_score, classification_report
# Predict on the test set
y_pred = clf.predict(X_test)
# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=newsgroups.target_names)
print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(report)
Output:
Accuracy: 0.9966
Classification Report:
precision recall f1-score support
rec.sport.baseball 0.99 1.00 1.00 286
sci.space 1.00 0.99 1.00 309
accuracy 1.00 595
macro avg 1.00 1.00 1.00 595
weighted avg 1.00 1.00 1.00 595
Step 5: Define a Function to Predict Class for New Text
This code defines a function predict_category
that takes a text input, vectorizes it using a pre-trained vectorizer, and predicts its category using a pre-trained classifier. The function then maps the predicted label to its corresponding category name from the newsgroups
dataset. Finally, an example usage of the function is provided, demonstrating the prediction of a sample text about exoplanets.
Python
def predict_category(text):
"""
Predict the category of a given text using the trained classifier.
"""
text_vec = vectorizer.transform([text])
prediction = clf.predict(text_vec)
return newsgroups.target_names[prediction[0]]
# Example usage
sample_text = "NASA announced the discovery of new exoplanets."
predicted_category = predict_category(sample_text)
print(f'The predicted category is: {predicted_category}')
Output:
The predicted category is: sci.space
Conclusion
In this article, we showed you how to use scikit-learn to create a simple text categorization pipeline. The first steps involved importing and preparing the dataset, using TF-IDF to convert text data into numerical representations, and then training an SVM classifier. Lastly, we assessed the model's effectiveness and offered a feature for categorising fresh textual input. Depending on the dataset and the requirements, this method can be modified to perform a variety of text classification tasks, including subject categorization, sentiment analysis, and spam detection.
Similar Reads
Clustering Text Documents using K-Means in Scikit Learn Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-M
3 min read
Text Classification using Logistic Regression Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to textual data. It has a wide range of applications, including spam detection, sentiment analysis, topic categorization, and language identification. Logistic Regre
4 min read
Classification of text documents using sparse features in Python Scikit Learn Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a liv
5 min read
Text Classification using Decision Trees in Python Text classification is the process of classifying the text documents into predefined categories. In this article, we are going to explore how we can leverage decision trees to classify the textual data. Text Classification and Decision Trees Text classification involves assigning predefined categori
5 min read
Zero-Shot Text Classification using HuggingFace Model Zero-shot text classification is a groundbreaking technique that allows for categorizing text into predefined labels without any prior training on those specific labels. This method is particularly useful when labeled data is scarce or unavailable. Leveraging the HuggingFace Transformers library, we
4 min read
Classification of Text Documents using Naive Bayes In natural language processing and machine learning Naive Bayes is a popular method for classifying text documents. It can be used to classifies documents into pre defined types based on likelihood of a word occurring by using Bayes theorem. In this article we will implement Text Classification usin
4 min read
Comprehensive Guide to Classification Models in Scikit-Learn Scikit-Learn, a powerful and user-friendly machine learning library in Python, has become a staple for data scientists and machine learning practitioners. It offers a wide array of tools for data mining and data analysis, making it accessible and reusable in various contexts. This article delves int
12 min read
Processing text using NLP | Basics In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and program
2 min read
Accessing Text Corpora and Lexical Resources using NLTK Accessing Text Corpora and Lexical Resources using NLTK provides efficient access to extensive text data and linguistic resources, empowering researchers and developers in natural language processing tasks. Natural Language Toolkit (NLTK) is a powerful Python library for natural language processing
5 min read
What Is the Right Approach for Text-Classification Problems? Answer: Preprocess text, select a suitable model (e.g., Naive Bayes, SVM, deep learning), train, evaluate, and iterate for optimization.Text classification is a pivotal task in natural language processing (NLP) aimed at categorizing text into predefined categories. The right approach to text classif
1 min read