Sentiment Analysis using NLTK

Last Updated : 11 Sep, 2025

Sentiment analysis using the Natural Language Toolkit (NLTK) is a computational process that determines the emotional tone such as positive, negative or neutral in a text. NLTK is a comprehensive Python library that supports various NLP tasks including:

NLTK offers efficient text preprocessing functions: tokenization, stop word removal, stemming, lemmatization and text classification
Sentiment analysis can be performed using machine learning (Naive Bayes, SVM) or lexicon-based methods like VADER.
VADER in NLTK produce scores for positive, negative, neutral and a compound measure ranging from -1 (most negative) to 1 (most positive).
Typical applications include analyzing customer feedback, social media posts and product reviews for business insights.
Sentiment analysis in NLTK is scalable and user-friendly, making it suitable for prototyping and production NLP systems.

Implementing Sentimental Analysis using NLTK

In this section, we are going to perform sentimental analysis with NLTK using Twitter samples dataset using the following steps:

Step 1: Installing NLTK

We need to install NLTK and support packages for our model.

Python

!pip install nltk scikit-learn matplotlib seaborn

NLTK packages,

Python

import nltk
nltk.download('punkt', force=True)
nltk.download('stopwords', force=True)
nltk.download('wordnet', force=True)
nltk.download('vader_lexicon', force=True)
nltk.download('averaged_perceptron_tagger', force=True)
nltk.download('punkt_tab', force=True)

Step 2: Loading and Preprocessing Data

The sample documents can be downloaded from here.

For sentiment analysis,

Loads positive and negative sentence lists from JSON files using Python’s json.load() for structured access.
Combines and shuffles data for randomized training and testing splits, preventing bias in model training.

Python

import json
import random
with open('positive_data.json', 'r') as f:
    positive_data = json.load(f)

with open('negative_data.json', 'r') as f:
    negative_data = json.load(f)

data = positive_data + negative_data
random.shuffle(data)

print(f"Total samples loaded: {len(data)}")

Output:

Total samples loaded: 196

Step 3: Explore Data Analysis (EDA)

Quick look at data distribution,

Uses pandas to visualize and count label distribution for data balance.
Displays sample entries to inspect text format and sentiment labeling.

Python

import pandas as pd

df = pd.DataFrame(data)
print(df['label'].value_counts())

df.head()

Output:

Step 4: Text Preprocessing Pipeline

Defines a function to tokenize sentences into words with word_tokenize.
Converts words to lowercase, removes stop words and punctuation for noise reduction.
Lemmatizes tokens using WordNetLemmatizer to reduce inflectional forms to base words, improving consistency.
Applies preprocessing to all sentences, producing cleaned text for feature extraction.

Python

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def preprocess_text(text):
    tokens = word_tokenize(text)
    filtered_tokens = [w.lower() for w in tokens if w.isalpha()
                       and w.lower() not in stop_words]

    lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    return ' '.join(lemmas)


df['processed_text'] = df['sentence'].apply(preprocess_text)
print(df[['sentence', 'processed_text']].head())

Output:

Screenshot-2025-09-06-160846 — Sentences

Step 5: Feature Extraction (Bag of Words)

Transforms processed text into numerical vectors using CountVectorizer creating a word occurrence matrix.
Maps sentiment labels to binary values (positive = 1, negative = 0) for model compatibility.
Outputs matrix shape for verifying feature dimensionality.

Python

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['processed_text'])
y = df['label'].map({'positive': 1, 'negative': 0}).values

print(f"Feature matrix shape: {X.shape}")

Output:

Feature matrix shape: (196, 347)

Step 6: Train-Test Split

We will

Splits features and labels into training and test sets with train_test_split, preserving class proportions (stratify).
Enables objective model evaluation.

Python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Step 7: Model Training (Naive Bayes)

Use Multinomial Naive Bayes,

Trains a MultinomialNB classifier, which is efficient for word frequency-based text data.
Fits the model on training data to learn text-sentiment relationships.

Python

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

Output:

Step 8: Model Evaluation

Predicts sentiments on the test set and computes accuracy score.
Prints classification report detailing precision, recall, and F1-score for each class.
Generates confusion matrix to visualize prediction correctness and misclassifications.

Python

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Output:

Step 9: Visualization of Confusion Matrix

We will visualize the confusion matrix,

Use matplotlib and seaborn to plot a heatmap for the confusion matrix.
Visualizes true vs predicted labels for diagnostic insights.

Python

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[
            'Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

Output:

Step 10: Use NLTK’s VADER Sentiment Analyzer (Lexicon-based)

We,

Instantiates SentimentIntensityAnalyzer to compute sentiment polarity.
Defines a function returning 'positive', 'negative', or 'neutral' based on the compound score.
Applies VADER to example and dataset sentences to compare lexicon-based vs ML-based approaches.

Python

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()


def get_vader_sentiment(text):
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.05:
        return 'positive'
    elif compound <= -0.05:
        return 'negative'
    else:
        return 'neutral'


sample_text = "I love this product! It's amazing."
print(get_vader_sentiment(sample_text))

df['vader_sentiment'] = df['sentence'].apply(get_vader_sentiment)
print(df[['sentence', 'vader_sentiment']].head())

Output:

Positive

Step 11: Compare VADER and ML Model Predictions

Maps VADER sentiment outputs to binary for accuracy comparison against labels.
Calculates and prints VADER-based accuracy for benchmark evaluation.

Python

from sklearn.metrics import accuracy_score

df['vader_sentiment'] = df['sentence'].apply(get_vader_sentiment)

vader_labels = df['vader_sentiment'].map(
    {'positive': 1, 'negative': 0, 'neutral': -1})
valid_indices = vader_labels != -1

print("VADER Sentiment Accuracy compared to labels:")
print(accuracy_score(y[valid_indices], vader_labels[valid_indices]))

Output:

VADER Sentiment Accuracy compared to labels:
0.9548387096774194

Step 12: Predict Results on New Sentences

We will predict the results of more sentences using the model,

Preprocesses new input sentences and predicts sentiment using the trained model.
Demonstrates real-world use for unseen text.

Python

def predict_sentiment(text):
    processed = preprocess_text(text)
    vectorized = vectorizer.transform([processed])
    pred = model.predict(vectorized)[0]
    return 'positive' if pred == 1 else 'negative'


test_sentences = [
    "This is an amazing product with great quality.",
    "I did not like the service at all, very poor."
]

for text in test_sentences:
    print(f"Sentence: {text} => Sentiment: {predict_sentiment(text)}")

Output:

Sentence: This is an amazing product with great quality.
=> Sentiment: positive
Sentence: I did not like the service at all, very poor.
=> Sentiment: negative

Step 13: Save the Model and Vectorizer

We,

Serializes the trained model and vectorizer with joblib.dump, enabling future predictions without retraining.
Facilitates deployment and reproducibility.

Python

import joblib

joblib.dump(model, 'sentiment_nb_model.joblib')
joblib.dump(vectorizer, 'count_vectorizer.joblib')

Output:

['count_vectorizer.joblib']

As we can see the model was able to predict whether it was positive or negative emotion.

thilagavathimfj0

Improve

Article Tags :

Sentiment Analysis using NLTK

Implementing Sentimental Analysis using NLTK

Step 1: Installing NLTK

Step 2: Loading and Preprocessing Data

Step 3: Explore Data Analysis (EDA)

Step 4: Text Preprocessing Pipeline

Step 5: Feature Extraction (Bag of Words)

Step 6: Train-Test Split

Step 7: Model Training (Naive Bayes)

Step 8: Model Evaluation

Step 9: Visualization of Confusion Matrix

Step 10: Use NLTK’s VADER Sentiment Analyzer (Lexicon-based)

Step 11: Compare VADER and ML Model Predictions

Step 12: Predict Results on New Sentences

Step 13: Save the Model and Vectorizer

Explore

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Thank You!

What kind of Experience do you want to share?