TF-IDF Representations in TensorFlow
Last Updated :
23 Jul, 2025
Text data is one of the most common forms of unstructured data, and converting it into a numerical representation is essential for machine learning models.
Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used text vectorization technique that helps represent text in a way that captures word importance. It evaluates the importance of a word in a document relative to a collection (corpus) of documents. It consists of two components:
- Term Frequency (TF): Measures how often a word appears in a document.
TF(w) = \frac{\text{Number of times word w appears in the document}}{\text{Total number of words in the document}} - Inverse Document Frequency (IDF): Measures the significance of a word across multiple documents.
IDF(w) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the word w}} + 1 \right)
The final TF-IDF score is calculated as:
TF-IDF(w) = TF(w) \times IDF(w)
Words that appear frequently in a document but are rare across the corpus will have higher TF-IDF scores.
Implementing TF-IDF in TensorFlow
TensorFlow provides efficient ways to handle text preprocessing, including TF-IDF representation. We will use the tf.keras.layers.TextVectorization layer to compute TF-IDF features.
Step 1: Import Required Libraries
Python
import tensorflow as tf
import numpy as np
Step 2: Prepare the Dataset
Python
corpus = [
"TensorFlow is an open-source machine learning framework.",
"Machine learning models improve by training on data.",
"Deep learning is a subset of machine learning.",
"TF-IDF helps in text vectorization for NLP tasks."
]
Step 3: Create a TextVectorization Layer with TF-IDF Mode
TensorFlow’s TextVectorization layer can be used to automatically compute TF-IDF values.
Python
vectorizer = tf.keras.layers.TextVectorization(
output_mode="tf_idf",
ngrams=None
)
# Adapting the vectorizer to the corpus
vectorizer.adapt(corpus)
Step 4: Convert Text to TF-IDF Representation
Python
tfidf_matrix = vectorizer(corpus)
tfidf_matrix_np = tfidf_matrix.numpy()
# Print the TF-IDF matrix
print(tfidf_matrix_np)
Output:
Each row in the TF-IDF matrix corresponds to a document in the corpus, and each column represents a tokenized word. The values indicate the importance of words within each document.
Advantages of Using TensorFlow for TF-IDF
- Scalability: TensorFlow handles large text datasets efficiently using GPU acceleration.
- Ease of Integration: Works seamlessly with other TensorFlow components like tf.data pipelines.
- Customization: Allows users to apply preprocessing (lowercasing, tokenization) and integrate TF-IDF with deep learning models.
TF-IDF is a fundamental technique for representing text in a way that emphasizes important words. TensorFlow’s TextVectorization layer simplifies TF-IDF computation, making it a great choice for NLP applications. With this approach, you can efficiently preprocess text and feed it into machine learning models for tasks like classification, clustering, and information retrieval.
Explore
Introduction to NLP
Libraries for NLP
Text Normalization in NLP
Text Representation and Embedding Techniques
NLP Deep Learning Techniques
NLP Projects and Practice