Semantic search allows computers to understand the meaning behind user queries rather than relying only on exact keyword matching. Using FAISS (Facebook AI Similarity Search), we can build a high-performance system that searches through hundreds or even thousands of documents by meaning and not just by text overlap. This approach enables smarter, faster and more context-aware information retrieval.
Implementation
Let's see how the model will work:
Working- Document Ingestion: PDF, DOCX and TXT files are read and converted into text.
- Chunking: Each document is split into smaller, meaningful parts.
- Embedding Generation: Text chunks are converted into dense vector representations.
- FAISS Indexing: Embeddings are stored in FAISS for efficient similarity search.
- Query Encoding: A user query is embedded into the same vector space.
- Similarity Search: FAISS finds top matches based on cosine similarity.
- Results Display: The most relevant document snippets are shown.
Let's implement this model,
Used samples can be downloaded from here.
Step 1: Install Dependencies
We need to install the required dependencies such as faiss-cpu, sentence-transformers, python-docx.
Python
!pip install faiss-cpu sentence-transformers python-docx PyMuPDF
Step 2: Import Libraries
We will import the necessary libraires such as os, docx, numpy, SentenceTransformers, faiss.
Python
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
from docx import Document
We need to define the function for document loading,
- Reads different file formats like .pdf, .docx and .txt.
- Ensures content is extracted as plain text for embedding generation.
Python
def extract_text_from_file(file_path):
text = ""
if file_path.endswith(".pdf"):
reader = PdfReader(file_path)
for page in reader.pages:
text += page.extract_text() + "\n"
elif file_path.endswith(".docx"):
doc = Document(file_path)
for para in doc.paragraphs:
text += para.text + "\n"
elif file_path.endswith(".txt"):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
return text.strip()
Step 4: Split Text into Chunks
- Divides long documents into smaller segments (chunks).
- Improves search accuracy and performance by focusing on smaller text units.
Python
def chunk_text(text, chunk_size=300):
words = text.split()
return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
Step 5: Load and Process Documents
Here:
- Reads all files in the documents/ folder.
- Splits them into manageable text chunks and stores the source file name for each.
Python
folder_path = "documents/"
documents = []
doc_sources = []
for file in os.listdir(folder_path):
if file.endswith((".pdf", ".docx", ".txt")):
path = os.path.join(folder_path, file)
print(f"Reading file: {file}")
content = extract_text_from_file(path)
chunks = chunk_text(content)
documents.extend(chunks)
doc_sources.extend([file] * len(chunks))
print(
f"\nLoaded {len(documents)} text chunks from {len(os.listdir(folder_path))} files.")
Output:
ResultStep 6: Generate Text Embeddings
Here we will:
- Converts text chunks into vector representations using SentenceTransformer.
- Normalizes vectors for cosine similarity in FAISS.
- Shows embedding progress for transparency.
Python
model = SentenceTransformer('all-MiniLM-L6-v2')
print("\nGenerating embeddings... (this may take a minute)")
embeddings = model.encode(
documents, convert_to_numpy=True, show_progress_bar=True)
embeddings = embeddings.astype('float32')
faiss.normalize_L2(embeddings)
print(f"Embeddings shape: {embeddings.shape}")
Output:
OutputStep 7: Create FAISS Index
- Initializes a FAISS IndexFlatIP index (for cosine similarity).
- Adds all text embeddings into the FAISS index for fast retrieval.
Python
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
print(f"FAISS index created with {index.ntotal} vectors.")
Output:
FAISS index created with 3 vectors.
Step 8: Define Cleaning and Search Functions
1. clean_text(): Removes unwanted formatting and extra spaces.
2. semantic_search_best():
- Converts the query into a vector.
- Searches the FAISS index for similar embeddings.
- Displays the best matches with readable snippets.
Python
import re
import textwrap
def clean_text(text):
text = re.sub(r'[#=*`~_-]+', '', text)
text = re.sub(r'\*\*(.*?)\*\*', r'\1', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
def semantic_search_best(query, top_k=1, wrap_width=100, similarity_threshold=0.35, snippet_length=300):
query_embedding = model.encode([query]).astype('float32')
faiss.normalize_L2(query_embedding)
D, I = index.search(query_embedding, top_k)
print("\nTop Semantic Search Result(s):")
print("=" * 120)
results_shown = 0
for rank, idx in enumerate(I[0]):
score = D[0][rank]
if score < similarity_threshold:
continue
snippet = clean_text(documents[idx])[:snippet_length]
wrapped_snippet = textwrap.fill(snippet, width=wrap_width)
print(f"\nRank {rank + 1}")
print(f"Source File : {doc_sources[idx]}")
print(f"Similarity Score: {score:.4f}")
print("-" * 120)
print(f"Preview Snippet:\n{wrapped_snippet}")
print("=" * 120)
results_shown += 1
if results_shown == 0:
print("No strong semantic matches found for your query.")
Step 9: Run Semantic Search
- Retrieves top semantically relevant chunks for each query.
- Displays source document name, similarity score and wrapped text preview.
Python
semantic_search_best("applications of artificial intelligence")
Output:
Result
Python
semantic_search_best("database systems and AI", top_k=3)
Output:
ResultThe source code can be downloaded from here.
Advantages
- High Performance: FAISS enables instant retrieval even across thousands of vectors.
- Semantic Understanding: Captures meaning beyond keywords.
- Multi-format Support: Works with PDF, DOCX and TXT files.
- Scalable: Easily extensible to large datasets.
What is the main goal of semantic search?
-
To match keywords exactly
-
To understand the meaning behind user queries
-
To count word frequencies
-
To extract document metadata
Explanation:
Semantic search focuses on understanding context and meaning, not just keyword matches.
Which library is used for similarity search in the implementation?
Explanation:
FAISS (Facebook AI Similarity Search) is used for high-speed vector similarity search.
What is the role of the SentenceTransformer model in the pipeline?
-
-
It generates embeddings for documents and queries
-
-
It visualizes similarity scores
Explanation:
SentenceTransformer converts text into numerical vector embeddings for semantic comparison.
Why is text chunking performed before embedding generation?
-
-
-
To improve retrieval accuracy and performance
-
Explanation:
Splitting text into chunks improves semantic accuracy and makes search results more context-specific.
What type of FAISS index is used in this implementation?
Explanation:
IndexFlatIP is used for cosine similarity-based searches by normalizing vectors.
What does the semantic_search_best() function do?
-
-
Encodes documents into embeddings
-
Searches and displays top semantically relevant document snippets
-
Cleans and tokenizes input text
Explanation:
The function converts a query into an embedding, searches FAISS, and returns the most semantically similar results.
Quiz Completed Successfully
Your Score : 2/6
Accuracy : 0%
Login to View Explanation
1/6
1/6
< Previous
Next >
Explore
Introduction to Machine Learning
Python for Machine Learning
Introduction to Statistics
Feature Engineering
Model Evaluation and Tuning
Data Science Practice