0% found this document useful (0 votes)

14 views124 pages

Unit 3

Unit 3 covers the concepts of vector databases, high-dimensional data storage, and vector embeddings, emphasizing their importance in AI applications. It discusses Retrieval-Augmented Generation (RAG) and hyperparameter tuning techniques, including Full Fine Tuning and Parameter Efficient Fine Tuning (PEFT). The document highlights the advantages of vector databases in handling unstructured data and their role in enhancing machine learning models through efficient data retrieval and semantic understanding.

Uploaded by

kbaylake21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views124 pages

Unit 3

Uploaded by

kbaylake21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 3

RAG and Fine Tuning

What we will cover…

• Vector Databases
• High-dimensional Data Storage
• Vector Embeddings
• High-Dimensional Semantic Similarity
• Retrieval-Augmented Generation (RAG)
• Hyperparameter Tuning and Optimization
• Full fine Tuning v/s Parameter Efficient Fine Tuning (PEFT) Techniques
• LoRA
• QLoRA
Introduction
• We are in the midst of the AI revolution.
• It’s upending any industry it touches, promising great innovations - but it also introduces new
challenges.
• Efficient data processing has become more crucial than ever for applications that involve
large language models, generative AI, and semantic search.
• All of these new applications rely on vector embeddings, a type of vector data representation
that carries within it semantic information that’s critical for the AI to gain understanding and
maintain a long-term memory they can draw upon when executing complex tasks.
• Embeddings are generated by AI models (such as Large Language Models) and have many
attributes or features, making their representation challenging to manage.
• In the context of AI and machine learning, these features represent different dimensions of
the data that are essential for understanding patterns, relationships, and underlying
structures.
[Link]
Introduction
• What are vector embeddings?
• Vector embeddings are numerical representations of data points that convert
various types of data—including nonmathematical data such as words, audio
or images—into arrays of numbers that ML models can process.
• Artificial intelligence (AI) models, from simple linear regression algorithms to
the intricate neural networks used in deep learning, operate through
mathematical logic.
• Any data that an AI model uses, including unstructured data, needs to be
recorded numerically.
• Vector embedding is a way to convert an unstructured data point into an array
of numbers that expresses that data’s original meaning.

[Link]
Introduction
• Here's a simplified example of word embeddings for a very small corpus (2 words),
where each word is represented as a 3-dimensional vector:
• cat [0.2, -0.4, 0.7]
• dog [0.6, 0.1, 0.5]
• In this example, each word ("cat") is associated with a unique vector ([0.2, -0.4, 0.7]).
• The values in the vector represent the word's position in a continuous 3-dimensional
vector space.
• Words with similar meanings or contexts are expected to have similar vector
representations.
• For instance, the vectors for "cat" and "dog" are close together, reflecting their semantic
relationship.

[Link]
Introduction
• Embedding models are trained to convert data points into vectors.
• Vector databases store and index the outputs of these embedding models.
• Within the database, vectors can be grouped together or identified as opposites based
on semantic meaning or features across virtually any data type.
• Vector embeddings are the backbone of recommendations, chatbots and generative
apps such as ChatGPT.
• For example, take the words “car” and “vehicle.”
• They have similar meanings but are spelled differently.
• For an AI application to enable effective semantic search, the vector representations of
“car” and “vehicle” must capture their semantic similarity.
• In machine learning, embeddings represent high-dimensional vectors that encode this
semantic information.

[Link]
Vector Databases
• A vector database is a collection of data stored as mathematical representations.
• Vector databases make it easier for machine learning models to remember previous inputs,
allowing machine learning to be used to power search, recommendations, and text generation use-cases.
• Data can be identified based on similarity metrics instead of exact matches, making it possible for a
computer model to understand data contextually.
• E.g. when one visits a shoe store a salesperson may suggest shoes that are similar to the pair one prefers.
• Likewise, when shopping in an ecommerce store, the store may suggest similar items under a header like
"Customers also bought..."
• Vector databases enable machine learning models to identify similar objects, just as the salesperson can
find comparable shoes and the ecommerce store can suggest related products.
• Thus vector databases make it possible for computer programs to draw comparisons, identify relationships,
and understand context.
• This enables the creation of advanced artificial intelligence (AI) programs like large language models (LLMs).
[Link]
Vector Databases
• A vector is an array of numerical values that expresses the location
of a floating point along several dimensions.
• In more everyday language, a vector is a list of numbers, like: {12,
13, 19, 8, 9}.
• These numbers indicate a location within a space, just as a row and
column number indicates a certain cell in a spreadsheet (e.g. "B7").

In this simple vector database, the documents in the upper right are likely similar
to each other. [Link]
Vector Databases
• Traditional Databases
• Data Type: Store structured data (rows, columns, tables).
• Query Mechanism: Rely on exact match or relational logic (e.g., SQL).
• Indexing: B-trees, hash indexes optimized for structured queries.
• Use Cases: Banking transactions, inventory systems, CRM, ERP.
• Strengths:
• ACID compliance (transactions, consistency) Atomicity, Consistency, Isolation, and Durability—
that ensure database transactions are processed reliably and result in a valid and consistent
database state.
• Great for well-defined schema
• High efficiency for deterministic queries
• Limitations:
• Cannot handle unstructured data (images, text, audio) effectively
• No semantic understanding — “apple” the fruit ≠ “Apple” the company
Vector Databases
• Vector Databases
• Data Type: Store embeddings (high-dimensional vectors) from ML/AI models.
• Query Mechanism: Similarity search (cosine similarity, Euclidean distance, dot product).
• Indexing: Approximate Nearest Neighbor (ANN) techniques (HNSW, IVF, PQ).
• Use Cases: Semantic search, recommendation engines, Retrieval-Augmented Generation (RAG),
image/video/audio similarity, anomaly detection.
• Strengths:
• Handles unstructured data effectively
• Supports semantic and fuzzy queries (meaning-based)
• Scales to billions of embeddings
• Limitations:
• Not optimized for transactions
• Emerging ecosystem — fewer mature tools than traditional DBs
• Compute-intensive indexing/search
Vector Databases

• Quick Analogy
• Traditional DB = Library Catalog: find exact book ID or author
name.
• Vector DB = Google Search: find content by meaning, even if you
don’t use exact words.
Vector Databases
• How are vector databases used?
• Vector databases serve three key functions in AI and ML applications:
• Vector storage
• Vector indexing
• Similarity search based on querying or prompting
• In operation, vector databases work by using multiple algorithms to conduct an approximate
nearest neighbor (ANN) search.
• The algorithms are then gathered in a pipeline to quickly and accurately retrieve and deliver data
neighboring the vector that is queried.
• For example, an ANN search could look for products that are visually similar in an e-commerce
catalog.
• Additional uses include anomaly detection, classification and semantic search. Because a dataset
runs through a model just once, results are returned within milliseconds.

[Link]
Vector Databases
• Vector storage
• Vector databases store the outputs of an embedding model algorithm, the
vector embeddings.
• They also store each vector’s metadata—including title, description and data
type—which can be queried by using metadata filters.
• By ingesting and storing these embeddings, the database can facilitate fast
retrieval of a similarity search, matching the user’s prompt with a similar
vector embedding.

[Link]
Vector Databases
• Vector indexing
• Vectors need to be indexed to accelerate searches within high-dimensional data spaces.
• Vector databases create indexes on vector embeddings for search functions.
• The vector database indexes vectors by using an ML algorithm.
• Indexing maps the vectors to new data structures that enable faster similarity or distance
searches, such as nearest neighbor searches, between vectors.
• Vectors can be indexed by using algorithms such as hierarchical navigable small world
(HNSW), locality-sensitive hashing (LSH) or product quantization (PQ).
• HNSW is popular as it creates a tree-like structure. Each node of the tree shows a set of vectors complete
with the hierarchies in each. Similarities between vectors are shown at the edges between the nodes.
• LSH indexes content by using an approximate nearest-neighbor search. For extra speed, the index can be
optimized by returning an approximate, but nonexhaustive result.
• PQ converts each dataset into a short, memory-efficient representation. Only the short representations
are stored, rather than all of the vectors.
[Link]
Vector Databases
• Similarity search based on querying or prompting
• Query vectors are vector representations of search queries.
• When a user queries or prompts an AI model, the model computes an embedding of the query or
prompt.
• The database then calculates distances between query vectors and vectors stored in the index to
return similar results.
• Databases can measure the distance between vectors with various algorithms, such as nearest
neighbor search.
• Measurements can also be based on various similarity metrics, such as cosine similarity.
• The database returns the most similar vectors or nearest neighbors to the query vector according to
the similarity ranking.
• These calculations support various machine learning tasks, such as recommendation systems,
semantic search, image recognition and other natural language processing tasks.

[Link]
Vector Databases
• Advantages of vector databases
• Vector databases are a popular way to power enterprise AI-based applications because they can deliver many benefits:
• Speed and performance
• Vector databases use various indexing techniques to enable faster searching.
• Vector indexing and distance-calculating algorithms such as nearest neighbor search can help optimize performance
when searching for relevant results across large datasets with millions, if not billions, of data points.
• Scalability
• Vector databases can store and manage massive amounts of unstructured data by scaling horizontally with additional
nodes, maintaining performance as query demands and data volumes increase.
• Lower cost of ownership
• Because they enable faster data retrieval, vector databases speed the training of foundation models.
• Data management
• Vector databases typically provide built-in features to easily update and insert new unstructured data.
• Flexibility
• Vector databases are built to handle the added complexity of using images, videos or other multidimensional data.
• Given the multiple use cases ranging from semantic search to conversational AI applications, vector databases can be
customized to meet business and AI requirements.
[Link]
Vector Databases
• How does a vector database work?
• A vector database uses a combination of different algorithms that all
participate in Approximate Nearest Neighbor (ANN) search.
• These algorithms optimize the search through hashing, quantization, or
graph-based search.
• These algorithms are assembled into a pipeline that provides fast and
accurate retrieval of the neighbors of a queried vector.
• Since the vector database provides approximate results, the main trade-
offs we consider are between accuracy and speed.
• The more accurate the result, the slower the query will be. However, a
good system can provide ultra-fast search with near-perfect accuracy.

[Link]
Vector Databases

• Unstructured data, such as text, images, and audio, lacks a predefined format, posing challenges for
traditional databases. To leverage this data in artificial intelligence and machine learning applications, it's
transformed into numerical representations using embeddings.
• Embedding is like giving each item, whether it's a word, image, or something else, a unique code that
captures its meaning or essence. This code helps computers understand and compare these items in a more
efficient and meaningful way. Think of it as turning a complicated book into a short summary that still
captures the main points. [Link]
databases
Vector Databases

• This embedding process is typically achieved using a special kind of neural network designed for the task. For
example, word embeddings convert words into vectors in such a way that words with similar meanings are
closer in the vector space.
• This transformation allows algorithms to understand relationships and similarities between items.
• Essentially, embeddings serve as a bridge, converting non-numeric data into a form that machine learning
models can work with, enabling them to discern patterns and relationships in the data more effectively.
[Link]
databases
Vector Databases
• Pipeline for a vector database

[Link]: The vector database indexes vectors using an algorithm such as PQ, LSH, or
HNSW. This step maps the vectors to a data structure that will enable faster searching.
[Link]: The vector database compares the indexed query vector to the indexed vectors
in the dataset to find the nearest neighbors (applying a similarity metric used by that
index)
[Link] Processing: In some cases, the vector database retrieves the final nearest neighbors
from the dataset and post-processes them to return the final results. This step can include
re-ranking the nearest neighbors using a different similarity measure.
[Link]
Vector Databases
• Popular vector databases
• Chroma

• Chroma is an open-source embedding database.

• Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.
• You can easily manage text documents, convert text to embeddings, and do similarity searches.
• ChromaDB features:
• LangChain (Python and JavScript) and LlamaIndex support available
• The same API that runs in Python notebook scales to the production cluster
[Link]
databases
Vector Databases
• Popular vector databases
• Pinecone

Pinecone is a managed vector database platform that has been purpose-built to tackle the unique challenges associated with high-
dimensional data.
Equipped with cutting-edge indexing and search capabilities, Pinecone empowers data engineers and data scientists to construct and
implement large-scale machine learning applications that effectively process and analyze high-dimensional data.
Key features of Pinecone include:
• Fully managed service
• Highly scalable
• Real-time data ingestion
• Low-latency search
• Integration with LangChain
Notably, Pinecone was the only vector database included in the inaugural Fortune 2023 50 AI Innovator list.
[Link]
databases
Vector Databases
• Popular vector databases
• Weaviate

• Weaviate is an open-source vector database.

• It allows you to store data objects and vector embeddings from your favorite ML models and scale seamlessly into
billions of data objects.
• Some of the key features of Weaviate are:
• Weaviate can quickly search the nearest neighbors from millions of objects in just a few milliseconds.
• With Weaviate, either vectorize data during import or upload your own, leveraging modules that integrate with
platforms like OpenAI, Cohere, HuggingFace, and more.
• From prototypes to large-scale production, Weaviate emphasizes scalability, replication, and security.
• Apart from fast vector searches, Weaviate offers recommendations, summarizations, and neural search
framework integrations.
[Link]
databases
Vector Databases
• Popular vector databases
• Faiss

• Faiss is an open-source library for the swift search of similarities and the clustering of dense vectors.
• It houses algorithms capable of searching within vector sets of varying sizes, even those that might exceed RAM
capacity.
• Additionally, Faiss offers auxiliary code for assessment and adjusting parameters.
• While it's primarily coded in C++, it fully supports Python/NumPy integration.
• Some of its key algorithms are also available for GPU execution.
• The primary development of Faiss is undertaken by the Fundamental AI Research group at Meta.
[Link]
databases
Vector Databases
• Popular vector databases
• Qdrant

• Qdrant is a vector database and a tool for conducting vector similarity searches. It operates as an API service, enabling
searches for the closest high-dimensional vectors.
• Using Qdrant, you can transform embeddings or neural network encoders into comprehensive applications for tasks
like matching, searching, making recommendations, and much more.
• Here are some key features of Qdrant:
• Offers OpenAPI v3 specs and ready-made clients for various languages.
• Uses a custom HNSW algorithm for rapid and accurate searches.
• Allows results filtering based on associated vector payloads.
• Supports string matching, numerical ranges, geo-locations, and more.
• Cloud-native design with horizontal scaling capabilities.
• Built-in Rust, optimizing resource use with dynamic query planning. [Link]
Indexing in Vector Databases
• Indexing refers to the process of organizing high-dimensional vectors in a way that provides efficient
querying of nearest-neighbor vectors.
• This is the most crucial part of building any vector database.
• These indexes enable fast and efficient querying of high-dimensional embeddings.
• There are multiple indexing methods to create vector indices, such as:
• Linear search Algorithm (Flat Indexing):
• This is a linear search algorithm, which means it will compare the query vector with every other vector stored
in the database.
• This is the simplest method out there and works well with small datasets.
• Cluster-based algorithm (IVF):
• Inverted File is a cluster-based indexing technique.
• It uses k-means clustering to cluster all the vectors.
• When a query vector is provided, it calculates the distance between the query vector and the centroids of each
cluster.
• And starts searching for the nearest neighbors in the cluster with the centroid closest to the query vector.
• This significantly reduces query time.
[Link]
Indexing in Vector Databases
• Quantization (Scalar and Product Quantization):
• The quantization technique involves reducing the memory footprint of large
embeddings by reducing their precision.
• Graph-based (HNSW):
• The most common indexing method.
• It uses hierarchical graph architecture to index vectors.

[Link]
Hierarchical Navigable Small World
• Hierarchical Navigable Small World (HNSW) is a
state-of-the-art algorithm used for an approximate
search of nearest neighbours.
• HNSW constructs optimized graph structures.
• The main idea of HNSW is to construct such a
graph where a path between any pair of vertices
could be traversed in a small number of steps.
• HNSW graphs are among the top-performing
indexes for vector similarity search.
• It is a hugely popular technology that time and
time again produces state-of-the-art performance
with super fast search speeds and fantastic recall.
Foundations of HNSW
• Approximate Neighbor Search
• To reduce the computation complexity added by an exhaustive search like KNN approximate
neighbor search is preferred technique.
• ANN allows us to get a massive performance boost on similarity search when dealing with huge
datasets.
• In approximate nearest neighbors (ANN), we build index structures that narrow down the search
space and improve lookup times.
• Apart from that, most ML models produce vectors that have high dimensionality which is
another hurdle to overcome.
• Approximate search relies on the fact that even though data is represented in a large number of
dimensions, their actual complexity is low.
• It tries to work with the true intrinsic dimensionality of data.
• There are various algorithms to solve the approximate search problem and to actually dive into how
approximate search works warrants another article of its own.
Foundations of HNSW
• We can split ANN algorithms into three distinct categories; trees, hashes,
and graphs.
• HNSW slots into the graph category.
• More specifically, it is a proximity graph, in which two vertices are linked
based on their proximity (closer vertices are linked) — often defined in
Euclidean distance.
• There is a significant leap in complexity from a ‘proximity’ graph
to ‘hierarchical navigable small world’ graph.
• The two fundamental techniques that contributed most heavily to HNSW:
• The probability skip list
• Navigable small world graphs
Foundations of HNSW
• Probability Skip List
• The probability skip list was introduced way back in 1990 by William Pugh [2].
• Skip list is a probabilistic data structure that allows inserting and searching elements within a sorted list for O(logn) on
average.
• It allows fast search like a sorted array, while using a linked list structure for easy (and fast) insertion of new elements
(something that is not possible with sorted arrays).
• Skip lists work by building several layers of linked lists.
• The lowest layer has the original linked list with all the elements in it.
• When moving to higher levels, the number of skipped elements increases, thus decreasing the number of connections.

[Link]
Foundations of HNSW
• Probability Skip List
• On the first layer, we find links that skip many intermediate nodes/vertices.
• As we move down the layers, the number of ‘skips’ by each link is decreased.
• The search procedure for a certain value starts from the highest level and compares its next element with the value.
• If the value is less or equal to the element, then the algorithm proceeds to its next element.
• Otherwise, the search procedure descends to the lower layer with more connections and repeats the same process.
• At the end, the algorithm descends to the lowest layer and finds the desired node.

[Link]
Foundations of HNSW
• Probability Skip List
• To search a skip list, we start at the highest layer with the longest ‘skips’ and move along the
edges towards the right (below).
• If we find that the current node ‘key’ is greater than the key we are searching for — we know
we have overshot our target, so we move down to previous node in the next level.
• HNSW inherits the same layered format with longer edges in the highest layers (for fast
search) and shorter edges in the lower layers (for accurate search).

[Link]
Foundations of HNSW
• Probability Skip List

• Example of searching in probability skip list

• Search for index no. 20

[Link]
2aad4fe87d37/
Foundations of HNSW
• Probability Skip List
• Example of searching in probability skip list
Foundations of HNSW
• Probability Skip List
• While random access is faster, insertion and deletion are slower as they add
additional overhead for updating and deleting on multiple layers.
• While insertion, we start from the bottom list and add the node at the
appropriate position.
• As skip lists maintain a hierarchical structure, we need to determine if the
node appears at a higher level.
• The process is random, like a coin toss.
• The probability of a node appearing in its immediate upper layer is 0.5.

[Link]
Foundations of HNSW
• Probability Skip List
• In an ideal skip list, the number of nodes on layer+1 will be ~n/2, and in layer+2 ~n/4,
where n is the number of nodes on the bottom-most layer or the complete linked list.
• Consider the following example:

• We find the ideal place for insertion and insert

the node at the bottom level.
• We then decide if the node appears on the
upper level based on a random binary outcome
(heads or tail).
• In a perfect skip list, we get a balanced
distribution of nodes in each level.
• Deletion happens similarly.
• Find the target number and delete the node.
• If the element is there on a higher layer, delete
it and update the linked list.
[Link]
Foundations of HNSW
• Small world experiment
• Small worlds were famously studied in Stanley Milgram’s small-world experiment.
• Participants were given a letter containing the address and other basic details of a
randomly chosen target individual, along with the experiment’s instructions.
• In the unlikely event that they personally knew the target, they were instructed to
send them the letter; otherwise, they were told to think of someone they knew
who was more likely to know the target, and send the letter on to them.
• The surprising conclusion was that the letters were typically only sent around six
times before reaching the target, demonstrating the famous idea of "six degrees of
separation" – any two people can usually be connected by a small chain of friends.
• In the mathematical field of graph theory, a graph is a set of points, some
of which are connected.
• For e.g. a social network as a graph, with people as points and Illustration of a small world. Most
connections (grey) are local, but there are
friendships as connections. also long-range connections (green),
• The small-world experiment found that most pairs of points in this graph which create short paths between points,
are connected by short paths that have a small number of steps. such as the three step path between
points A and B indicated with arrows.
[Link]
Foundations of HNSW
• Small world
• Most graphs have the property of short paths, including graphs
created by just connecting random pairs of points.
• Consider the example of social networks:
• They are not connected randomly, they are highly local
• Friends tend to live close to each other, and if you know two people, it’s quite
likely they know each other too.
• This is described technically as the graph having a high clustering coefficient.
• The surprising thing about the small-world experiment is that two distant points
are only separated by a short path despite connections typically being short-
range.
• In cases like these when a graph has lots of local connections, but also has short
paths, we say the graph is a small world.

[Link]
Foundations of HNSW
• Small world
• Most graphs have the property of short paths, including graphs
created by just connecting random pairs of points.
• Another example of a small world is the global airport network.
• Airports in the same region are highly connected to one another, but it’s
possible to make a long journey in only a few stops by making use of
major hub airports.
• For example, a journey from Nagpur, India to Osaka, Japan typically
starts with a local flight from Nagpur to Mumbai, then a long distance
flight from Mumbai to Tokyo, and finally another local flight from Tokyo
to Osaka.
• Long-range hubs are a common way of achieving the small world
property.
• A final interesting example of graphs with the small world
property is biological neural networks such as the human brain.
[Link]
Foundations of HNSW
• Navigable Small Worlds
• The idea of small world can be adopted for nearest neighbour search
• if we create connections between our vectors in such a way that it forms a small world
graph, we can quickly find the vectors near a target by starting from an arbitrary "entry
point" vector and then navigating through the graph towards the target.
• This possibility was explored by Kleinberg.
• He noted that
• The existence of short paths wasn’t the only interesting thing about small world
experiment:
• it was also surprising that people were able to find these short paths, without using any
global knowledge about the graph.
• Rather, the people were following a simple greedy algorithm.
• At each step, they examined each of their immediate connections, and sent it to the one
they thought was closest to the target.
• We can use a similar algorithm to search a graph that connects vectors.
[Link]
Foundations of HNSW
• Navigable Small Worlds

Illustration of the greedy search algorithm.

• We are searching for the vector that is nearest the target X.
• Starting at the entry point E, we check the distance to X of each vector connected to E (indicated by the
arrows from E), and go to the closest one (indicated by the red arrow from E).
• We repeat this procedure at successive vectors until we reach Y.
• As Y has no connections that are closer to X than Y itself, we stop and return Y.

[Link]
Foundations of HNSW
• Navigable Small Worlds
• Kleinberg performed simple simulations of small worlds in which all of the points were connected to their immediate
neighbours, with additional longer connections created between random points.

• He discovered that the greedy algorithm would only find a short path in specific conditions, depending on the lengths of the
long-range connections.
• If the long-range connections were too long (as was the case when they connected pairs of points in completely random
locations), the greedy algorithm could follow a long-range connection to quickly reach the rough area of the target, but after
that the long-range connections were of no use, and the path had to step through the local connections to get closer.
• On the other hand, if the long-range connections were too short, it would simply take too many steps to reach the area of the
target.

• If, however, the lengths of the long-range connections were just right (if they were uniformly distributed, so that all lengths
were equally likely), the greedy algorithm would typically reach the neighbourhood of the target in an especially small number
of steps (a number proportional to log(n), where n is the number of points in the graph).
• In cases like this where the greedy algorithm can find the target in a small number of steps, we say the small world is
a navigable small world (NSW).
[Link]
Foundations of HNSW
• Navigable Small Worlds
• A method proposed to build NSW for vectors in a complex, high-
dimensional space :
• we insert one randomly chosen vector at a time to the graph, and connect it to a
small number m of nearest neighbours that were already inserted.

Illustration of building an NSW.

• Vectors are inserted in a random order and connected to the
nearest m = 2 inserted vectors.
• Note how the first vectors to be inserted form long-range
connections while later vectors form local connections.

[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds
• A typical path through an NSW from the entry point towards the target went through two
phases: a "zoom-out" phase, in which connection lengths increase from short to long, and a
"zoom-in" phase, in which the reverse happens.
• To achieve this efficiently, the number of connections checked at each hub.
• This leads to the main idea of HNSW: explicitly distinguishing between short-range and long-range
connections.
• In the initial stage of a search, we will only consider the long-range connections between hubs.
• Once the greedy search has found a hub near the target, we then switch to using the short-range
connections.

[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds

• Illustration of a search through an HNSW.

• We are searching for the vector nearest the target
X.
• Long-range connections and hubs are green; short-
range connections are grey.
• Arrows show the search path.
• Starting at the entry point E1, we perform a greedy
search among the long-range connections,
reaching E2, which is the nearest long-range hub
to X.
• From there we continue the greedy search among
the short-range connections, ending at Y, the
nearest vector to X.
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds

• Illustration of a search through an HNSW.

• We can also explicitly impose a maximum number of long-
range and short-range connections of each vector when we
build the index.
• This results in a fast search time (proportional to log(n)).
• The idea of separate short and long connections can be
generalized to include several intermediate levels of
connection lengths.
• We can visualize this as a hierarchy of layers of connected
vectors, with each layer only using a fraction of the vectors
in the layer below.

[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds

• Illustration of a search through an HNSW.

Left: illustration of an HNSW with three levels of connection length – short connections are grey,
longer connections are green, and the longest connections are red. E is the entry point. Right:
visualising the HNSW as a stack of three layers. Dotted lines indicate the location of the same vector
in the layer below.
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds

• Illustration of a search through an HNSW.

Foundations of HNSW
Hierarchical Navigable Small Worlds

• HNSW (Hierarchical Navigable Small World graphs) is one of the most popular
Approximate Nearest Neighbor (ANN) indexing algorithms used in vector
databases.
• What HNSW Does
• Instead of exhaustively comparing a query vector with all stored vectors (which
is expensive at scale),
• HNSW builds a graph structure where each vector is connected to its
“neighbors” in multiple layers.
• Search starts from the top layer (sparser connections) and navigates down to
denser layers.
• This allows fast approximate nearest neighbor search.
Foundations of HNSW
Hierarchical Navigable Small Worlds

Top-k Retrieval with HNSW

• When you query with a vector, HNSW traverses the graph to find candidate neighbors.
• Then, it ranks them by similarity (cosine similarity, dot product, or Euclidean distance).
• Finally, it returns the top-k closest matches.
• In vector DBs like Chroma, Weaviate, Milvus, Qdrant, you typically specify k (or
n_results) in your query:
• Example of ChromaDB
Foundations of HNSW

• Key HNSW Parameters

• M (max connections per node)
• Controls how many neighbors each vector connects to in the
graph.
• Impact:
• Higher M → denser graph → better recall (higher accuracy in top-k results).
• Lower M → sparser graph → faster, less memory, but reduced accuracy.
• Typical values: 8–64 (default in many DBs is ~16).
Foundations of HNSW

• Key HNSW Parameters

• efConstruction (size of dynamic candidate list during index building)
• Determines how much effort the algorithm
• Impact:
• Higher efConstruction → more accurate graph structure → better search
quality.
• Lower efConstruction → faster index build but weaker graph.
• Trade-off: This affects indexing time (not query time).
Foundations of HNSW

• Key HNSW Parameters

• efSearch (size of dynamic candidate list during querying)
• Controls how many candidate neighbors are explored at query
time.
• Impact:
• Higher efSearch → higher recall (top-k results closer to exact nearest
neighbors).
• Lower efSearch → faster queries but possibly lower accuracy.
• Typical values: 50–500.
Foundations of HNSW

• Key HNSW Parameters

• Effect on Top-k Retrieval
• Recall (accuracy): If efSearch is too low, you might miss the true
nearest neighbors in your top-k.
• Latency (speed): Larger efSearch = slower search because more
nodes are visited.
• Memory usage: Higher M increases memory usage, since more
connections per node are stored.
Foundations of HNSW

Example: Top-5 Matches

Suppose we want top-5 (k=5) nearest neighbors.

•If M=16, efConstruction=200, efSearch=50 → fast search, ~90% recall.

•If M=32, efConstruction=400, efSearch=200 → slower but ~99% recall.

•For production (real-time queries): Keep efSearch smaller (faster).

•For research/accuracy-critical tasks: Increase efSearch (more accurate).
Tutorial 1
Step 1: Data Setup

We use a simplified greedy insertion rule:

• For each new point, start from an entry point (the first node inserted — A),
greedily move to closer neighbors, collect up to efConstruction candidate nodes,
then connect the new node to the up to M nearest among those candidates.
• We also update neighbor links (symmetrically) but cap neighbors per node to M.
Tutorial
Step 2: Building a simplified HNSW graph (insertion order A → B → C → D → E → F)

• We use 6 points in 2D space:

• A = (0,0)
• B = (1,0)
• C = (0,1)
• D = (1,1)
• E = (2,0)
• F = (2,1)

• Query = (1.5, 0.2)

• k = 2 (top-2 neighbors)
• M = 2, efConstruction = 2, efSearch = 3
Tutorial
Step 2: Simplified HNSW Graph
• Edges after simplified construction (M=2):

• A: {B, C}
• B: {A, E}
• C: {A, D}
• D: {C, F}
• E: {B, F}
• F: {E, D}
Tutorial
Step 3: Distance Calculations (Part 1)
• Query q = (1.5, 0.2)

• dist(q,A) = sqrt((1.5-0)^2 + (0.2-0)^2) ≈ 1.513

• dist(q,B) = sqrt((1.5-1)^2 + (0.2-0)^2) ≈ 0.539
• dist(q,C) = sqrt((1.5-0)^2 + (0.2-1)^2) = 1.700
Tutorial
Step 3: Distance Calculations (Part 2)

• dist(q,D) = sqrt((1.5-1)^2 + (0.2-1)^2) ≈ 0.943

• dist(q,E) = sqrt((1.5-2)^2 + (0.2-0)^2) ≈ 0.539
• dist(q,F) = sqrt((1.5-2)^2 + (0.2-1)^2) ≈ 0.943

• True nearest neighbors: B and E (≈0.539)

Tutorial
Step 4: HNSW Search Walkthrough (efSearch=3)
• Start at A → dist=1.513
• Candidates: {A}

• Expand A: add B (0.539), C (1.700)

• Candidates: {B, A, C}

• Expand B: add E (0.539)

• Candidates: {E, A, C}

• Expand E: add F (0.943)

• Candidates: {F, A, C}

• Best found: B and E → Top-2 matches

Tutorial
Step 5: Key Takeaways
• HNSW finds nearest neighbors by graph navigation.
• Top-2 neighbors found: B and E.
• efSearch controls accuracy vs speed.
• M controls graph density.
• Efficient: not all distances need checking.
Locality Sensitive hashing

• Locality Sensitive Hashing (LSH) is a technique used to efficiently find

similar items in high-dimensional data.
• Instead of comparing every pair of items (which is very costly when you
have millions of vectors), LSH uses special hash functions that map
similar items to the same “bucket” with high probability.

• Traditional hashing tries to minimize collisions.

• LSH does the opposite: it designs hash functions so that similar items are more
likely to collide (land in the same bucket).
• This allows fast approximate nearest neighbor (ANN) search.
Locality Sensitive hashing

A typical hash function used in dictionaries in python aims to place different values (no
matter how similar) into separate buckets.
Locality Sensitive hashing
• However, there is a key difference between this type of hash function and that used in LSH.
• With dictionaries, our goal is to minimize the chances of multiple key-values being mapped to the
same bucket — we minimize collisions.
• LSH is almost the opposite. In LSH, we want to maximize collisions — although ideally only
for similar inputs.
Locality Sensitive hashing
• There is no single approach to hashing in LSH.
• They all share the same ‘bucket similar samples through a hash function’ logic , but
they can vary a lot beyond this.
• There are various approaches like traditional approach, using shingling, MinHashing,
and banding.
• There are several other techniques, such as Random Projection
Locality Sensitive hashing
• How LSH Works
• The LSH algorithm comprises three main steps:
• Step 1:
• Hashing LSH involves creating multiple hash functions, each generating a hash code for every data
point.
• These hash codes typically consist of binary strings of equal length.
• Step 2:
• Binning The hash codes are used to distribute the data points into hash buckets.
• Points with the same hash code are grouped together in the same bucket.
• Step 3:
• Querying When a query item is presented, it undergoes the same hashing process to
generate a hash code.
• Then, we look for the bucket containing the hash code of the query item.
• All items within this bucket are potential candidates for being similar to the query item.
• We then perform a refined similarity check on these candidates using the original high-
dimensional data to identify true similarities.
Locality Sensitive hashing
• Key factors and parameters in LSH:
• The effectiveness of LSH depends on several important parameters:
[Link] of Hash Functions (K):
[Link] hash functions lead to a higher probability of similar items being hashed to
the same bucket.
[Link], increasing the number of hash functions also increases computational
overhead.
[Link] of Hash Codes (L):
1. Longer hash codes provide more accurate results but also require more space to
store.
[Link] trade-off between length and accuracy needs to be considered depending on
the application.
[Link] of Hash Tables (M):
[Link] multiple hash tables (M) allows for a more robust search, as we can look for
similar items across multiple hash tables.
Locality Sensitive hashing
DISTANCE MEASURES
◾ Goal:Find near-neighbors in [Link]
• We formally define“near neighbors” as
points that are a“small distance” apart
• For each application,we first need to define what
“distance” means
• Today:Jaccard distance/similarity
• The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1,C 2 ) = |C1ᲘC2|/|C1 Ი C2|
• Jaccard distance: d(C1,C 2 ) = 1 - |C1 Ი C2|/|C1 Ი C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
Locality Sensitive hashing
TASK: FINDING SIMILAR DOCUMENTS
Locality Sensitive hashing
ENCODING SETSAS BITVECTORS
Locality Sensitive hashing
FROM SETS TO BOOLEAN MATRICES
• Rows = elements (shingles)
• Columns = sets (documents)
• 1 in row e and column s if and only if e is a Documents
member of s 1 1 1 0
• Column similarity is the Jaccard similarity of the
1 1 0 1
corresponding sets (rows with value 1)
• Typical matrix is sparse!
0 1 0 1

Shingles
• Each document is a column: 0 0 0 1
• Example: sim(C1 ,C2) = ? 1 0 0 1
• Size of intersection = 3;size of union = 6,
Jaccard similarity (not distance) = 3/6 1 1 1 0
• d(C1,C2) = 1 – (Jaccard similarity) = 3/6
1 0 1 0
Locality Sensitive hashing
HASHING COLUMNS (SIGNATURES)
Locality Sensitive hashing
MIN-HASHING
Locality Sensitive
THE MIN-HASH hashing
PROPERTY 0 0
0 0
• Choos e a random permutation  1 1
• Claim :Pr[h (C1 ) = h(C2)] = sim(C1, C 2 ) 0 0
• Why? 0 1
• Let X be a doc (set of shingles),y X is a shingle
• The n: Pr[(y) = min( (X))] = 1/|X|
1 0
• It is equally likely that any y X is mapped to the min element
• Let y be s.t.(y) = min((C1C2))
• Then either: (y) = min((C1)) if y  C 1 ,or
One of the two
(y) = min((C2)) if y  C2 cols had to have
1 at position y
• So the [Link] both are true is the prob.y  C 1  C2
• Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2 )
Locality Sensitive hashing
FOURTYPES OF ROWS
• Given cols C 1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
• a = # rows of typeA, etc.
• Note: sim(C1, C 2 ) = a/(a +b +c)
• Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
• Look down the cols C 1 and C 2 until we see a 1
• If it’s a type-A row,then h(C1) = h(C2)
If a type-B or type-C row,then not
Locality Sensitive hashing
SIMILARITY FOR SIGNATURES
Tutorial 2

We have two short documents:

•Doc1 = "data science"
•Doc2 = "science of data"
Detect their similarity using the LSH pipeline.
RAG
RAG
• LLMs are powerful but come with inherent limitations:
• Limited knowledge:
• Limited to providing generic answers based on their training data.
• Cannot provide accurate answers to specific domain related questions.
• The training data of these models have a cutoff date, limiting their ability to provide up-to-date responses.

• Hallucinations:
• They tend to confidently generate false responses based on imagined facts.
• Can provide responses that are off-topic if they don’t have an accurate answer to the user’s query.
• These models sometimes generate plausible-sounding but incorrect information.
• Generic responses:
• Often provide generic responses that aren’t tailored to specific contexts.
• Cannot provide a personalized customer experience.
• Without access to external sources, LLMs may provide vague or imprecise answers.
[Link]
RAG
Ways to optimize LLMs.

Prompt Engineering

Retrieval-Augmented
Generation

Instruct / Fine-tuning

A typical case of RAG

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
RAG Applications
Scenarios where RAG is applicable: Q&A Fact Checking Dialog
RETRO (Borgeaud et al2021) RAG (Lewis et al, 2020) Blender Bot 3 (Shuster et
REALM (Gu et al, 2020) ATLAS (lzacard et al, 2022) al.2022)
ATLAS (lzacard et al, 2023) Evi. Generator (Asai et al, Internet-augmented
2022a） generation
• Long-tail distribution of data (Komeili et a., 2022)

• Frequent knowledge updates

Summary Machine Translation Code Generation
FLARE (Jiang et al, 2023) kNN-MT (Khandelwal et al., DocPrompting (Zhou et al.,
• Answers requiring verification 2020)TRIME-MT (Zhong et
al., 2022)
2023
Natural ProverWelleck et al.,
2022)
and traceability

• Specialized domain knowledge Natural Language Sentiment Commonsense

Inference analysis reasoning
kNN-Prompt (Shi et al., 2022) k NN- Prompt ( Shi e t a l . , Raco (Yu et al, 2022)
• Data privacy preservation NPM (Min et al., 2023) 2022)NPM (Min et al., 2023)

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
• Retrieval-Augmented Generation (RAG) has
emerged as a powerful paradigm in the field of AI
and Large Language Models (LLMs).
• Retrieval Augmented Generation (RAG) is a
technique that enhances LLMs by integrating them
with external data sources.
• By combining the generative capabilities of models
like GPT-4 with precise information retrieval
mechanisms, RAG enables AI systems to produce
more accurate and contextually relevant responses.
RAG
The retrieval-augmented generation (RAG) approach helps solve several challenges in natural language processing
(NLP) and AI applications:
1. Factual Inaccuracies and Hallucinations:
• Traditional generative models can produce plausible but incorrect information.
• RAG reduces this risk by retrieving verified, external data to ground responses in factual knowledge.
2. Outdated Information:
• Static models rely on training data that may become obsolete.
• RAG dynamically retrieves up-to-date information, ensuring relevance and accuracy in real-time.
3. Contextual Relevance:
• Generative models often struggle with maintaining context in complex or multi-turn conversations.
• RAG retrieves relevant documents to enrich the context, improving coherence and relevance.
4. Domain-Specific Knowledge:
• Generic models may lack expertise in specialized fields.
• RAG integrates domain-specific external knowledge for tailored and precise responses.
5. Cost and Efficiency:
• Fine-tuning large models for specific tasks is expensive.
• RAG eliminates the need for retraining by dynamically retrieving relevant data, reducing costs and computational
load.
6. Scalability Across Domains:
• RAG is adaptable to diverse industries, from healthcare to finance, without extensive retraining, making it highly
scalable [Link]
RAG
• How Does RAG Work?
• Step 1: Data collection
• You must first gather all the data that is needed for your application.
• Step 2: Data chunking
• Data chunking is the process of breaking your data down into smaller, more
manageable pieces.
• For instance, if you have a lengthy 100-page user manual, you might break it down
into different sections, each potentially answering different customer questions.
• This way, each chunk of data is focused on a specific topic. When a piece of
information is retrieved from the source dataset, it is more likely to be directly
applicable to the user’s query, since we avoid including irrelevant information from
entire documents.
• This also improves efficiency, since the system can quickly obtain the most relevant
pieces of information instead of processing entire documents.
RAG
• How Does RAG Work?
• Step 3: Document embeddings
• Now that the source data has been broken down into smaller parts, it needs to be
converted into a vector representation.
• This involves transforming text data into embeddings, which are numeric
representations that capture the semantic meaning behind text.
• In simple words, document embeddings allow the system to understand user queries
and match them with relevant information in the source dataset based on the
meaning of the text, instead of a simple word-to-word comparison.
• This method ensures that the responses are relevant and aligned with the user’s
query.
RAG
• How Does RAG Work?
• Step 4: Handling user queries
• When a user query enters the system, it must also be converted into an embedding
or vector representation.
• The same model must be used for both the document and query embedding to
ensure uniformity between the two.
• Once the query is converted into an embedding, the system compares the query
embedding with the document embeddings.
• It identifies and retrieves chunks whose embeddings are most similar to the query
embedding, using measures such as cosine similarity and Euclidean distance.
• These chunks are considered to be the most relevant to the user’s query.
• Step 5: Generating responses with an LLM
• The retrieved text chunks, along with the initial user query, are fed into a language
model.
• The algorithm will use this information to generate a coherent response to the user’s
questions through a chat interface.
Naïve or Simple RAG
Step1 Indexing Step3 Generation
Naive RAG
1. Divide the document into even The original query and the retrieved text are combined
chunks, each chunk being a and input into a LLM to get the final answer
piece of the original text.
2. Using the encoding model to
generate an embedding for each
chunck.
Advanced RAG
3. Store the Embedding of each
block in the vector database.

Step2 Retrival
Retrieve the k most relevant
documents using vector similarity
Modular RAG
search.

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
Simple RAG
• Embedding model:
• A pre-trained language model that converts input text
into embeddings - vector representations that capture
semantic meaning.
• These vectors will be used to search for relevant
information in the dataset.
• Vector database:
• A storage system for knowledge and its corresponding
embedding vectors.
• Chatbot:
• A language model that generates responses based on
retrieved knowledge.
• This can be any language model, such as Llama,
Gemma, or GPT.
RAG
Indexing phase
• The indexing phase is the first step in creating
a RAG system.
• It involves breaking the dataset (or
documents) into small chunks and
calculating a vector representation for each
chunk that can be efficiently searched during
generation.
• The size of each chunk can vary depending
on the dataset and the application.
• For example, in a document retrieval system,
each chunk can be a paragraph or a
sentence. In a dialogue system, each chunk
can be a conversation turn.
• After the indexing phrase, each chunk with its
corresponding embedding vector will be
stored in the vector database.
RAG
Indexing phase

Consider the following paragraph:

"Vector databases are a type of database designed to handle vector embeddings efficiently. They
are widely used in applications like semantic search, recommendation systems, and Retrieval-
Augmented Generation (RAG). Unlike traditional databases, vector databases can store high-
dimensional vectors and perform fast similarity searches using algorithms such as HNSW."

Chunking Strategy Chunks Produced

1️⃣ "Vector databases are a type of database designed to handle vector embeddings efficiently. They are
1. Fixed-size (20 words widely used in" 2️⃣ "applications like semantic search, recommendation systems, and Retrieval-Augmented
per chunk) Generation (RAG). Unlike traditional databases," 3️⃣ "vector databases can store high-dimensional vectors and
perform fast similarity searches using algorithms such as HNSW."
1️⃣ "Vector databases are a type of database designed to handle vector embeddings efficiently." 2️⃣ "They are
widely used in applications like semantic search, recommendation systems, and Retrieval-Augmented
2. Sentence-based
Generation (RAG)." 3️⃣ "Unlike traditional databases, vector databases can store high-dimensional vectors and
perform fast similarity searches using algorithms such as HNSW."
3. Paragraph-based 1️⃣ The whole paragraph as one chunk
RAG
Indexing phase
Consider the following paragraph:
"Vector databases are a type of database designed to handle vector embeddings efficiently. They
are widely used in applications like semantic search, recommendation systems, and Retrieval-
Augmented Generation (RAG). Unlike traditional databases, vector databases can store high-
dimensional vectors and perform fast similarity searches using algorithms such as HNSW."

Chunking Strategy Chunks Produced

1️⃣ "Vector databases are a type of database designed to handle vector embeddings efficiently. They are
4. Sliding Window (20 widely used in" 2️⃣ "used in applications like semantic search, recommendation systems, and Retrieval-
words, 5-word overlap) Augmented Generation (RAG). Unlike traditional" 3️⃣ "traditional databases, vector databases can store high-
dimensional vectors and perform fast similarity searches using algorithms such as HNSW."
5. Heading-based (If this were under a heading “Vector Databases” → entire paragraph stored as one chunk)
1️⃣ "Vector databases are a type of database designed to handle vector embeddings efficiently." 2️⃣ "They
are widely used in applications like semantic search, recommendation systems, and Retrieval-Augmented
6. Semantic/Embedding-
Generation (RAG)." 3️⃣ "Unlike traditional databases, vector databases can store high-dimensional vectors
based
and perform fast similarity searches using algorithms such as HNSW." (similar to sentence-based, but
grouping depends on embedding similarity)
RAG
Retrieval phase

• Consider an example of a given Input Query from User.

• We then calculate the Query Vector to represent the query,
and compare it against the vectors in the database to find
the most relevant chunks.
• The result returned by The Vector Database will contains
top N most relevant chunks to the query.
• These chunks will be used by the Chatbot to generate a
response.
RAG
Advanced RAG Naive RAG
Index Optimization → Pre-Retrieval Process → Retrieval →
Post-Retrieval Process→ Genaration
• Optimizing Data Indexing：

sliding window, fine-grained

segmentation、adding metadata Advanced RAG

• Pre-Retrieval Process：retrieve

routes, summaries, rewriting, and

confidence judgment

• Post-Retrieval Process：reorder, Modular RAG

filter content retrieval

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University
RAG
Advanced RAG
• Advanced RAG combines the power of better retrieval and generation by using
sophisticated algorithms—a series of ideas, such as rerankers, fine-tuned LLMs
and feedback loops.
• These improvements bring enhancements in accuracy, adaptability and
performance that make these models the better choices for more complex and
production-grade applications.

[Link]
RAG
Advanced RAG
Advanced RAG works as a sequential step-based process as follows:
1. Query processing:
• Upon the reception of a user query, it is transformed into a high-dimensional vector
by using the embedding model that captures the semantic meaning of the query.
2. Document retrieval:
• The encoded query traverses a huge knowledge database that provides hybrid
retrieval by using both dense vector search and sparse retrieval that is, semantic
similarity and keyword-based search.
• The results thus introduce semantic keyword matches into the retrieved documents.
3. Reranking retrieved documents:
• The retriever gives a final score based on context and in relation to the query
retrieving the documents.

[Link]
RAG
Advanced RAG
Advanced RAG works as a sequential step-based process as follows:
4. Contextual fusion for generation:
• Because each document is encoded differently, the decoder fuses all encoded contexts
to ensure that the generated responses have coherence with to the encoded query.
5. Response generation:
• The generator of advanced RAG, usually an LLM provides the answer based on the
retrieved documents.
6. Feedback loop:
• As advanced RAG uses various techniques like active learning, reinforcement learning
and retriever-generator cotraining to continuously enhance its performance.
• During this phase implicit signals occur, such as clicks on retrieved documents that
infer relevance causing explicit feedback that includes corrections or ratings for further
application during generation.
.
[Link]
RAG
Advanced RAG

• Advanced RAG is extremely versatile for a variety of

applications across industries due to the capability for real-
time information retrieval and dynamic, accurate and
context-based responses.
• Its application varies from enabling customer service to
bringing about relevant information thereby improving
decision making and adding enhancement to personalized
learning experiences.
• The improved retrieval and generation through advanced
RAG makes it practical for applications in real time, but
scalability and usability are below par for production level
use cases.
RAG
Modular RAG Naive RAG
Modules Pattern

Search
Naive RAG Read Retrieve Generate

Aggregation Read
Predict
Rewrite DSP
Demonstrate Search Predict Generate Advanced RAG
(2022)

RAG Rerank
Retrieve Rewrite-
Filter Retrieve-Read Rewrite Retrieve Read
(2023)
Generate
Demonstrate
Retrieve-then-
read Retrieve Read Generate
Reflect (2023)
Modular
RAG

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
Modular RAG
• Modular RAG is the most advanced variant of RAG, where the information retrieval and the
generative model work in an open, composable linear pipeline-like architecture.
• This approach allows different use cases to perform better with customizability and scalability.
• By disaggregating the act of RAG into modules, one can better adapt, debug and optimize each
component independently.
• The steps in modular RAG:.
1. User query processing:
• The first step is the user submitting a query, such as, "What is the most trending book in the
market these days?"
• A query processing module then transforms the input that might include rephrasing the query,
removing ambiguities and performing semantic parsing to provide a more informed context
before it is submitted for retrieval.
2. Retrieval module:
• The retrieval module processes the query on the vector database or knowledge base to obtain
relevant documents.
• It performs retrieval by using the embedding-based similarity paradigm.
RAG
Modular RAG
3. Filtering and ranking module:
• The retrieved documents are then filtered by using metadata, recency or relevance.
• A reranking model scores and prioritizes the most useful information.
4. Context augmentation module:
• This module feeds retrieved information with knowledge graphs, embeds structured data
coming from databases and APIs and applies retrieval compression to achieve the best
content retrieval.
5. Response generation:
• The LLM processes the user query along with the retrieved context to generate a coherent
and accurate response, minimizing hallucinations and ensuring relevance.
6. Post-processing module:
• This module ensures accuracy through fact-checking, improves readability with structured
formatting and enhances credibility by generating citations.
7. Output and the feedback loop:
• The final output of the response is presented to the user while a feedback loop is created
from their interaction to assist with refining retrieval and model performance over time.
RAG
Modular RAG

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
Comparison of RAG paradigms

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG
The three key questions of RAG
How to use the
What to retrieve ? When to retrive ?
retrieved information ?
• Token • Single search • Input/Data Layer
• Phrase • Each token
• Chunk • Every N tokens • Model/Intermediate Layer
• Paragraph • Adaptive search • Output/Prediction Layer
• Entity
• Knowledge graph

Augmentation stage： Retrieval choice: Model Generation choice:

• Pre-training • BERT Collaboration • GPT
Other
Issues • Fine-tuning • Roberta • Llama
• Inference • BGE • T5
Scale
• ...... selection • ......
RAG

Retrieval-Augmented Generation (RAG): Paradigms, Technologies, and Trends, Tongji University

RAG