Unit 3
Unit 3
• Vector Databases
• High-dimensional Data Storage
• Vector Embeddings
• High-Dimensional Semantic Similarity
• Retrieval-Augmented Generation (RAG)
• Hyperparameter Tuning and Optimization
• Full fine Tuning v/s Parameter Efficient Fine Tuning (PEFT) Techniques
• LoRA
• QLoRA
What we will cover…
• Vector Databases
• High-dimensional Data Storage
• Vector Embeddings
• High-Dimensional Semantic Similarity
• Retrieval-Augmented Generation (RAG)
• Hyperparameter Tuning and Optimization
• Full fine Tuning v/s Parameter Efficient Fine Tuning (PEFT) Techniques
• LoRA
• QLoRA
Introduction
• We are in the midst of the AI revolution.
• It’s upending any industry it touches, promising great innovations - but it also introduces new
challenges.
• Efficient data processing has become more crucial than ever for applications that involve
large language models, generative AI, and semantic search.
• All of these new applications rely on vector embeddings, a type of vector data representation
that carries within it semantic information that’s critical for the AI to gain understanding and
maintain a long-term memory they can draw upon when executing complex tasks.
• Embeddings are generated by AI models (such as Large Language Models) and have many
attributes or features, making their representation challenging to manage.
• In the context of AI and machine learning, these features represent different dimensions of
the data that are essential for understanding patterns, relationships, and underlying
structures.
[Link]
Introduction
• What are vector embeddings?
• Vector embeddings are numerical representations of data points that convert
various types of data—including nonmathematical data such as words, audio
or images—into arrays of numbers that ML models can process.
• Artificial intelligence (AI) models, from simple linear regression algorithms to
the intricate neural networks used in deep learning, operate through
mathematical logic.
• Any data that an AI model uses, including unstructured data, needs to be
recorded numerically.
• Vector embedding is a way to convert an unstructured data point into an array
of numbers that expresses that data’s original meaning.
[Link]
Introduction
• Here's a simplified example of word embeddings for a very small corpus (2 words),
where each word is represented as a 3-dimensional vector:
• cat [0.2, -0.4, 0.7]
• dog [0.6, 0.1, 0.5]
• In this example, each word ("cat") is associated with a unique vector ([0.2, -0.4, 0.7]).
• The values in the vector represent the word's position in a continuous 3-dimensional
vector space.
• Words with similar meanings or contexts are expected to have similar vector
representations.
• For instance, the vectors for "cat" and "dog" are close together, reflecting their semantic
relationship.
[Link]
Introduction
• Embedding models are trained to convert data points into vectors.
• Vector databases store and index the outputs of these embedding models.
• Within the database, vectors can be grouped together or identified as opposites based
on semantic meaning or features across virtually any data type.
• Vector embeddings are the backbone of recommendations, chatbots and generative
apps such as ChatGPT.
• For example, take the words “car” and “vehicle.”
• They have similar meanings but are spelled differently.
• For an AI application to enable effective semantic search, the vector representations of
“car” and “vehicle” must capture their semantic similarity.
• In machine learning, embeddings represent high-dimensional vectors that encode this
semantic information.
[Link]
Vector Databases
• A vector database is a collection of data stored as mathematical representations.
• Vector databases make it easier for machine learning models to remember previous inputs,
allowing machine learning to be used to power search, recommendations, and text generation use-cases.
• Data can be identified based on similarity metrics instead of exact matches, making it possible for a
computer model to understand data contextually.
• E.g. when one visits a shoe store a salesperson may suggest shoes that are similar to the pair one prefers.
• Likewise, when shopping in an ecommerce store, the store may suggest similar items under a header like
"Customers also bought..."
• Vector databases enable machine learning models to identify similar objects, just as the salesperson can
find comparable shoes and the ecommerce store can suggest related products.
• Thus vector databases make it possible for computer programs to draw comparisons, identify relationships,
and understand context.
• This enables the creation of advanced artificial intelligence (AI) programs like large language models (LLMs).
[Link]
Vector Databases
• A vector is an array of numerical values that expresses the location
of a floating point along several dimensions.
• In more everyday language, a vector is a list of numbers, like: {12,
13, 19, 8, 9}.
• These numbers indicate a location within a space, just as a row and
column number indicates a certain cell in a spreadsheet (e.g. "B7").
In this simple vector database, the documents in the upper right are likely similar
to each other. [Link]
Vector Databases
• Traditional Databases
• Data Type: Store structured data (rows, columns, tables).
• Query Mechanism: Rely on exact match or relational logic (e.g., SQL).
• Indexing: B-trees, hash indexes optimized for structured queries.
• Use Cases: Banking transactions, inventory systems, CRM, ERP.
• Strengths:
• ACID compliance (transactions, consistency) Atomicity, Consistency, Isolation, and Durability—
that ensure database transactions are processed reliably and result in a valid and consistent
database state.
• Great for well-defined schema
• High efficiency for deterministic queries
• Limitations:
• Cannot handle unstructured data (images, text, audio) effectively
• No semantic understanding — “apple” the fruit ≠ “Apple” the company
Vector Databases
• Vector Databases
• Data Type: Store embeddings (high-dimensional vectors) from ML/AI models.
• Query Mechanism: Similarity search (cosine similarity, Euclidean distance, dot product).
• Indexing: Approximate Nearest Neighbor (ANN) techniques (HNSW, IVF, PQ).
• Use Cases: Semantic search, recommendation engines, Retrieval-Augmented Generation (RAG),
image/video/audio similarity, anomaly detection.
• Strengths:
• Handles unstructured data effectively
• Supports semantic and fuzzy queries (meaning-based)
• Scales to billions of embeddings
• Limitations:
• Not optimized for transactions
• Emerging ecosystem — fewer mature tools than traditional DBs
• Compute-intensive indexing/search
Vector Databases
• Quick Analogy
• Traditional DB = Library Catalog: find exact book ID or author
name.
• Vector DB = Google Search: find content by meaning, even if you
don’t use exact words.
Vector Databases
• How are vector databases used?
• Vector databases serve three key functions in AI and ML applications:
• Vector storage
• Vector indexing
• Similarity search based on querying or prompting
• In operation, vector databases work by using multiple algorithms to conduct an approximate
nearest neighbor (ANN) search.
• The algorithms are then gathered in a pipeline to quickly and accurately retrieve and deliver data
neighboring the vector that is queried.
• For example, an ANN search could look for products that are visually similar in an e-commerce
catalog.
• Additional uses include anomaly detection, classification and semantic search. Because a dataset
runs through a model just once, results are returned within milliseconds.
[Link]
Vector Databases
• Vector storage
• Vector databases store the outputs of an embedding model algorithm, the
vector embeddings.
• They also store each vector’s metadata—including title, description and data
type—which can be queried by using metadata filters.
• By ingesting and storing these embeddings, the database can facilitate fast
retrieval of a similarity search, matching the user’s prompt with a similar
vector embedding.
[Link]
Vector Databases
• Vector indexing
• Vectors need to be indexed to accelerate searches within high-dimensional data spaces.
• Vector databases create indexes on vector embeddings for search functions.
• The vector database indexes vectors by using an ML algorithm.
• Indexing maps the vectors to new data structures that enable faster similarity or distance
searches, such as nearest neighbor searches, between vectors.
• Vectors can be indexed by using algorithms such as hierarchical navigable small world
(HNSW), locality-sensitive hashing (LSH) or product quantization (PQ).
• HNSW is popular as it creates a tree-like structure. Each node of the tree shows a set of vectors complete
with the hierarchies in each. Similarities between vectors are shown at the edges between the nodes.
• LSH indexes content by using an approximate nearest-neighbor search. For extra speed, the index can be
optimized by returning an approximate, but nonexhaustive result.
• PQ converts each dataset into a short, memory-efficient representation. Only the short representations
are stored, rather than all of the vectors.
[Link]
Vector Databases
• Similarity search based on querying or prompting
• Query vectors are vector representations of search queries.
• When a user queries or prompts an AI model, the model computes an embedding of the query or
prompt.
• The database then calculates distances between query vectors and vectors stored in the index to
return similar results.
• Databases can measure the distance between vectors with various algorithms, such as nearest
neighbor search.
• Measurements can also be based on various similarity metrics, such as cosine similarity.
• The database returns the most similar vectors or nearest neighbors to the query vector according to
the similarity ranking.
• These calculations support various machine learning tasks, such as recommendation systems,
semantic search, image recognition and other natural language processing tasks.
[Link]
Vector Databases
• Advantages of vector databases
• Vector databases are a popular way to power enterprise AI-based applications because they can deliver many benefits:
• Speed and performance
• Vector databases use various indexing techniques to enable faster searching.
• Vector indexing and distance-calculating algorithms such as nearest neighbor search can help optimize performance
when searching for relevant results across large datasets with millions, if not billions, of data points.
• Scalability
• Vector databases can store and manage massive amounts of unstructured data by scaling horizontally with additional
nodes, maintaining performance as query demands and data volumes increase.
• Lower cost of ownership
• Because they enable faster data retrieval, vector databases speed the training of foundation models.
• Data management
• Vector databases typically provide built-in features to easily update and insert new unstructured data.
• Flexibility
• Vector databases are built to handle the added complexity of using images, videos or other multidimensional data.
• Given the multiple use cases ranging from semantic search to conversational AI applications, vector databases can be
customized to meet business and AI requirements.
[Link]
Vector Databases
• How does a vector database work?
• A vector database uses a combination of different algorithms that all
participate in Approximate Nearest Neighbor (ANN) search.
• These algorithms optimize the search through hashing, quantization, or
graph-based search.
• These algorithms are assembled into a pipeline that provides fast and
accurate retrieval of the neighbors of a queried vector.
• Since the vector database provides approximate results, the main trade-
offs we consider are between accuracy and speed.
• The more accurate the result, the slower the query will be. However, a
good system can provide ultra-fast search with near-perfect accuracy.
[Link]
Vector Databases
• Unstructured data, such as text, images, and audio, lacks a predefined format, posing challenges for
traditional databases. To leverage this data in artificial intelligence and machine learning applications, it's
transformed into numerical representations using embeddings.
• Embedding is like giving each item, whether it's a word, image, or something else, a unique code that
captures its meaning or essence. This code helps computers understand and compare these items in a more
efficient and meaningful way. Think of it as turning a complicated book into a short summary that still
captures the main points. [Link]
databases
Vector Databases
• This embedding process is typically achieved using a special kind of neural network designed for the task. For
example, word embeddings convert words into vectors in such a way that words with similar meanings are
closer in the vector space.
• This transformation allows algorithms to understand relationships and similarities between items.
• Essentially, embeddings serve as a bridge, converting non-numeric data into a form that machine learning
models can work with, enabling them to discern patterns and relationships in the data more effectively.
[Link]
databases
Vector Databases
• Pipeline for a vector database
[Link]: The vector database indexes vectors using an algorithm such as PQ, LSH, or
HNSW. This step maps the vectors to a data structure that will enable faster searching.
[Link]: The vector database compares the indexed query vector to the indexed vectors
in the dataset to find the nearest neighbors (applying a similarity metric used by that
index)
[Link] Processing: In some cases, the vector database retrieves the final nearest neighbors
from the dataset and post-processes them to return the final results. This step can include
re-ranking the nearest neighbors using a different similarity measure.
[Link]
Vector Databases
• Popular vector databases
• Chroma
Pinecone is a managed vector database platform that has been purpose-built to tackle the unique challenges associated with high-
dimensional data.
Equipped with cutting-edge indexing and search capabilities, Pinecone empowers data engineers and data scientists to construct and
implement large-scale machine learning applications that effectively process and analyze high-dimensional data.
Key features of Pinecone include:
• Fully managed service
• Highly scalable
• Real-time data ingestion
• Low-latency search
• Integration with LangChain
Notably, Pinecone was the only vector database included in the inaugural Fortune 2023 50 AI Innovator list.
[Link]
databases
Vector Databases
• Popular vector databases
• Weaviate
• Faiss is an open-source library for the swift search of similarities and the clustering of dense vectors.
• It houses algorithms capable of searching within vector sets of varying sizes, even those that might exceed RAM
capacity.
• Additionally, Faiss offers auxiliary code for assessment and adjusting parameters.
• While it's primarily coded in C++, it fully supports Python/NumPy integration.
• Some of its key algorithms are also available for GPU execution.
• The primary development of Faiss is undertaken by the Fundamental AI Research group at Meta.
[Link]
databases
Vector Databases
• Popular vector databases
• Qdrant
• Qdrant is a vector database and a tool for conducting vector similarity searches. It operates as an API service, enabling
searches for the closest high-dimensional vectors.
• Using Qdrant, you can transform embeddings or neural network encoders into comprehensive applications for tasks
like matching, searching, making recommendations, and much more.
• Here are some key features of Qdrant:
• Offers OpenAPI v3 specs and ready-made clients for various languages.
• Uses a custom HNSW algorithm for rapid and accurate searches.
• Allows results filtering based on associated vector payloads.
• Supports string matching, numerical ranges, geo-locations, and more.
• Cloud-native design with horizontal scaling capabilities.
• Built-in Rust, optimizing resource use with dynamic query planning. [Link]
Indexing in Vector Databases
• Indexing refers to the process of organizing high-dimensional vectors in a way that provides efficient
querying of nearest-neighbor vectors.
• This is the most crucial part of building any vector database.
• These indexes enable fast and efficient querying of high-dimensional embeddings.
• There are multiple indexing methods to create vector indices, such as:
• Linear search Algorithm (Flat Indexing):
• This is a linear search algorithm, which means it will compare the query vector with every other vector stored
in the database.
• This is the simplest method out there and works well with small datasets.
• Cluster-based algorithm (IVF):
• Inverted File is a cluster-based indexing technique.
• It uses k-means clustering to cluster all the vectors.
• When a query vector is provided, it calculates the distance between the query vector and the centroids of each
cluster.
• And starts searching for the nearest neighbors in the cluster with the centroid closest to the query vector.
• This significantly reduces query time.
[Link]
Indexing in Vector Databases
• Quantization (Scalar and Product Quantization):
• The quantization technique involves reducing the memory footprint of large
embeddings by reducing their precision.
• Graph-based (HNSW):
• The most common indexing method.
• It uses hierarchical graph architecture to index vectors.
[Link]
Hierarchical Navigable Small World
• Hierarchical Navigable Small World (HNSW) is a
state-of-the-art algorithm used for an approximate
search of nearest neighbours.
• HNSW constructs optimized graph structures.
• The main idea of HNSW is to construct such a
graph where a path between any pair of vertices
could be traversed in a small number of steps.
• HNSW graphs are among the top-performing
indexes for vector similarity search.
• It is a hugely popular technology that time and
time again produces state-of-the-art performance
with super fast search speeds and fantastic recall.
Foundations of HNSW
• Approximate Neighbor Search
• To reduce the computation complexity added by an exhaustive search like KNN approximate
neighbor search is preferred technique.
• ANN allows us to get a massive performance boost on similarity search when dealing with huge
datasets.
• In approximate nearest neighbors (ANN), we build index structures that narrow down the search
space and improve lookup times.
• Apart from that, most ML models produce vectors that have high dimensionality which is
another hurdle to overcome.
• Approximate search relies on the fact that even though data is represented in a large number of
dimensions, their actual complexity is low.
• It tries to work with the true intrinsic dimensionality of data.
• There are various algorithms to solve the approximate search problem and to actually dive into how
approximate search works warrants another article of its own.
Foundations of HNSW
• We can split ANN algorithms into three distinct categories; trees, hashes,
and graphs.
• HNSW slots into the graph category.
• More specifically, it is a proximity graph, in which two vertices are linked
based on their proximity (closer vertices are linked) — often defined in
Euclidean distance.
• There is a significant leap in complexity from a ‘proximity’ graph
to ‘hierarchical navigable small world’ graph.
• The two fundamental techniques that contributed most heavily to HNSW:
• The probability skip list
• Navigable small world graphs
Foundations of HNSW
• Probability Skip List
• The probability skip list was introduced way back in 1990 by William Pugh [2].
• Skip list is a probabilistic data structure that allows inserting and searching elements within a sorted list for O(logn) on
average.
• It allows fast search like a sorted array, while using a linked list structure for easy (and fast) insertion of new elements
(something that is not possible with sorted arrays).
• Skip lists work by building several layers of linked lists.
• The lowest layer has the original linked list with all the elements in it.
• When moving to higher levels, the number of skipped elements increases, thus decreasing the number of connections.
[Link]
Foundations of HNSW
• Probability Skip List
• On the first layer, we find links that skip many intermediate nodes/vertices.
• As we move down the layers, the number of ‘skips’ by each link is decreased.
• The search procedure for a certain value starts from the highest level and compares its next element with the value.
• If the value is less or equal to the element, then the algorithm proceeds to its next element.
• Otherwise, the search procedure descends to the lower layer with more connections and repeats the same process.
• At the end, the algorithm descends to the lowest layer and finds the desired node.
[Link]
Foundations of HNSW
• Probability Skip List
• To search a skip list, we start at the highest layer with the longest ‘skips’ and move along the
edges towards the right (below).
• If we find that the current node ‘key’ is greater than the key we are searching for — we know
we have overshot our target, so we move down to previous node in the next level.
• HNSW inherits the same layered format with longer edges in the highest layers (for fast
search) and shorter edges in the lower layers (for accurate search).
[Link]
Foundations of HNSW
• Probability Skip List
[Link]
2aad4fe87d37/
Foundations of HNSW
• Probability Skip List
• Example of searching in probability skip list
Foundations of HNSW
• Probability Skip List
• While random access is faster, insertion and deletion are slower as they add
additional overhead for updating and deleting on multiple layers.
• While insertion, we start from the bottom list and add the node at the
appropriate position.
• As skip lists maintain a hierarchical structure, we need to determine if the
node appears at a higher level.
• The process is random, like a coin toss.
• The probability of a node appearing in its immediate upper layer is 0.5.
[Link]
Foundations of HNSW
• Probability Skip List
• In an ideal skip list, the number of nodes on layer+1 will be ~n/2, and in layer+2 ~n/4,
where n is the number of nodes on the bottom-most layer or the complete linked list.
• Consider the following example:
[Link]
Foundations of HNSW
• Small world
• Most graphs have the property of short paths, including graphs
created by just connecting random pairs of points.
• Another example of a small world is the global airport network.
• Airports in the same region are highly connected to one another, but it’s
possible to make a long journey in only a few stops by making use of
major hub airports.
• For example, a journey from Nagpur, India to Osaka, Japan typically
starts with a local flight from Nagpur to Mumbai, then a long distance
flight from Mumbai to Tokyo, and finally another local flight from Tokyo
to Osaka.
• Long-range hubs are a common way of achieving the small world
property.
• A final interesting example of graphs with the small world
property is biological neural networks such as the human brain.
[Link]
Foundations of HNSW
• Navigable Small Worlds
• The idea of small world can be adopted for nearest neighbour search
• if we create connections between our vectors in such a way that it forms a small world
graph, we can quickly find the vectors near a target by starting from an arbitrary "entry
point" vector and then navigating through the graph towards the target.
• This possibility was explored by Kleinberg.
• He noted that
• The existence of short paths wasn’t the only interesting thing about small world
experiment:
• it was also surprising that people were able to find these short paths, without using any
global knowledge about the graph.
• Rather, the people were following a simple greedy algorithm.
• At each step, they examined each of their immediate connections, and sent it to the one
they thought was closest to the target.
• We can use a similar algorithm to search a graph that connects vectors.
[Link]
Foundations of HNSW
• Navigable Small Worlds
[Link]
Foundations of HNSW
• Navigable Small Worlds
• Kleinberg performed simple simulations of small worlds in which all of the points were connected to their immediate
neighbours, with additional longer connections created between random points.
• He discovered that the greedy algorithm would only find a short path in specific conditions, depending on the lengths of the
long-range connections.
• If the long-range connections were too long (as was the case when they connected pairs of points in completely random
locations), the greedy algorithm could follow a long-range connection to quickly reach the rough area of the target, but after
that the long-range connections were of no use, and the path had to step through the local connections to get closer.
• On the other hand, if the long-range connections were too short, it would simply take too many steps to reach the area of the
target.
• If, however, the lengths of the long-range connections were just right (if they were uniformly distributed, so that all lengths
were equally likely), the greedy algorithm would typically reach the neighbourhood of the target in an especially small number
of steps (a number proportional to log(n), where n is the number of points in the graph).
• In cases like this where the greedy algorithm can find the target in a small number of steps, we say the small world is
a navigable small world (NSW).
[Link]
Foundations of HNSW
• Navigable Small Worlds
• A method proposed to build NSW for vectors in a complex, high-
dimensional space :
• we insert one randomly chosen vector at a time to the graph, and connect it to a
small number m of nearest neighbours that were already inserted.
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds
• A typical path through an NSW from the entry point towards the target went through two
phases: a "zoom-out" phase, in which connection lengths increase from short to long, and a
"zoom-in" phase, in which the reverse happens.
• To achieve this efficiently, the number of connections checked at each hub.
• This leads to the main idea of HNSW: explicitly distinguishing between short-range and long-range
connections.
• In the initial stage of a search, we will only consider the long-range connections between hubs.
• Once the greedy search has found a hub near the target, we then switch to using the short-range
connections.
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds
Left: illustration of an HNSW with three levels of connection length – short connections are grey,
longer connections are green, and the longest connections are red. E is the entry point. Right:
visualising the HNSW as a stack of three layers. Dotted lines indicate the location of the same vector
in the layer below.
[Link]
Foundations of HNSW
Hierarchical Navigable Small Worlds
• HNSW (Hierarchical Navigable Small World graphs) is one of the most popular
Approximate Nearest Neighbor (ANN) indexing algorithms used in vector
databases.
• What HNSW Does
• Instead of exhaustively comparing a query vector with all stored vectors (which
is expensive at scale),
• HNSW builds a graph structure where each vector is connected to its
“neighbors” in multiple layers.
• Search starts from the top layer (sparser connections) and navigates down to
denser layers.
• This allows fast approximate nearest neighbor search.
Foundations of HNSW
Hierarchical Navigable Small Worlds
• A = (0,0)
• B = (1,0)
• C = (0,1)
• D = (1,1)
• E = (2,0)
• F = (2,1)
• A: {B, C}
• B: {A, E}
• C: {A, D}
• D: {C, F}
• E: {B, F}
• F: {E, D}
Tutorial
Step 3: Distance Calculations (Part 1)
• Query q = (1.5, 0.2)
A typical hash function used in dictionaries in python aims to place different values (no
matter how similar) into separate buckets.
Locality Sensitive hashing
• However, there is a key difference between this type of hash function and that used in LSH.
• With dictionaries, our goal is to minimize the chances of multiple key-values being mapped to the
same bucket — we minimize collisions.
• LSH is almost the opposite. In LSH, we want to maximize collisions — although ideally only
for similar inputs.
Locality Sensitive hashing
• There is no single approach to hashing in LSH.
• They all share the same ‘bucket similar samples through a hash function’ logic , but
they can vary a lot beyond this.
• There are various approaches like traditional approach, using shingling, MinHashing,
and banding.
• There are several other techniques, such as Random Projection
Locality Sensitive hashing
• How LSH Works
• The LSH algorithm comprises three main steps:
• Step 1:
• Hashing LSH involves creating multiple hash functions, each generating a hash code for every data
point.
• These hash codes typically consist of binary strings of equal length.
• Step 2:
• Binning The hash codes are used to distribute the data points into hash buckets.
• Points with the same hash code are grouped together in the same bucket.
• Step 3:
• Querying When a query item is presented, it undergoes the same hashing process to
generate a hash code.
• Then, we look for the bucket containing the hash code of the query item.
• All items within this bucket are potential candidates for being similar to the query item.
• We then perform a refined similarity check on these candidates using the original high-
dimensional data to identify true similarities.
Locality Sensitive hashing
• Key factors and parameters in LSH:
• The effectiveness of LSH depends on several important parameters:
[Link] of Hash Functions (K):
[Link] hash functions lead to a higher probability of similar items being hashed to
the same bucket.
[Link], increasing the number of hash functions also increases computational
overhead.
[Link] of Hash Codes (L):
1. Longer hash codes provide more accurate results but also require more space to
store.
[Link] trade-off between length and accuracy needs to be considered depending on
the application.
[Link] of Hash Tables (M):
[Link] multiple hash tables (M) allows for a more robust search, as we can look for
similar items across multiple hash tables.
Locality Sensitive hashing
DISTANCE MEASURES
◾ Goal:Find near-neighbors in [Link]
• We formally define“near neighbors” as
points that are a“small distance” apart
• For each application,we first need to define what
“distance” means
• Today:Jaccard distance/similarity
• The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1,C 2 ) = |C1ᲘC2|/|C1 Ი C2|
• Jaccard distance: d(C1,C 2 ) = 1 - |C1 Ი C2|/|C1 Ი C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
Locality Sensitive hashing
TASK: FINDING SIMILAR DOCUMENTS
Locality Sensitive hashing
ENCODING SETSAS BITVECTORS
Locality Sensitive hashing
FROM SETS TO BOOLEAN MATRICES
• Rows = elements (shingles)
• Columns = sets (documents)
• 1 in row e and column s if and only if e is a Documents
member of s 1 1 1 0
• Column similarity is the Jaccard similarity of the
1 1 0 1
corresponding sets (rows with value 1)
• Typical matrix is sparse!
0 1 0 1
Shingles
• Each document is a column: 0 0 0 1
• Example: sim(C1 ,C2) = ? 1 0 0 1
• Size of intersection = 3;size of union = 6,
Jaccard similarity (not distance) = 3/6 1 1 1 0
• d(C1,C2) = 1 – (Jaccard similarity) = 3/6
1 0 1 0
Locality Sensitive hashing
HASHING COLUMNS (SIGNATURES)
Locality Sensitive hashing
MIN-HASHING
Locality Sensitive
THE MIN-HASH hashing
PROPERTY 0 0
0 0
• Choos e a random permutation 1 1
• Claim :Pr[h (C1 ) = h(C2)] = sim(C1, C 2 ) 0 0
• Why? 0 1
• Let X be a doc (set of shingles),y X is a shingle
• The n: Pr[(y) = min( (X))] = 1/|X|
1 0
• It is equally likely that any y X is mapped to the min element
• Let y be s.t.(y) = min((C1C2))
• Then either: (y) = min((C1)) if y C 1 ,or
One of the two
(y) = min((C2)) if y C2 cols had to have
1 at position y
• So the [Link] both are true is the prob.y C 1 C2
• Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2 )
Locality Sensitive hashing
FOURTYPES OF ROWS
• Given cols C 1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
• a = # rows of typeA, etc.
• Note: sim(C1, C 2 ) = a/(a +b +c)
• Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
• Look down the cols C 1 and C 2 until we see a 1
• If it’s a type-A row,then h(C1) = h(C2)
If a type-B or type-C row,then not
Locality Sensitive hashing
SIMILARITY FOR SIGNATURES
Tutorial 2
• Hallucinations:
• They tend to confidently generate false responses based on imagined facts.
• Can provide responses that are off-topic if they don’t have an accurate answer to the user’s query.
• These models sometimes generate plausible-sounding but incorrect information.
• Generic responses:
• Often provide generic responses that aren’t tailored to specific contexts.
• Cannot provide a personalized customer experience.
• Without access to external sources, LLMs may provide vague or imprecise answers.
[Link]
RAG
Ways to optimize LLMs.
Prompt Engineering
Retrieval-Augmented
Generation
Instruct / Fine-tuning
Step2 Retrival
Retrieve the k most relevant
documents using vector similarity
Modular RAG
search.
• Pre-Retrieval Process:retrieve
confidence judgment
[Link]
RAG
Advanced RAG
Advanced RAG works as a sequential step-based process as follows:
1. Query processing:
• Upon the reception of a user query, it is transformed into a high-dimensional vector
by using the embedding model that captures the semantic meaning of the query.
2. Document retrieval:
• The encoded query traverses a huge knowledge database that provides hybrid
retrieval by using both dense vector search and sparse retrieval that is, semantic
similarity and keyword-based search.
• The results thus introduce semantic keyword matches into the retrieved documents.
3. Reranking retrieved documents:
• The retriever gives a final score based on context and in relation to the query
retrieving the documents.
[Link]
RAG
Advanced RAG
Advanced RAG works as a sequential step-based process as follows:
4. Contextual fusion for generation:
• Because each document is encoded differently, the decoder fuses all encoded contexts
to ensure that the generated responses have coherence with to the encoded query.
5. Response generation:
• The generator of advanced RAG, usually an LLM provides the answer based on the
retrieved documents.
6. Feedback loop:
• As advanced RAG uses various techniques like active learning, reinforcement learning
and retriever-generator cotraining to continuously enhance its performance.
• During this phase implicit signals occur, such as clicks on retrieved documents that
infer relevance causing explicit feedback that includes corrections or ratings for further
application during generation.
.
[Link]
RAG
Advanced RAG
Search
Naive RAG Read Retrieve Generate
Aggregation Read
Predict
Rewrite DSP
Demonstrate Search Predict Generate Advanced RAG
(2022)
RAG Rerank
Retrieve Rewrite-
Filter Retrieve-Read Rewrite Retrieve Read
(2023)
Generate
Demonstrate
Retrieve-then-
read Retrieve Read Generate
Reflect (2023)
Modular
RAG
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
PART 03
Key Technologies and Evaluation
124