0% found this document useful (0 votes)

15 views38 pages

Embeddings, Vector Databases, and Search in LLM

Uploaded by

engabdelrahmanmostafamohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views38 pages

Embeddings, Vector Databases, and Search in LLM

Uploaded by

engabdelrahmanmostafamohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Module 2

Embeddings, Vector Databases,

and Search

© Databricks Inc. — All rights reserved

Learning Objectives

By the end of this module you will:

• Understand vector search strategies and how to evaluate search results

• Understand the utility of vector databases

• Differentiate between vector databases, vector libraries, and vector plugins

• Learn best practices for when to use vector stores and how to improve
search-retrieval performance

© Databricks Inc. — All rights reserved

How do language models learn knowledge?

Through model training or ﬁne-tuning

• Via model weights
• More on ﬁne-tuning in Module

Through model inputs

• Insert knowledge or context into the input
• Ask the LM to incorporate the context in its output

This is what we will cover:

• How do we use vectors to search and provide relevant context to LMs?

© Databricks Inc. — All rights reserved

Passing context to LMs helps factual recall

• Fine-tuning is usually better-suited to teach a model specialized tasks

• Analogy: Studying for an exam weeks away

• Passing context as model inputs improves factual recall

• Analogy: Take an exam with open notes
• Downsides:
• Context length limitation
• E.g., OpenAI’s gpt-3.5-turbo accepts a maximum of ~ tokens (~ pages) as context
• Common mitigation method: pass document summaries instead
• Anthropic’s Claude: k token limit
• An ongoing research area (Pope et al , Fu et al )
• Longer context = higher API costs = longer processing times

© Databricks Inc. — All rights reserved Source: OpenAI

Refresher: We represent words with vectors

We can project these vectors onto D

to see how they relate graphically

Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
© Databricks Inc. — All rights reserved
Turn images and audio into vectors too
Data objects Vectors Tasks
• Object recognition
[ . , . , - . , ….] • Scene detection
• Product search

• Translation
[ . , . , - . , ….] • Question Answering
• Semantic search

• Speech to text
[ . , . , - . , ….] • Music transcription
• Machinery malfunction
© Databricks Inc. — All rights reserved
Use cases of vector databases
• Similarity search: text, images, audio
Are electric cars better for the environment?
• De-duplication
• Semantic match, rather than keyword match! electric cars climate impact
• Example on enhancing product search
• Very useful for knowledge-based Q/A Environmental impact of electric vehicles

• Recommendation engines
How to cope with the pandemic
• Example blog post: Spotify uses vector
search to recommend podcast episodes dealing with covid ptsd

• Finding security threats Dealing with covid anxiety

• Vectorizing virus binaries

and ﬁnding anomalies Shared embedding space for queries and podcast episodes

Source: Spotify

© Databricks Inc. — All rights reserved

Search and Retrieval-Augmented Generation
The RAG workﬂow

© Databricks Inc. — All rights reserved

Search and Retrieval-Augmented Generation
The RAG workﬂow

© Databricks Inc. — All rights reserved

Search and Retrieval-Augmented Generation
The RAG workﬂow

© Databricks Inc. — All rights reserved

How does
vector search work?

© Databricks Inc. — All rights reserved

9
2
Vector search strategies
• K-nearest neighbors (KNN)

• Approximate nearest neighbors (ANN)

• Trade accuracy for speed gains
• Examples of indexing algorithms:
• Tree-based: ANNOY by Spotify
• Proximity graphs: HNSW
• Clustering: FAISS by Facebook
• Hashing: LSH
• Vector compression:
Source: Weaviate
SCaNN by Google

© Databricks Inc. — All rights reserved

How to measure if 2 vectors are similar?
L2 (Euclidean) and cosine are most popular

Distance metrics Similarity metrics

The higher the metric, the less similar The higher the metric, the more similar

Source: buildin.com

© Databricks Inc. — All rights reserved

Compressing vectors with Product Quantization
PQ stores vectors with fewer bytes

Quantization = representing vectors to a smaller set of vectors

• Naive example: round(8.954521346) = 9

Trade off between recall and memory saving

© Databricks Inc. — All rights reserved

FAISS: Facebook AI Similarity Search
Forms clusters of dense vectors and conducts Product Quantization

• Compute Euclidean distance between all points and query vector

• Given a query vector, identify which cell it belongs to
• Find all other vectors belonging to that cell
• Limitation: Not good with sparse vectors (refer to GitHub issue)

© Databricks Inc. — All rights reserved Source: Pinecone

HNSW: Hierarchical Navigable Small Worlds
Builds proximity graphs based on Euclidean (L2) distance

Uses linked list to ﬁnd the element x: “11”

Traverses from query vector node to ﬁnd the

nearest neighbor
• What happens if too many nodes?
Use hierarchy!

Source: Pinecone
© Databricks Inc. — All rights reserved
Ability to search for similar
objects is

Not limited to fuzzy text or

exact matching rules

© Databricks Inc. — All rights reserved

Filtering

© Databricks Inc. — All rights reserved

Adding ﬁltering function is hard
I want Nike-only: need an additional metadata index for “Nike”

Types Source: Pinecone

• Post-query
• In-query
• Pre-query

No one-sized shoe ﬁts all

Different vector databases implement this differently
© Databricks Inc. — All rights reserved
Post-query ﬁltering
Applies ﬁlters to top-k results after user queries

• Leverages ANN speed

• # of results is highly
unpredictable

• Maybe no products meet

the requirements

© Databricks Inc. — All rights reserved

In-query ﬁltering
Compute both product similarity and ﬁlters simultaneously

• Product similarity as vectors

• Branding as a scalar

• Leverages ANN speed

• May hit system OOM!

• Especially when many ﬁlters
are applied

• Suitable for row-based data

© Databricks Inc. — All rights reserved

Pre-query ﬁltering
Search for products within a limited scope

• All data needs to be

ﬁltered == brute force
search!
• Slows down search

• Not as performant as
post- or in-query ﬁltering

© Databricks Inc. — All rights reserved

Vector stores
Databases, libraries, plugins

© Databricks Inc. — All rights reserved

Why are vector database (VDBs) so hot?
Query time and scalability

• Specialized, full-ﬂedged databases

for unstructured data
• Inherit database properties, i.e.
Create-Read-Update-Delete (CRUD)

• Speed up query search for the

closest vectors
• Rely on ANN algorithms
• Organize embeddings into indices

© Databricks Inc. — All rights reserved Image Source: Weaviate

What about vector libraries or plugins?
Many don’t support ﬁlter queries, i.e. “WHERE”

Libraries create vector indices Plugins provide architectural

enhancements
• Approximate Nearest Neighbor • Relational databases or search
(ANN) search algorithm systems may offer vector search
• Sufﬁcient for small, static data plugins, e.g.,
• Do not have CRUD support • Elasticsearch
• Need to rebuild • pgvector
• Need to wait for full import to • Less rich features (generally)
• Fewer metric choices
ﬁnish before querying
• Fewer ANN choices
• Stored in-memory (RAM)
• Less user-friendly APIs
• No data replication

Caveat: things are moving fast! These weaknesses

© Databricks Inc. — All rights reserved could improve soon!

Do I need a vector database?
Best practice: Start without. Scale out as necessary.

Pros Cons

• Scalability • One more system to learn

• Mil/billions of records and integrate
• Speed • Added cost
• Fast query time (low latency)
• Full-ﬂedged database properties
• If use vector libraries, need to come up with a
way to store the objects and do ﬁltering
• If data changes frequently, it’s cheaper than
using an online model to compute
embeddings dynamically!

© Databricks Inc. — All rights reserved

Popular vector database comparisons
Released Billion-scale vector Approximate Nearest LangChain
support Neighbor Algorithm Integration

Open-Sourced

Chroma No HNSW Yes

Milvus Yes FAISS, ANNOY, HNSW

Qdrant No HNSW

Redis No HNSW

Weaviate No HNSW

Vespa Yes Modiﬁed HNSW

Not Open-Sourced

Pinecone Yes Proprietary Yes

*Note: the information is collected from public documentation. It is accurate

as of May , .

Best practices

Do I always need a vector store?
Vector store includes vector databases, libraries or plugins

• Vector stores extend LLMs with knowledge

• The returned relevant documents become the LLM context
• Context can reduce hallucination (Module !)

• Which use cases do not need context augmentation?

• Summarization
• Text classiﬁcation
• Translation

How to improve retrieval performance?
This means users get better responses

• Embedding model selection

• Do I have the right embedding model for my data?
• Do my embeddings capture BOTH my documents and queries?

• Document storage strategy

• Should I store the whole document as one? Or split it up into chunks?

Tip 1: Choose your embedding model wisely
The embedding model should represent BOTH your queries and documents

Tip 2: Ensure embedding space is the same
for both queries and documents

• Use the same embedding model for indexing and querying

• OR if you use different embedding models, make sure they are trained on similar
data (therefore produce the same embedding space!)

Chunking strategy: Should I split my docs?
Split into paragraphs? Sections?

• Chunking strategy determines

• How relevant is the context to the prompt?
• How much context/chunks can I ﬁt within the model’s token limit?
• Do I need to pass this output to the next LLM? (Module : Chaining LLMs into a workﬂow)

• Splitting doc into smaller docs = doc can produce N vectors of M tokens

Chunking strategy is use-case speciﬁc
Another iterative step! Experiment with different chunk sizes and approaches

• How long are our documents?

• sentence?
• N sentences?

• If chunk = sentence, embeddings focus on speciﬁc meaning

• If chunk = multiple paragraphs, embeddings capture broader theme

• How about splitting by headers?

• Do we know user behavior? How long are the queries?

• Long queries may have embeddings more aligned with the chunks returned
• Short queries can be more precise

Chunking best practices are not yet well-deﬁned
It’s still a very new ﬁeld!

Existing resources:
• Text Splitters by LangChain
• Blog post on semantic search by Vespa - light mention of chunking
• Chunking Strategies by Pinecone

Preventing silent failures and undesired
performance
• For users: include explicit instructions in prompts
• "Tell me the top 3 hikes in California. If you do not know the answer, do not
make it up. Say 'I don’t have information for that.'"
• Helpful when upstream embedding model selection is incorrect

• For software engineers

• Add failover logic
• If distance-x exceeds threshold y, show canned response, rather than showing nothing
• Add basic toxicity classiﬁcation model on top
• Prevent users from submitting offensive inputs
Source: BBC
• Discard offensive content to avoid training or saving to VDB
• Conﬁgure VDB to time out if a query takes too long to return a response

Module Summary
Embeddings, Vector Databases and Search - What have we learned?

• Vector stores are useful when you need context augmentation.

• Vector search is all about calculating vector similarities or distances.
• A vector database is a regular database with out-of-the-box search
capabilities.
• Vector databases are useful if you need database properties, have big
data, and need low latency.
• Select the right embedding model for your data.
• Iterate upon document splitting/chunking strategy

Time for some code!

whitepaper_emebddings_vectorstores_v2
No ratings yet
whitepaper_emebddings_vectorstores_v2
64 pages
Embeddings - Vector Databases
No ratings yet
Embeddings - Vector Databases
2 pages
Vector Database
No ratings yet
Vector Database
7 pages
What is Vector
No ratings yet
What is Vector
4 pages
You Ll Learn Why They Matter What Makes Them Different How They Work the New Use Cases They Re Designed for and How to Get Started 1688203106
No ratings yet
You Ll Learn Why They Matter What Makes Them Different How They Work the New Use Cases They Re Designed for and How to Get Started 1688203106
25 pages
Sponsored DZ RC 396 Getting Started Vector Databas
No ratings yet
Sponsored DZ RC 396 Getting Started Vector Databas
9 pages
The Rise of Vector Databases in the Age of LLMs
No ratings yet
The Rise of Vector Databases in the Age of LLMs
26 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
Embeddings
No ratings yet
Embeddings
13 pages
Vector Database
No ratings yet
Vector Database
3 pages
Vector-DataBase in AI
No ratings yet
Vector-DataBase in AI
14 pages
Final Year Project
No ratings yet
Final Year Project
25 pages
Vector Databases
No ratings yet
Vector Databases
24 pages
Vector Database
No ratings yet
Vector Database
8 pages
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
No ratings yet
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
12 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
Vector_Databases
No ratings yet
Vector_Databases
2 pages
What Are Vector Databases
No ratings yet
What Are Vector Databases
5 pages
5bdb704a-2eaa-40ff-a177-1c16b064da57 -2
No ratings yet
5bdb704a-2eaa-40ff-a177-1c16b064da57 -2
54 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Vector_Databases
No ratings yet
Vector_Databases
35 pages
Introduction To Vector Embeddings and Vector Databases
No ratings yet
Introduction To Vector Embeddings and Vector Databases
11 pages
tm3
No ratings yet
tm3
8 pages
the-case-against-vector-databases
No ratings yet
the-case-against-vector-databases
24 pages
RAGHack-AzureAISearch-Spanish
No ratings yet
RAGHack-AzureAISearch-Spanish
85 pages
2. Vector Databases
No ratings yet
2. Vector Databases
2 pages
Vector Databases
No ratings yet
Vector Databases
2 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
A Comprehensive Survey On Vector Database
No ratings yet
A Comprehensive Survey On Vector Database
13 pages
vectorsearch
No ratings yet
vectorsearch
37 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
THE FAISS LIBRARY
No ratings yet
THE FAISS LIBRARY
21 pages
Vector Search
No ratings yet
Vector Search
10 pages
Elasticsearch-2308 14963
No ratings yet
Elasticsearch-2308 14963
9 pages
Elastic Ebook Building Ai Powered Search Experiences
No ratings yet
Elastic Ebook Building Ai Powered Search Experiences
33 pages
Generative Certification Notes-1
No ratings yet
Generative Certification Notes-1
22 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
PostgreSQL As A Vector Database: Create, Store, and Query OpenAI Embeddings With Pgvector
No ratings yet
PostgreSQL As A Vector Database: Create, Store, and Query OpenAI Embeddings With Pgvector
2 pages
2(d) Vector Space Model
No ratings yet
2(d) Vector Space Model
9 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
21 pages
Matrix-Vector Multiplication by MapReduce-V2
No ratings yet
Matrix-Vector Multiplication by MapReduce-V2
26 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
1742933587478
No ratings yet
1742933587478
25 pages
Maths Roadmap for Machine Learning-1
No ratings yet
Maths Roadmap for Machine Learning-1
8 pages
Model Training and Fine Tuning
No ratings yet
Model Training and Fine Tuning
11 pages
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
No ratings yet
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
53 pages
Maths Roadmap For Machine Learning - Linear Algebra-1
No ratings yet
Maths Roadmap For Machine Learning - Linear Algebra-1
5 pages
embeddings
No ratings yet
embeddings
83 pages
Embedding S
No ratings yet
Embedding S
83 pages
Embeddings
No ratings yet
Embeddings
82 pages
Billion-Scale Similarity Search With GPUs
No ratings yet
Billion-Scale Similarity Search With GPUs
12 pages
1 Linear Algebra Basics 25-07-2024
No ratings yet
1 Linear Algebra Basics 25-07-2024
30 pages
Embeddings 1686516367
No ratings yet
Embeddings 1686516367
82 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
No ratings yet
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
8 pages
WP NAND Oracle Vector Search FINAL
No ratings yet
WP NAND Oracle Vector Search FINAL
14 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
DynamoDB Applied Design Patterns
From Everand
DynamoDB Applied Design Patterns
Uchit Vyas
3/5 (1)
AI FOR RED TEAM
No ratings yet
AI FOR RED TEAM
28 pages
Chatgpt Analogies
No ratings yet
Chatgpt Analogies
16 pages
Chronos - Learning The Language of Time Series
No ratings yet
Chronos - Learning The Language of Time Series
40 pages
Adaptive AI 101_ Characteristics, Components, and Use Cases
No ratings yet
Adaptive AI 101_ Characteristics, Components, and Use Cases
12 pages
2407.17788v1
No ratings yet
2407.17788v1
12 pages
Ma 等 - 2024 - LLMParser An Exploratory Study on Using Large Language Models for Log Parsing
No ratings yet
Ma 等 - 2024 - LLMParser An Exploratory Study on Using Large Language Models for Log Parsing
13 pages
Towards Metamorphic Testing of Space Software Using Large Language Models
No ratings yet
Towards Metamorphic Testing of Space Software Using Large Language Models
133 pages
2407.19098v1
No ratings yet
2407.19098v1
28 pages
Pragmatic Engineer - How GenAI is reshaping tech hiring
No ratings yet
Pragmatic Engineer - How GenAI is reshaping tech hiring
11 pages
4. Introduction to Generative AI-en
No ratings yet
4. Introduction to Generative AI-en
3 pages
Mausam CV
No ratings yet
Mausam CV
34 pages
Mastering Generative AI (1)
No ratings yet
Mastering Generative AI (1)
4 pages
86. Schmidt et al., 2024, AI-Enhanced QOC-Analysis A Framework for Transparent and Insightful Decision-Making
No ratings yet
86. Schmidt et al., 2024, AI-Enhanced QOC-Analysis A Framework for Transparent and Insightful Decision-Making
15 pages
Resume - Faaz Farooqui - AI_2024
No ratings yet
Resume - Faaz Farooqui - AI_2024
3 pages
Stable or Shaky The Semantics of ChatGPTs Behavior Under Repeated Queries-1
No ratings yet
Stable or Shaky The Semantics of ChatGPTs Behavior Under Repeated Queries-1
7 pages
2001040208_Đặng Quỳnh Trang - Quỳnh Trang Đặng
No ratings yet
2001040208_Đặng Quỳnh Trang - Quỳnh Trang Đặng
68 pages
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
No ratings yet
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
23 pages
ChatGPT Nature
No ratings yet
ChatGPT Nature
3 pages
Prodigal AI
No ratings yet
Prodigal AI
6 pages
Data Strategy Only in God may we trust The rest bring Data
No ratings yet
Data Strategy Only in God may we trust The rest bring Data
26 pages
GenAI-Unit1-3
No ratings yet
GenAI-Unit1-3
31 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
All-Round Creator and Editor Following Instructions Via Diffusion Transformer
No ratings yet
All-Round Creator and Editor Following Instructions Via Diffusion Transformer
48 pages
2406.13161v1
No ratings yet
2406.13161v1
25 pages
A Survey On Generative AI and LLM For Video
No ratings yet
A Survey On Generative AI and LLM For Video
16 pages
OpenAISDK Documentacion
No ratings yet
OpenAISDK Documentacion
165 pages
AI cyber benchmark
No ratings yet
AI cyber benchmark
30 pages
Gemini v. ChatGPT v. Mistral
No ratings yet
Gemini v. ChatGPT v. Mistral
23 pages
Omniparser For Pure Vision Based Gui Agent: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
No ratings yet
Omniparser For Pure Vision Based Gui Agent: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
14 pages
LangChain for JavaScript Developers How to Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
No ratings yet
LangChain for JavaScript Developers How to Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
120 pages