100% found this document useful (1 vote)

106 views

Building Blocks of Rag Ebook Final

Uploaded by

csgoskins6464

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

106 views

Building Blocks of Rag Ebook Final

Uploaded by

csgoskins6464

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Building Blocks

of RAG with Intel

Dive into Retrieval Augmented Generation (RAG), the innovative approach
that defines how organizations harness the value of their data with large
language models (LLMs). Explore some of the Intel hardware and software
building blocks that optimize RAG applications, enabling contextual, real-
time responses while simplifying deployment and enabling scale.

› Tailoring GenAI for your Application . . . . . . . . . . . . . . . . . . . . . . 2

› What is Retrieval Augmented Generation (RAG)? . . . . . . . . . . . . 3
› Standard RAG Solution Architecture . . . . . . . . . . . . . . . . . . . . . . 4
› Technologies for RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
› Accelerating RAG in Production . . . . . . . . . . . . . . . . . . . . . . . . . 6
› Opportunities for RAG in Enterprise . . . . . . . . . . . . . . . . . . . . . . . 9
› Take the Next Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Authors: Eduardo Alvarez, Senior AI Solutions Engineer, Intel • Sancha Norris, AI Product Marketing, Intel
Tailoring GenAI for Your Application
The public debut of ChatGPT has changed the AI landscape. Enterprises are rushing to take advantage of
this new technology to give them a competitive edge with new products, improved productivity and more
cost-efficient operations.
Generative AI (GenAI) models, like Grok-1 (300B+ parameters) and GPT-4 (trillions+), are trained on
massive amounts of data from the internet and other text sources. These 3rd party large language models
are good for general-purpose use cases. However, most use cases for enterprises will require AI models to
be trained and/or augmented with your data so the results can be more relevant to your business. Here are
some examples of how generative AI can be applied within various industries.

Consumer Goods Healthcare Media &

and Retails Manufacturing Entertainment Financial Services
& Medicine

• Virtual fitting rooms • Assist busy front-line staff • E xpert copilot for • Intelligent search, tailored • Uncovering trading
technicians content discovery signals, alerting traders
• Delivery and installation • Transcribe and summarize
to vulnerable postitions
medical notes • Conversational interactions • Headline and copy
• In-store product-finding
with machines development • Accelerating underwriting
assistance • Chatbots and answer
decisions
medical questions • Prescriptive and • Real-time feedback on
• Demand prediction and
proactive field service content quality • Optimizing and rebuilding
inventory planning • Predictive analytics to
legacy system
inform diagnosis and • Natual language • Personalized playlists, news
• Novel product designs
treatments troubleshooting digests, recommendations • Reverse-engineering
banking and insurance
• Warranty status and • Interactive storytelling via
models
documentation viewer choices
• Monitoring for potential
• Understanding process • Targeted offers,
financial crimes and fraud
bottlenecks, devising subscription plans
recovery strategies • Automating data gathering
for regulatory compliance
• E xtracting insights from
corporate disclosures

Source: Compiled by MIT Technology Review Insights, based on data from "Retail in the Age of Generative AI," 9 "The Great Unlock: Large Language Models in Manufacturing," 10 "Generative AI Is
Everything Everywhere, All at Once," and "Large Language Models in Media & Entertainment, " 12 Databricks, April-June 2023.

While you can use your data to fine-tune a model, In this introductory guide, we will explain how RAG
retraining a model takes additional time and can be paired with various Intel optimizations and
resources. An alternative popular technique, platforms to yield incredible value and performance
retrieval augmented generation (RAG), creates a for production GenAI systems.
domain-specific LLM by augmenting open-source
pre-trained models with your proprietary data to
develop business-specific results. RAG allows you
to keep your data safe and secure without sharing it
with third-party large foundation models.

2
What is retrieval augmented generation (RAG)?

The RAG technique adds dynamic, query-dependent data into the model's prompt stream. Relevant
data is retrieved from a custom-built knowledge base stored in a vector database. The prompt and the
retrieved context enrich the model's output, delivering more relevant and accurate results. RAG allows
you to leverage your data with an LLM while keeping the integrity of your data private, as it's not sent to a
third party managing the model. The key components of the RAG workflow can be captured in four simple
steps: user query processing, retrieval, context incorporation and output generation. The diagram below
illustrates this basic flow.

Prompt +
? Ask
Retrieved
Context
Generate
Answer

Retrieval Generated Response

Pre-Trained LLM based on Retrieved
User Prompt Mechanisms
Context + User Prompt

Relevant
Vector
Context
Search
Retrieved

Private Knowledge
(Vector Database)

RAG’s utility is not confined to text; it can The most relevant data is retrieved, incorporated
revolutionize video search and interactive document with the user's prompt, and passed to a model
exploration, even enabling a chatbot to draw on PDF for inference service and final output generation.
content for answers. This context incorporation provides models with
additional information unavailable during pre-
RAG applications are often called "RAG pipelines"
training, better aligning them with the user's task or
due to their consistent data process flow, starting
domain of interest. Because RAG does not require
with the user prompt. The prompt is passed through
retraining or fine-tuning the model, it can be an
the core component, the retrieval mechanism, which
efficient way to add your data to provide context to
converts it into a vector embedding and uses vector
an LLM.
search to find similar content in a pre-constructed
vector database (e.g., from PDFs, logs, transcripts). The next section will explore the RAG solution
architecture and stack.

3
Standard RAG Solution Architecture
The following RAG solution architecture provides an overview of the building blocks of a standard RAG
implementation. Core components of the flow include 1 building the knowledge base, 2 query and
context retrieval, 3 response generation and 4 production monitoring across applications.

RAG LLM Architecture

Build Knowledge Base

Private Pre-processing Processed Data stored

knowledge pipeline objects chunking Vector DB
base

Embedding
model
Query and Context Retrieval

User Input Retrieval retrieval

User query
authentication guardrailing vector search
Retrieved
context Top-K

Output
Reranking

Response Generation

Output Prompt
LLM inference
guardrailing template

Production Monitoring

Let’s expand on some of these components: 3 Response generation:

• LLM Inference and Response Generation: Combine top context
1 Build the knowledge base: with the user query, process through a pre-trained or fine-tuned
• Data Collection: Assemble a private knowledge base from text- LLM and post-process for quality and safety.
based sources such as transcripts, PDFs and digitized documents. •R
esponse Delivery: Return the final response to the user or
• Data Processing Pipeline: Utilize a RAG-specific pipeline for subsystem through the interface, ensuring a coherent and
extracting text, formatting content for processing, and chunking contextually accurate answer. FEC91B FEC91B

data into manageable sizes.

• Vectorization: Process chunks through an embedding model to
4 Production monitoring:
convert text into vectors, optionally including metadata for richer • Retrieval Performance: Monitor latency and accuracy of the
context. retrieval process, keeping records for auditing purposes.
• Vector Database Storage: Store vectorized data in a scalable • Re-ranking Efficiency: Track re-ranking performance, ensuring
vector database, ready for efficient retrieval. contextual relevance and speed.
• Inference Service Quality: Observe latency and quality of LLM
2 Query and context retrieval: inference, maintaining logs for auditing and improvement.
• Query Submission: Users or a subsystem submit queries through a • Guardrail Effectiveness: Monitor guardrails for input and output
chat-like interface or API calls, authenticated by a secure service. processing, ensuring compliance and content safety.
• Query Processing: Implement input safeguards for security and
compliance, followed by query vectorization.
• Vector Search and Re-ranking: Conduct an initial vector search to
retrieve relevant vectors, followed by a re-ranking process to refine
results using a more complex model.
4
Technologies for RAG
Developing RAG applications usually starts with integrated RAG frameworks, such as Haystack,
LlamaIndex, LangChain and Intel Lab's fastRAG. These frameworks streamline development by offering
optimizations and integrating essential AI toolchains.
Let's consider the RAG toolchain in the three familiar components: knowledge base construction, query
and context retrieval and response generation. Often, RAG frameworks provide APIs that encompass the
entire toolchain. Choosing between using these abstractions and leveraging independent components is a
thoughtful engineering decision that should be considered carefully.

Building Knowledge Base Query and Context Retrieval Response Generation

RAG Frameworks Search Algorithms
Data Processing Vector Databases
• Faiss
• SVS
• HNSW
• Vamana

Intel Opimizations

oneMKL oneDAL oneDNN oneCCL

Compute Platform

Intel optimizations bridge the gap between the toolchain and hardware, enhancing performance across the chain while ensuring compatibility
and improved functionality on Intel® Xeon® CPUs and Intel® Gaudi® accelerators. These optimizations are integrated into stock frameworks
or distributed as add-on extensions, with the goal to decrease the need for extensive low-level programming. This abstraction enables
developers to focus on building RAG applications efficiently and effectively, leveraging enhanced performance and tailored solutions for
their specific use cases.
Let’s explore the various components of the toolchain in more detail.

Building the knowledge base + context retrieval: Response generation:

• Integrated Frameworks: Haystack and LangChain are notable RAG • Low-Level Optimizations: oneAPI performance libraries optimize
frameworks that offer high-level abstractions for vector databases popular AI frameworks like PT, TF, and ONNX so you can use your
and search algorithms, enabling developers to manage complex familiar open source tools knowing they are optimized for Intel®
processes within Python-based environments. hardware.
• Vector Database Technologies: Pinecone, Redis and Chroma are • Advanced Inference Optimization: Extensions such as Intel®
some key vector database solutions that support popular search Extension for PyTorch add advanced quantized inference
algorithms. Scalable Vector Search (SVS) from Intel Labs is techniques, boosting performance for large language models.
another promising addition, expected to integrate with major vector
As you can see, RAG involves several interconnected components
databases by early 2024.
and managing them on a single platform, like Intel Xeon CPUs,
• Embedding and Model Accessibility: Embedding models, which streamlines configuration, deployment and maintenance. For larger
are often integrated via Hugging Face APIs, can be seamlessly LLMs or high-throughput LLM inference, integrating Gaudi accelerators
incorporated into RAG frameworks, making it easier to include becomes an optimal solution for fulfilling application needs.
advanced natural language processing (NLP) models.
The following section dives into the complexities of RAG in production,
addressing various considerations and technologies that help teams
achieve successful deployment.
5
Accelerating RAG in Production
Many components of the RAG pipeline are computationally intense, while at the same time, end
users require low-latency responses. Additionally, because RAG is often used for confidential data, the
entire pipeline must be secure. Intel technologies can power the RAG pipeline, contributing to secure
performance across compute platforms and helping enable the full power of generative AI tailored to
specific domains and industries.

Computational demand In a recent study with Hugging Face, we evaluated throughput for
peak encoding performance in terms of documents per second.
Generally, LLM inference is the most computationally-intensive Overall, for all model sizes, the quantized model shows up to 4x
phase of a RAG pipeline, particularly in a live production environment. improvement compared to the baseline bfloat16 (BF16) model in
However, creating the initial knowledge base — processing data and various batch sizes. Read more here: https://huggingface.co/blog/
generating embeddings — can be equally demanding, depending on
intel-fast-embedding
the data's complexity and volume. Intel's advancements in general
compute technology, AI accelerators and confidential computing
provide the essential building blocks for addressing the
compute challenges of the entire RAG pipeline while ensuring
BGE-small Throughput
data privacy and security.
Like most software applications, RAG benefits from a scalable 2000
infrastructure tailored to meet end-users’ transactional
demands. As transaction demand increases, developers may
examples/sec

experience increased latency due to the load on compute 1500

infrastructure, which becomes saturated by vector database
queries and inference calculations. For this reason, it's crucial
to have access to readily available compute resources to scale 1000
up systems to quickly handle increased demand. Equally
important is the need to implement critical optimizations to
boost the performance of key components such as embedding 500
generation, vector search and inference.

Data privacy and security 0

• Secure AI Processing: Intel® Software Guard Extensions 1 4 8 16 32 64 128 256
(Intel® SGX) and Intel® Trust Domain Extensions (Intel®
batch size
TDX) boost data security via confidential computing and
data encryption in CPU memory during processing. These IPEX int8 IPEX bf16 (torchscript) bf16 (pytorch)
technologies are crucial for handling sensitive information,
contributing to the creation of secure RAG applications with
encrypted data throughout the pipeline's various parts. This
is an essential feature for RAG applications that require secure Figure 1: Throughput for BGE small source:
processing of sensitive data during vector embedding generation, https://huggingface.co/blog/intel-fast-embedding
retrieval, or inference.
• Implement Proper Guardrailing: In RAG applications, guardrailing Vector search optimizations
involves implementing measures to manage the behavior of the
LLM within the RAG system. This includes monitoring the model's • CPU-Optimized Workloads: Vector search operations are
responses, helping to adhere to guidelines and best practices, highly optimized on Intel Xeon processors, particularly with the
and controlling its output to decrease the risks of toxicity, unfair introduction of Intel® Advanced Vector Extensions 512 (Intel® AVX-
bias, and privacy breaches. Guardrailing in RAG applications helps 512) in 3rd generation processors or later. Intel AVX-512 leverages
maintain the trust and responsible usage of the LLM while ensuring the fused multiply-add (FMA) instruction, which combines
it aligns with the system's overall goals and requirements. multiplication and addition in a single operation, enhancing inner
product calculations — a fundamental operation in vector search.
Open-source optimizations This capability significantly improves throughput and performance
by reducing the number of instructions needed for computation.
Embedding optimizations
• Scalable Vector Search (SVS): Scalable Vector Search (SVS)
• Quantized Embedding Models: Intel Xeon processors can take technology delivers fast vector search capabilities, optimizing
advantage of quantized embedding models to optimize the retrieval times and improving overall system performance. SVS
generation of vector embeddings from documents. A great optimizes graph-based similarity search using locally-adaptive
example is bge-small-en-v1.5-rag-int8-static, a version of BAAI/ vector quantization (LVQ), which minimizes memory bandwidth
BGE-small-en-v1.5 quantized with Intel® Neural Compressor requirements while maintaining accuracy. The result is significantly
and compatible with Optimum-Intel. Retrieval and re-ranking reduced distance calculation latency and higher performance in
performance tasks using the quantized model on the Massive Text
throughput and memory requirements, as demonstrated in the
Embedding Benchmark (MTEB) reveals a < 2% difference between
figure below.
the floating-point (FP32) and the quantized INT8 version, on MTEB
performance metric, while enhancing throughput (see footnote 1,3).
6
Throughput (QPS) query batch = 10 rqa-768-10M-OOD query batch = 1000 rqa-768-10M-OOD

100K 100K

10K 10K

1K 1K

0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00
10 recall@10 10 recall@10

SVS HNSWlib Faiss-IVFPQfs

Figure 2. Query per Second (Throughput) performance of SVS compared to other well adopted implementations, HNSWlib
and FAISS’s. The figure shows the QPS vs recall curves for the rqa-768-10M-OOD dataset (10M 768-dimensional embeddings
generated with the dense passage retriever model RocketQA [QDLL21] with out-of-distribution queries). (Footnote 2,3)
source: https://intellabs.github.io/ScalableVectorSearch/benchs/static/latest.html

Inference Optimizations • Open Source SOTA Inference Optimization Tools: Intel

contributes to and extends popular deep learning frameworks like
RAG primarily involves inference operations, which Intel Xeon PyTorch, TensorFlow, Hugging Face, DeepSpeed, etc. Of interest
processors support through advanced model compression for the RAG workflow are the opportunities to optimize LLMs by
techniques. These techniques enable operations at reduced implementing model compression techniques like quantization.
precisions (BF16 and INT8) without significant performance loss. Intel® Extension for PyTorch currently provides a variety of state-
For larger models and high throughput requirements, Intel Gaudi of-the-art (SOTA) LLM quantization recipes such as SmoothQuant,
accelerators offer excellent price/performance benefits and can weight-only quantization, and mixed precision (FP32/BF16). The
replace CPUs and other accelerators for RAG inference. In this figure below showcases the inference latency performance of an
section, we'll outline various inference-specific optimizations INT8-quantized Llama 2 model running on a single-socket 4th Gen
and opportunities. Intel Xeon platform.
• Intel Advanced Matrix Extensions (Intel AMX): The 4th and 5th Gen
Intel Xeon Scalable processors incorporate Intel AMX, enabling
more efficient matrix operations and improved memory management.

5th Gen Xeon best market requirements on LLM latencies

Single node 2S 5th Gen Xeon 8592+ (64C) Large Language Model Next Token Latency

150 ms

125 ms
Market requirement
100 ms <100ms

75 ms

50 ms

25 ms

0
on

ion

ot
bo
arc

tio

arc

tio

b
i

ati

ati
rat

rat
ea

ea
at

at
Se

Se
iﬁc

er
Ch

Ch
ne

siﬁ
Cr

Cr
en

en
s
Ge

Ge
as

as
nt

nt
gG

gG
nte

nte
Cl

Cl
xt

xt
din

din
Te

Te
xt

xt
Co

Co
Te

Te
Co

Llama2 GPT-J
13B 6B
See backup configuration for workload and configurations. Results may vary.

Figure 3. Llama 2 13B and GPT-J 6B performance on 5th Gen Intel® Xeon® Scalable processors3
7
4

Relative speedup (throughput in tps)

1.5x faster 2.7

Higher is better
inferencing
Average projection for Intel® Gaudi® 3 accelerator
vs. Nvidia H100, running common Large Language Models*
1.7
1.5

1.2
1.1 1.1 1.1
1 1 0.9 0.9 0.95
2048 output

2048 output

2048 output
2048 input

2048 input

2048 input
128 output

128 output

128 output
128 input

128 input

128 input
Nvidia LLAMA – 7B LLAMA – 70B Falcon 180B
H100

*Source: NV H100 comparison based on https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8 , Mar 28th, 2024.

Reported numbers are per GPU. Vs Intel® Gaudi® 3 projections for LLAMA2-7B, LLAMA2-70B & Falcon 180B projections. Results may vary.

Figure 4. LLM inference performance on Intel Gaudi 3

Inference complexity and Intel Gaudi

One of the benefits of RAG is that since you depend less on the LLM’s “knowledge” and more on its “language modeling capabilities,” you can
use models with a much lower parameter count. In many cases, a smaller 7B parameter model with RAG can beat larger models with tens of
billions of parameters on domain-specific tasks associated with the RAG model’s knowledge base.
Highly specialized tasks may sometimes require larger models and, consequently, specialized accelerators like Intel Gaudi processors. For
RAG applications requiring the highest throughput or lowest latency, run LLM inference on the highest-performing AI accelerator available,
such as an Intel Gaudi 3 processor.

Explore Intel Gaudi RAG resources to learn more

• Multi-Modal RAG Demo at Intel® Vision 2024 from Intel Labs Cognitive AI team
• A scalable Retrieval Augmented Generation (RAG) application using Hugging Face tools as an way of deploying optimized
applications utilizing the Intel Gaudi 2 accelerator

8
Opportunities for RAG in Enterprise
Retail Take the next step
Retailers face the challenge of recommending products that match When you are ready to kick-start your implementation, Intel provides
their customers' diverse and changing preferences. Traditional a suite of resources to help you get started, from hardware access in
recommendation systems may not effectively account for the latest the Intel® Tiber™ Developer Cloud to ubiquitous compute in major
trends or individual customer feedback, leading to less relevant cloud providers like Google Cloud Platform, Amazon Web Services,
suggestions. and Microsoft Azure. For developers seeking code samples,
walkthroughs, training and more, please visit Intel Developer Zone.
Implementing a RAG-based recommendation system enables
retailers to dynamically incorporate the latest trends and individual
customer feedback into personalized product suggestions. This
system enriches the shopping experience by offering relevant,
timely and personalized product recommendations, driving sales
and customer loyalty. Intel® Tiber™ Developer Cloud
Accelerate AI development using Intel®-optimized software on
LEARN MORE the latest Intel® Xeon® processors and GPU compmute.

Accelerate AI development using Intel®-

Manufacturing
optimized software on the latest Intel® Xeon®
In manufacturing, unexpected downtime due to equipment failure is
processors, Intel® Gaudi® Accelerators, and
a significant cost driver. Traditional predictive maintenance models
may miss subtle anomalies that precede a failure, especially in other Intel platforms.
complex machinery where historical failure data may be limited or
nonexistent.
A RAG-based anomaly detection system for predictive maintenance
can analyze vast amounts of operational data in real-time, comparing Access Intel hardware and start building RAG applications
it against an extensive knowledge base of equipment performance on cloud providers like Amazon Web Services, Google Cloud
to identify potential failures before they occur. This approach Platform and Microsoft Azure.
minimizes downtime and maintenance costs while extending
equipment life.

LEARN MORE

Financial services
Providing personalized financial advice at scale can be challenging
due to the vast amount of ever-changing financial data and regulations.
Customers expect quick, relevant and personalized financial advice
that traditional chatbots cannot always accurately provide.
Your Official Source for Developing on
A RAG model enhances a financial advice chatbot by dynamically Intel® Hardware and Software
pulling the most current financial data and regulations to generate
personalized advice. By leveraging a vast knowledge base, the chatbot Explore Intel's most popular
can provide clients with tailored investment strategies, real-time market development areas and resources.
insights and regulatory advice, enhancing customer satisfaction
and engagement. Intel GenAI Development Resources

LEARN MORE

1
Performance claims based on 4th gen Intel Xeon 8480+ with 2 sockets, 56 cores per socket. Pytorch model was evaluated with 56 cores on 1 CPU socket. IPEX/Optimum setups were evaluated with
ipexrun, 1 CPU socket, and cores ranging from 22-56. TCMalloc was installed and defined as an environment variable in all runs. See www.intel.com/performanceindex for details. Results may vary.
2
Performance claims based on a 2-socket 4th generation Intel® Xeon® Platinum 8480L CPU with 56 cores per socket, equipped with 512GB DDR4 memory per socket @4800MT/s speed, running
Ubuntu 22.04. 1 2 For the deep-96-1B, dataset we use a server with the same characteristics except that it is equipped with 1TB DDR4 memory per socket @4400MT/s speed. See www.intel.com/
performanceindex for details. Results may vary.
3
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may
not reflect all publicly available updates. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service
activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
0424/SN/MESH/PDF 358260-001US

Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development
No ratings yet
Comparative Analysis of RAG Fine-Tuning and Prompt Engineering in Chatbot Development
4 pages
Full download Natural Language Processing in Action Understanding analyzing and generating text with Python 1st Edition Hobson Lane pdf docx
100% (2)
Full download Natural Language Processing in Action Understanding analyzing and generating text with Python 1st Edition Hobson Lane pdf docx
40 pages
Astm D4491
No ratings yet
Astm D4491
8 pages
AIM307_Retrieval-Augmented-Generation-with-Amazon-Bedrock
No ratings yet
AIM307_Retrieval-Augmented-Generation-with-Amazon-Bedrock
15 pages
RAG Syllabus R&D
No ratings yet
RAG Syllabus R&D
6 pages
Implementing A Retrieval-Augmented Generation System
No ratings yet
Implementing A Retrieval-Augmented Generation System
3 pages
Building a Streamlit Chatbot with LangChain and Llama 3.1_ Exploring LLMs — 3 _ by Abou Zuhayr _ Sep, 2024 _ GoPenAI
No ratings yet
Building a Streamlit Chatbot with LangChain and Llama 3.1_ Exploring LLMs — 3 _ by Abou Zuhayr _ Sep, 2024 _ GoPenAI
15 pages
Number Bonds Activities
No ratings yet
Number Bonds Activities
17 pages
Newwhitepaper_Embeddings & vector stores
No ratings yet
Newwhitepaper_Embeddings & vector stores
51 pages
Intelligent Agents
No ratings yet
Intelligent Agents
42 pages
Maths in Daily Life
No ratings yet
Maths in Daily Life
5 pages
Y2 Autumn Block 2 SOL Addition and Subtraction
No ratings yet
Y2 Autumn Block 2 SOL Addition and Subtraction
67 pages
Document Object Model (Dom) Api For Javascript: Al and Range
No ratings yet
Document Object Model (Dom) Api For Javascript: Al and Range
14 pages
LangChain QuickStart With Llama 2
No ratings yet
LangChain QuickStart With Llama 2
16 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
19 pages
Red Wine Quality Detection
No ratings yet
Red Wine Quality Detection
17 pages
Weaviate Advanced RAG Techniques eBook
100% (1)
Weaviate Advanced RAG Techniques eBook
13 pages
Stable Diffusion
No ratings yet
Stable Diffusion
6 pages
Download Complete Machine Learning Pocket Reference Working with Structured Data in Python 1st Edition Matt Harrison PDF for All Chapters
100% (3)
Download Complete Machine Learning Pocket Reference Working with Structured Data in Python 1st Edition Matt Harrison PDF for All Chapters
55 pages
RAG_Beyond_Text_Enhancing_Image_Retrieval_in_RAG_Systems
100% (1)
RAG_Beyond_Text_Enhancing_Image_Retrieval_in_RAG_Systems
6 pages
Linear Algebra in 4 Pages PDF
No ratings yet
Linear Algebra in 4 Pages PDF
4 pages
Nodepy Docs
No ratings yet
Nodepy Docs
19 pages
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
No ratings yet
Hybrid Retrieval-Augmented Generation Approach For LLMs Query Response Enhancement
5 pages
Conveyor Belt Splicing
100% (1)
Conveyor Belt Splicing
5 pages
ARTICLE- Is Agentic RAG Worth the Investment? Agentic RAG Pricing and ROI Breakdown
No ratings yet
ARTICLE- Is Agentic RAG Worth the Investment? Agentic RAG Pricing and ROI Breakdown
1 page
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
Application of Mathematics in Real Life PDF
100% (1)
Application of Mathematics in Real Life PDF
30 pages
Retrading Business
No ratings yet
Retrading Business
64 pages
Gate Information Night 2-8-22
No ratings yet
Gate Information Night 2-8-22
17 pages
M.Tech.: Data Science & Engineering
No ratings yet
M.Tech.: Data Science & Engineering
16 pages
Marilyn Burns On The Language of Math
No ratings yet
Marilyn Burns On The Language of Math
6 pages
Streamlit PDF Application Setup All Commands in One Single File
No ratings yet
Streamlit PDF Application Setup All Commands in One Single File
8 pages
August 2022: Top 10 Read Articles in Signal & Image Processing
No ratings yet
August 2022: Top 10 Read Articles in Signal & Image Processing
31 pages
Artofdatascience PDF
No ratings yet
Artofdatascience PDF
159 pages
Technical Analysis Course Syllabus
No ratings yet
Technical Analysis Course Syllabus
1 page
Machine Learning
No ratings yet
Machine Learning
135 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Finance Modeling Handbook (00000002)
No ratings yet
Finance Modeling Handbook (00000002)
1 page
GK Capsule 2022
No ratings yet
GK Capsule 2022
288 pages
Mathematical Capabilities of ChatGPT
100% (1)
Mathematical Capabilities of ChatGPT
29 pages
EDA - The Right Way
No ratings yet
EDA - The Right Way
111 pages
Ratan Tata
No ratings yet
Ratan Tata
36 pages
AI-ML Edited
100% (1)
AI-ML Edited
12 pages
Chatgpt for python
No ratings yet
Chatgpt for python
192 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
ControlNet For Stable Diffusion
No ratings yet
ControlNet For Stable Diffusion
4 pages
Sumit Tripathi Applied AI Course Schedule
No ratings yet
Sumit Tripathi Applied AI Course Schedule
31 pages
GenAI Interview Questions-1
No ratings yet
GenAI Interview Questions-1
9 pages
IoT Frameworks, Tools, APIs and Architectures
No ratings yet
IoT Frameworks, Tools, APIs and Architectures
11 pages
1537 Samskrita Prathama Pathamala
No ratings yet
1537 Samskrita Prathama Pathamala
95 pages
CSIRO Lowering Ruminant Methane Emissions Through Improved Feed Conversion Efficiency
No ratings yet
CSIRO Lowering Ruminant Methane Emissions Through Improved Feed Conversion Efficiency
11 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Month 12 Level 1 Lessons 51-53B PDF
No ratings yet
Month 12 Level 1 Lessons 51-53B PDF
15 pages
Machine Learning Approachs (AI)
100% (1)
Machine Learning Approachs (AI)
11 pages
FINAL Vedic Maths
No ratings yet
FINAL Vedic Maths
29 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
OceanofPDF.com JavaScript Object-Oriented Programming - Neo D Truman (1)
No ratings yet
OceanofPDF.com JavaScript Object-Oriented Programming - Neo D Truman (1)
137 pages
A Practical Blueprint For Implementing Generative AI Retrieval-Augmented Generation
No ratings yet
A Practical Blueprint For Implementing Generative AI Retrieval-Augmented Generation
19 pages
RAG - The Future of LLMs - LinkedIn
No ratings yet
RAG - The Future of LLMs - LinkedIn
7 pages
Building_Scalable_AI-Powered_Applications_with_Clo
No ratings yet
Building_Scalable_AI-Powered_Applications_with_Clo
9 pages
building-blocks-of-rag-ebook-final
No ratings yet
building-blocks-of-rag-ebook-final
15 pages
Mtr10ii Series
No ratings yet
Mtr10ii Series
290 pages
Pioneer Deh-1250mp Parts SCH Incomplete
No ratings yet
Pioneer Deh-1250mp Parts SCH Incomplete
12 pages
S1WP Data Sheet 1001903-EN-03
No ratings yet
S1WP Data Sheet 1001903-EN-03
7 pages
Lecture Note Network Problem Set 3
No ratings yet
Lecture Note Network Problem Set 3
6 pages
Computer Science (9608)
No ratings yet
Computer Science (9608)
38 pages
20 Effective ChatGPT Prompts
No ratings yet
20 Effective ChatGPT Prompts
6 pages
UNIT-2 Activities, Fragments and Intents
No ratings yet
UNIT-2 Activities, Fragments and Intents
44 pages
4 - 5 Questions On Uses of The Electromagnet
No ratings yet
4 - 5 Questions On Uses of The Electromagnet
6 pages
Part Eterna
No ratings yet
Part Eterna
9 pages
Submission of Assessment For Midas Labs Internship Cum PPO Recruitment Drive-2024 Graduating Batch
No ratings yet
Submission of Assessment For Midas Labs Internship Cum PPO Recruitment Drive-2024 Graduating Batch
25 pages
Abdul Aziz
No ratings yet
Abdul Aziz
2 pages
PW2NX Programming Reference
No ratings yet
PW2NX Programming Reference
470 pages
1DBMS Keys - Primary, Foreign, Candidate and Super Key - Javatpoint PDF
No ratings yet
1DBMS Keys - Primary, Foreign, Candidate and Super Key - Javatpoint PDF
4 pages
The Complete Guide To Marketing Automation
No ratings yet
The Complete Guide To Marketing Automation
33 pages
Innovus
No ratings yet
Innovus
77 pages
3545 Manual de Servicio
No ratings yet
3545 Manual de Servicio
8 pages
Ipc 610 L - 611 - DS (061620) - New20200616173652
No ratings yet
Ipc 610 L - 611 - DS (061620) - New20200616173652
2 pages
Micro Box
No ratings yet
Micro Box
2 pages
CQI & Throughput
No ratings yet
CQI & Throughput
9 pages
Networking Assignment
No ratings yet
Networking Assignment
73 pages
Clevo M860TU M865TU
No ratings yet
Clevo M860TU M865TU
104 pages
Z-135-70 Operators Manual
No ratings yet
Z-135-70 Operators Manual
68 pages
Cryptography in Blockchain
No ratings yet
Cryptography in Blockchain
26 pages
GE Fanuc CNC: Power Mate D and F Motion Controllers Maintenance Manual
No ratings yet
GE Fanuc CNC: Power Mate D and F Motion Controllers Maintenance Manual
419 pages
MIP ch1 To ch5
No ratings yet
MIP ch1 To ch5
45 pages
Math Answers and WB
No ratings yet
Math Answers and WB
31 pages
Thesis Chapter 2 Format Sample
100% (2)
Thesis Chapter 2 Format Sample
5 pages
ANTIQUE - Refllector ClearTop - PDF
No ratings yet
ANTIQUE - Refllector ClearTop - PDF
6 pages
Service Manual: Daewoo Electronics Co., LTD
No ratings yet
Service Manual: Daewoo Electronics Co., LTD
68 pages

Building Blocks of Rag Ebook Final

Uploaded by

Building Blocks of Rag Ebook Final

Uploaded by

Building Blocks

of RAG with Intel

› Tailoring GenAI for your Application . . . . . . . . . . . . . . . . . . . . . . 2

Consumer Goods Healthcare Media &

Retrieval Generated Response

RAG LLM Architecture

Build Knowledge Base

Private Pre-processing Processed Data stored

User Input Retrieval retrieval

Let’s expand on some of these components: 3 Response generation:

data into manageable sizes.

Building Knowledge Base Query and Context Retrieval Response Generation

oneMKL oneDAL oneDNN oneCCL

Building the knowledge base + context retrieval: Response generation:

experience increased latency due to the load on compute 1500

Data privacy and security 0

SVS HNSWlib Faiss-IVFPQfs

Inference Optimizations • Open Source SOTA Inference Optimization Tools: Intel

5th Gen Xeon best market requirements on LLM latencies

Relative speedup (throughput in tps)

*Source: NV H100 comparison based on https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8 , Mar 28th, 2024.

Figure 4. LLM inference performance on Intel Gaudi 3

Inference complexity and Intel Gaudi

Explore Intel Gaudi RAG resources to learn more

Accelerate AI development using Intel®-

You might also like

Inference Optimizations • Open Source SOTA Inference Optimization Tools: Intel