Building Blocks of Rag Ebook Final
Building Blocks of Rag Ebook Final
Authors: Eduardo Alvarez, Senior AI Solutions Engineer, Intel • Sancha Norris, AI Product Marketing, Intel
Tailoring GenAI for Your Application
The public debut of ChatGPT has changed the AI landscape. Enterprises are rushing to take advantage of
this new technology to give them a competitive edge with new products, improved productivity and more
cost-efficient operations.
Generative AI (GenAI) models, like Grok-1 (300B+ parameters) and GPT-4 (trillions+), are trained on
massive amounts of data from the internet and other text sources. These 3rd party large language models
are good for general-purpose use cases. However, most use cases for enterprises will require AI models to
be trained and/or augmented with your data so the results can be more relevant to your business. Here are
some examples of how generative AI can be applied within various industries.
• Virtual fitting rooms • Assist busy front-line staff • E xpert copilot for • Intelligent search, tailored • Uncovering trading
technicians content discovery signals, alerting traders
• Delivery and installation • Transcribe and summarize
to vulnerable postitions
medical notes • Conversational interactions • Headline and copy
• In-store product-finding
with machines development • Accelerating underwriting
assistance • Chatbots and answer
decisions
medical questions • Prescriptive and • Real-time feedback on
• Demand prediction and
proactive field service content quality • Optimizing and rebuilding
inventory planning • Predictive analytics to
legacy system
inform diagnosis and • Natual language • Personalized playlists, news
• Novel product designs
treatments troubleshooting digests, recommendations • Reverse-engineering
banking and insurance
• Warranty status and • Interactive storytelling via
models
documentation viewer choices
• Monitoring for potential
• Understanding process • Targeted offers,
financial crimes and fraud
bottlenecks, devising subscription plans
recovery strategies • Automating data gathering
for regulatory compliance
• E xtracting insights from
corporate disclosures
Source: Compiled by MIT Technology Review Insights, based on data from "Retail in the Age of Generative AI," 9 "The Great Unlock: Large Language Models in Manufacturing," 10 "Generative AI Is
Everything Everywhere, All at Once," and "Large Language Models in Media & Entertainment, " 12 Databricks, April-June 2023.
While you can use your data to fine-tune a model, In this introductory guide, we will explain how RAG
retraining a model takes additional time and can be paired with various Intel optimizations and
resources. An alternative popular technique, platforms to yield incredible value and performance
retrieval augmented generation (RAG), creates a for production GenAI systems.
domain-specific LLM by augmenting open-source
pre-trained models with your proprietary data to
develop business-specific results. RAG allows you
to keep your data safe and secure without sharing it
with third-party large foundation models.
2
What is retrieval augmented generation (RAG)?
The RAG technique adds dynamic, query-dependent data into the model's prompt stream. Relevant
data is retrieved from a custom-built knowledge base stored in a vector database. The prompt and the
retrieved context enrich the model's output, delivering more relevant and accurate results. RAG allows
you to leverage your data with an LLM while keeping the integrity of your data private, as it's not sent to a
third party managing the model. The key components of the RAG workflow can be captured in four simple
steps: user query processing, retrieval, context incorporation and output generation. The diagram below
illustrates this basic flow.
Prompt +
? Ask
Retrieved
Context
Generate
Answer
Relevant
Vector
Context
Search
Retrieved
Private Knowledge
(Vector Database)
RAG’s utility is not confined to text; it can The most relevant data is retrieved, incorporated
revolutionize video search and interactive document with the user's prompt, and passed to a model
exploration, even enabling a chatbot to draw on PDF for inference service and final output generation.
content for answers. This context incorporation provides models with
additional information unavailable during pre-
RAG applications are often called "RAG pipelines"
training, better aligning them with the user's task or
due to their consistent data process flow, starting
domain of interest. Because RAG does not require
with the user prompt. The prompt is passed through
retraining or fine-tuning the model, it can be an
the core component, the retrieval mechanism, which
efficient way to add your data to provide context to
converts it into a vector embedding and uses vector
an LLM.
search to find similar content in a pre-constructed
vector database (e.g., from PDFs, logs, transcripts). The next section will explore the RAG solution
architecture and stack.
3
Standard RAG Solution Architecture
The following RAG solution architecture provides an overview of the building blocks of a standard RAG
implementation. Core components of the flow include 1 building the knowledge base, 2 query and
context retrieval, 3 response generation and 4 production monitoring across applications.
Embedding
model
Query and Context Retrieval
Output
Reranking
Response Generation
Output Prompt
LLM inference
guardrailing template
Production Monitoring
Intel Opimizations
Compute Platform
Intel optimizations bridge the gap between the toolchain and hardware, enhancing performance across the chain while ensuring compatibility
and improved functionality on Intel® Xeon® CPUs and Intel® Gaudi® accelerators. These optimizations are integrated into stock frameworks
or distributed as add-on extensions, with the goal to decrease the need for extensive low-level programming. This abstraction enables
developers to focus on building RAG applications efficiently and effectively, leveraging enhanced performance and tailored solutions for
their specific use cases.
Let’s explore the various components of the toolchain in more detail.
Computational demand In a recent study with Hugging Face, we evaluated throughput for
peak encoding performance in terms of documents per second.
Generally, LLM inference is the most computationally-intensive Overall, for all model sizes, the quantized model shows up to 4x
phase of a RAG pipeline, particularly in a live production environment. improvement compared to the baseline bfloat16 (BF16) model in
However, creating the initial knowledge base — processing data and various batch sizes. Read more here: https://huggingface.co/blog/
generating embeddings — can be equally demanding, depending on
intel-fast-embedding
the data's complexity and volume. Intel's advancements in general
compute technology, AI accelerators and confidential computing
provide the essential building blocks for addressing the
compute challenges of the entire RAG pipeline while ensuring
BGE-small Throughput
data privacy and security.
Like most software applications, RAG benefits from a scalable 2000
infrastructure tailored to meet end-users’ transactional
demands. As transaction demand increases, developers may
examples/sec
100K 100K
10K 10K
1K 1K
0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00
10 recall@10 10 recall@10
Figure 2. Query per Second (Throughput) performance of SVS compared to other well adopted implementations, HNSWlib
and FAISS’s. The figure shows the QPS vs recall curves for the rqa-768-10M-OOD dataset (10M 768-dimensional embeddings
generated with the dense passage retriever model RocketQA [QDLL21] with out-of-distribution queries). (Footnote 2,3)
source: https://intellabs.github.io/ScalableVectorSearch/benchs/static/latest.html
150 ms
125 ms
Market requirement
100 ms <100ms
75 ms
50 ms
25 ms
0
on
on
on
ion
on
ot
bo
arc
tio
tio
arc
tio
b
i
ati
ati
ati
rat
rat
ea
ca
ea
at
at
Se
Se
ific
er
er
Ch
Ch
ne
ne
sifi
Cr
Cr
en
en
s
Ge
Ge
as
as
nt
nt
gG
gG
nte
nte
Cl
Cl
xt
xt
din
din
Te
Te
xt
xt
Co
Co
Te
Te
Co
Co
Llama2 GPT-J
13B 6B
See backup configuration for workload and configurations. Results may vary.
Figure 3. Llama 2 13B and GPT-J 6B performance on 5th Gen Intel® Xeon® Scalable processors3
7
4
Higher is better
inferencing
Average projection for Intel® Gaudi® 3 accelerator
vs. Nvidia H100, running common Large Language Models*
1.7
1.5
1.2
1.1 1.1 1.1
1 1 0.9 0.9 0.95
2048 output
2048 output
2048 output
2048 output
2048 output
2048 output
2048 input
2048 input
2048 input
2048 input
2048 input
2048 input
128 output
128 output
128 output
128 output
128 output
128 output
128 input
128 input
128 input
128 input
128 input
128 input
Nvidia LLAMA – 7B LLAMA – 70B Falcon 180B
H100
8
Opportunities for RAG in Enterprise
Retail Take the next step
Retailers face the challenge of recommending products that match When you are ready to kick-start your implementation, Intel provides
their customers' diverse and changing preferences. Traditional a suite of resources to help you get started, from hardware access in
recommendation systems may not effectively account for the latest the Intel® Tiber™ Developer Cloud to ubiquitous compute in major
trends or individual customer feedback, leading to less relevant cloud providers like Google Cloud Platform, Amazon Web Services,
suggestions. and Microsoft Azure. For developers seeking code samples,
walkthroughs, training and more, please visit Intel Developer Zone.
Implementing a RAG-based recommendation system enables
retailers to dynamically incorporate the latest trends and individual
customer feedback into personalized product suggestions. This
system enriches the shopping experience by offering relevant,
timely and personalized product recommendations, driving sales
and customer loyalty. Intel® Tiber™ Developer Cloud
Accelerate AI development using Intel®-optimized software on
LEARN MORE the latest Intel® Xeon® processors and GPU compmute.
LEARN MORE
Financial services
Providing personalized financial advice at scale can be challenging
due to the vast amount of ever-changing financial data and regulations.
Customers expect quick, relevant and personalized financial advice
that traditional chatbots cannot always accurately provide.
Your Official Source for Developing on
A RAG model enhances a financial advice chatbot by dynamically Intel® Hardware and Software
pulling the most current financial data and regulations to generate
personalized advice. By leveraging a vast knowledge base, the chatbot Explore Intel's most popular
can provide clients with tailored investment strategies, real-time market development areas and resources.
insights and regulatory advice, enhancing customer satisfaction
and engagement. Intel GenAI Development Resources
LEARN MORE
1
Performance claims based on 4th gen Intel Xeon 8480+ with 2 sockets, 56 cores per socket. Pytorch model was evaluated with 56 cores on 1 CPU socket. IPEX/Optimum setups were evaluated with
ipexrun, 1 CPU socket, and cores ranging from 22-56. TCMalloc was installed and defined as an environment variable in all runs. See www.intel.com/performanceindex for details. Results may vary.
2
Performance claims based on a 2-socket 4th generation Intel® Xeon® Platinum 8480L CPU with 56 cores per socket, equipped with 512GB DDR4 memory per socket @4800MT/s speed, running
Ubuntu 22.04. 1 2 For the deep-96-1B, dataset we use a server with the same characteristics except that it is equipped with 1TB DDR4 memory per socket @4400MT/s speed. See www.intel.com/
performanceindex for details. Results may vary.
3
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may
not reflect all publicly available updates. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service
activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
0424/SN/MESH/PDF 358260-001US