100% found this document useful (1 vote)
106 views

Building Blocks of Rag Ebook Final

Uploaded by

csgoskins6464
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
106 views

Building Blocks of Rag Ebook Final

Uploaded by

csgoskins6464
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Building Blocks

of RAG with Intel


Dive into Retrieval Augmented Generation (RAG), the innovative approach
that defines how organizations harness the value of their data with large
language models (LLMs). Explore some of the Intel hardware and software
building blocks that optimize RAG applications, enabling contextual, real-
time responses while simplifying deployment and enabling scale.

› Tailoring GenAI for your Application . . . . . . . . . . . . . . . . . . . . . . 2


› What is Retrieval Augmented Generation (RAG)? . . . . . . . . . . . . 3
› Standard RAG Solution Architecture . . . . . . . . . . . . . . . . . . . . . . 4
› Technologies for RAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
› Accelerating RAG in Production . . . . . . . . . . . . . . . . . . . . . . . . . 6
› Opportunities for RAG in Enterprise . . . . . . . . . . . . . . . . . . . . . . . 9
› Take the Next Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Authors: Eduardo Alvarez, Senior AI Solutions Engineer, Intel • Sancha Norris, AI Product Marketing, Intel
Tailoring GenAI for Your Application
The public debut of ChatGPT has changed the AI landscape. Enterprises are rushing to take advantage of
this new technology to give them a competitive edge with new products, improved productivity and more
cost-efficient operations.
Generative AI (GenAI) models, like Grok-1 (300B+ parameters) and GPT-4 (trillions+), are trained on
massive amounts of data from the internet and other text sources. These 3rd party large language models
are good for general-purpose use cases. However, most use cases for enterprises will require AI models to
be trained and/or augmented with your data so the results can be more relevant to your business. Here are
some examples of how generative AI can be applied within various industries.

Consumer Goods Healthcare Media &


and Retails Manufacturing Entertainment Financial Services
& Medicine

• Virtual fitting rooms • Assist busy front-line staff • E xpert copilot for • Intelligent search, tailored • Uncovering trading
technicians content discovery signals, alerting traders
• Delivery and installation • Transcribe and summarize
to vulnerable postitions
medical notes • Conversational interactions • Headline and copy
• In-store product-finding
with machines development • Accelerating underwriting
assistance • Chatbots and answer
decisions
medical questions • Prescriptive and • Real-time feedback on
• Demand prediction and
proactive field service content quality • Optimizing and rebuilding
inventory planning • Predictive analytics to
legacy system
inform diagnosis and • Natual language • Personalized playlists, news
• Novel product designs
treatments troubleshooting digests, recommendations • Reverse-engineering
banking and insurance
• Warranty status and • Interactive storytelling via
models
documentation viewer choices
• Monitoring for potential
• Understanding process • Targeted offers,
financial crimes and fraud
bottlenecks, devising subscription plans
recovery strategies • Automating data gathering
for regulatory compliance
• E xtracting insights from
corporate disclosures

Source: Compiled by MIT Technology Review Insights, based on data from "Retail in the Age of Generative AI," 9 "The Great Unlock: Large Language Models in Manufacturing," 10 "Generative AI Is
Everything Everywhere, All at Once," and "Large Language Models in Media & Entertainment, " 12 Databricks, April-June 2023.

While you can use your data to fine-tune a model, In this introductory guide, we will explain how RAG
retraining a model takes additional time and can be paired with various Intel optimizations and
resources. An alternative popular technique, platforms to yield incredible value and performance
retrieval augmented generation (RAG), creates a for production GenAI systems.
domain-specific LLM by augmenting open-source
pre-trained models with your proprietary data to
develop business-specific results. RAG allows you
to keep your data safe and secure without sharing it
with third-party large foundation models.

2
What is retrieval augmented generation (RAG)?

The RAG technique adds dynamic, query-dependent data into the model's prompt stream. Relevant
data is retrieved from a custom-built knowledge base stored in a vector database. The prompt and the
retrieved context enrich the model's output, delivering more relevant and accurate results. RAG allows
you to leverage your data with an LLM while keeping the integrity of your data private, as it's not sent to a
third party managing the model. The key components of the RAG workflow can be captured in four simple
steps: user query processing, retrieval, context incorporation and output generation. The diagram below
illustrates this basic flow.

Prompt +
? Ask
Retrieved
Context
Generate
Answer

Retrieval Generated Response


Pre-Trained LLM based on Retrieved
User Prompt Mechanisms
Context + User Prompt

Relevant
Vector
Context
Search
Retrieved

Private Knowledge
(Vector Database)

RAG’s utility is not confined to text; it can The most relevant data is retrieved, incorporated
revolutionize video search and interactive document with the user's prompt, and passed to a model
exploration, even enabling a chatbot to draw on PDF for inference service and final output generation.
content for answers. This context incorporation provides models with
additional information unavailable during pre-
RAG applications are often called "RAG pipelines"
training, better aligning them with the user's task or
due to their consistent data process flow, starting
domain of interest. Because RAG does not require
with the user prompt. The prompt is passed through
retraining or fine-tuning the model, it can be an
the core component, the retrieval mechanism, which
efficient way to add your data to provide context to
converts it into a vector embedding and uses vector
an LLM.
search to find similar content in a pre-constructed
vector database (e.g., from PDFs, logs, transcripts). The next section will explore the RAG solution
architecture and stack.

3
Standard RAG Solution Architecture
The following RAG solution architecture provides an overview of the building blocks of a standard RAG
implementation. Core components of the flow include 1 building the knowledge base, 2 query and
context retrieval, 3 response generation and 4 production monitoring across applications.

RAG LLM Architecture

Build Knowledge Base

Private Pre-processing Processed Data stored


knowledge pipeline objects chunking Vector DB
base

Embedding
model
Query and Context Retrieval

User Input Retrieval retrieval


User query
authentication guardrailing vector search
Retrieved
context Top-K

Output
Reranking

Response Generation

Output Prompt
LLM inference
guardrailing template

Production Monitoring

Let’s expand on some of these components: 3 Response generation:


• LLM Inference and Response Generation: Combine top context
1 Build the knowledge base: with the user query, process through a pre-trained or fine-tuned
• Data Collection: Assemble a private knowledge base from text- LLM and post-process for quality and safety.
based sources such as transcripts, PDFs and digitized documents. •R
 esponse Delivery: Return the final response to the user or
• Data Processing Pipeline: Utilize a RAG-specific pipeline for subsystem through the interface, ensuring a coherent and
extracting text, formatting content for processing, and chunking contextually accurate answer. FEC91B FEC91B

data into manageable sizes.


• Vectorization: Process chunks through an embedding model to
4 Production monitoring:
convert text into vectors, optionally including metadata for richer • Retrieval Performance: Monitor latency and accuracy of the
context. retrieval process, keeping records for auditing purposes.
• Vector Database Storage: Store vectorized data in a scalable • Re-ranking Efficiency: Track re-ranking performance, ensuring
vector database, ready for efficient retrieval. contextual relevance and speed.
• Inference Service Quality: Observe latency and quality of LLM
2 Query and context retrieval: inference, maintaining logs for auditing and improvement.
• Query Submission: Users or a subsystem submit queries through a • Guardrail Effectiveness: Monitor guardrails for input and output
chat-like interface or API calls, authenticated by a secure service. processing, ensuring compliance and content safety.
• Query Processing: Implement input safeguards for security and
compliance, followed by query vectorization.
• Vector Search and Re-ranking: Conduct an initial vector search to
retrieve relevant vectors, followed by a re-ranking process to refine
results using a more complex model.
4
Technologies for RAG
Developing RAG applications usually starts with integrated RAG frameworks, such as Haystack,
LlamaIndex, LangChain and Intel Lab's fastRAG. These frameworks streamline development by offering
optimizations and integrating essential AI toolchains.
Let's consider the RAG toolchain in the three familiar components: knowledge base construction, query
and context retrieval and response generation. Often, RAG frameworks provide APIs that encompass the
entire toolchain. Choosing between using these abstractions and leveraging independent components is a
thoughtful engineering decision that should be considered carefully.

Building Knowledge Base Query and Context Retrieval Response Generation


RAG Frameworks Search Algorithms
Data Processing Vector Databases
• Faiss
• SVS
• HNSW
• Vamana

Intel Opimizations

oneMKL oneDAL oneDNN oneCCL

Compute Platform

Intel optimizations bridge the gap between the toolchain and hardware, enhancing performance across the chain while ensuring compatibility
and improved functionality on Intel® Xeon® CPUs and Intel® Gaudi® accelerators. These optimizations are integrated into stock frameworks
or distributed as add-on extensions, with the goal to decrease the need for extensive low-level programming. This abstraction enables
developers to focus on building RAG applications efficiently and effectively, leveraging enhanced performance and tailored solutions for
their specific use cases.
Let’s explore the various components of the toolchain in more detail.

Building the knowledge base + context retrieval: Response generation:


• Integrated Frameworks: Haystack and LangChain are notable RAG • Low-Level Optimizations: oneAPI performance libraries optimize
frameworks that offer high-level abstractions for vector databases popular AI frameworks like PT, TF, and ONNX so you can use your
and search algorithms, enabling developers to manage complex familiar open source tools knowing they are optimized for Intel®
processes within Python-based environments. hardware.
• Vector Database Technologies: Pinecone, Redis and Chroma are • Advanced Inference Optimization: Extensions such as Intel®
some key vector database solutions that support popular search Extension for PyTorch add advanced quantized inference
algorithms. Scalable Vector Search (SVS) from Intel Labs is techniques, boosting performance for large language models.
another promising addition, expected to integrate with major vector
As you can see, RAG involves several interconnected components
databases by early 2024.
and managing them on a single platform, like Intel Xeon CPUs,
• Embedding and Model Accessibility: Embedding models, which streamlines configuration, deployment and maintenance. For larger
are often integrated via Hugging Face APIs, can be seamlessly LLMs or high-throughput LLM inference, integrating Gaudi accelerators
incorporated into RAG frameworks, making it easier to include becomes an optimal solution for fulfilling application needs.
advanced natural language processing (NLP) models.
The following section dives into the complexities of RAG in production,
addressing various considerations and technologies that help teams
achieve successful deployment.
5
Accelerating RAG in Production
Many components of the RAG pipeline are computationally intense, while at the same time, end
users require low-latency responses. Additionally, because RAG is often used for confidential data, the
entire pipeline must be secure. Intel technologies can power the RAG pipeline, contributing to secure
performance across compute platforms and helping enable the full power of generative AI tailored to
specific domains and industries.

Computational demand In a recent study with Hugging Face, we evaluated throughput for
peak encoding performance in terms of documents per second.
Generally, LLM inference is the most computationally-intensive Overall, for all model sizes, the quantized model shows up to 4x
phase of a RAG pipeline, particularly in a live production environment. improvement compared to the baseline bfloat16 (BF16) model in
However, creating the initial knowledge base — processing data and various batch sizes. Read more here: https://huggingface.co/blog/
generating embeddings — can be equally demanding, depending on
intel-fast-embedding
the data's complexity and volume. Intel's advancements in general
compute technology, AI accelerators and confidential computing
provide the essential building blocks for addressing the
compute challenges of the entire RAG pipeline while ensuring
BGE-small Throughput
data privacy and security.
Like most software applications, RAG benefits from a scalable 2000
infrastructure tailored to meet end-users’ transactional
demands. As transaction demand increases, developers may
examples/sec

experience increased latency due to the load on compute 1500


infrastructure, which becomes saturated by vector database
queries and inference calculations. For this reason, it's crucial
to have access to readily available compute resources to scale 1000
up systems to quickly handle increased demand. Equally
important is the need to implement critical optimizations to
boost the performance of key components such as embedding 500
generation, vector search and inference.

Data privacy and security 0


• Secure AI Processing: Intel® Software Guard Extensions 1 4 8 16 32 64 128 256
(Intel® SGX) and Intel® Trust Domain Extensions (Intel®
batch size
TDX) boost data security via confidential computing and
data encryption in CPU memory during processing. These IPEX int8 IPEX bf16 (torchscript) bf16 (pytorch)
technologies are crucial for handling sensitive information,
contributing to the creation of secure RAG applications with
encrypted data throughout the pipeline's various parts. This
is an essential feature for RAG applications that require secure Figure 1: Throughput for BGE small source:
processing of sensitive data during vector embedding generation, https://huggingface.co/blog/intel-fast-embedding
retrieval, or inference.
• Implement Proper Guardrailing: In RAG applications, guardrailing Vector search optimizations
involves implementing measures to manage the behavior of the
LLM within the RAG system. This includes monitoring the model's • CPU-Optimized Workloads: Vector search operations are
responses, helping to adhere to guidelines and best practices, highly optimized on Intel Xeon processors, particularly with the
and controlling its output to decrease the risks of toxicity, unfair introduction of Intel® Advanced Vector Extensions 512 (Intel® AVX-
bias, and privacy breaches. Guardrailing in RAG applications helps 512) in 3rd generation processors or later. Intel AVX-512 leverages
maintain the trust and responsible usage of the LLM while ensuring the fused multiply-add (FMA) instruction, which combines
it aligns with the system's overall goals and requirements. multiplication and addition in a single operation, enhancing inner
product calculations — a fundamental operation in vector search.
Open-source optimizations This capability significantly improves throughput and performance
by reducing the number of instructions needed for computation.
Embedding optimizations
• Scalable Vector Search (SVS): Scalable Vector Search (SVS)
• Quantized Embedding Models: Intel Xeon processors can take technology delivers fast vector search capabilities, optimizing
advantage of quantized embedding models to optimize the retrieval times and improving overall system performance. SVS
generation of vector embeddings from documents. A great optimizes graph-based similarity search using locally-adaptive
example is bge-small-en-v1.5-rag-int8-static, a version of BAAI/ vector quantization (LVQ), which minimizes memory bandwidth
BGE-small-en-v1.5 quantized with Intel® Neural Compressor requirements while maintaining accuracy. The result is significantly
and compatible with Optimum-Intel. Retrieval and re-ranking reduced distance calculation latency and higher performance in
performance tasks using the quantized model on the Massive Text
throughput and memory requirements, as demonstrated in the
Embedding Benchmark (MTEB) reveals a < 2% difference between
figure below.
the floating-point (FP32) and the quantized INT8 version, on MTEB
performance metric, while enhancing throughput (see footnote 1,3).
6
Throughput (QPS) query batch = 10 rqa-768-10M-OOD query batch = 1000 rqa-768-10M-OOD

100K 100K

10K 10K

1K 1K

0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00
10 recall@10 10 recall@10

SVS HNSWlib Faiss-IVFPQfs

Figure 2. Query per Second (Throughput) performance of SVS compared to other well adopted implementations, HNSWlib
and FAISS’s. The figure shows the QPS vs recall curves for the rqa-768-10M-OOD dataset (10M 768-dimensional embeddings
generated with the dense passage retriever model RocketQA [QDLL21] with out-of-distribution queries). (Footnote 2,3)
source: https://intellabs.github.io/ScalableVectorSearch/benchs/static/latest.html

Inference Optimizations • Open Source SOTA Inference Optimization Tools: Intel


contributes to and extends popular deep learning frameworks like
RAG primarily involves inference operations, which Intel Xeon PyTorch, TensorFlow, Hugging Face, DeepSpeed, etc. Of interest
processors support through advanced model compression for the RAG workflow are the opportunities to optimize LLMs by
techniques. These techniques enable operations at reduced implementing model compression techniques like quantization.
precisions (BF16 and INT8) without significant performance loss. Intel® Extension for PyTorch currently provides a variety of state-
For larger models and high throughput requirements, Intel Gaudi of-the-art (SOTA) LLM quantization recipes such as SmoothQuant,
accelerators offer excellent price/performance benefits and can weight-only quantization, and mixed precision (FP32/BF16). The
replace CPUs and other accelerators for RAG inference. In this figure below showcases the inference latency performance of an
section, we'll outline various inference-specific optimizations INT8-quantized Llama 2 model running on a single-socket 4th Gen
and opportunities. Intel Xeon platform.
• Intel Advanced Matrix Extensions (Intel AMX): The 4th and 5th Gen
Intel Xeon Scalable processors incorporate Intel AMX, enabling
more efficient matrix operations and improved memory management.

5th Gen Xeon best market requirements on LLM latencies


Single node 2S 5th Gen Xeon 8592+ (64C) Large Language Model Next Token Latency

150 ms

125 ms
Market requirement
100 ms <100ms

75 ms

50 ms

25 ms

0
on

on

on

ion

on

ot
bo
arc

tio

tio

arc

tio

b
i

ati

ati

ati
rat

rat
ea

ca

ea
at

at
Se

Se
ific

er

er
Ch

Ch
ne

ne

sifi
Cr

Cr
en

en
s
Ge

Ge
as

as
nt

nt
gG

gG
nte

nte
Cl

Cl
xt

xt
din

din
Te

Te
xt

xt
Co

Co
Te

Te
Co

Co

Llama2 GPT-J
13B 6B
See backup configuration for workload and configurations. Results may vary.

Figure 3. Llama 2 13B and GPT-J 6B performance on 5th Gen Intel® Xeon® Scalable processors3
7
4

Relative speedup (throughput in tps)


1.5x faster 2.7

Higher is better
inferencing
Average projection for Intel® Gaudi® 3 accelerator
vs. Nvidia H100, running common Large Language Models*
1.7
1.5

1.2
1.1 1.1 1.1
1 1 0.9 0.9 0.95
2048 output

2048 output

2048 output

2048 output

2048 output

2048 output
2048 input

2048 input

2048 input

2048 input

2048 input

2048 input
128 output

128 output

128 output

128 output

128 output

128 output
128 input

128 input

128 input

128 input

128 input

128 input
Nvidia LLAMA – 7B LLAMA – 70B Falcon 180B
H100

*Source: NV H100 comparison based on https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8 , Mar 28th, 2024.


Reported numbers are per GPU. Vs Intel® Gaudi® 3 projections for LLAMA2-7B, LLAMA2-70B & Falcon 180B projections. Results may vary.

Figure 4. LLM inference performance on Intel Gaudi 3

Inference complexity and Intel Gaudi


One of the benefits of RAG is that since you depend less on the LLM’s “knowledge” and more on its “language modeling capabilities,” you can
use models with a much lower parameter count. In many cases, a smaller 7B parameter model with RAG can beat larger models with tens of
billions of parameters on domain-specific tasks associated with the RAG model’s knowledge base.
Highly specialized tasks may sometimes require larger models and, consequently, specialized accelerators like Intel Gaudi processors. For
RAG applications requiring the highest throughput or lowest latency, run LLM inference on the highest-performing AI accelerator available,
such as an Intel Gaudi 3 processor.

Explore Intel Gaudi RAG resources to learn more


• Multi-Modal RAG Demo at Intel® Vision 2024 from Intel Labs Cognitive AI team
• A scalable Retrieval Augmented Generation (RAG) application using Hugging Face tools as an way of deploying optimized
applications utilizing the Intel Gaudi 2 accelerator

8
Opportunities for RAG in Enterprise
Retail Take the next step
Retailers face the challenge of recommending products that match When you are ready to kick-start your implementation, Intel provides
their customers' diverse and changing preferences. Traditional a suite of resources to help you get started, from hardware access in
recommendation systems may not effectively account for the latest the Intel® Tiber™ Developer Cloud to ubiquitous compute in major
trends or individual customer feedback, leading to less relevant cloud providers like Google Cloud Platform, Amazon Web Services,
suggestions. and Microsoft Azure. For developers seeking code samples,
walkthroughs, training and more, please visit Intel Developer Zone.
Implementing a RAG-based recommendation system enables
retailers to dynamically incorporate the latest trends and individual
customer feedback into personalized product suggestions. This
system enriches the shopping experience by offering relevant,
timely and personalized product recommendations, driving sales
and customer loyalty. Intel® Tiber™ Developer Cloud
Accelerate AI development using Intel®-optimized software on
LEARN MORE the latest Intel® Xeon® processors and GPU compmute.

Accelerate AI development using Intel®-


Manufacturing
optimized software on the latest Intel® Xeon®
In manufacturing, unexpected downtime due to equipment failure is
processors, Intel® Gaudi® Accelerators, and
a significant cost driver. Traditional predictive maintenance models
may miss subtle anomalies that precede a failure, especially in other Intel platforms.
complex machinery where historical failure data may be limited or
nonexistent.
A RAG-based anomaly detection system for predictive maintenance
can analyze vast amounts of operational data in real-time, comparing Access Intel hardware and start building RAG applications
it against an extensive knowledge base of equipment performance on cloud providers like Amazon Web Services, Google Cloud
to identify potential failures before they occur. This approach Platform and Microsoft Azure.
minimizes downtime and maintenance costs while extending
equipment life.

LEARN MORE

Financial services
Providing personalized financial advice at scale can be challenging
due to the vast amount of ever-changing financial data and regulations.
Customers expect quick, relevant and personalized financial advice
that traditional chatbots cannot always accurately provide.
Your Official Source for Developing on
A RAG model enhances a financial advice chatbot by dynamically Intel® Hardware and Software
pulling the most current financial data and regulations to generate
personalized advice. By leveraging a vast knowledge base, the chatbot Explore Intel's most popular
can provide clients with tailored investment strategies, real-time market development areas and resources.
insights and regulatory advice, enhancing customer satisfaction
and engagement. Intel GenAI Development Resources

LEARN MORE

1
Performance claims based on 4th gen Intel Xeon 8480+ with 2 sockets, 56 cores per socket. Pytorch model was evaluated with 56 cores on 1 CPU socket. IPEX/Optimum setups were evaluated with
ipexrun, 1 CPU socket, and cores ranging from 22-56. TCMalloc was installed and defined as an environment variable in all runs. See www.intel.com/performanceindex for details. Results may vary.
2
Performance claims based on a 2-socket 4th generation Intel® Xeon® Platinum 8480L CPU with 56 cores per socket, equipped with 512GB DDR4 memory per socket @4800MT/s speed, running
Ubuntu 22.04. 1 2 For the deep-96-1B, dataset we use a server with the same characteristics except that it is equipped with 1TB DDR4 memory per socket @4400MT/s speed. See www.intel.com/
performanceindex for details. Results may vary.
3
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may
not reflect all publicly available updates. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service
activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
0424/SN/MESH/PDF 358260-001US

You might also like