Why RAG Fails in Production: A Solution

Sr. ML Engineer | Building AI That Works

Your RAG works in demos but fails in production. Here's the one capability you're missing. The problem isn't embeddings or vector database. It is treating RAG like a pipeline instead of a reasoning system. Here's what actually works. Traditional RAG (What Everyone Builds First) 1. Split documents into chunks 2. Create embeddings 3. Store in vector database 4. User asks question → retrieve top 5 results 5. Send to LLM Simple. Clean. Breaks on real questions. Why It Fails: Single retrieval pass: "Compare Q3 to Q4 revenue" → System gets Q3 OR Q4, not both → LLM guesses the rest No way to refine: → First search misses? Done. → Can't ask follow-up searches → Can't course-correct Agentic RAG (What Actually Works) Give your LLM search tools. Let it decide the strategy. Tools: vector_search (semantic) keyword_search (exact match) metadata_filter (date, category, source) rerank (relevance scoring) Example Flow: User: "Compare Q3 to Q4 revenue" Search 1: vector_search("Q3 2024 revenue") Agent: "Got Q3, need Q4" Search 2: vector_search("Q4 2024 revenue") Agent: "Have both, ready to compare" The agent decides when to stop searching. The Metadata Trick: User: "Latest engineering docs" Agent applies filters first: department = "engineering" date > last_30_days Then searches 500 docs instead of 100K. Results: Traditional: 1 search, 65% accuracy, hallucinations Agentic: 3-5 searches, 89% accuracy, cited sources The Insight: RAG needs multiple retrieval passes with adaptation. Pipelines can't do this. Agents can. Building RAG? What's breaking for you?

18 Comments

Muhammad Talha Shahid

Backend Developer | MERN Stack | Learning System Design & DSA | Microservices & Scalable Architecture

Agentic RAG changes everything.

1 Reaction

Ved Vekhande

I build AI Agents | n8n | LangGraph | Langchain | IIIT

perfect analogy to exaplain

1 Reaction

Sameer Khan

One-pass retrieval makes the model guess, multi-pass, agentic search makes it actually reason.

1 Reaction

Musub Khan

I love the way you explain. Everything looks so easy

1 Reaction

Mohammad Syed

Founder & Principal Architect | AI/ML Architecture - AI Security - Cybersecurity | Securing AWS/Azure/GCP

Syed Sherjeel, Agentic RAG wins.

1 Reaction

Touseef Ullah

Chief of Staff @ Futureproof Labs | xSocial Champ | xShape Global | xEcoanalytics | xGul Ahmed

Love the charts you attach with your posts

1 Reaction

Will Scardino

AI Product Leader 🔥 Driving $168M+ for 100M users | Agentic AI @Verizon | Top 6% Voice | ✨AI PM BY DESIGN | Ex-Grubhub, Acxiom, Humana, FEMA

Your excalidraw skills are off the charts ;) no pun intended

Abdul Raheem

Operations Associate at Superhuman | Financial Operations, Sales Growth, Sponsor Relationships

These are such good tips Syed Sherjeel

1 Reaction

Om Nalinde

Building & Teaching AI Agents | CS @ IIIT

solid tips

1 Reaction

Arsala Shinwari

Senior Account Manager @ Superhuman AI I Startups I Growth

So true! most RAG setups fall apart after one retrieval. Letting the system reason and refine makes all the difference.

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Sritaj Patel

QE & Scrum Master @ DevOn | At the crossroads of AI, Ikigai, and Infinite Curiosity.
3d
Report this post
𝗙𝗿𝗼𝗺 𝗰𝗼𝗻𝗳𝘂𝘀𝗶𝗼𝗻 𝘁𝗼 𝗰𝗹𝗮𝗿𝗶𝘁𝘆: 𝗺𝘆 𝗹𝗶𝘁𝘁𝗹𝗲 𝗥𝗔𝗚 𝗱𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 When I first started working with RAG (Retrieval-Augmented Generation), one question always bugged me 👇 𝗪𝗵𝗮𝘁 𝗿𝗲𝗮𝗹𝗹𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝘁𝗼 𝘁𝗵𝗲 𝗶𝗻𝗶𝘁𝗶𝗮𝗹 𝘂𝘀𝗲𝗿 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗼𝗻𝗰𝗲 𝗶𝘁’𝘀 𝗽𝗮𝘀𝘀𝗲𝗱 𝘁𝗼 𝘁𝗵𝗲 𝗟𝗟𝗠? I always assumed it was just used to retrieve data from the vector database, but turns out that’s only half the story. The question actually serves two different purposes: 1️⃣ It’s transformed into embeddings to help the Vector DB retrieve relevant context 2️⃣ It also needs to appear again in the prompt to help the LLM understand the intent behind the retrieved data Without it, the retriever works perfectly fine, 𝙗𝙪𝙩 𝙩𝙝𝙚 𝙇𝙇𝙈 𝙚𝙣𝙙𝙨 𝙪𝙥 𝙘𝙡𝙪𝙚𝙡𝙚𝙨𝙨, 𝙣𝙤𝙩 𝙠𝙣𝙤𝙬𝙞𝙣𝙜 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙙𝙤 𝙬𝙞𝙩𝙝 𝙩𝙝𝙚 𝙘𝙤𝙣𝙩𝙚𝙭𝙩 𝙞𝙩 𝙟𝙪𝙨𝙩 𝙜𝙤𝙩. This realization really changed how I look at prompt design and RAG pipelines. I wrote a blog about it where I break down this subtle but important detail (with examples, code, and a fun story from my testing phase). If you’ve ever wondered how retrieval ≠ understanding, this might make it clearer 👇 🔗 [RAG Facts: Retrieval ≠ Understanding - Why LLMs Need the Question Too] (https://lnkd.in/gYjZcc_F) #RAG #LLMs #ArtificialIntelligence #PromptEngineering #MachineLearning #VectorDB #AIExplained #TechLearning #OpenAI #DataRetrieval

RAG Facts: Retrieval ≠ Understanding: Why LLMs Need the Question Too medium.com
Like Comment
To view or add a comment, sign in
The Curious Cast - Podcast

1,063 followers
3w
Report this post
Amp Tab by Sourcegraph — intent‑aware “Tab” that edits across files (now on by default) TL;DR — Amp Tab is Amp’s new in‑editor completion engine. It watches your recent changes, IDE diagnostics, and semantic context to suggest multi‑line (and even cross‑file) edits you can accept with Tab. It’s free and, as of Sept 23, 2025, it’s on by default for new installs of the Amp extension. Amp Tab is a smart completion engine that goes beyond ordinary autocomplete. Instead of just filling in the next token, it predicts the next meaningful edit you’re likely to make—using your recent edits, the language server’s diagnostics, and Amp’s notion of semantic context. Suggestions can be single- or multi‑line and may apply near your cursor or elsewhere in the file. Since launch, Amp Tab has grown to handle edits across multiple files too—so if an accepted change introduces an error, Tab can propose the follow‑up edits required to get your project back to green. On by default for new installs (Sept 23, 2025). If you install the Amp extension today, Amp Tab is enabled automatically. Alread https://lnkd.in/gaVktX7d
Like Comment
To view or add a comment, sign in
Sameer khanzada

Lead AI Engr. | AI Consultant & Advisor | AI Trainer
1w
Report this post
Stop Chunking Blindly, the Hidden Problem Breaking Your RAG Pipeline Ever wonder why your RAG pipeline nails it in demos but starts hallucinating in production? It’s not your LLM’s fault, it’s your chunking. Most people slice documents into flat 512 or 1000-token chunks and call it a day. That kills precision. You’re cutting right through semantic boundaries, mixing definitions, skipping context, and feeding garbage to retrieval. It could be fix by Respecting the document. - Structure matters. Context matters. Metadata matters. - Reliable RAG = Right tokens + Right signals. Pipeline that actually works: Document Skeleton → Clean & Normalize → Metadata Enrich → Hybrid Retrieval (Dense + Sparse) → Rerank How to do it: - Parse by sections (headings, lists, captions) not arbitrary tokens. - Hybrid split by structure first, then by token cap (~1K) if needed. - Handle unstructured text with small LLM segmentation. - Keep metadata (titles, pages, types) with embeddings. - Retrieve both semantically (dense) and lexically (BM25). - Rerank with a cross-encoder or small LLM for true relevance. Results: - Precision ↑ - Hallucinations ↓ - Cost & latency ↓ Easy updates, better compliance control. - Don’t just add more tokens, add meaning. - Clean → Structure → Metadata → Hybrid Retrieve → Rerank. That’s how you move from brittle demos to production-grade RAG systems.
Like Comment
To view or add a comment, sign in
Mihir Panjikar

Diploma In Data Science @IIT Madras | Backend Developer | Python | Machine Learning | GenAI Applications
3w
Report this post
꩜ 𝗧𝗮𝗺𝗶𝗻𝗴 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗥𝗔𝗚-𝗯𝗮𝘀𝗲𝗱 𝗤&𝗔 When deploying retrieval-augmented models, hallucinations remain a serious trust risk. Here are five techniques that help defend against them: 1. 𝗟𝗟𝗠 𝗮𝘀 𝗝𝘂𝗱𝗴𝗲 (𝗩𝗲𝗿𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗟𝗮𝘆𝗲𝗿)🧑⚖️ Use a second model to evaluate the generated answer against retrieved context. Verifying is often easier than generating—to catch inconsistencies early. 2. 𝗕𝗲𝘁𝘁𝗲𝗿 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 / 𝗥𝗲𝗿𝗮𝗻𝗸𝗶𝗻𝗴 🔀 More precision and recall in document selection reduces “garbage in, garbage out.” For example, rerank via an LLM or filter out irrelevant chunks before generation. 3. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 / 𝗖𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 🛠️ Craft instructions like “only use provided documents” to discourage the model from inventing facts. Iterate prompts by feedback loops. 4. 𝗦𝗼𝘂𝗿𝗰𝗲 𝗖𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 ℹ️ Require the model to cite which chunk or document underlies each statement. That gives transparency and lets users verify. 5. 𝗨𝘀𝗲𝗿 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝗰𝘆 & 𝗘𝘅𝗽𝗲𝗰𝘁𝗮𝘁𝗶𝗼𝗻𝘀 🤝 Let users know what the system can and cannot do. A simple note: “answers are based on retrieved docs; occasional errors possible” can go a long way. Recently implemented some of these techniques in my project 📚EDUQUERY to make question-answering more reliable. Live deployment here 👉 https://lnkd.in/dyK92FVB #RAG #QA #LLM #Streamlit
Like Comment
To view or add a comment, sign in
Zyte

59,190 followers
3w
Report this post
Web scraping deserves a spec as much as any product. Too many pipelines still live only in the head of their author—fragile, undocumented, and hard to align with business goals. A simple PRD changes that. By treating your data pipeline like a product, you: • Anchor datasets to real business needs • Assign clear ownership and accountability • Build with guardrails for breakage and maintenance • Document sources, schema, cadence, and formats The result? Data that solves the intended problem, delivered on time and on budget. MIT CISR research shows that organizations that adopt a product mindset for data consistently generate more value from their investments. In scraping, the same applies: clarity upfront avoids wasted effort later. 👉 Read the full story https://lnkd.in/dYD6T7mr #WebScraping #DataEngineering #PRD
2 Comments
Like Comment
To view or add a comment, sign in
Mian Maaz Ullah Khan

AI Engineer | Orchestrating LLMs, Agents & RAG Pipelines for Scalable AI Products
6d
Report this post
Most teams tweak prompts or swap LLMs when RAG output is vague or factually incorrect. The biggest difference usually come from the retrieval side; how you chunk, find, and rank the right text passage before the model writes. If retrieval is weak, the answer, even the best LLM, can go wrong or hallucinate. You break large knowledge bases (PDF, documents, article) into chunks. If you cut too coarsely, you lose clarity; if too fine, you lose the big picture. After this, you apply search techniques. Pure vector similarity (embeddings) captures meaning well, but often misses exact keywords, names or specific terms. Pure keyword search is precise but blind to synonyms or context. Today, most systems use a hybrid approach (keyword plus vector) to balance both kinds of strength. Even after retrieving a batch of candidates, you must rerank them. You can use a cross-encoder or stronger comparison model to decide which chunks are most relevant relative to your question. That extra step often boosts your answer quality more than tweaking prompts or switching the LLM. In newer systems, there’s also smart adaptation: for each query, dynamically adjusting the mix between keyword and vector, or even selecting which retrieval method to trust more. These techniques ensure the system is flexible and query-aware. Therefore, RAG doesn't rely on the LLM or prompt tricks. Get the basics right and your answer will be almost always get sharper

1 Comment
Like Comment
To view or add a comment, sign in
Pierre R.

Data Science Manager at Uber | Sustainable Supply Chain Evangelist (Follow me!) | Consultant (Hiring!)
2w Edited
Report this post
Web Scraping at Scale with Claude Desktop + MCP Servers From manual data collection to automated intelligence—transform any website into structured, analyzable data. In this tutorial, see how Samir builds an MCP server that gives Claude Desktop web scraping superpowers: crawling entire sites, extracting specific data, and mapping content architectures that would typically require expensive enterprise tools. A user asks Claude to analyze a website. Claude crawls up to 150 pages, following links and respecting depth limits. Results return as structured data with SEO insights, orphaned pages, and linking patterns. This transforms what used to be hours of manual auditing into instant, conversational analysis. 🔗 Watch the tutorial & get the code: https://lnkd.in/ejuUPXUW — I'm Pierre Rizzitas, founder and CEO of PR Consulting (London). We support retail and supply chain businesses in aligning operations, sustainability, and profitability through simulation and analytics. #webscraping #mcp #claudedesktop #python #dataextraction #supplychain #automation #aiagents

Turn Claude into an AI Web Scraping Machine with a local MCP Server (Source Code)

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Nikhil Kumar

Software Engineer (SDE) @IVP | 350+ DSA @LeetCode + GFG + HackerRank | Problem Solving & SQL Gold Badge HackerRank | Skilled in Java, SQL, PL/SQL, Spring Boot, AWS | Passionate Backend & Cloud Developer
2d Edited
Report this post
🚀 LeetCode Daily – Day 75: Encode and Decode Strings (Blind 75) Today’s problem is from the #Blind75 list and focuses on string encoding and decoding to ensure data integrity when transmitting a list of strings. Problem Summary: We need to design an algorithm to encode a list of strings into a single string, and then decode that single string back into the original list of strings. The main challenge is to handle cases where strings may contain special characters — especially the delimiter. ✅ Step-by-Step Thought Process: 🔹 Step 1: Identify Edge Cases → Empty strings in the list. → Strings containing spaces or special characters. → Multiple strings with different lengths. → A single long string vs multiple short strings. 🔹 Step 2: Explore Approaches 1️⃣ Naive delimiter approach. Join strings with a special character like #. Problem: what if the string itself contains #? 2️⃣ Length-prefix encoding (Optimal) Encode each string as: 👉 length_of_string + '#' + actual_string Example: ["leet", "code"] → "4#leet4#code" While decoding, we first read the length, then extract exactly those many characters after #. This avoids ambiguity even if the string contains #. Time: O(N) (where N is total length of all strings), Space: O(N) 🔹 Step 3: Thinking Beyond → This length-prefix approach is a common serialization technique used in real systems. → It can be extended to encode any type of structured data (e.g., lists of lists, maps, etc.). → Similar techniques are used in network protocols and database systems for reliable data transfer. 🧠 Key Learnings: → Delimiters alone are unsafe unless escaped or length-prefixed. → Length-prefixing is simple, robust, and widely used in production systems. → This problem builds string manipulation and parsing intuition.
Like Comment
To view or add a comment, sign in
Aurimas Griciūnas Aurimas Griciūnas is an Influencer

Founder @ SwirlAI • UpSkilling the Next Generation of AI Talent • Author of SwirlAI Newsletter • Public Speaker
2w
Report this post
Building even a simple 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗴𝗿𝗮𝗱𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚) 𝗯𝗮𝘀𝗲𝗱 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺 is a challenging task. Read until the end to understand why 👇 Here are some of the moving parts in the RAG based systems that you will need to take care of and continuously tune in order to achieve desired results: 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹: 𝘍 ) Chunking - how do you chunk the data that you will use for external context. - Small, Large chunks. - Sliding or tumbling window for chunking. - Retrieve parent or linked chunks when searching or just use originally retrieved data. 𝘊 ) Choosing the embedding model to embed and query and external context to/from the latent space. Considering Contextual embeddings. 𝘋 ) Vector Database. - Which Database to choose. - Where to host. - What metadata to store together with embeddings. - Indexing strategy. 𝘌 ) Vector Search - Choice of similarity measure. - Choosing the query path - metadata first vs. ANN first. - Hybrid search. 𝘎 ) Heuristics - business rules applied to your retrieval procedure. - Time importance. - Reranking. - Duplicate context (diversity ranking). - Source retrieval. - Conditional document preprocessing. 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻: 𝘈 ) LLM - Choosing the right Large Language Model to power your application. ✅ It is becoming less of a headache the further we are into the LLM craze. The performance of available LLMs are converging, both open source and proprietary. The main choice nowadays is around using a proprietary model or self-hosting. 𝘉 ) Prompt Engineering - having context available for usage in your prompts does not free you from the hard work of engineering the prompts. You will still need to align the system to produce outputs that you desire and prevent jailbreak scenarios. And let’s not forget the less popular part: 𝘏) Observing, Evaluating, Monitoring and Securing your application in production! What other pieces of the system am I missing? Let me know in the comments 👇
34 Comments
Like Comment
To view or add a comment, sign in
Subayyal Sheikh

PhD Aspirant | Hackathon Winner | Generative AI | Machine Learning | Artificial Intelligence |
2w
Report this post
Its not just about Retreving the info, Augemnting it and then generating the response. Although simple RAG improves factuality somewhat by grounding answers in external data. But real-world deployments demand much more than factual retrieval: Factuality → Outputs must match evidence. Accountability → Systems must attribute why/what sources were used. Robustness → Resistant to noisy, adversarial, or biased inputs. Fairness → Retrieval + generation should not amplify societal bias. Transparency → Show the reasoning path and data trail. Privacy → Ensure sensitive/private data isn’t exposed in retrieval or output. Explainability → Human-understandable justifications for both retrieval and answer generation. 👉 This is where TE-RAG (Trustworthy Explainable - RAG) comes in: A next-generation framework that integrates Trustworthy AI principles directly into the RAG pipeline.
Aurimas Griciūnas Aurimas Griciūnas is an Influencer

Founder @ SwirlAI • UpSkilling the Next Generation of AI Talent • Author of SwirlAI Newsletter • Public Speaker
2w

Building even a simple 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗴𝗿𝗮𝗱𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚) 𝗯𝗮𝘀𝗲𝗱 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺 is a challenging task. Read until the end to understand why 👇 Here are some of the moving parts in the RAG based systems that you will need to take care of and continuously tune in order to achieve desired results: 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹: 𝘍 ) Chunking - how do you chunk the data that you will use for external context. - Small, Large chunks. - Sliding or tumbling window for chunking. - Retrieve parent or linked chunks when searching or just use originally retrieved data. 𝘊 ) Choosing the embedding model to embed and query and external context to/from the latent space. Considering Contextual embeddings. 𝘋 ) Vector Database. - Which Database to choose. - Where to host. - What metadata to store together with embeddings. - Indexing strategy. 𝘌 ) Vector Search - Choice of similarity measure. - Choosing the query path - metadata first vs. ANN first. - Hybrid search. 𝘎 ) Heuristics - business rules applied to your retrieval procedure. - Time importance. - Reranking. - Duplicate context (diversity ranking). - Source retrieval. - Conditional document preprocessing. 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻: 𝘈 ) LLM - Choosing the right Large Language Model to power your application. ✅ It is becoming less of a headache the further we are into the LLM craze. The performance of available LLMs are converging, both open source and proprietary. The main choice nowadays is around using a proprietary model or self-hosting. 𝘉 ) Prompt Engineering - having context available for usage in your prompts does not free you from the hard work of engineering the prompts. You will still need to align the system to produce outputs that you desire and prevent jailbreak scenarios. And let’s not forget the less popular part: 𝘏) Observing, Evaluating, Monitoring and Securing your application in production! What other pieces of the system am I missing? Let me know in the comments 👇
Like Comment
To view or add a comment, sign in

2,636 followers

29 Posts

View Profile Connect

LinkedIn respects your privacy

Why RAG Fails in Production: A Solution

Explore content categories

Why RAG Fails in Production: A Solution

More Relevant Posts

Turn Claude into an AI Web Scraping Machine with a local MCP Server (Source Code)

https://www.youtube.com/

Explore content categories