RAG: ground prompts in real data, kill hallucinations
Retrieval-Augmented Generation (RAG) injects relevant documents into the prompt at query time so the model answers from real data instead of guessing. Learn the core loop, the failure modes, and what to build vs. buy.
A customer asks your support bot: "What's your refund policy for the Pro plan if I cancel mid-cycle?"
Without RAG, the model has two options. (1) Refuse to answer because it doesn't know your policy. (2) Make something up that sounds reasonable. Both are bad. Option 2 is worse — it sounds authoritative and might commit you to a refund policy you don't actually have.
With RAG, the model retrieves your actual policy doc, injects it into the prompt, and answers from real text. Suddenly the bot is reasoning over your business, not guessing from training memory. This is why RAG — Retrieval-Augmented Generation — is the highest-impact technique in production LLM applications, and the most-implemented architecture in the field today.
The whole idea in one line
The mental model: open-book vs. closed-book#
Direct prompting is a closed-book exam. The model answers from whatever it remembers from training. Sometimes brilliant, sometimes confidently wrong, never up-to-date past its training cutoff.
RAG turns it into an open-book exam. The relevant pages are placed on the desk before the question is asked. The model still has to read, reason, and synthesize — but it's reading from your actual source-of-truth, not its memory of similar topics.
Three problems this solves at once:
- Knowledge cutoff. Anything post-training (recent docs, today's data, your private corpus) is invisible to direct prompting. RAG makes it visible.
- Hallucination. When asked something it half-knows, the model fabricates plausible details. With retrieved context, the answer is in front of it — no need to invent.
- Auditability. RAG outputs can cite source documents. Without RAG, "where did that fact come from" has no answer.
The four-step loop#
Every RAG system, no matter how complex, runs the same four-step loop. Understanding this loop is enough to build a working RAG system; the rest is optimization.
Step 1 — Index (offline, runs once)#
Take your corpus (documents, FAQs, product specs, customer records). Split each document into chunks — passages of 200-500 tokens. Use an embedding model to convert each chunk into a vector that represents its meaning. Store the vectors in a search index.
This step runs once when you first ingest your corpus, then incrementally when documents are added or changed. It's the "putting books on the shelf" phase.
Step 2 — Retrieve (at query time)#
The user asks a question. Embed the question into the same vector space as your chunks. Find the top K most similar chunks via vector search (or hybrid vector + keyword search — more on this below).
Typical K is 3-10. Too few and you might miss the answer; too many and the model drowns in irrelevant context.
Step 3 — Augment (at query time)#
Take the retrieved chunks and inject them into your prompt as context — labeled clearly so the model can cite them. The structure looks like this:
Answer the user's question using ONLY the documents below.
Rules:
- If the answer is not in the documents, say "I don't have that information."
- Cite the source document for every factual claim, like [Doc 2].
- Do not use information outside the provided documents.
Documents:
[Doc 1] (source: refund-policy.md)
{{retrieved_passage_1}}
[Doc 2] (source: pricing.md)
{{retrieved_passage_2}}
[Doc 3] (source: pro-plan-faq.md)
{{retrieved_passage_3}}
Question: {{user_question}}
Answer:Three rules in the prompt do most of the work: answer only from documents, refuse if missing, cite sources. The citation rule is what turns a black-box answer into one you can audit.
Step 4 — Generate#
The model produces an answer using the retrieved context. With well-tuned retrieval and a tight prompt, this step rarely fails — most RAG quality issues come from steps 1 and 2 (chunking and retrieval), not from generation.
Chunking: the unsexy decision that decides RAG quality#
If you remember nothing else from this guide, remember this: chunking decides RAG quality more than the prompt does. Spend most of your iteration budget here.
- Sweet spot: 200-500 tokens per chunk for most text-heavy content. Smaller for dense reference docs (200-300); larger for narrative or instructional content (400-600).
- Overlap: 10-20% overlap between consecutive chunks so a fact at a chunk boundary isn't lost. Without overlap, an answer that spans the boundary becomes invisible.
- Respect structure: split on paragraphs, sections, or semantic boundaries — not on character counts. A chunk that ends mid-sentence confuses both retrieval and generation.
- Tag chunks with metadata: source URL, section title, last-modified date, document type. The model can use these to filter and cite; you can use them to update or expire individual chunks without re-indexing.
The 'put a whole document in one chunk' anti-pattern
Retrieval: hybrid beats pure vector search#
Vector search captures semantic similarity — "how do refunds work" matches a paragraph titled "Money-back guarantees" even though they share no exact words. That's the magic.
But vector search misses exact-match cases: product codes, specific names, error strings. A user searching ERR_CONN_REFUSED wants exactly that string, not semantically-similar errors.
Three retrieval styles to know:
Choosing a retrieval style
| Your queries look like… | Use… | Why |
|---|---|---|
| Concept-level questions ("how do refunds work?") | Vector (dense) | Captures semantic meaning across different wordings |
| Exact-match terms (codes, names, error strings) | Keyword (BM25) | Embeddings dilute exact strings; BM25 doesn't |
| A mix (most production apps) | Hybrid (vector + keyword) | Run both, combine rankings — almost always beats either alone |
| After hybrid retrieval | Reranker on top 50 | Cross-encoder reranker scores query+chunk together — catches relevance the embedder missed |
Reranking is the unsung hero
The four classic RAG failure modes#
When a RAG system produces bad output, it's almost always one of these four:
1. Retrieval missed the answer#
Your chunks didn't contain the relevant info, or embedding-space mismatch hid it. Fix: better chunking, hybrid search, query rewriting before retrieval (rewrite vague queries into search-optimized form).
2. The model ignored the documents#
Retrieval worked; the model answered from training memory anyway, sometimes contradicting the retrieved context. Fix: tighter prompt rules ("use ONLY the documents"), prefilling on Claude, lower temperature.
3. Hallucinated citations#
The model cites [Doc 4] when only [Doc 1]-[Doc 3] exist. Fix: Structured Outputs with citation as an enum from valid IDs, or post-process to validate citations against actually-retrieved IDs.
4. Context window stuffed with junk#
You retrieved 20 passages, most irrelevant. The model drowns in noise and either ignores everything or picks wrong details. Fix: reranking, lower top-K, summarize or filter before injecting.
RAG vs. long context: when each wins#
Long-context models (Gemini 1M+, Claude 200K) can hold huge documents directly. The honest question is when that's simpler than RAG.
RAG vs. long context
| If your situation is… | Reach for… | Why |
|---|---|---|
| Corpus is huge (millions of tokens) | RAG | Long context can't hold it; cost prohibitive even when it can |
| Corpus updates frequently (daily, hourly) | RAG | Re-embedding new chunks beats stuffing fresh content into every prompt |
| Need source attribution | RAG | Citation by chunk ID is structural; long context citations are reconstructed |
| Cost-per-query at scale | RAG | Smaller prompts; embeddings are cheap to query |
| Corpus is static and bounded | Long context | Skip the pipeline; let the model find the answer |
| Single-document Q&A | Long context | No retrieval to mess up; just dump the doc and ask |
| Latency is critical | Depends — measure | RAG adds retrieval latency; long context adds processing latency |
Going further: advanced RAG patterns#
Query rewriting#
Users ask questions that look different from the docs that contain the answer. "How do I get my money back?" matches poorly to a doc titled "Refund Process." Run a small LLM call first that rewrites the user's question into 1-3 search-optimized variants, then retrieve over all of them. Cheap; often 10-15% accuracy gain.
Multi-hop retrieval#
Some questions require chaining lookups: "What was our highest-revenue customer's onboarding flow?" needs (1) find highest-revenue customer, then (2) look up their onboarding. A single retrieval round won't handle this — the model retrieves, reasons, retrieves again. This is where RAG starts looking like an agent.
Generation-time validators#
After generation, a second prompt verifies that every claim in the answer maps to a retrieved chunk. Flag claims with no supporting source for human review (or strip them automatically). Worth the latency cost on high-stakes outputs. This is the prompt chaining pattern applied to grounding.
Evaluation: precision, recall, faithfulness#
Three metrics to track for production RAG:
- Retrieval precision — of retrieved chunks, how many were relevant?
- Retrieval recall — of chunks-that-could-have-helped, how many did we retrieve?
- Faithfulness — does the generated answer correspond to the retrieved chunks, or did the model freelance?
Build a test set of 50-100 question/answer pairs early — see A/B testing prompts for the workflow.
Common mistakes#
- Skipping evaluation entirely. Without measurement, "the new chunking is better" is a feeling, not a fact. The eval test set is non-negotiable past the first week.
- Same prompt for retrieval and generation. Search queries and user questions look different. Often worth rewriting the user's question into a search-optimized form before retrieval (see advanced patterns above).
- Trusting "say I don't know" instructions. Even with explicit rules, models sometimes invent when context is sparse. Pair the instruction with output validation that flags claims missing citations.
- Forgetting to version your prompt. The RAG generation prompt is mission-critical infrastructure. Use version control so you can roll back regressions.
- Using a single embedding model for everything. Different embedding models suit different content types. Code search benefits from code-tuned embedders; multilingual content needs multilingual models. Don't default-pick.
Quick reference#
The 60-second summary
What it is: retrieve relevant documents → inject into the prompt → generate the answer using only those documents. Open-book exam, not closed-book.
Where quality lives: chunking (200-500 tokens, 10-20% overlap, respect structure) and retrieval (hybrid vector + keyword, plus a reranker on top).
Where it fails: retrieval misses, model ignores docs, hallucinated citations, context drowned in junk.
When to use: huge or fast-changing corpora, citation requirements, cost-sensitive scale. When to skip: static bounded docs that fit in long context.
What to read next#
For a single-prompt approximation when you don't want to build a pipeline, Generated Knowledge prompting. To make a RAG system safe against attackers planting malicious content in your corpus, prompt injection is essential reading. To turn a one-shot RAG into an iterative lookup loop, ReAct generalizes the pattern. For tools that implement retrieval (Pinecone, Weaviate, Qdrant, pgvector), see our tools list. For the foundational paper (Lewis et al., 2020) and follow-up research, see papers.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.