RAG: ground prompts in real data, kill hallucinations

Retrieval-Augmented Generation (RAG) injects relevant documents into the prompt at query time so the model answers from real data instead of guessing. Learn the core loop, the failure modes, and what to build vs. buy.

schedule9 min readLast updated May 1, 2026

A customer asks your support bot: "What's your refund policy for the Pro plan if I cancel mid-cycle?"

Without RAG, the model has two options. (1) Refuse to answer because it doesn't know your policy. (2) Make something up that sounds reasonable. Both are bad. Option 2 is worse — it sounds authoritative and might commit you to a refund policy you don't actually have.

With RAG, the model retrieves your actual policy doc, injects it into the prompt, and answers from real text. Suddenly the bot is reasoning over your business, not guessing from training memory. This is why RAG — Retrieval-Augmented Generation — is the highest-impact technique in production LLM applications, and the most-implemented architecture in the field today.

The whole idea in one line

Retrieve relevant documents → inject into the prompt → ask the model to answer using only those documents. The model reasons; the documents ground.

The mental model: open-book vs. closed-book#

Direct prompting is a closed-book exam. The model answers from whatever it remembers from training. Sometimes brilliant, sometimes confidently wrong, never up-to-date past its training cutoff.

RAG turns it into an open-book exam. The relevant pages are placed on the desk before the question is asked. The model still has to read, reason, and synthesize — but it's reading from your actual source-of-truth, not its memory of similar topics.

Three problems this solves at once:

Knowledge cutoff. Anything post-training (recent docs, today's data, your private corpus) is invisible to direct prompting. RAG makes it visible.
Hallucination. When asked something it half-knows, the model fabricates plausible details. With retrieved context, the answer is in front of it — no need to invent.
Auditability. RAG outputs can cite source documents. Without RAG, "where did that fact come from" has no answer.

The four-step loop#

Every RAG system, no matter how complex, runs the same four-step loop. Understanding this loop is enough to build a working RAG system; the rest is optimization.

Step 1 — Index (offline, runs once)#

Take your corpus (documents, FAQs, product specs, customer records). Split each document into chunks — passages of 200-500 tokens. Use an embedding model to convert each chunk into a vector that represents its meaning. Store the vectors in a search index.

This step runs once when you first ingest your corpus, then incrementally when documents are added or changed. It's the "putting books on the shelf" phase.

Step 2 — Retrieve (at query time)#

The user asks a question. Embed the question into the same vector space as your chunks. Find the top K most similar chunks via vector search (or hybrid vector + keyword search — more on this below).

Typical K is 3-10. Too few and you might miss the answer; too many and the model drowns in irrelevant context.

Step 3 — Augment (at query time)#

Take the retrieved chunks and inject them into your prompt as context — labeled clearly so the model can cite them. The structure looks like this:

terminalPromptStandard RAG prompt

Any

Answer the user's question using ONLY the documents below.

Rules:
- If the answer is not in the documents, say "I don't have that information."
- Cite the source document for every factual claim, like [Doc 2].
- Do not use information outside the provided documents.

Documents:
[Doc 1] (source: refund-policy.md)
{{retrieved_passage_1}}

[Doc 2] (source: pricing.md)
{{retrieved_passage_2}}

[Doc 3] (source: pro-plan-faq.md)
{{retrieved_passage_3}}

Question: {{user_question}}

Answer:

play_arrowTry in PromptShip

Three rules in the prompt do most of the work: answer only from documents, refuse if missing, cite sources. The citation rule is what turns a black-box answer into one you can audit.

Step 4 — Generate#

The model produces an answer using the retrieved context. With well-tuned retrieval and a tight prompt, this step rarely fails — most RAG quality issues come from steps 1 and 2 (chunking and retrieval), not from generation.

Chunking: the unsexy decision that decides RAG quality#

If you remember nothing else from this guide, remember this: chunking decides RAG quality more than the prompt does. Spend most of your iteration budget here.

Sweet spot: 200-500 tokens per chunk for most text-heavy content. Smaller for dense reference docs (200-300); larger for narrative or instructional content (400-600).
Overlap: 10-20% overlap between consecutive chunks so a fact at a chunk boundary isn't lost. Without overlap, an answer that spans the boundary becomes invisible.
Respect structure: split on paragraphs, sections, or semantic boundaries — not on character counts. A chunk that ends mid-sentence confuses both retrieval and generation.
Tag chunks with metadata: source URL, section title, last-modified date, document type. The model can use these to filter and cite; you can use them to update or expire individual chunks without re-indexing.

The 'put a whole document in one chunk' anti-pattern

A 5,000-token chunk technically retrieves but produces terrible results. The model spends most of its attention on irrelevant content. Worse, vector search struggles to score huge chunks accurately because the embedding averages too many concepts.

Retrieval: hybrid beats pure vector search#

Vector search captures semantic similarity — "how do refunds work" matches a paragraph titled "Money-back guarantees" even though they share no exact words. That's the magic.

But vector search misses exact-match cases: product codes, specific names, error strings. A user searching ERR_CONN_REFUSED wants exactly that string, not semantically-similar errors.

Three retrieval styles to know:

Choosing a retrieval style

Your queries look like…	Use…	Why
Concept-level questions ("how do refunds work?")	Vector (dense)	Captures semantic meaning across different wordings
Exact-match terms (codes, names, error strings)	Keyword (BM25)	Embeddings dilute exact strings; BM25 doesn't
A mix (most production apps)	Hybrid (vector + keyword)	Run both, combine rankings — almost always beats either alone
After hybrid retrieval	Reranker on top 50	Cross-encoder reranker scores query+chunk together — catches relevance the embedder missed

Reranking is the unsung hero

After retrieving the top 50 candidates with hybrid search, run a smaller reranker model to score and pick the actual top 5. The reranker sees the query and the candidate together — it catches relevance the embedding model missed. A simple add that often jumps end-to-end accuracy by 10-20%. If your RAG isn't great, this is the first place to look.

The four classic RAG failure modes#

When a RAG system produces bad output, it's almost always one of these four:

1. Retrieval missed the answer#

Your chunks didn't contain the relevant info, or embedding-space mismatch hid it. Fix: better chunking, hybrid search, query rewriting before retrieval (rewrite vague queries into search-optimized form).

2. The model ignored the documents#

Retrieval worked; the model answered from training memory anyway, sometimes contradicting the retrieved context. Fix: tighter prompt rules ("use ONLY the documents"), prefilling on Claude, lower temperature.

3. Hallucinated citations#

The model cites [Doc 4] when only [Doc 1]-[Doc 3] exist. Fix: Structured Outputs with citation as an enum from valid IDs, or post-process to validate citations against actually-retrieved IDs.

4. Context window stuffed with junk#

You retrieved 20 passages, most irrelevant. The model drowns in noise and either ignores everything or picks wrong details. Fix: reranking, lower top-K, summarize or filter before injecting.

RAG vs. long context: when each wins#

Long-context models (Gemini 1M+, Claude 200K) can hold huge documents directly. The honest question is when that's simpler than RAG.

RAG vs. long context

If your situation is…	Reach for…	Why
Corpus is huge (millions of tokens)	RAG	Long context can't hold it; cost prohibitive even when it can
Corpus updates frequently (daily, hourly)	RAG	Re-embedding new chunks beats stuffing fresh content into every prompt
Need source attribution	RAG	Citation by chunk ID is structural; long context citations are reconstructed
Cost-per-query at scale	RAG	Smaller prompts; embeddings are cheap to query
Corpus is static and bounded	Long context	Skip the pipeline; let the model find the answer
Single-document Q&A	Long context	No retrieval to mess up; just dump the doc and ask
Latency is critical	Depends — measure	RAG adds retrieval latency; long context adds processing latency

Going further: advanced RAG patterns#

Query rewriting#

Users ask questions that look different from the docs that contain the answer. "How do I get my money back?" matches poorly to a doc titled "Refund Process." Run a small LLM call first that rewrites the user's question into 1-3 search-optimized variants, then retrieve over all of them. Cheap; often 10-15% accuracy gain.

Multi-hop retrieval#

Some questions require chaining lookups: "What was our highest-revenue customer's onboarding flow?" needs (1) find highest-revenue customer, then (2) look up their onboarding. A single retrieval round won't handle this — the model retrieves, reasons, retrieves again. This is where RAG starts looking like an agent.

Generation-time validators#

After generation, a second prompt verifies that every claim in the answer maps to a retrieved chunk. Flag claims with no supporting source for human review (or strip them automatically). Worth the latency cost on high-stakes outputs. This is the prompt chaining pattern applied to grounding.

Evaluation: precision, recall, faithfulness#

Three metrics to track for production RAG:

Retrieval precision — of retrieved chunks, how many were relevant?
Retrieval recall — of chunks-that-could-have-helped, how many did we retrieve?
Faithfulness — does the generated answer correspond to the retrieved chunks, or did the model freelance?

Build a test set of 50-100 question/answer pairs early — see A/B testing prompts for the workflow.

Common mistakes#

Skipping evaluation entirely. Without measurement, "the new chunking is better" is a feeling, not a fact. The eval test set is non-negotiable past the first week.
Same prompt for retrieval and generation. Search queries and user questions look different. Often worth rewriting the user's question into a search-optimized form before retrieval (see advanced patterns above).
Trusting "say I don't know" instructions. Even with explicit rules, models sometimes invent when context is sparse. Pair the instruction with output validation that flags claims missing citations.
Forgetting to version your prompt. The RAG generation prompt is mission-critical infrastructure. Use version control so you can roll back regressions.
Using a single embedding model for everything. Different embedding models suit different content types. Code search benefits from code-tuned embedders; multilingual content needs multilingual models. Don't default-pick.

Quick reference#

The 60-second summary

What it is: retrieve relevant documents → inject into the prompt → generate the answer using only those documents. Open-book exam, not closed-book.

Where quality lives: chunking (200-500 tokens, 10-20% overlap, respect structure) and retrieval (hybrid vector + keyword, plus a reranker on top).

Where it fails: retrieval misses, model ignores docs, hallucinated citations, context drowned in junk.

When to use: huge or fast-changing corpora, citation requirements, cost-sensitive scale. When to skip: static bounded docs that fit in long context.

What to read next#

For a single-prompt approximation when you don't want to build a pipeline, Generated Knowledge prompting. To make a RAG system safe against attackers planting malicious content in your corpus, prompt injection is essential reading. To turn a one-shot RAG into an iterative lookup loop, ReAct generalizes the pattern. For tools that implement retrieval (Pinecone, Weaviate, Qdrant, pgvector), see our tools list. For the foundational paper (Lewis et al., 2020) and follow-up research, see papers.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library