Prompt chaining: split big tasks into reliable steps
One huge prompt fails in unpredictable ways. Three small chained prompts succeed reliably. Learn the patterns: linear chains, fan-out/fan-in, and validators that catch errors mid-chain.
You wrote a beautiful prompt. It does five things: extract entities from a customer email, classify their intent, generate a personalized reply, check the reply for compliance violations, and translate it into the customer's language. It works 70% of the time. The other 30% it fails in some new creative way every time and you can never tell which of the five things went wrong.
Now you split it into five smaller prompts. Each does one job. Each succeeds 95% of the time. End-to-end, your pipeline now succeeds 77% of the time — and when it fails, you know exactly which step failed because every step's output is inspectable.
Better still: each step is reusable. The entity extractor from this pipeline becomes the entity extractor for three other workflows. Suddenly you're not writing prompts; you're building a library of composable building blocks. That's prompt chaining — and once you start, you stop writing monoliths forever.
The whole idea in one line
The mental model: from monolith to pipeline#
Software engineering learned this lesson decades ago. A 500-line function that does "all the things" is harder to maintain, harder to test, and more likely to break than five 100-line functions that compose. Same logic applies to prompts.
The intuition behind why chains beat monoliths comes down to two compounding effects:
- Independent failure modes. A single prompt with five instructions makes a five-way bet. Each instruction is a chance to fail. Compounding 95%-reliable steps beats a single 70%-reliable mega-prompt.
- Visible intermediate state. When a chain fails, you can inspect each step's output and identify the failure. When a monolith fails, the failure is somewhere in a black box.
The reliability math, made concrete#
Two scenarios producing the same output:
| Approach | Per-step reliability | End-to-end reliability |
|---|---|---|
| One mega-prompt doing 5 things | ~70% (compounded silently) | ~70% |
| 5-step chain, 95% per step | 95% per step | 0.95⁵ = 77% |
| 5-step chain with retries on each step | 99% per step (with retry) | 0.99⁵ = 95% |
The third row is where chains really earn their keep. With independent steps, you can wrap each in a retry-on-validation-failure loop. With a monolith, you can't — you have no signal that anything went wrong until the final output looks weird.
The basic linear chain#
Simplest pattern: step 1 → step 2 → step 3. The output of one prompt becomes the input of the next. Example: an email auto-responder that needs to (1) classify intent, (2) draft a reply, (3) translate it.
Classify the customer email below. Output exactly one of:
billing, technical, sales, general
Email: """{{email}}"""
Classification:Draft a customer support reply for a {{classification}} email.
Tone: warm, specific, under 80 words. Acknowledge the customer's
issue in the first sentence.
Email: """{{email}}"""
Reply:Translate the reply below into {{language}}. Preserve the warm,
specific tone. Do not add or remove information.
Reply: """{{reply}}"""
Translated:Each prompt is short and focused. If translation breaks, you fix step 3 in isolation; the rest of the chain is unaffected. If the classifier miscategorizes, you can swap it out without touching the drafter.
Three chaining patterns to know#
Linear: A → B → C#
Each step depends on the one before. The simplest pattern and the right default. Use this for sequential processing where order matters: extract → analyze → synthesize.
Fan-out / fan-in: A → (B, C, D in parallel) → E#
Step A produces a structured output. B, C, and D each process a different facet of A's output in parallel. Step E synthesizes their results.
Example: extract topics from a long article (A), then generate a quote, a summary, and an X thread for each topic in parallel (B, C, D), then assemble the final content package (E).
Latency win: B, C, D run concurrently. Cost win: if you don't need all three branches for some inputs, you can skip them.
Generator → validator (with retry)#
Step A generates a candidate answer. Step B is a separate prompt that checks the candidate against a rubric and either approves or returns a structured critique. If rejected, you re-run A with the critique appended to the input.
This pattern catches errors the original generator wouldn't catch on its own — the validator has a fresh context and a focused job.
Self-validation has limits
How to pass data cleanly between steps#
The biggest source of chain bugs: messy hand-offs between steps. Three rules to keep them clean:
- Use structured outputs at every boundary. JSON between steps, not free text. Parsing prose is where chains break.
- Validate the structure before passing it on. If step 1 returns invalid JSON, don't hand it to step 2 — retry, fall back, or fail fast.
- Don't pass the entire upstream output. Each downstream step needs only what it needs. Pruning intermediate outputs reduces token cost and prevents irrelevant context from confusing later steps.
When to split, when to keep monolithic#
Chain vs. single prompt vs. agent
| If your situation is… | Reach for… | Why |
|---|---|---|
| Multiple distinct phases (extract, then check, then write) | Chain | Each phase has its own success criteria |
| Different parts need different models | Chain | Cheap classification on Haiku, expensive generation on Opus |
| Need to validate / retry intermediate outputs | Chain | Validators only work when intermediate outputs exist |
| Steps need external tools or retrieval | Chain (with ReAct sub-loops) | Tool-using steps fit inside chain boundaries cleanly |
| Single coherent task (summarize, classify, translate) | Single prompt | Splitting adds latency without quality |
| Unknown number of steps; depends on input | Agent (ReAct) | Chains are static; agents are dynamic |
| Latency-sensitive interactive use | Single prompt or fan-out | Sequential chains are roughly Nx the latency |
Going further: production-grade chain patterns#
Fallback chains#
Try the cheap-fast chain first. If it fails validation, fall through to the expensive-careful chain. Most production traffic hits the cheap path; only edge cases pay the expensive cost. Example: GPT-4o-mini classifier first, GPT-4o full analysis only on ambiguous cases.
Strict-typed boundaries#
Use Structured Outputs (OpenAI), tool use (Anthropic), or function calling (Gemini) at every chain boundary. The schema is enforced by the model provider, so step 2 receives well-typed input. Eliminates an entire class of parsing-related failures. Worth it for production chains.
Observability and tracing#
Log every step's input, output, latency, token usage, and any retries. When chains misbehave in production, you need to see which step regressed and when. Tools like LangSmith and Helicone (see tools) handle this with minimal code.
Per-step versioning#
Each step in your chain is its own prompt with its own version history. When you improve the classifier, the drafter and translator stay frozen. When you swap models, you swap one step at a time. See version control for prompts for the workflow.
Common mistakes#
- Over-decomposition. A 7-step chain where 3 would do is harder to maintain and slower. Aim for the minimum number of steps that each do one job reliably.
- No validation between steps. Garbage in, garbage out — and you won't notice until the chain's final output looks wrong. Validate intermediate outputs the way you'd validate API responses.
- Versioning the chain as one unit instead of per-step. Each step is its own prompt with its own history. Treat them independently.
- Skipping observability. Log inputs and outputs at every step. When chains misbehave, you will need the trace.
- Passing the full upstream output downstream. Wasteful and confusing. Project to only the fields the next step needs.
Quick reference#
The 60-second summary
What it is: split a complex task into multiple sequential prompts where each one's output feeds the next.
Three patterns: linear (A → B → C), fan-out (A → B, C, D in parallel → E), generator-validator (A → B reviews A, retries on failure).
The reliability math: compounding 95%-reliable steps with retries beats one 70%-reliable mega-prompt.
The discipline: structured outputs at every boundary, validation before handoff, observability from day one, per-step versioning.
What to read next#
Chains are the substrate of every real LLM application. For chains that need to interact with the outside world (search, code execution, APIs), see ReAct. To pick which prompt in your chain is actually the best version, A/B testing prompts. And for team-scale chain maintenance, build a team prompt library.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.