Automatic Prompt Engineer (APE): let the model optimize its own prompts

APE generates candidate prompts, scores them against an eval set, and keeps the winner. The result: prompts you didn't hand-write that often outperform the ones you did. The research foundation behind production meta-prompting workflows.

schedule7 min readLast updated May 2, 2026

You spend an afternoon hand-tuning a prompt. Try a phrasing. Run on 10 inputs. Tweak. Re-run. After three hours you have something acceptable. A teammate spends 20 minutes writing a script that generates 50 candidate prompts, scores each on a 50-input eval set, keeps the winner. Their winner outperforms your hand-tuned version by 12%.

That's Automatic Prompt Engineer (APE). Effectively gradient descent over prompts — the model proposes candidates, an evaluator scores them, the best survives. Modern production "meta-prompting" tools are applied APE under the hood. Knowing the pattern lets you build your own when no tool fits, and lets you understand what the tools are doing when one does.

The whole idea in one line

Generate prompt candidates, score each against an eval set, keep the winner. The more candidates and the better the eval set, the better the winner.

The mental model: search over prompt-space#

Hand-tuning a prompt is local search — you wiggle in the neighborhood of your current draft. APE is broader search: generate many candidates that vary substantially, then evaluate them objectively to find the winner. It's how you escape local optima you can't see from where you're standing.

The catch: search is only as good as the evaluation. A noisy or biased evaluator produces a winner that's well-tuned for the noise, not for production reality. The eval set is the entire game.

The APE loop#

Define the task. A one-paragraph description of what the prompt needs to do.
Build an eval set. 20-100 input/expected-output pairs. Mix easy and adversarial cases.
Generate candidate prompts. Use an LLM to propose 10-50 variations. Variations should differ on structure, phrasing, examples — not just wording.
Score each candidate. Run on the eval set, score outputs against expected. Sum scores.
Keep the winner. Promote the top scorer. Optionally iterate: generate variations of the winner and rerun.

The candidate-generation prompt#

APE's generator step is the most important prompt to get right. It's meta-prompting at scale.

terminalPromptCandidate generator

Any (capable model — GPT-4o, Claude Sonnet)

I need to optimize a prompt for the task below. Generate
{{n}} different candidate prompts. Each should approach the
task differently — different structure, different framing,
different examples. Don't just rephrase one prompt {{n}} times.

Task:
"""
{{task_description}}
"""

Examples of inputs the prompt will receive:
"""
{{example_inputs}}
"""

Output as a JSON array. Each element is a complete prompt that
could be run as-is on the inputs above.

JSON:

play_arrowTry in PromptShip

Two non-obvious things this prompt does: explicitly demands diversity (otherwise models produce variations of the same idea), and shows the inputs the candidates will operate on (so candidates are written with the actual data shape in mind).

The scoring step#

The scoring approach depends on what the output looks like:

Structured outputs (JSON, classifications): exact match against expected. Cheap and reliable.
Free-text outputs (writing, summaries): LLM-as-judge. A separate prompt grades each output 1-5 against a rubric.
Code outputs: run the code, check tests pass. Like PAL in reverse — the code is what you're evaluating, not the result.

The eval set IS the product

APE without a real eval set is noise generation. Time spent building the eval is the highest-leverage time in the whole workflow. 50 hand-curated input/output pairs from your real production traffic beat 5,000 noisy synthetic ones every time.

When APE earns its keep#

APE vs. alternatives

If your situation is…	Reach for…	Why
Production prompt running thousands of times	APE	5-10% accuracy lift compounds across all those runs
Objective scoring possible (exact match, schema check)	APE	Scoring is fast and cheap — search becomes practical
Hand-tuning has plateaued	APE	Local search exhausted; broader search finds new optima
Subjective tasks (brand voice, creative writing)	Skip — APE struggles here	Without objective scoring, APE produces well-scoring slop
One-off prompts you'll run twice	Skip	Eval-set construction cost > the lift on rare-use prompts
Tight token budget at scale	APE for the prompt, then deploy	Search is expensive; the deployed prompt isn't. Pay once, save forever.

Going further: production patterns#

Iterative APE#

Run APE once, get a winner. Use that winner as the seed for a second round — generate variations of it, score, keep the new winner. Each round narrows in further. Diminishing returns past 3-4 rounds; budget accordingly.

Evolutionary APE (genetic operators)#

Beyond simple iteration: take the top K candidates and produce next-generation candidates by "crossover" — combining elements of two parents. More exploration than pure iteration. Useful when the best ideas come from combining different prompts' strengths.

LLM-as-judge for subjective scoring#

For free-text tasks where exact match doesn't apply, use a separate LLM call as the scorer. The judge prompt takes (input, candidate output, rubric) and returns a score. Calibrate the judge against human ratings on a small sample first to make sure scores correlate with what you actually care about.

When to use a tool instead of building APE yourself#

Production tools that implement APE workflows: Promptfoo, Braintrust, LangSmith's eval features, Humanloop. See tools. Build APE from scratch when your eval logic is unusual or proprietary; use a tool when standard eval (exact match, LLM-as-judge) is enough.

Common mistakes#

Skipping the eval set. APE without one is just generating prompts and picking your favorite. No real signal.
Generating candidates that all say the same thing. Diversity is what makes search work. If your candidates are 10 rephrasings of the same idea, the winner is barely better than any of them.
Overfitting to the eval set. The winner is great on the 50 inputs you tested. In production it might fail on unseen patterns. Hold out a portion of the eval as a true test set; never train (or APE) on it.
Using on subjective tasks. APE rewards well-scoring outputs. On subjective tasks (brand voice, creative writing), the score is biased and the winner is the prompt that best games the judge — not the best prompt for users.
Running once and shipping. Production data drifts. The APE winner from Q1 may not be the winner from Q4. Re-run periodically or whenever the underlying input distribution shifts.

Quick reference#

The 60-second summary

What it is: generate prompt candidates → score against an eval set → keep the winner. Search over prompt-space.

Why it works: hand-tuning is local search; APE is broader search. Often finds optima you wouldn't reach by tweaking.

When it shines: high-volume production prompts, objective scoring possible, hand- tuning has plateaued.

The gating factor: eval set quality. APE without a real eval is noise generation.

The non-negotiables: diverse candidates (not 10 rephrasings), held-out test set (don't overfit to APE's eval), re-run as production drifts.

What to read next#

APE is the research foundation for modern production meta-prompting workflows — meta-prompting covers the applied side. For the eval methodology that makes APE work, A/B testing prompts. For example-selection problems APE pairs naturally with, Active-Prompt. Original Zhou et al. (2022) paper in our papers list.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library