LLM settings: temperature, top-p, and the knobs that actually matter
Model parameters control how creative, deterministic, or constrained your outputs are. Learn what temperature, top-p, max tokens, and stop sequences actually do — and which to tune for which task.
You ship a feature in production. A user submits a prompt, gets a great answer. Submits again with the exact same input — gets a wildly different answer. Submits a third time, gets one with a typo and a hallucinated fact. Same prompt. Same model. Same everything you can see. What changed?
Probably one knob: temperature. The model didn't regress; it just sampled differently. Most teams ship LLM features without knowing what their temperature is set to, what top-p does, or whether their max-tokens cap is the reason 1% of outputs cut off mid-sentence. This guide is the operating manual. The five parameters that matter, what each does, and the right defaults for production tasks.
The whole idea in one line
The mental model: knobs, not magic#
A language model produces text by sampling. At each step, it has a probability distribution over the next possible token — "the" might be 8% likely, "a" 6%, "is" 4%, and so on across a vocabulary of ~50,000 to ~200,000 tokens.
LLM settings are knobs that shape how the model picks from that distribution. They don't make the model smarter. They don't change what it knows. They just control which tokens get selected — more deterministic vs. more creative, tighter vocabulary vs. wider, shorter vs. longer.
Five knobs to know:
- Temperature — how flat or peaked the distribution is.
- Top-p (nucleus sampling) — how much of the distribution's mass to consider.
- Max tokens — hard cap on output length.
- Stop sequences — strings that terminate output when generated.
- Frequency / presence penalty — discourage repetition.
Temperature#
Temperature scales the probability distribution before sampling. Low temperatures sharpen the distribution toward the most-probable token; high temperatures flatten it so unlikely tokens get chosen too.
- Temperature 0 — fully deterministic (greedy decoding). Same input, same output, every time.
- Temperature 0.1–0.3 — mostly deterministic with occasional variance. Production default for structured tasks.
- Temperature 0.7 — balanced creativity. Default for chat / general use.
- Temperature 1.0+ — high creativity. Useful for brainstorming; risky for factual content (more hallucination).
- Temperature 1.5+ — chaos. Mostly useful for testing how the model fails.
Top-p (nucleus sampling)#
Top-p caps the cumulative probability mass the model samples from. Top-p of 0.9 means: ignore the long tail of unlikely tokens, sample only from the set whose probabilities sum to 0.9.
- Top-p 1.0 — consider every token. The default.
- Top-p 0.9 — drop the weirdest 10% of tokens. Often a good middle ground.
- Top-p 0.5 — only sample from the most probable half. Conservative, repetitive.
Set temperature OR top-p — never both
Max tokens#
The hard cap on output length, measured in tokens (~0.75 words each in English). When the model hits this limit, generation stops mid-sentence if necessary.
Three reasons to always set this in production:
- Cost control. Without a cap, a runaway response can hit the context-window maximum and bill accordingly.
- Latency control. Longer outputs take longer to generate. A cap prevents the "why is this taking 30 seconds?" tail.
- Predictability. Knowing your output is at most N tokens lets you size downstream UI and processing.
Set this to roughly 2× your expected output length. For a prompt that should produce ~100 words, set max-tokens to ~250.
Stop sequences#
Strings that, when generated, immediately stop output. Useful for enforcing output boundaries deterministically.
Common patterns:
- End-of-message marker. Set stop sequence to
###if your prompt expects output ending with###. - Multi-turn delimiter. In few-shot prompts that show alternating user/assistant turns, set stop to
User:so the model doesn't hallucinate the next user turn. - JSON close. For structured outputs, stop after the first
}at the top level (though Structured Outputs APIs handle this natively now).
Frequency and presence penalty#
Two related parameters that discourage repetition:
- Frequency penalty — penalizes tokens proportional to how often they've already appeared. Reduces repetition of specific words.
- Presence penalty — flat penalty for any token that's appeared at all. Encourages exploring new topics.
Both range -2.0 to 2.0. Default 0. Useful when you notice the model getting stuck in loops or rehashing the same words. Most production prompts leave these at 0; the rare task that needs them tends to need careful tuning.
{
"model": "gpt-4o",
"messages": [...],
"temperature": 0.1,
"top_p": 1.0,
"max_tokens": 500,
"stop": ["###"],
"frequency_penalty": 0,
"presence_penalty": 0
}Picking settings by task#
Settings by task type
| If your situation is… | Reach for… | Why |
|---|---|---|
| Classification, extraction, structured output | temp=0 to 0.2 | Determinism prevents drift; same input → same output |
| Customer-facing replies, brand-voice writing | temp=0.3 to 0.5 | Some variance feels natural; too much breaks consistency |
| General chat, Q&A | temp=0.7 | The default — balanced creativity and reliability |
| Brainstorming, creative writing, ideation | temp=0.9 to 1.2 | You want diversity; mediocre ideas are the cost of finding novel ones |
| Self-consistency (sample N, take majority) | temp=0.7+ | Need diverse reasoning paths; temp=0 produces N identical samples |
| Reasoning models (o1, Claude thinking, Gemini thinking) | Default | These models manage temperature internally; let them |
Going further: production patterns#
Different settings per step in a chain#
In a prompt chain, each step has different settings needs. The classifier runs at temperature 0 (deterministic routing), the generator at 0.5 (some variety), the validator back at 0 (consistent scoring). Don't use one global setting; tune per step.
Determinism in production isn't actually free#
Even at temperature 0, modern models have small non-determinism from batch processing on the serving side. If true determinism matters (regulated industries, audit trails), you need additional safeguards: pinned model versions, hash-based output verification, and accepting that vendor-side updates may shift behavior anyway. Don't assume temp=0 means bit-identical forever.
Run an A/B test before production rollouts#
Settings that work in dev sometimes fail at production traffic shapes. Before promoting a change (especially temperature), run an A/B test across your real eval set. Cheap insurance.
Common mistakes#
- Setting temperature AND top-p. They both control the same thing. Tuning becomes a guessing game when both are non-default.
- No max-tokens cap in production. Cost and latency surprises waiting to happen.
- Using high temperature for factual tasks. Higher temperature = higher hallucination rate. See hallucinations.
- Using temperature 0 for self-consistency prompts. The whole point of self-consistency is diverse reasoning paths. Temp=0 produces N identical samples.
- Tuning settings before fixing the prompt. Settings tuning is the LAST 5% of quality. A bad prompt with perfect settings still produces bad output.
Quick reference#
The 60-second summary
Five knobs: temperature (randomness), top-p (vocabulary breadth), max-tokens (length cap), stop sequences (boundary enforcement), frequency/presence penalty (anti-repetition).
The cardinal rule: set temperature OR top-p, never both.
Production defaults: temp 0–0.2 for structured tasks, 0.7 for general chat, 0.9+ for creative work.
Always set max-tokens in production. Cost and latency depend on it.
Reasoning models manage these internally — leave them at default.
What to read next#
Now that you can tune the knobs, learn how to build the prompt itself — Anatomy of a prompt. For the universal habits that lift output quality regardless of settings, general prompting tips. For per-model defaults that actually differ, prompting ChatGPT, Claude, and Gemini.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.