LLM settings: temperature, top-p, and the knobs that actually matter

Model parameters control how creative, deterministic, or constrained your outputs are. Learn what temperature, top-p, max tokens, and stop sequences actually do — and which to tune for which task.

schedule7 min readLast updated May 2, 2026

You ship a feature in production. A user submits a prompt, gets a great answer. Submits again with the exact same input — gets a wildly different answer. Submits a third time, gets one with a typo and a hallucinated fact. Same prompt. Same model. Same everything you can see. What changed?

Probably one knob: temperature. The model didn't regress; it just sampled differently. Most teams ship LLM features without knowing what their temperature is set to, what top-p does, or whether their max-tokens cap is the reason 1% of outputs cut off mid-sentence. This guide is the operating manual. The five parameters that matter, what each does, and the right defaults for production tasks.

The whole idea in one line

Temperature controls randomness. Top-p controls vocabulary. Max-tokens caps output length. Stop sequences enforce boundaries. Set temperature OR top-p — never both.

The mental model: knobs, not magic#

A language model produces text by sampling. At each step, it has a probability distribution over the next possible token — "the" might be 8% likely, "a" 6%, "is" 4%, and so on across a vocabulary of ~50,000 to ~200,000 tokens.

LLM settings are knobs that shape how the model picks from that distribution. They don't make the model smarter. They don't change what it knows. They just control which tokens get selected — more deterministic vs. more creative, tighter vocabulary vs. wider, shorter vs. longer.

Five knobs to know:

Temperature — how flat or peaked the distribution is.
Top-p (nucleus sampling) — how much of the distribution's mass to consider.
Max tokens — hard cap on output length.
Stop sequences — strings that terminate output when generated.
Frequency / presence penalty — discourage repetition.

Temperature#

Temperature scales the probability distribution before sampling. Low temperatures sharpen the distribution toward the most-probable token; high temperatures flatten it so unlikely tokens get chosen too.

Temperature 0 — fully deterministic (greedy decoding). Same input, same output, every time.
Temperature 0.1–0.3 — mostly deterministic with occasional variance. Production default for structured tasks.
Temperature 0.7 — balanced creativity. Default for chat / general use.
Temperature 1.0+ — high creativity. Useful for brainstorming; risky for factual content (more hallucination).
Temperature 1.5+ — chaos. Mostly useful for testing how the model fails.

Top-p (nucleus sampling)#

Top-p caps the cumulative probability mass the model samples from. Top-p of 0.9 means: ignore the long tail of unlikely tokens, sample only from the set whose probabilities sum to 0.9.

Top-p 1.0 — consider every token. The default.
Top-p 0.9 — drop the weirdest 10% of tokens. Often a good middle ground.
Top-p 0.5 — only sample from the most probable half. Conservative, repetitive.

Set temperature OR top-p — never both

Both parameters change the same thing (which tokens get sampled). Setting both creates unpredictable interactions and makes tuning impossible. Convention: pick temperature, leave top-p at 1.0. Some teams pick top-p and set temperature to 1.0. Either works; mixing doesn't.

Max tokens#

The hard cap on output length, measured in tokens (~0.75 words each in English). When the model hits this limit, generation stops mid-sentence if necessary.

Three reasons to always set this in production:

Cost control. Without a cap, a runaway response can hit the context-window maximum and bill accordingly.
Latency control. Longer outputs take longer to generate. A cap prevents the "why is this taking 30 seconds?" tail.
Predictability. Knowing your output is at most N tokens lets you size downstream UI and processing.

Set this to roughly 2× your expected output length. For a prompt that should produce ~100 words, set max-tokens to ~250.

Stop sequences#

Strings that, when generated, immediately stop output. Useful for enforcing output boundaries deterministically.

Common patterns:

End-of-message marker. Set stop sequence to ### if your prompt expects output ending with ###.
Multi-turn delimiter. In few-shot prompts that show alternating user/assistant turns, set stop to User: so the model doesn't hallucinate the next user turn.
JSON close. For structured outputs, stop after the first } at the top level (though Structured Outputs APIs handle this natively now).

Frequency and presence penalty#

Two related parameters that discourage repetition:

Frequency penalty — penalizes tokens proportional to how often they've already appeared. Reduces repetition of specific words.
Presence penalty — flat penalty for any token that's appeared at all. Encourages exploring new topics.

Both range -2.0 to 2.0. Default 0. Useful when you notice the model getting stuck in loops or rehashing the same words. Most production prompts leave these at 0; the rare task that needs them tends to need careful tuning.

terminalPromptProduction-ready settings for a structured-output task

OpenAI / Anthropic / Gemini API

{
  "model": "gpt-4o",
  "messages": [...],
  "temperature": 0.1,
  "top_p": 1.0,
  "max_tokens": 500,
  "stop": ["###"],
  "frequency_penalty": 0,
  "presence_penalty": 0
}

Picking settings by task#

Settings by task type

If your situation is…	Reach for…	Why
Classification, extraction, structured output	temp=0 to 0.2	Determinism prevents drift; same input → same output
Customer-facing replies, brand-voice writing	temp=0.3 to 0.5	Some variance feels natural; too much breaks consistency
General chat, Q&A	temp=0.7	The default — balanced creativity and reliability
Brainstorming, creative writing, ideation	temp=0.9 to 1.2	You want diversity; mediocre ideas are the cost of finding novel ones
Self-consistency (sample N, take majority)	temp=0.7+	Need diverse reasoning paths; temp=0 produces N identical samples
Reasoning models (o1, Claude thinking, Gemini thinking)	Default	These models manage temperature internally; let them

Going further: production patterns#

Different settings per step in a chain#

In a prompt chain, each step has different settings needs. The classifier runs at temperature 0 (deterministic routing), the generator at 0.5 (some variety), the validator back at 0 (consistent scoring). Don't use one global setting; tune per step.

Determinism in production isn't actually free#

Even at temperature 0, modern models have small non-determinism from batch processing on the serving side. If true determinism matters (regulated industries, audit trails), you need additional safeguards: pinned model versions, hash-based output verification, and accepting that vendor-side updates may shift behavior anyway. Don't assume temp=0 means bit-identical forever.

Run an A/B test before production rollouts#

Settings that work in dev sometimes fail at production traffic shapes. Before promoting a change (especially temperature), run an A/B test across your real eval set. Cheap insurance.

Common mistakes#

Setting temperature AND top-p. They both control the same thing. Tuning becomes a guessing game when both are non-default.
No max-tokens cap in production. Cost and latency surprises waiting to happen.
Using high temperature for factual tasks. Higher temperature = higher hallucination rate. See hallucinations.
Using temperature 0 for self-consistency prompts. The whole point of self-consistency is diverse reasoning paths. Temp=0 produces N identical samples.
Tuning settings before fixing the prompt. Settings tuning is the LAST 5% of quality. A bad prompt with perfect settings still produces bad output.

Quick reference#

The 60-second summary

Five knobs: temperature (randomness), top-p (vocabulary breadth), max-tokens (length cap), stop sequences (boundary enforcement), frequency/presence penalty (anti-repetition).

The cardinal rule: set temperature OR top-p, never both.

Production defaults: temp 0–0.2 for structured tasks, 0.7 for general chat, 0.9+ for creative work.

Always set max-tokens in production. Cost and latency depend on it.

Reasoning models manage these internally — leave them at default.

What to read next#

Now that you can tune the knobs, learn how to build the prompt itself — Anatomy of a prompt. For the universal habits that lift output quality regardless of settings, general prompting tips. For per-model defaults that actually differ, prompting ChatGPT, Claude, and Gemini.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library