LLM benchmarks and datasets you should know
A curated list of the benchmarks and datasets used to evaluate LLMs and prompts — MMLU, HumanEval, GSM8K, TruthfulQA, BIG-Bench, and more — with what each measures and when to use it.
Every model release post says "state of the art on X" for some benchmark X. Most readers nod along without asking which benchmark, what it measures, or whether being best at it would matter for their actual work.
This page closes that gap. The benchmarks LLM teams cite, what each actually measures, and when you should care. Plus the punchline most engineers learn the hard way: public benchmarks tell you how a model compares to others. They don't tell you whether the model works for your task. For that, you need a test set built from your own data.
Public benchmarks vs. your own
General capability#
Broad benchmarks that summarize a model's overall capability. Useful as a single number for tracking progress across releases.
- MMLU (Massive Multitask Language Understanding)
What it measures: General knowledge across 57 academic subjects via multiple-choice questions, from basic math to professional medicine.
When to use: When you want a single number that summarizes a model's breadth. The most-cited benchmark in model release notes.
- BIG-Bench Hard (BBH)
What it measures: A subset of BIG-Bench tasks that current models still struggle with. Useful for distinguishing frontier models from each other.
When to use: When MMLU has saturated and you need a harder discriminator between top models.
- HellaSwag
What it measures: Commonsense reasoning: pick the most plausible continuation of a video caption or short story.
When to use: For commonsense and grounded-reasoning tasks. Older benchmark; mostly saturated by frontier models.
Reasoning and math#
Benchmarks for the techniques covered in Chain-of-Thought and self-consistency.
- GSM8K
What it measures: Grade-school math word problems requiring multi-step reasoning. The classic Chain-of-Thought benchmark.
When to use: When evaluating reasoning ability or testing whether your CoT prompt works as intended.
- MATH
What it measures: Competition-level math problems (AMC, AIME). Much harder than GSM8K — frontier models are still well under 100%.
When to use: For pushing reasoning models. Distinguishes o1/o3-style reasoning from base instruction-tuned models clearly.
- ARC (AI2 Reasoning Challenge)
What it measures: Grade-school science questions requiring reasoning. Two splits: easy and challenge.
When to use: When evaluating science-flavored reasoning. The challenge split is still meaningful for non-frontier models.
- DROP
What it measures: Discrete reasoning over paragraphs — questions that require reading a paragraph and computing a numeric or extractive answer.
When to use: For reading-comprehension-with-reasoning tasks. Tests whether the model can combine extraction with arithmetic.
Factuality and hallucinations#
Benchmarks specifically for measuring how often models produce false information. See the hallucinations guide for context.
- TruthfulQA
What it measures: Questions where humans frequently hold misconceptions. Tests whether the model repeats common falsehoods or pushes back.
When to use: When evaluating hallucination resistance. The standard test set for measuring whether a model repeats common false beliefs.
- HaluEval
What it measures: Hallucination detection across QA, dialogue, and summarization tasks. Both reference-based and reference-free splits.
When to use: For systematic hallucination evaluation when you have ground-truth references.
Code generation#
Programming benchmarks. HumanEval is the default; SWE-bench is where modern coding agents are evaluated.
- HumanEval
What it measures: 164 Python programming problems with hidden test cases. The default code-generation benchmark.
When to use: When evaluating code-generation models or prompts. Heavily Python-biased; complement with multilingual benchmarks if needed.
- MBPP (Mostly Basic Python Problems)
What it measures: Around 1,000 entry-level Python programming problems with test cases.
When to use: For broader code-generation eval than HumanEval. Easier on average but covers more breadth.
- SWE-bench
What it measures: Real GitHub issues from 12 popular Python repositories. Models must produce a patch that passes the project's tests.
When to use: For evaluating coding agents on realistic, multi-file engineering tasks. The hardest credible coding benchmark today.
Retrieval and RAG#
For evaluating retrievers, rerankers, and end-to-end RAG pipelines. Cross-references the RAG guide.
- MS MARCO
What it measures: Real Bing search queries paired with relevant passages. The standard for evaluating retrievers and rerankers.
When to use: When training or evaluating embedding models, hybrid retrievers, or rerankers for RAG.
- Natural Questions (NQ)
What it measures: Real Google search queries paired with Wikipedia answers. Tests open-domain question answering with retrieval.
When to use: For end-to-end RAG evaluation when Wikipedia is a reasonable proxy for your corpus.
- BEIR
What it measures: A heterogeneous benchmark of 18 retrieval tasks across diverse domains and query types.
When to use: When you want to know how a retriever generalizes beyond its training distribution. Critical for production RAG systems.
Alignment and instruction-tuning#
Datasets for training models to follow instructions and align with human preferences. Mostly research-side — included for completeness.
- HH-RLHF (Anthropic Helpful and Harmless)
What it measures: Pairs of model responses with human preference labels for helpfulness and harmlessness. Foundational dataset for RLHF research.
When to use: When training reward models or fine-tuning for safety-helpfulness trade-offs.
- Alpaca / Dolly / OpenAssistant
What it measures: Instruction/response pairs for instruction-tuning. Dolly and OpenAssistant are open-license alternatives to Alpaca.
When to use: When fine-tuning a base model into an instruction-follower without commercial licensing constraints.
Why you still need your own eval set#
Public benchmarks measure things most users care about. Your application has specific tasks, edge cases, brand voice, regulatory constraints, and quality bars that no public benchmark covers. The general guidance:
- Start small. 50 hand-curated examples beats 5,000 noisy ones.
- Mix easy, average, and adversarial cases. Easy cases catch regressions; adversarial cases catch failures.
- Keep eval data separate from training data. The instinct to fix every failure case by adding it to a few-shot prompt corrupts your eval set. Hold out test cases.
- Re-run on every prompt change. See A/B testing prompts for the workflow.
Quick reference#
The 60-second summary
Six categories: general capability, reasoning, factuality, coding, RAG/retrieval, alignment.
The most-cited: MMLU (general), GSM8K and MATH (reasoning), HumanEval and SWE-bench (coding), TruthfulQA (hallucination), MS MARCO and BEIR (retrieval).
The truth: public benchmarks tell you how a model compares to others, not whether it works for your task. Always have your own eval set.
Building yours: 50 hand-curated examples, mix of easy and adversarial, hold-out discipline, re-run on every change.
What to read next#
For tools that run evaluations against these datasets (and your own), tools. For the papers that introduced many of these benchmarks, papers. For the eval workflow itself, A/B testing prompts.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.