Prompt engineering papers worth reading

A curated reading list of the most important prompt engineering and LLM papers — Chain-of-Thought, ReAct, RAG, InstructGPT, and more — with one-paragraph summaries.

schedule10 min readLast updated May 1, 2026

Most prompt engineering content is downstream of about 15 papers. Read those papers and the rest of the field clicks into place — you can see which ideas are foundational, which are refinements, and which are dead ends people still cite. Skip them and every blog post feels like a fresh discovery.

This is the reading list. Curated, not exhaustive. Each paper has a one-paragraph summary that tells you whether to read it, and a link to the source. Cross-references tie each paper to the corresponding guide on this site.

Where to start (read these three first)

If you only read three: Chain-of-Thought Prompting (Wei et al., 2022), ReAct (Yao et al., 2022), and Retrieval-Augmented Generation (Lewis et al., 2020). They cover most of what you need to know to build production LLM applications.

How to read papers without losing your day#

ML papers can feel intimidating. They're not, when you read them strategically:

Read the abstract. Tells you what the paper claims.
Look at the figures. Most papers compress the core idea into one diagram and one results table.
Read the introduction and conclusion. Skip the methods section unless you need to implement something.
Skim related work. Tells you what other papers to read next.

This gets you the core idea in 15 minutes. Deep-read the methods only if you're actually going to build something based on the paper.

Foundations#

The papers that explain what modern LLMs are and how they were trained.

Attention Is All You NeedVaswani et al., 2017
The Transformer architecture paper. Introduced self-attention as a replacement for recurrence — every modern LLM is a descendant of this design.
Language Models are Few-Shot Learners (GPT-3)Brown et al., 2020
The GPT-3 paper. Demonstrated that scale alone enables in-context learning — the foundation that makes few-shot prompting work.
Training language models to follow instructions with human feedback (InstructGPT)Ouyang et al., 2022
How RLHF turns base models into instruction-followers. The method behind ChatGPT, Claude, and every modern instruction-tuned model.
Constitutional AI: Harmlessness from AI FeedbackBai et al., 2022
Anthropic's alternative to RLHF. Train models to follow a written constitution instead of fitting to individual human ratings.

Prompting techniques#

Papers introducing the core prompting techniques. Cross-link to our guides where relevant: Chain-of-Thought, self-consistency, Tree of Thoughts, ReAct.

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsWei et al., 2022
The original Chain-of-Thought paper. Showed that asking the model to reason step by step dramatically improves performance on math and logic.
Large Language Models are Zero-Shot ReasonersKojima et al., 2022
The "Let's think step by step" paper. Demonstrated that a single phrase can unlock CoT-quality reasoning in zero-shot settings.
Self-Consistency Improves Chain of Thought ReasoningWang et al., 2022
Run CoT multiple times at temperature > 0 and take the majority answer. Significant accuracy gains for the cost of extra samples.
Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsYao et al., 2023
Generalizes CoT into a search tree. The model branches on difficult decisions, evaluates partial solutions, and backtracks from dead ends.
ReAct: Synergizing Reasoning and Acting in Language ModelsYao et al., 2022
The Thought-Action-Observation loop. Foundation for almost every modern LLM agent — interleave reasoning steps with tool calls.
Reflexion: Language Agents with Verbal Reinforcement LearningShinn et al., 2023
Agents that reflect on past failures and self-correct. A practical pattern for iteratively improving agent performance without weight updates.
Automatic Prompt Engineer (APE)Zhou et al., 2022
Use the model to generate, score, and select prompts automatically. The research foundation behind meta-prompting workflows.

RAG and grounding#

Retrieval, long context, and the limits of stuffing everything into the prompt. See also the RAG guide.

Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksLewis et al., 2020
The original RAG paper. Combine a parametric LLM with a non-parametric retrieval index over an external corpus.
Lost in the Middle: How Language Models Use Long ContextsLiu et al., 2023
The empirical case for placing critical information at the start or end of long contexts. Quality drops noticeably for content in the middle.
In-Context Retrieval-Augmented Language ModelsRam et al., 2023
Practical patterns for using retrieved passages effectively in the prompt — what to include, where to put it, when to truncate.

Safety, bias, and adversarial behavior#

The threat surface and what we know about defending it. See the prompt injection and hallucinations guides for practical guidance.

Universal and Transferable Adversarial Attacks on Aligned Language ModelsZou et al., 2023
Demonstrated that automated optimization can find adversarial prompts that transfer across models. Sobering reading on the limits of current defenses.
Prompt Injection Attacks Against Application-Integrated Large Language ModelsGreshake et al., 2023
Maps the indirect prompt injection threat surface — how documents, search results, and tool outputs can hijack LLM applications.
TruthfulQA: Measuring How Models Mimic Human FalsehoodsLin et al., 2021
A benchmark for hallucination — questions where models tend to repeat common misconceptions. Foundation for measuring factuality.

Recent and notable#

Capability surveys and recent technical reports worth reading even if you're not a researcher.

Sparks of Artificial General Intelligence: Early experiments with GPT-4Bubeck et al. (Microsoft), 2023
A wide-ranging probe of GPT-4's capabilities. Useful as a survey of what frontier models can and cannot do, with concrete examples.
Llama 3 Herd of ModelsMeta, 2024
Meta's technical report for the Llama 3 family. Detailed training methodology and ablations — the most thorough open-source model paper to date.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMsDeepSeek, 2025
How reasoning models are trained. Demonstrates that pure RL on math/code benchmarks produces strong CoT-style reasoning without supervised fine-tuning.

Quick reference#

The 60-second summary

The three to start: Wei et al. (CoT), Yao et al. (ReAct), Lewis et al. (RAG). Cover most of production LLM theory.

How to read: abstract + figures + intro + conclusion in 15 minutes. Skip methods unless you're implementing.

Five sections here: foundations, techniques, RAG, safety, recent.

The discipline: save links to papers behind the techniques you actually use. Re-read them when models or techniques shift.

What to read next#

For practical applications of these papers, Prompting techniques walks through how to apply each in production. For tools that implement these patterns, see tools. For external blogs and write-ups, readings.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library