AI agents: what they actually are (and aren't)
An agent is an LLM in a loop with the ability to act and observe. Learn the core anatomy, the difference between an agent and a chain, and the signal that you actually need one.
Two engineering teams built a customer-research tool last quarter. Same goal: scrape competitor announcements, summarize them, file them in Notion. Team A built "an agent" — a single LLM with web-search, summarize, and Notion-write tools, looping autonomously until the job was done. Team B built a chain — three fixed prompts running in sequence with a tiny orchestrator on top.
Team A's agent worked spectacularly when it worked. It also burned $47 in one runaway loop, forgot to file two of the summaries, and once opened 30 browser tabs trying to disambiguate a competitor name. Team B's chain was boring — and shipped on schedule with predictable cost.
The word "agent" has become wildly overloaded. Some people mean a chatbot. Others mean a fully autonomous system that books your flights unsupervised. Most production wins are in between. This guide is about getting clear on what an agent really is, when you actually need one, and the patterns that separate Team A's expensive failure from Team B's reliable ship.
The whole idea in one line
The mental model: an LLM in a loop with arms#
A vanilla LLM call is a single forward pass: input in, output out, no awareness of what happened before. A chain is a fixed sequence of calls stitched together by your code. An agent is the next step: an LLM running in a loop where each step decides what to do based on what just happened.
Three things make it different:
- It can act. Tools let it affect the world — search, query a database, send an email, run code.
- It can perceive. Tool returns are observations the model attends to before deciding what comes next.
- It decides what comes next. Unlike a chain, the sequence isn't known in advance. The model picks tools and arguments based on the goal and the current state.
That third property — dynamic step selection — is the whole reason agents exist. It's also the reason they're harder to operate than chains. Loss of determinism is the agent tax.
Anatomy of an agent#
Every working agent has four moving parts. Get each one right and most production agent work takes care of itself.
The model#
An LLM that does the reasoning and decides what to do at each step. Frontier models (Claude Sonnet, GPT-4o, Gemini Pro) handle agent loops well. Smaller and cheaper models can work for tightly-scoped agents but degrade fast when the task gets open-ended.
The loop#
An outer process (your application code) that calls the model, parses its output, runs any actions, feeds results back, and repeats until the model produces a finish signal. Almost every modern agent runs some variant of the ReAct loop: Thought → Action → Observation → repeat.
Tools#
A set of functions the model can call — search, code execution, database queries, API calls. The agent's arms and eyes. Most quality issues in production agents trace back to tool design, not prompt design. Read Agent tools next.
Memory#
A way to retain state across turns — short-term within a single task, long-term across tasks. Without memory, every conversation starts from zero and the agent feels broken. See agent memory for the patterns.
Agent vs. chain: when do you actually need one?#
The most common — and most expensive — mistake in this space: building an agent when a chain would do. Agents add complexity, cost, and unpredictability for capabilities most products don't need.
Chain vs. agent
| If your situation is… | Reach for… | Why |
|---|---|---|
| Steps are known in advance | Chain | No reason to pay for dynamic selection |
| Iteration count varies by input | Agent | You can't hardcode the path |
| You need predictable cost / latency | Chain | Agents have variable token usage and runtime |
| Open-ended tasks (research, data exploration) | Agent | The next step depends on what you just learned |
| Customer-facing with reliability requirements | Chain (or agent + heavy guardrails) | Agents fail in surprising ways; chains fail predictably |
| Internal tool, technical users, exploratory work | Agent | Users tolerate agent quirks for the open-endedness |
| Single-prompt task with no follow-up | Just a prompt | No loop = no agent |
A practical rule: start with a chain; promote to an agent only when you actually need the dynamism. Most production LLM apps that succeed are chains; most that fail spectacularly were agents that didn't need to be.
A minimal agent in 15 lines#
function runAgent(userMessage, tools, maxIters = 10):
messages = [systemPrompt, userMessage]
for i in range(maxIters):
response = llm.call(messages, tools=tools)
if response.kind == "final":
return response.text
if response.kind == "tool_call":
result = runTool(response.name, response.args)
messages.append(response) # the tool call
messages.append(toolResult(result)) # the observation
raise "Hit iteration budget without finishing"That's it. Every agent framework — LangChain, LlamaIndex, vendor SDKs — is implementing some version of this loop with extras (memory, retries, parallel tool calls, observability). Knowing the core loop lets you debug whatever framework you end up using.
Three architectural patterns to know#
Single-loop ReAct#
One model, one loop, full freedom to call any tool. Simplest pattern. Works for many tasks — research, lookup-heavy Q&A, exploratory data queries. Becomes unreliable past 5-7 steps for hard problems because the model loses sight of the goal.
Planner-executor#
One model (the planner) breaks the goal into steps. Another model (the executor) carries them out, often in tighter ReAct loops. Better for complex tasks because the planner holds the goal while the executor handles tactics. More reliable on long-horizon work; more expensive (two model calls minimum).
Multi-agent#
Multiple specialized agents (a researcher, a writer, a fact-checker) coordinated by a supervisor. Hot research area. In production, often over-engineered. Try a single-loop or planner-executor first. Reach for multi-agent only when you have measured evidence that the simpler patterns are insufficient.
The five failure modes of production agents#
1. Infinite loops#
The model keeps calling tools and never decides to finish. Often happens when the task is genuinely impossible and the agent won't admit it. Fix: always set a max-iteration budget; on the final iteration, inject a system message that forces a finish.
2. Compounding errors#
Each step has a failure rate; ten steps compound. A 95%-reliable step run 10 times succeeds end-to-end only 60% of the time. Fix: keep loops shallow when you can; add validators between steps; design tools with retry-friendly error responses.
3. Goal drift#
The model loses track of what the user asked for after many turns. Fix: reinject the goal periodically, or use a planner-executor architecture where a planner LLM holds the goal and the executor never sees the user's original phrasing.
4. Tool overload#
Giving the agent 30 tools is worse than 5. The model picks wrong tools more often when similar-sounding ones blur into each other. The token cost of all those tool descriptions is real too. Fix: curate; use a tool router for large catalogs.
5. Black-box failures#
When an agent misbehaves in production, you need the trace — every prompt, every tool call, every observation. Without observability, debugging is guesswork. Fix: build trace logging from day one. Tools like LangSmith and Helicone (see our tools list) make this near-zero effort.
Evaluating an agent#
Agents are harder to evaluate than chains because outputs aren't deterministic and trajectories vary. Two complementary evaluation styles, run both:
- End-to-end. Did the agent achieve the goal? A binary or graded score per task. Use a fixed test set and score against a rubric.
- Trajectory. How did the agent get there? Number of steps, tool-call accuracy, time to completion. Catches degradations that end-to-end metrics miss.
End-to-end tells you if it works. Trajectory tells you why. See A/B testing prompts for the workflow you adapt for agent eval.
Going further: production-grade agent patterns#
Iteration and cost budgets#
Every agent should have a hard ceiling on iterations and a separate ceiling on tool calls or tokens. A runaway agent can hit your rate limits and your wallet hard. Track both per-task and per-user budgets.
Human-in-the-loop checkpoints#
For agents with destructive tool access (file deletion, money movement, public posting), insert explicit confirmation steps. Either the agent proposes an action and a human approves before commit, or the agent surfaces a preview the user can edit. Slows the agent slightly; prevents incidents.
Self-reflection and self-correction#
After a task, run a separate prompt that critiques the agent's trace: did it miss anything? Are the tool calls efficient? If the critique flags an issue, re-run with the critique injected as context. Catches a class of errors pure ReAct misses. The Reflexion paper (Shinn et al., 2023 — in our papers list) formalizes this pattern.
Structured handoffs in multi-agent setups#
If you really do need multiple agents, define strict handoff schemas: agent A produces a structured object, agent B consumes that object, no free-text in between. Same discipline as prompt chaining: structured outputs at every boundary.
Common mistakes#
- Building an agent when a chain would do. The most common mistake. Agents add complexity, cost, and unpredictability for capabilities most products don't need.
- No iteration budget. A runaway agent can burn dollars or call APIs thousands of times. Always cap iterations and tool calls.
- No retry / fallback for tool errors. Tools fail. Networks blip. Return structured error observations the model can react to, not raw stack traces.
- Skipping observability. Without traces, debugging is impossible. Log every prompt, tool call, observation, and final answer.
- Treating "works on the demo input" as ready to ship. Demos hit the happy path. Build a real eval set with adversarial inputs before any production traffic.
Quick reference#
The 60-second summary
What it is: an LLM in a loop with tools, deciding what to do next based on what just happened. Four parts: model, loop, tools, memory.
When you need one: iteration count varies, next step depends on intermediate results, task is genuinely open-ended.
Three patterns: single-loop ReAct (default), planner-executor (complex tasks), multi-agent (rarely needed).
The non-negotiables: iteration budget, tool-error handling, observability, eval set with adversarial inputs.
The discipline: start with a chain; promote to agent only when iteration is dynamic.
What to read next#
Two deeper guides: Agent memory covers how agents retain state across turns, and Agent tools is where most agent quality comes from. For the underlying loop pattern, ReAct is essential reading. And for prompts that are production-ready, version control is non-negotiable.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.