Agent tools: how to design tools the model uses well
Tools are how an agent affects the world. Designing them well is half of building a reliable agent. Learn the principles: descriptive names, narrow scopes, structured arguments, helpful errors.
Your support agent has 23 tools registered: get_user, list_users, search_users_by_email, find_user_by_phone, lookup_account, get_account_info... A user asks "what plan is sam@acme.com on?" The agent picks list_users, gets back 10,000 results, panics, picks find_user_by_phone, fails because there's no phone, gives up.
Same agent, three well-designed tools: find_user(query) with hybrid search, get_user_by_id(id), and get_account(user_id). The agent solves the same query in two calls, every time.
Most agent quality issues trace to tool design, not prompt design. The model isn't bad at using tools — your tools are bad at being used. This guide is the playbook: six rules that consistently separate agents that work from agents that don't.
The whole idea in one line
The mental model: an API for the model, not a human#
When you design an API for a human developer, you optimize for documentation lookup and IntelliSense. When you design tools for a model, you optimize for in-context decision-making. The model sees all your tool descriptions every turn and picks one (or none) based on the immediate situation.
That changes the design priorities:
- Names matter more. The name is the model's primary signal for when to use a tool.
searchis ambiguous;search_company_docsisn't. - Descriptions teach selection. A human reads a description once. The model reads it every turn while comparing to other descriptions. Descriptions need to explain WHEN to use the tool, not just WHAT it does.
- Errors are observations. A failed tool returns an Observation the model reads. A useless error wastes a turn; a helpful one lets the model recover.
- Tool count is a tax. Every tool description costs context tokens AND decision attention. More tools makes the model worse at picking, not better.
Rule 1: descriptive names#
The tool's name is the model's primary signal for when to use it. Treat naming with the same care you'd treat a public API surface.
- Lead with a verb:
get_user_by_id,send_slack_message,execute_sql_query. - Be domain-specific:
list_open_pull_requestsbeatslist_items. - Avoid abbreviations the model might misinterpret.
get_prvsget_pull_request— the second wins. - Use noun-noun structure that hints at the return type:
find_customer_by_emailreturns a customer;list_customers_by_planreturns a list.
Rule 2: narrow scope per tool#
A single tool that does "manage user" (create, update, delete, list, get) is harder for the model to use correctly than five tools each named exactly what they do. Narrow scopes reduce argument-construction errors and make tool selection deterministic.
{
"name": "manage_user",
"description": "Create, update, delete, list, or get users",
"parameters": {
"action": "create | update | delete | list | get",
"user_id": "(optional)",
"fields": "(optional, depends on action)"
}
}{
"name": "create_user",
"description": "Create a new user. Returns the created user object.",
"parameters": { "name": "string", "email": "string", "role": "admin|member|viewer" }
}
{
"name": "get_user_by_id",
"description": "Look up a single user by their UUID.",
"parameters": { "user_id": "uuid" }
}
{
"name": "list_users",
"description": "Page through users. Filter optional. Returns up to 50 per call.",
"parameters": { "filter": "string?", "page": "integer?" }
}Rule 3: descriptions that explain WHEN to use#
The tool description is the second-strongest signal (after the name). The most common mistake: descriptions that say WHAT the tool is, not WHEN to use it.
Bad: "A search tool."
Good: "Use this to find content from internal company documentation. Best for questions about company policies, products, or procedures. NOT for general web facts — use search_web for those."
When you have multiple similar tools, describe how to pick between them in each tool's description. That's what stops the model from picking the wrong one.
Rule 4: structured arguments with explicit types#
Use the strongest typing the API supports — JSON Schema for OpenAI/Anthropic, TypedDicts for Gemini. Three specifics that compound:
- Enums for fixed choices.
priority: "low" | "medium" | "high"instead of free-text. Eliminates a class of errors. - Required vs optional. Mark optional params as optional. The model otherwise invents values for them.
- Argument descriptions. Each parameter gets its own one-line description. The model reads these.
Rule 5: helpful errors over silent failures#
When a tool fails, the agent needs to know why. A stack trace is useless; a structured, readable error lets the model recover.
{
"error": "TypeError: Cannot read property 'id' of undefined at line 47..."
}{
"error": "User not found",
"user_id_provided": "u_abc123",
"hint": "If you don't have the exact user_id, try list_users with a filter."
}The model treats the error as another observation. Make that observation actionable.
Rule 6: fewer tools, better tools#
Each tool definition costs context tokens — typically 100-300 tokens of schema and description per tool. Ten tools is fine. Thirty is a problem. The model spends attention reading tool descriptions instead of solving the user's task.
More importantly, the model picks wrong tools more often when there are many similar ones. Three tools that each do something distinct outperform fifteen that blur into each other.
Curation strategies by tool count
| You have… | Strategy | Why |
|---|---|---|
| 1–10 tools | Just expose all of them | Below the noise threshold; no curation needed |
| 11–30 tools | Group + carefully name | Naming patterns help the model navigate |
| 30–100 tools | Tool router pattern | A first-pass classifier picks 5–7 relevant ones per task |
| 100+ tools | Hierarchy + retrieval over tool descriptions | Treat tool catalog as a RAG corpus |
| Any count where most aren't used in production | Prune ruthlessly | Unused tools hurt without helping; cut on a cadence |
The output shape matters as much as the input#
A tool that returns a JSON blob with 50 fields the agent doesn't need is worse than one that returns only the 3 fields it does. Token cost compounds across iterations. Strip outputs to what's actually useful.
Two patterns help:
- Pagination. List tools should return a page, not the entire dataset.
- Field projections. Let the agent request only the fields it needs:
get_user(id, fields=["name", "email"]).
Confirmation for side-effecting tools#
For tools that mutate state — sending messages, making purchases, deleting data — consider a two-step pattern:
- Tool 1: dry-run.
preview_send_emailreturns what would be sent, no side effect. - Tool 2: commit.
commit_email_send(preview_id)actually sends.
Slows the agent slightly; saves you from incidents. For high-stakes side effects, also surface the dry-run output to a human for approval. See prompt injection for the broader case for least-privilege tool design.
Going further: production-grade tool patterns#
Tool routers#
For agents with large tool catalogs, run a lightweight first-pass classifier (or embedding similarity over tool descriptions) that selects 5-7 tools relevant to the current user message. The agent then runs with that subset. Keeps the visible tool list small at decision time without sacrificing breadth.
Sandboxing tool execution#
For tools that execute code, run shell commands, or hit external services — run every action in an isolated environment: ephemeral container, network-restricted, file-system-restricted. Even if a prompt-injection attack succeeds, blast radius is contained.
Per-tool rate limiting and budgets#
Set per-tool, per-user rate limits inside your agent loop. An agent stuck in a retry loop on a paginated list can hit your rate limits before the iteration budget catches it. Budget at the tool level, not just the agent level.
Evaluating tool design#
Build a test set where each test case has a known "correct tool" for the user input. Run the agent and score whether the right tool was called first. Catches naming and description issues that subtler eval methods miss. Re-run on every tool catalog change.
Tool-call observability#
Log every tool call with: input arguments, output (or error), latency, and whether it was the "right" choice (if you have ground truth). Aggregate to surface tools that get called incorrectly often, tools that error out regularly, tools that no one uses. The data is your tool-design debug log.
Common mistakes#
- One mega-tool with an action argument. Almost always worse than multiple narrow tools.
- Tool descriptions that don't differentiate from each other. The model can't tell when to use which.
- Returning raw API responses. Strip to what the agent needs. Token cost is real.
- No iteration budget on tool calls. An agent in a tool-call loop can hit your rate limits and your wallet hard.
- Skipping observability. Log every tool call with input, output, latency, and error. You will need this when debugging.
- Treating tool design as the prompt-engineer's job. Tools are an interface; designing them well is API design. Apply the same rigor.
Quick reference#
The 60-second summary
Six rules: descriptive names, narrow scope, descriptions that explain WHEN, structured args with types, helpful errors, few tools.
Output discipline: paginate lists, allow field projections, strip irrelevant fields.
Side effects: two-step preview/commit pattern; human-in-the-loop for high-stakes actions.
The mindset: tools are an API surface designed for the model. The same care you'd apply to a public REST API applies to your tool catalog.
What to read next#
Tools live inside the agent loop — Introduction to agents covers the loop. The state agents retain across tool calls is agent memory. For the underlying interleaved-reasoning pattern, ReAct. For the security implications of tool design, prompt injection.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.