How agents pick the right tool from many options — from LLM-native selection via descriptions, to RAG-over-tools for large catalogues, to hierarchical routing that avoids overloading the context window.
When you pass tools to an LLM, it selects which to call (and with what arguments) based entirely on the tool descriptions and the current conversation. There's no explicit "tool selection algorithm" in the model — it's all inference from text.
This means tool selection quality is a prompt engineering problem: write clearer, more specific descriptions and the model selects tools more reliably. The model asks itself (implicitly): "Given the user's request and these tool descriptions, which tool, if any, should I call right now?"
For small sets of tools (under ~20), this works excellently. The challenge comes with scale: a production agent system might have hundreds of tools, and including all of them in every prompt is expensive and confusing.
Tool descriptions are the primary lever for improving selection accuracy. Effective descriptions answer four questions:
What does this tool do? ("Searches the product catalogue for items matching a query")
When should it be used? ("Use when the user asks about product availability, pricing, or specifications")
When should it NOT be used? ("Do not use for order status — use get_order_status instead")
What does it return? ("Returns a list of up to 10 matching products with name, price, and stock level")
# Bad description — too vague
{
"name": "search",
"description": "Search for information.",
...
}
# Good description — specific, actionable, with negative examples
{
"name": "search_products",
"description": (
"Search the product catalogue by keyword. "
"Use when the user asks about product availability, pricing, features, or specifications. "
"Returns up to 10 matching products. "
"Do NOT use for order status, returns, or account issues — use the appropriate tools for those."
),
...
}
For large tool catalogues (50+ tools), include all descriptions in every prompt is wasteful and degrades selection quality. Instead, treat tool selection as a retrieval problem: embed all tool descriptions, then retrieve only the most relevant ones for each user query.
import numpy as np
from sentence_transformers import SentenceTransformer
import anthropic
# Tool catalogue
ALL_TOOLS = [
{"name": "search_products", "description": "Search product catalogue...", "schema": {...}},
{"name": "get_order_status", "description": "Check order delivery status...", "schema": {...}},
{"name": "process_return", "description": "Initiate a product return...", "schema": {...}},
# ... 100+ more tools
]
# Embed all tool descriptions at startup
embedder = SentenceTransformer("all-MiniLM-L6-v2")
tool_descriptions = [t["description"] for t in ALL_TOOLS]
tool_embeddings = embedder.encode(tool_descriptions) # shape: (N, 384)
def retrieve_relevant_tools(query: str, k: int = 8) -> list[dict]:
'''Retrieve top-K most relevant tools for the user query.'''
query_embedding = embedder.encode([query])
# Cosine similarity
norms = np.linalg.norm(tool_embeddings, axis=1) * np.linalg.norm(query_embedding)
similarities = (tool_embeddings @ query_embedding.T).flatten() / (norms + 1e-8)
top_k_indices = np.argsort(similarities)[-k:][::-1]
return [ALL_TOOLS[i] for i in top_k_indices]
client = anthropic.Anthropic()
def agent_with_tool_retrieval(user_query: str) -> str:
# Retrieve only relevant tools
relevant_tools = retrieve_relevant_tools(user_query, k=8)
api_tools = [{"name": t["name"], "description": t["description"],
"input_schema": t["schema"]} for t in relevant_tools]
response = client.messages.create(
model="claude-sonnet-4-5", max_tokens=1024,
tools=api_tools,
messages=[{"role": "user", "content": user_query}]
)
return response.content[0].text
For very large catalogues (hundreds to thousands of tools), even RAG retrieval can be slow. Hierarchical routing adds a fast first-stage classifier that narrows down the category before retrieving specific tools:
TOOL_CATEGORIES = {
"orders": ["get_order_status", "cancel_order", "update_shipping"],
"products": ["search_products", "get_product_details", "check_stock"],
"returns": ["process_return", "get_return_status", "refund_timeline"],
"account": ["update_account", "reset_password", "billing_history"],
}
def classify_query_category(query: str) -> str:
categories = list(TOOL_CATEGORIES.keys())
response = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=32,
messages=[{"role": "user", "content": f"Classify into one category: {categories}
Query: {query}
Category:"}]
)
return response.content[0].text.strip().lower()
def route_to_tools(query: str) -> list[str]:
category = classify_query_category(query)
return TOOL_CATEGORIES.get(category, TOOL_CATEGORIES["products"])
# Now load only the 3-5 tools in the matched category
tool_names = route_to_tools("Where is my order #12345?")
# → ["get_order_status", "cancel_order", "update_shipping"]
The classification call uses a cheap, fast model (Haiku). The second call uses the right tool subset with a more capable model. Total latency is lower than loading all tools, and accuracy improves because the model has fewer tools to confuse.
Measure tool selection quality before optimising it. Key metrics:
Selection accuracy: what % of time does the model call the correct tool for a given query? Build a test set of (query, expected_tool) pairs and measure.
False positive rate: how often does the model call a tool when no tool is needed (the query could be answered from knowledge)?
Wrong-tool rate: how often does the model call a plausible-but-wrong tool (e.g., calling search_products instead of get_order_status for an order query)?
test_cases = [
{"query": "Where is my order 123?", "expected_tool": "get_order_status"},
{"query": "Do you have red sneakers in size 10?", "expected_tool": "search_products"},
{"query": "What is your return policy?", "expected_tool": None}, # No tool needed
]
correct = sum(1 for case in test_cases
if get_selected_tool(case["query"]) == case["expected_tool"])
accuracy = correct / len(test_cases)
print(f"Tool selection accuracy: {accuracy:.1%}")
Instead of defining tools statically, load them from a registry at runtime. This enables tool versioning, A/B testing tool descriptions, and feature flags for tool availability:
import json
class ToolRegistry:
def __init__(self, registry_path: str):
with open(registry_path) as f:
self._tools = json.load(f) # loaded from DB or config file
def get_tools_for_user(self, user_id: str, tier: str) -> list[dict]:
'''Return tools available for this user tier.'''
return [t for t in self._tools
if tier in t.get("available_tiers", ["free", "pro", "enterprise"])]
def get_tools_for_context(self, context: str) -> list[dict]:
'''Return tools tagged for a specific context.'''
return [t for t in self._tools if context in t.get("contexts", [])]
registry = ToolRegistry("tools_registry.json")
tools = registry.get_tools_for_user(user_id="user_123", tier="pro")
Store tool descriptions in a database with versioning. When descriptions are improved, update the registry and all agents automatically benefit — no code deploy required.
Similar tools confuse the model. If you have search_products and search_catalogue that do the same thing, the model will pick randomly between them. Audit your tool set for redundancy — merge or rename overlapping tools.
Tool count degrades selection. Empirically, model tool selection accuracy drops meaningfully above ~20 tools in a single prompt. If you have more tools, use RAG retrieval or hierarchical routing — don't just stuff all tools into the prompt and hope the model handles it.
The model can call no tool when it should. If the model can't find a good tool match, it will sometimes answer from its parametric knowledge instead of calling a tool. Add explicit instructions: "Always use the available tools to answer user questions. Do not use general knowledge for questions about orders, products, or account information."
| Pattern | Scale | Selection Method | Latency Overhead |
|---|---|---|---|
| Flat list | 1–20 tools | In-context description | None (in prompt) |
| Semantic routing | 20–200 tools | Embedding similarity | +20–50ms |
| Hierarchical (category → tool) | 50–500 tools | Two-stage LLM routing | +100–300ms |
| Agent-selects-tools (meta-agent) | Any | LLM reasons over tool catalogue | +200–500ms |
For semantic tool routing, embed both the tool descriptions and the incoming query, then retrieve the top-k most similar tools by cosine similarity before passing them to the LLM. Use a retrieval threshold of 0.75 minimum similarity — tools below this threshold are unlikely to be relevant and including them wastes context. Maintain a tool embedding index that updates automatically when tool descriptions change, and log tool selection decisions to detect cases where the wrong tool was retrieved.
Tool description quality is the most impactful variable in selection accuracy. Include the tool name, a one-sentence description of what it does, what input format it expects, and a concrete example of when to use it versus similar tools. Descriptions under 50 words typically underperform; descriptions over 200 words slow down context processing without quality gain. Aim for 80–120 words per tool description.
Monitor tool selection accuracy in production by logging every tool call alongside the user intent category. Compute a confusion matrix monthly: which tools are being selected for which intent categories? Systematic misrouting (e.g. a search tool being called for calculation tasks) reveals gaps in tool descriptions that a simple prompt update can fix. Set up an alert when a tool's call volume drops by more than 50% week-over-week — this often indicates the tool has silently broken or been incorrectly deprioritised by the model.
When the tool set changes frequently, implement a tool registry with versioning: each tool carries a unique ID, a semantic version, and a deprecation date. Route queries to the correct tool version based on when the conversation started, not the current registry state. This prevents mid-conversation behaviour changes. Archive deprecated definitions for at least 90 days to support debugging of past conversations.