I've already written about context engineering as the core discipline of building AI systems. I've been experimenting with my own AI tools for coding, research, and automation. I'm noticing that tool calling starts to consume more and more context, and so we need strategies to scale tool calling.
My stack is Rust-based, using Rig for LLM abstraction, Restate for durable execution, Postgres, and a hypermedia architecture with Maud and HTMX. It works well. But as I've added more tools and connected more MCP servers, context usage is creeping up.
Every tool definition (name, description, JSON schema) eats tokens before the conversation even starts. A modest setup with a few MCP servers can consume 50,000+ tokens just on tool schemas.
Also, each tool call is a full inference round-trip. The model calls a tool, waits for the result, processes it, calls the next one. A workflow that touches five tools means five round-trips, plus all the intermediate reasoning. It's slow and eats up tokens.
Tool search: load what you need, when you need it
Instead of stuffing every tool definition into the context upfront, you load only a small set of frequently used tools plus a special tool search tool. Everything else is deferred. When the agent needs a capability it doesn't have, it searches for it, gets back lightweight summaries, and then the full schema of the selected tool gets loaded for the rest of the conversation.
Anthropic's research shows an 85% reduction in context usage, and accuracy on tool selection improved from 49% to 74% on Opus 4. On Opus 4.5 it went from 79.5% to 88.1%.
Anthropic offer a server-side implementation where you mark tools with defer_loading: true in the API request and they
handle the search internally. But the more interesting version, for my purposes, is client-side. You build a tool
registry that indexes tool names and descriptions, expose a tool_search tool that returns lightweight summaries, and
on selection inject the full schema into context. This is model-agnostic. It's just a tool that returns tool
definitions.
It turns out Rig, the Rust LLM framework I'm already using, has a version of this built in.
Rig's "RAG-enabled tools" let you implement a ToolEmbedding trait on your tools, store them in a vector store, and
retrieve the most relevant ones at query time using .dynamic_tools(n, vector_store_index, toolset). It's the
client-side tool search pattern, using embedding-based semantic retrieval rather than keyword matching. The mechanism is
the same as document RAG, applied to tool definitions instead of documents. I hadn't realised the utility of this
before, but the infrastructure for tool search is already in my stack.
I'll probably take a hybrid approach by keeping a few core tools always loaded and deferring everything else.
Programmatic tool calling: let the LLM write code
Instead of calling tools one at a time through the standard tool-calling protocol, the LLM writes code that orchestrates multiple tool calls, processes results with proper programming constructs (loops, conditionals, aggregation), and returns only the final output. The code runs in a sandbox with no direct network access. Tool calls inside the generated code go through a bridge back to the host application, which handles authentication and routing.
This approach can achieve higher accuracy with much lower token usage. Anthropic reports average token usage dropping from 43,588 to 27,297 (a 37% reduction) on complex research tasks, and accuracy improvements on GIA benchmarks from 46.5% to 51.2%. A third-party test by The AI Automators backed this up: a budget compliance check across 20 team members took 56 tool calls and 76,000 tokens with traditional calling and still missed a result. The same task with programmatic calling took 4 to 12 tool calls, used fewer tokens, and got all results correct.
Cloudflare has two takes on this. Their original Code Mode converts MCP tool schemas into TypeScript type definitions and
runs generated code in V8 isolates. Their newer Code Mode MCP server takes it further, working against Cloudflare's
OpenAPI spec rather than MCP schemas. The model writes JavaScript to call search() and execute(), exposing the
entire Cloudflare API through just two tools and consuming around 1,000 tokens regardless of how many API endpoints sit
behind it. When I first saw this approach, I joked it was RCE-as-a-Service, but it actually looks
quite cool if you can get the sandboxing and permissions worked out.
For my Rust stack, the sandbox question is still open. Pydantic's Monty is appealing because it's a Rust-based Python interpreter that boots in single-digit microseconds. But it only supports a subset of Python. I'm also curious about what could be achieved with something like Rhai, a pure Rust embeddable scripting language. There's a lot to think about and get right here including sandboxing, expressiveness, how well LLMs can actually generate code for the target language, security, and performance.
I still think for recurring, well-defined tasks, it's better to use pre-written scripts (a "skills" system) rather than having the LLM generate code every time. Programmatic tool calling is most valuable for novel, ad-hoc queries where the specific combination of tools and logic can't be predicted in advance. I want to experiment with this, but I don't have a specific use case for this right now.
Tool use examples: few-shot prompting for tools
The third pattern is simpler. JSON schemas define structure but can't express usage patterns. Tool use examples provide concrete input/output demonstrations that show the LLM exactly how to call a tool correctly.
Anthropic's testing showed parameter accuracy improved from 72% to 90% with examples. The best practices are to add one to five examples per tool, use realistic data, show variety in how the tool can be called, and focus on cases where correct usage isn't obvious from the schema alone.
Tool search and tool use examples aren't compatible in Anthropic's current API. If you need examples for a specific tool, that tool needs to stay in standard (non-deferred) mode. A skills-based approach can serve a similar purpose, though. When the agent loads a skill file, it gets instructions and example invocations as part of the context, achieving the same effect through context engineering rather than a separate API feature.
What I'm building next
I'm going to try the client-side tool registry with search. This is low-effort, high-impact, and it works with any model. Second, I want to try adding sandboxed code execution once I've figured out the right sandbox approach for a Rust host.
I also still think the skills-based approach offers the best value. This means using skill descriptions and providing a CLI or scripts to access additional capabilities. The Skill + CLI combination is hard to beat because it's powerful and understandable.
I'll write more as I build this out. If you're working on similar problems, or if you've already implemented any of these patterns, I'd love to hear what you've found. Drop me a line.
Sources
- Advanced Tool Use (Anthropic)
- Code Execution with MCP (Anthropic)
- Effective Context Engineering for AI Agents (Anthropic)
- Tool Search Docs (Anthropic)
- Code Mode (Cloudflare)
- Code Mode MCP (Cloudflare)
- Pydantic Monty (Pydantic)
- Context Engineering for AI Agents (Manus)