Context Engineering Is the Job

In my previous post on AI engineering, I talked a lot about how I think it's largely about context management. Keep the context clean. Stay in the smart zone. Don't let the model guess.

I've been researching this more, and I've got a lot of insights from listening to Jeff Huber. He's the CEO of Chroma, the company behind the context rot research I referenced in that post. He's been across several podcasts making a case that I find compelling: context engineering isn't just a technique. It's the discipline of building AI systems.

Huber comes at this from the search and retrieval side as he's building infrastructure for agentic search. But the principles he's articulating extend well beyond search. I've been finding them just as applicable to agentic coding, and I suspect they hold for any system where an LLM needs the right information at the right time.

I spent some time pulling together his key ideas from a Vanishing Gradients episode and a few other appearances. Here's what stuck with me.

Stop saying RAG

Huber refuses to use the term "RAG." His argument is that it conflates three separate things (retrieval, augmentation, and generation) into one. The term that's becoming standard instead is context engineering: the discipline of figuring out what should be in the context window for any given LLM generation step. It's a better name because it describes the actual job. And it gives the work the status it deserves. This isn't prompt fiddling, it's engineering.

In a traditional MVC CRUD app, your business logic is encoded in controllers. In an AI app, your business logic is encoded in context.

— Jeff Huber

The key architectural decisions in an AI system are about what the model sees and when. This follows from the insight that an LLM is stateless, and its output depends entirely on its input. And the performance comes from what we build around it to support feeding it the right thing. I'm starting to think about agentic AI systems as having four key concerns: model choice, the agentic harness, context engineering, and orchestration. But of those four, context engineering is what we're talking about here.

Two loops

Huber breaks context engineering into an inner loop and an outer loop.

The inner loop is what goes into the context window right now, for this specific generation step. You have N candidate chunks of information and Y available slots. The job is to curate from potentially millions of candidates down to the handful that matter for this exact moment.

The outer loop is how you get better at the inner loop over time. Build, test, deploy, monitor, iterate. The classic software development cycle, applied to context quality.

This framing is useful because it separates two different kinds of work. The inner loop is the mechanics of assembling context, including retrieval, filtering, reranking, prompt construction. The outer loop is about measurement, feedback, and systematic improvement. It's easy to focus almost entirely on the inner loop and barely touch the outer.

Gather, then glean

For the inner loop, Huber describes a two-stage process:

Stage one: gather. Cast a wide net. Maximise recall. Use semantic search, keyword search, metadata filters, API calls, conversation history. You'll grab irrelevant things. That's fine.

Stage two: glean. Cull the candidates to the minimal set that actually matters. Rerank using cross-encoders, reciprocal rank fusion, or increasingly just LLMs directly. Go from a few hundred down to the 20 or so that belong in the context window.

The two stages optimise for different things. Gather is optimising for not missing anything important. Glean optimises for not including anything distracting. You need both.

Huber's framing here is search-specific, but the underlying problem applies everywhere. It's about context assembly and selecting the right parts from a larger pool. For agentic coding, I'm still doing this fairly manually as I learn what works. It's something I'm actively working on improving and automating.

Huber also makes an important point here that the most dangerous information isn't the obviously irrelevant stuff. It's the information that looks relevant but isn't, for some subtle reason. That's what causes the model to confidently go down the wrong path. Tight gleaning protects against this.

The outer loop is key

The outer loop is where the most real leverage is. You observe what your system actually does, compare it to what it should have done, and feed that back into how you build context next time. Without this, every change is a guess. With it, you're doing engineering.

Huber's version of this, coming from search, is the golden dataset. He recommends a spreadsheet of query-information pairs that define what your system should retrieve for given inputs. His advice for creating one is disarmingly simple: get your team together for an evening, buy some pizzas, spend a few hours writing pairs for every use case you can think of. Then improve it over time by studying what users actually query, analysing what succeeded and what failed, and wiring the results into CI.

For agentic coding, I'm finding the outer loop looks different but follows the same shape. It's about studying where the agent followed the plan and where it diverged, what context was missing when it made a bad decision, what assumptions it hallucinated because the right information wasn't in the window. Each of those failure cases becomes a lesson that feeds back into how I structure research, write plans, and assemble context for the next session. The research-plan-implement cycle I described previously is really an inner loop. The outer loop is how that cycle gets refined through experience.

The underlying principle is the same regardless of domain: you need a way to measure whether your context engineering is actually getting better. Huber calls the gap between demo and production "alchemy." The outer loop is what turns it into engineering.

Keeping context under control

Agentic workflows pile up tokens through multi-step interactions. You need strategies for keeping context windows clean. In my experience, there are two: summarise and delegate. They look similar but work at different points.

Summarising deals with context that's already accumulated. As a conversation grows, you extract what matters and discard the rest. This is what Dex called intentional compaction. The research-plan-implement cycle I've written about is essentially this. Each phase produces a compressed artefact that replaces the sprawl of the previous phase. It's reactive. When the context has grown, you compact it.

Delegating prevents the tokens from entering the main context in the first place. You hand work to a sub-agent that operates in its own isolated context window. It does the messy, token-heavy exploration, and only a concise result crosses back into the parent. Huber frames this as encapsulation, borrowing from software engineering, and I think that's exactly right. The same principle as keeping functions small and interfaces narrow, applied to context windows. The sprawl never reaches the main agent at all.

I use both. Sub-agents explore different parts of a codebase in parallel, each in a fresh context. Only their compressed summaries come back. And within a conversation, I compact between phases rather than letting history accumulate.

Scaffolding has a shelf life

Huber makes a strong argument that the scaffolding around LLMs should get simpler as models improve, not more complex. Teams that build elaborate workarounds for model weaknesses end up maintaining dead weight when the next model doesn't have those weaknesses. He points out that Manus has been re-architected five times since March 2024. Anthropic regularly strips out Claude Code's agent scaffolding as models get more capable.

I can relate to this directly. A few years ago at Peppy, I wasn't building the RAG system itself, but I was building components around it and could see what was going on. There was a lot of scaffolding in place to compensate for model limitations. Looking back, much of that could be dramatically simplified now. I've always aimed to build things out of smaller, replaceable parts. I haven't always managed to achieve that in practice. But that instinct serves you well here. If you expect the scaffolding to have a shelf life, composability isn't just good engineering, it's mandatory.

But I don't really want the model generating information. I want it synthesising from what's been provided. Which brings it right back to context engineering: make sure the right information is in the window.

Huber also argues that the cost of rebuilding is dramatically lower now, so teams should lean into impermanence. I've done some experiments with natural language specs and rebuilding parts of systems, and I can see the direction of travel. But I still think we're in the early days of learning how building with these tools actually works. I don't want to claim more confidence than I have on that one.

What I'm taking from this

These are the main insights I'm taking from Huber that are influencing my own work now:

Name the primitives. Don't say "RAG." Be explicit about the components that make up context engineering. Retrieval, filtering, reranking, context assembly, evaluation are separate concerns you can reason about, measure, and improve independently.

Close the outer loop. Find a way to measure context quality over time. "Does this feel better?" isn't good enough. Instrumentation matters, and so does evaluation against known data.

Respect context rot. I was already doing this for coding, but it applies to every AI system. Tight, structured contexts beat maximal windows. Always.

Embrace the rebuild. Stop trying to build permanent AI infrastructure. Build for the current model generation, keep things simple enough to rip out, and accept that the next model might change everything.

Start simple, stay simple. Exhaust prompt engineering and basic workflows before reaching for agents and complex retrieval. The premature complexity trap is real, and it's expensive.

There's a lot more in the full episode. Huber goes deep on hybrid search tradeoffs, evaluation practices, and the demo-to-production gap. Worth the listen if you're building anything that puts information in front of an LLM.

I'm curious whether others are finding the same things. Is context engineering the frame you're using, or something different? Drop me a line. I'd love to hear what's working for you.

Sources

Vanishing Gradients Ep. 65: The Rise of Agentic Search ( Jeff Huber with Hugo Bowne-Anderson)
Context Rot: How Increasing Input Tokens Impacts LLM Performance (Chroma Research)
Latent Space: RAG is Dead, Context Engineering is King ( Jeff Huber)