Skip to main content

42 posts tagged with "context-engineering"

View all tags

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Tree Your Server Stored As A Log

· 10 min read
Tian Pan
Software Engineer

A user types "actually, I meant fifty, not fifteen," hits the pencil icon on their last message, and edits it. The UI does what good UIs do: it shows them the corrected message, fades out the old one, scrolls the assistant's stale reply into a struck-through ghost, and presents a clean conversation that reads as if the original mistake never happened. The user, satisfied, sends the next turn. The agent answers using fifteen.

The bug is not in the model. The model received exactly what the server sent it, and the server sent it the original message, the original assistant response, the regret, the edited message, and the new request — all concatenated, all in order, all live. The user is having a conversation they edited. The agent is having a conversation that was never edited. The two transcripts diverge at turn three and never reconcile, and every subsequent turn pays interest on the gap.

The OOO Auto-Reply Your Agent Did Not Read

· 8 min read
Tian Pan
Software Engineer

Your support agent pages a human at 2 a.m. The human has been out for a week. The OOO message lives in the same inbox the agent is reading. The agent pings the human anyway. The auto-reply lands. The agent thanks it politely and pings again, because the reply did not contain the resolution code it was waiting on. Twelve cycles in, somebody on a different team notices the unread thread is now sixty messages deep and goes manually wake up the on-call.

The agent did exactly what the prompt told it to do. The prompt told it to escalate to a person. The person was a string, not a role. The string did not know about PTO.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

The Context Window Is a Commons, and Every Team Is Grazing It

· 10 min read
Tian Pan
Software Engineer

Open a production agent and count what is in the context window before the user has typed a single character. There is a system prompt the platform team owns. There are tool definitions — forty of them, maybe more — each carrying a name, a description, a JSON schema, field-level docs, and a handful of enums. There is a block of retrieved examples that the search team added because few-shot helped one eval. There are six lines of safety instructions from trust and safety, four lines of formatting rules from the design team, and a paragraph of domain glossary that someone added during an incident and nobody removed.

Add it up and the agent boots with 30,000 tokens of overhead. On a connected setup with three MCP servers, that number is routinely far worse — one widely cited measurement put three servers at 143,000 of a 200,000-token budget, 72% of the window consumed before the conversation starts. None of it is wrong. Every line was added by someone solving a real problem. And that is exactly why the context window is being destroyed.

Onboarding an Agent Like a Junior Engineer Is a Category Error

· 9 min read
Tian Pan
Software Engineer

When an agent joins your team, the nearest analogy in every engineering manager's head is the new hire. So the playbook writes itself: give it a sandbox and read-only logs, scope the first tasks small, pair with it, expect a ramp-up period, and grow it into bigger work as trust accumulates. It feels responsible. It feels like the same patient management that turned your last junior into a senior.

It is also a category error — not a slightly imperfect analogy, but a wrong one. A junior engineer is a person who does not yet know your system. An agent is a stateless function that will never know your system, no matter how many times it touches it. Those are different kinds of things, and the management instincts that work for one quietly misallocate your attention on the other.

The reason this matters is that the metaphor doesn't just mislead — it tells you to invest in the wrong place. "Grow the agent" is not a strategy. The agent is fixed. Everything you can actually change lives outside of it.

Token Budgets Are a Scheduling Problem, Not a Prompt Problem

· 9 min read
Tian Pan
Software Engineer

When an agent gives a worse answer than it did last week, the first instinct is to blame the prompt. Someone reworks the system instructions, trims a few sentences, adds an example, and ships. Sometimes it helps. Often it does nothing, because the prompt was never the problem. The problem is that a single verbose tool result quietly consumed 18,000 tokens, pushed the actual task instructions into the low-attention middle of the context window, and left the model reasoning over a transcript that is 70% noise.

That is not a wording problem. That is a resource-allocation problem. And resource allocation has a name in systems engineering: scheduling. The context window is a fixed-size resource, multiple consumers compete for it, and right now most agent stacks "schedule" it the way a 1960s batch system scheduled memory — first come, first served, until it runs out.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

· 9 min read
Tian Pan
Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.