Skip to main content

45 posts tagged with "context-engineering"

View all tags

The Conversation Summary Your Agent Regenerated Each Turn Because the Cache Key Included a Timestamp

· 11 min read
Tian Pan
Software Engineer

A cache that is being written to but never read from is not a cache. It is a logging system with extra latency, billed by the kilobyte. And the cruelest version of this failure mode is the one where the cache looks healthy from every angle except the one that matters: the set calls succeed, the get calls return quickly, the keys are well-formed, the values are valid, the TTLs are sensible. The only thing wrong is that no get call ever finds the key a previous set call wrote, because a single field in the key changes every time it is computed.

This is the story of a debugging session that added a timestamp to a cache key "so I can tell which cache entry I'm looking at," and the system that quietly paid for fourteen extra LLM calls per conversation for two weeks before anyone noticed.

The Summarizer That Paraphrased Away the User's Literal Question

· 8 min read
Tian Pan
Software Engineer

A user asks: "Does this qualify as a 'transfer' under article 28?" Forty turns later, the model gives an answer to a different question. The transcript shows the model answered the question it was given. The user is reading a complaint that reads like a hallucination. Both are right. The model never saw the user's question — it saw your summarizer's polite translation of it: "user asked about article 28 applicability."

The word "transfer" was the question. The summarizer threw it away because the summarizer's loss function was tuned to preserve facts, not wording, and the rubric never learned the difference between paraphrasing the topic and paraphrasing the constraint. Topic was preserved. Constraint became fog.

This failure mode is structural, not anecdotal. Any application that compresses long conversations with a model-generated summary has a second model in the critical path — one whose quality contract is usually treated as a token-budget knob rather than as a piece of product logic. That asymmetry is where the bug lives.

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Compaction Strategy That Summarized Away the User's Original Question

· 10 min read
Tian Pan
Software Engineer

A user asked our support agent: "Why was invoice INV-2025-08-44719 charged twice on April 3rd?" Forty-five minutes and eighteen tool calls later, the agent confidently reported back: there was no evidence of any duplicate billing on the account that quarter. The user, understandably, escalated. When we replayed the trace, the answer became obvious. The agent had compacted its conversation at turn nine. The summary said the user was "asking about a duplicate charge in early April." It did not contain the string "INV-2025-08-44719." Every subsequent tool call — the ledger lookup, the chargeback API query, the audit log scan — was issued against a paraphrased intent, not the literal invoice number the user typed.

The bug was not in the tools. It was not in the model's reasoning. It was that our context manager had a contract with every downstream component, and nobody had written it down. The contract said: "I will preserve meaning." The components needed: "I will preserve strings."

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Tree Your Server Stored As A Log

· 10 min read
Tian Pan
Software Engineer

A user types "actually, I meant fifty, not fifteen," hits the pencil icon on their last message, and edits it. The UI does what good UIs do: it shows them the corrected message, fades out the old one, scrolls the assistant's stale reply into a struck-through ghost, and presents a clean conversation that reads as if the original mistake never happened. The user, satisfied, sends the next turn. The agent answers using fifteen.

The bug is not in the model. The model received exactly what the server sent it, and the server sent it the original message, the original assistant response, the regret, the edited message, and the new request — all concatenated, all in order, all live. The user is having a conversation they edited. The agent is having a conversation that was never edited. The two transcripts diverge at turn three and never reconcile, and every subsequent turn pays interest on the gap.

The OOO Auto-Reply Your Agent Did Not Read

· 8 min read
Tian Pan
Software Engineer

Your support agent pages a human at 2 a.m. The human has been out for a week. The OOO message lives in the same inbox the agent is reading. The agent pings the human anyway. The auto-reply lands. The agent thanks it politely and pings again, because the reply did not contain the resolution code it was waiting on. Twelve cycles in, somebody on a different team notices the unread thread is now sixty messages deep and goes manually wake up the on-call.

The agent did exactly what the prompt told it to do. The prompt told it to escalate to a person. The person was a string, not a role. The string did not know about PTO.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

The Context Window Is a Commons, and Every Team Is Grazing It

· 10 min read
Tian Pan
Software Engineer

Open a production agent and count what is in the context window before the user has typed a single character. There is a system prompt the platform team owns. There are tool definitions — forty of them, maybe more — each carrying a name, a description, a JSON schema, field-level docs, and a handful of enums. There is a block of retrieved examples that the search team added because few-shot helped one eval. There are six lines of safety instructions from trust and safety, four lines of formatting rules from the design team, and a paragraph of domain glossary that someone added during an incident and nobody removed.

Add it up and the agent boots with 30,000 tokens of overhead. On a connected setup with three MCP servers, that number is routinely far worse — one widely cited measurement put three servers at 143,000 of a 200,000-token budget, 72% of the window consumed before the conversation starts. None of it is wrong. Every line was added by someone solving a real problem. And that is exactly why the context window is being destroyed.

Onboarding an Agent Like a Junior Engineer Is a Category Error

· 9 min read
Tian Pan
Software Engineer

When an agent joins your team, the nearest analogy in every engineering manager's head is the new hire. So the playbook writes itself: give it a sandbox and read-only logs, scope the first tasks small, pair with it, expect a ramp-up period, and grow it into bigger work as trust accumulates. It feels responsible. It feels like the same patient management that turned your last junior into a senior.

It is also a category error — not a slightly imperfect analogy, but a wrong one. A junior engineer is a person who does not yet know your system. An agent is a stateless function that will never know your system, no matter how many times it touches it. Those are different kinds of things, and the management instincts that work for one quietly misallocate your attention on the other.

The reason this matters is that the metaphor doesn't just mislead — it tells you to invest in the wrong place. "Grow the agent" is not a strategy. The agent is fixed. Everything you can actually change lives outside of it.

Token Budgets Are a Scheduling Problem, Not a Prompt Problem

· 9 min read
Tian Pan
Software Engineer

When an agent gives a worse answer than it did last week, the first instinct is to blame the prompt. Someone reworks the system instructions, trims a few sentences, adds an example, and ships. Sometimes it helps. Often it does nothing, because the prompt was never the problem. The problem is that a single verbose tool result quietly consumed 18,000 tokens, pushed the actual task instructions into the low-attention middle of the context window, and left the model reasoning over a transcript that is 70% noise.

That is not a wording problem. That is a resource-allocation problem. And resource allocation has a name in systems engineering: scheduling. The context window is a fixed-size resource, multiple consumers compete for it, and right now most agent stacks "schedule" it the way a 1960s batch system scheduled memory — first come, first served, until it runs out.