Skip to main content

161 posts tagged with "agents"

View all tags

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

Silent Tool Truncation: The Default Cap Your Agent Reasons Over Without Knowing

· 11 min read
Tian Pan
Software Engineer

A tool call returns a 142 KB JSON blob. Your agent framework drops everything past byte 8,192, hands the prefix to the model, and the model writes a confident answer based on a fragment it never knew was a fragment. Three weeks later a customer escalates. You scroll the trace, see "tool returned successfully," and the post-mortem turns into a hunt for which step "ignored" the evidence — except no step ignored it. The evidence was clipped before it ever reached the reasoner.

This isn't a hypothetical. Codex hardcodes tool output truncation at 10 KiB or 256 lines. Claude Code defaults to 25,000 tokens for tool results, with a separate display-layer cap that briefly clipped MCP responses at around 700 characters in 2025. OpenAI's tool-output submission caps at 512 KB. Each framework picked a number that seemed safe, and for short tool calls it is. The failure mode arrives when a single step's output crosses the line — quietly, without an exception, without a flag the model can see.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

· 9 min read
Tian Pan
Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.

Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think

· 11 min read
Tian Pan
Software Engineer

A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.

This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.

Conversational REST: When Your Chat UI Needs Pagination, Filters, and Sort

· 11 min read
Tian Pan
Software Engineer

A user asks your shopping agent for "running shoes under $150 with good arch support." The model dutifully returns twelve options as a wall of bulleted text inside a single chat bubble that overflows the viewport. The user scrolls, loses their place, and types "show me only Asics" — at which point your agent re-runs the entire search instead of filtering the result set it already has. Three turns later, the user is inventing a query language one prompt at a time, and your product feels like a command line wearing a chat-bubble costume.

This is the failure mode I keep watching teams ship. They built a chat product on top of what users actually wanted to be a faceted-search product. The model is fine. The retrieval is fine. The UI is the problem, and it's the wrong shape for the task.

The shortest way I can put it: chat is an input modality, not an output one. The agent's job is to translate user intent into a structured query. The moment the result set is more than three items, the right answer is to render UI, not to keep talking.

OAuth in MCP: Threading User Identity Through Tool Servers

· 10 min read
Tian Pan
Software Engineer

The first time you wire an MCP server into a real production system, you discover something the tutorials gloss over: the protocol gives the agent capabilities, but it does not give the tool server an answer to the question every audit log requires — which human is this acting on behalf of? You can ship a working demo without resolving that question. You cannot ship to a regulated enterprise without resolving it. And the gap between those two states is almost entirely a distributed-systems problem dressed up as an OAuth problem.

What teams reach for in that gap, in roughly the order they reach for it, is a tour of every anti-pattern the OAuth working group has spent fifteen years warning against. A shared service account in the MCP server's environment. A long-lived per-user token pasted into a config. A cheerful "we'll just forward the user's session cookie and let the downstream service figure it out." Each one works in staging. Each one breaks in a different way the first time security review actually looks at it.

The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For

· 11 min read
Tian Pan
Software Engineer

A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.

Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.

This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.

The Snapshot Trace Test: Production Traces as Your Regression Suite

· 10 min read
Tian Pan
Software Engineer

The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.

Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

· 10 min read
Tian Pan
Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.