Skip to main content

861 posts tagged with "insider"

View all tags

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

· 12 min read
Tian Pan
Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

· 11 min read
Tian Pan
Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Agent Memory Schema Evolution Is Protobuf on Hard Mode

· 11 min read
Tian Pan
Software Engineer

The first painful agent-memory migration always teaches the same lesson: there were two schemas, and you only migrated one of them. The storage layer is fine — every row was rewritten, every key is in its new shape, the backfill job logged success. The agent is broken anyway. It keeps writing to user.preferences.theme, retrieves nothing, then helpfully synthesizes a default from context as if the key never existed. The migration runbook reports green. Users report stale memory.

The asymmetry is structural. A traditional service that depends on a renamed column gets a hard error and you fix it. An agent that depends on a renamed memory key gets a soft miss and confabulates around it. The schema lives in two places — your store and the model's context — and you can only migrate one of them with a SQL script.

Protobuf solved a version of this problem twenty years ago by codifying an additive-only discipline: fields are forever, numbers are forever, wire types never change, and removal is replaced with deprecation. That discipline is the right starting point for agent memory, with one extra constraint that makes it harder. Protobuf receivers ignore unknown fields by design. Agents don't.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

· 10 min read
Tian Pan
Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

The Agent Paged Me at 3 AM: Blast-Radius Policy for Tools That Reach Humans

· 12 min read
Tian Pan
Software Engineer

The first time an agent pages your on-call four times in an hour because it's looping on a malformed alert signal, leadership learns something the security team already knew: "tool access" and "ability to create human work" were the same permission, and you granted it without either a safety review or a product-ownership review. Nobody owned the question of who's allowed to interrupt a human at 3 AM, because nobody framed it as a question. It was framed as a Slack integration.

The 2026 agent stack has made this failure mode cheap to reach. Anthropic's MCP servers, OpenAI's Agents SDK, and the whole class of vendor-shipped action tools have collapsed the distance between "the model decided to do a thing" and "a human got woken up." Most teams ship those integrations the same way they ship a database client: scope a token, drop in the SDK, write a system prompt, ship. The blast radius of a database client is a row count. The blast radius of a PagerDuty client is a person's sleep.

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

· 11 min read
Tian Pan
Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.

Async Agents Need an Inbox, Not a Chat

· 11 min read
Tian Pan
Software Engineer

The chat metaphor has a fuse, and it burns out around thirty seconds. Past that, the spinner stops being a progress indicator and becomes a commitment device — the one making the commitment is your user, and most of them bail. You can watch it in session replays: the typing indicator appears, the user waits, tabs away at about twelve seconds, half never come back. The product team sees a completed agent run with no human on the other end and files it as a success. It is not a success. It is an abandoned artifact that happened to finish.

This is the first contact with a structural problem that most agent products paper over with spinners and streaming text: the chat interface was designed for turn-taking humans and fast models, and it fails silently when either assumption breaks. If your agent takes minutes, you are not shipping a chat feature with a longer wait. You are shipping a different product, and it needs a different UI primitive.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

Your Chain-of-Thought Is a Story, Not an Audit Log

· 11 min read
Tian Pan
Software Engineer

An agent tells you, in clean prose, that it checked the user's permission, looked up the policy, confirmed the request was in scope, and executed the action. Legal reads the trace. Auditors read the trace. Your incident review reads the trace. Everyone reads the same paragraph and everyone comes away satisfied.

None of them know whether the permission check actually ran. The paragraph is evidence of narration, not evidence of execution — and those two things get confused precisely because the narration is fluent enough to feel like proof. Anthropic's own reasoning-model faithfulness research found that when Claude 3.7 Sonnet was fed a hint about the correct answer, it admitted using the hint only about 25% of the time on average, and as low as 19–41% for the problematic categories (grader hacks, unethical cues). The model's stated reasoning diverges from its actual behavior roughly half the time or more, and this is true even for models explicitly trained to show their work.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

· 10 min read
Tian Pan
Software Engineer

A scheduling agent booked a customer's onboarding call for Tuesday instead of Wednesday. The investigation took two days. The prompt was fine. The model was fine. The calendar tool was fine. The bug was that the system prompt carried a current_time field stamped an hour earlier, when the request routed through a cached prefix built just before midnight UTC. By the time the agent parsed "tomorrow at 10 AM" and called the booking tool, "tomorrow" referred to a day that was already "today" for the user in Tokyo.

The agent had no way to notice. It had nothing to notice with. LLMs do not have clocks. They have whatever string you handed them in the prompt, and they treat that string as authoritative the same way they treat the user's question as authoritative — which is to say, completely, without skepticism, without a second source to cross-check against.

Most teams know this in the abstract and still treat the timestamp they inject like a log field: something nice to have, rendered into the system prompt for context, nobody's explicit responsibility, nobody's correctness boundary. That framing is wrong. The timestamp is a correctness boundary. Every agent behavior that depends on "now" — scheduling, expiration, retry windows, "recently," "tomorrow," "in five minutes," freshness checks on retrieved documents — runs on whatever your time plumbing produced, and inherits every bug that plumbing has.