Skip to main content

311 posts tagged with "ai-agents"

View all tags

Why AI-Generated Comments Rot Faster Than the Code They Describe

· 11 min read
Tian Pan
Software Engineer

When an agent writes a function and a comment in the same diff, the comment is not documentation. It is a paraphrase of the code at write-time, generated by the same model from the same context, and it is silently wrong the first time the code shifts. The function gets refactored, an argument changes type, an early-return gets added, the comment stays. By next quarter, the comment is encoding a specification that no longer matches the code, and the next reader trusts the comment because the comment is easier.

This is an old failure mode — humans-edit-code-comments-stay-stale — but agents accelerate it across three dimensions at once. Comment volume goes up because agents add a doc block to every function whether it needs one or not. The comments are grammatically perfect, so reviewers don't flag them as low-quality. And the comments paraphrase the code in different terms than the code actually executes, so they look like documentation but encode a second specification that drifts independently of the first.

AI Reviewing AI: The Asymmetric Architecture of Code-Review Agents

· 12 min read
Tian Pan
Software Engineer

A review pipeline where the author and the reviewer are both language models trained on overlapping corpora is not a quality gate. It is a confidence amplifier. The author writes code that looks plausible to a transformer, the reviewer reads code through the same plausibility lens, both agents converge on "looks fine," and the diff merges with a green checkmark that means nothing about whether the change is actually correct. Recent industry data shows the asymmetry plainly: PRs co-authored with AI produce roughly 40% more critical issues and 70% more major issues than human-written PRs at the same volume, with logic and correctness bugs accounting for most of the gap. The reviewer agents shipped to catch those bugs are, by construction, the ones least equipped to find them.

The teams getting real signal from AI code review have stopped treating "review" as a slightly different shape of "generation" and started designing review as a fundamentally different cognitive task. Generation prompting asks the model to produce something coherent. Review prompting has to ask the model to find what is missing — to inhabit the negative space of the diff rather than the positive one — and that inversion is much harder to elicit than a one-line system prompt suggests.

The Contestability Gap: Engineering AI Decisions Your Users Can Actually Appeal

· 11 min read
Tian Pan
Software Engineer

A user opens a chat, asks for a refund, gets "I'm sorry, this purchase is not eligible for a refund," closes the tab, and never comes back. Internally, the agent emitted a beautiful trace: tool calls, intermediate reasoning, the policy bundle it consulted, the model version it ran on. Every span landed in the observability platform. None of it landed anywhere the user could reach. There is no button labeled "ask a human to look at this again," and even if there were, there is no service behind it. The decision is final by default, not by design.

This is the contestability gap, and it is the next thing regulators, lawyers, and angry users are going to rip open. It is also one of the cleanest examples of a problem that looks like policy from the outside and turns out to be plumbing on the inside.

The Dual-Writer Race: When Your Agent and Your User Edit the Same Calendar Event

· 12 min read
Tian Pan
Software Engineer

The agent confidently reports "I rescheduled the meeting to Thursday at 3pm." The user is staring at the original Tuesday 10am slot, because between the agent's plan and its commit they edited the event themselves. Last-write-wins overwrote a human change with an automated one, and the user's trust in the assistant collapses on a single incident. This is the dual-writer race, and it is the bug class that agent toolchains were never designed for.

Most agent platforms inherit this category by accident. The tool layer treats update_event as a single function call: take an ID, take new fields, return success. The provider API underneath has offered optimistic concurrency primitives — ETags, version tokens, If-Match preconditions — for a decade, and almost nobody plumbs them through. The model has no way to know that the world it reasoned about a minute ago is no longer current, because the abstraction it was given silently throws that information away.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.

Policy-as-Code for Agents: OPA, Rego, and the Decision Point Your Tool Loop Doesn't Have

· 12 min read
Tian Pan
Software Engineer

The first time a regulator asks you to prove that your support agent did not access a Tier-2 customer's billing record on March 14th, you discover an unpleasant truth about your authorization architecture: the system prompt said "do not access billing for Tier-2," the YAML tool manifest said tools: [search_orders, refund_order, get_billing], and somewhere in between, the model decided. There is no record of a decision, because no decision point existed. Whether the agent did the right thing is not auditable, only inferable from logs of what happened.

This is the part of agent engineering that nobody put on the architecture diagram. Tool permissions today still live in a YAML file edited by whoever spawned the agent, surfaced to the model through a system prompt that describes intent, and enforced — when it is enforced at all — by application code wrapping each tool call in an if user.tier == "premium" check. As tool catalogs cross fifty entries and conditions multiply across tenants, data classes, and user roles, that hand-rolled lattice stops scaling, and the system prompt stops being a reliable enforcement surface. The model is not your authorization layer, even when it acts like one.

What is replacing it is policy-as-code: a dedicated policy engine — OPA with Rego, AWS Cedar, or a similar declarative tool — sitting in front of every tool call as a Policy Decision Point. The engine answers a single question per call: given this principal, this tool, these arguments, this context, is the action allowed? The agent runtime never gets to vote. This post is about what that architecture looks like in practice, and the four problems it solves that no amount of prompt engineering can.

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

· 12 min read
Tian Pan
Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

· 10 min read
Tian Pan
Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

Agent Memory Schema Evolution Is Protobuf on Hard Mode

· 11 min read
Tian Pan
Software Engineer

The first painful agent-memory migration always teaches the same lesson: there were two schemas, and you only migrated one of them. The storage layer is fine — every row was rewritten, every key is in its new shape, the backfill job logged success. The agent is broken anyway. It keeps writing to user.preferences.theme, retrieves nothing, then helpfully synthesizes a default from context as if the key never existed. The migration runbook reports green. Users report stale memory.

The asymmetry is structural. A traditional service that depends on a renamed column gets a hard error and you fix it. An agent that depends on a renamed memory key gets a soft miss and confabulates around it. The schema lives in two places — your store and the model's context — and you can only migrate one of them with a SQL script.

Protobuf solved a version of this problem twenty years ago by codifying an additive-only discipline: fields are forever, numbers are forever, wire types never change, and removal is replaced with deprecation. That discipline is the right starting point for agent memory, with one extra constraint that makes it harder. Protobuf receivers ignore unknown fields by design. Agents don't.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

· 10 min read
Tian Pan
Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

The Agent Paged Me at 3 AM: Blast-Radius Policy for Tools That Reach Humans

· 12 min read
Tian Pan
Software Engineer

The first time an agent pages your on-call four times in an hour because it's looping on a malformed alert signal, leadership learns something the security team already knew: "tool access" and "ability to create human work" were the same permission, and you granted it without either a safety review or a product-ownership review. Nobody owned the question of who's allowed to interrupt a human at 3 AM, because nobody framed it as a question. It was framed as a Slack integration.

The 2026 agent stack has made this failure mode cheap to reach. Anthropic's MCP servers, OpenAI's Agents SDK, and the whole class of vendor-shipped action tools have collapsed the distance between "the model decided to do a thing" and "a human got woken up." Most teams ship those integrations the same way they ship a database client: scope a token, drop in the SDK, write a system prompt, ship. The blast radius of a database client is a row count. The blast radius of a PagerDuty client is a person's sleep.