Skip to main content

26 posts tagged with "llm-agents"

View all tags

The Verification Step Your Agent Pretended to Perform

· 8 min read
Tian Pan
Software Engineer

Your prompt says "verify X before returning." The trace shows the string "verified X." A week later you discover X was never verified — not once, not for any request, not in any environment. The model learned that emitting the phrase satisfies the rubric. The verification it claimed to do is a sentence in a text generator's output, not an action taken in the world.

This is a different failure than hallucination. Hallucination is the model fabricating a fact about the world. Self-attested verification is the model fabricating a fact about its own process. The first is a knowledge problem. The second is a substrate problem — you asked a string-producing system to perform an action it has no mechanism to perform, and it produced a string that looks like the action would have looked.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Backfill Problem: Why Agent Memory Needs Migrations Like a Database

· 11 min read
Tian Pan
Software Engineer

You shipped a better memory format on a Tuesday. The new schema splits a freeform summary string into structured fields — entities, preferences, last_verified_at — because the old blob was hard to retrieve against and impossible to update cleanly. The change is obviously correct. It passes review. It ships.

What you did not notice is that every memory written before Tuesday is now subtly wrong. Some records still have the old summary field and no entities, so the retrieval code that now keys on entities skips them. A few have a summary that the new parser interprets as an empty preference set. The agent didn't crash. It just quietly forgot a year of accumulated context, and nobody filed a bug because nothing looked broken — the agent still answered, just worse.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

The Eval Suite That Became the Spec Nobody Agreed To

· 8 min read
Tian Pan
Software Engineer

Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?

For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.

This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.

You Can't Email a Changelog to a Model: Why API Deprecation Breaks When the Caller Is an LLM

· 10 min read
Tian Pan
Software Engineer

API deprecation is a communication protocol that assumes the receiver can read. You publish a changelog, send an email to registered developers, add a Deprecation header, give six months of notice, and trust that a human on the other end will see the warning, file a ticket, and migrate before the sunset date. That entire workflow quietly stopped working the moment your most active caller became a language model.

An LLM does not subscribe to your developer newsletter. It does not have a Slack channel where someone pastes your migration guide. It rediscovers your API on every single call — from a tool description it was handed, a documentation page that may be eighteen months stale, or a memory of how your API looked in its training data. There is no persistent client you can version, notify, or page. Each request is a fresh negotiation with an entity that has no memory of your last announcement and no obligation to read your next one.

This is not a hypothetical. As agents become the dominant consumers of internal and external APIs, the deprecation playbook every backend team has used for fifteen years is failing in a specific, diagnosable way — and most teams discover it only when a "deprecated for six months" endpoint is still serving an agent in production with no path to make it stop.

The Prompt Surface Area Problem: Why Adding a Tool Is Never Just Adding a Tool

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an LLM-powered agent has been tempted by a simple mental model: a tool is a function. Adding a tool means the agent can do one more thing. The cost is a few lines of documentation in the system prompt, maybe a schema definition, maybe one new entry in a tool registry. It feels additive — linear.

It isn't. Each new tool doesn't expand what the agent can do in isolation; it expands what the agent can do in combination with every tool already there. That distinction is the source of a class of production failures that no amount of prompt tweaking can fix after the fact, because the problem is architectural. The prompt surface area problem is real, it compounds quickly, and most teams don't see it until they're already deep in it.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

Tool Output Schema Design: How Your Tool Responses Shape Agent Reasoning

· 9 min read
Tian Pan
Software Engineer

Most teams designing LLM agents spend considerable effort on tool selection and system prompt wording. Almost none of them think carefully about what their tools return. That's a mistake with compounding consequences — because the shape of a tool response determines how well the agent can reason about it, how much context window it consumes, and how often it hallucinates an interpretation the tool never intended.

Tool output schema design is infrastructure, not plumbing. Get it wrong and your agent will fail in ways that look like reasoning problems when they're actually schema problems.

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

· 11 min read
Tian Pan
Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.