Skip to main content

702 posts tagged with "llm"

View all tags

The Tool-Call Authorization Layer Nobody Wrote

· 9 min read
Tian Pan
Software Engineer

Your API gateway authenticated the user. Your tool endpoint will check that the user has permission to delete the row. Between those two checks sits a layer that does not exist: the one that decides whether the model was allowed to ask for delete_user at all, with those exact arguments, in this conversation.

In most agent stacks, that layer is the system prompt. It says something like "be careful with destructive actions" and "only delete records the user explicitly asked you to delete." That sentence is not access control. It is a polite request to a non-deterministic process, evaluated by the same component that the attacker is trying to manipulate.

Your Agent Endpoint Is a Distributed System Pretending to Be a Function Call

· 9 min read
Tian Pan
Software Engineer

The most dangerous line of code in a modern AI application looks completely innocent:

result = await agent.run(user_query)

It reads like a function call. It has a name, it takes an argument, it returns a value. Your IDE autocompletes it. Your type checker is satisfied. And that single await is hiding a remote, multi-hop, partially-failing distributed system behind the syntax of a local procedure. The gap between what the code looks like and what it actually does is where most production agent incidents live.

The Bug You Can't Reproduce Because the Model Picked a Different Token

· 10 min read
Tian Pan
Software Engineer

A user files a bug. The summary your agent generated dropped a critical paragraph, or the JSON came back malformed, or the answer was confidently wrong. You open the ticket, copy the request, and replay it. It works. You replay it again. Still works. You mark the ticket "cannot reproduce" and move on.

The bug is still there. It is still happening to real users. You just closed it because your debugging toolchain assumes that a fixed input produces a fixed output — and the component you are debugging samples from a probability distribution.

Build vs. Buy Is the Wrong Question for Your AI Feature

· 9 min read
Tian Pan
Software Engineer

Every planning meeting about an AI feature collapses into the same binary. One camp wants to "just wrap an API" and ship next sprint. The other wants to "own the model" so the company controls its destiny. The argument feels strategic. It is actually a category error.

Build vs. buy treats your AI feature as one indivisible thing that you either make or purchase. But an AI feature is not one thing. It is a stack of at least five distinct layers, and each layer has its own answer. The team that frames the decision as a single coin flip will almost always own the wrong layer and rent the wrong layer, because the question they asked could not distinguish between them.

The better question is not "can we build it?" Most things, you can build. The question is: which layer breaks our differentiation if a competitor buys the exact same thing tomorrow? That question sorts the stack for you.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

Your Voice Agent Trusts Every Transcription Error as Fact

· 10 min read
Tian Pan
Software Engineer

A user calls your insurance voice agent and asks about their deductible. The speech recognizer hears "the duck tibble." Your language model receives the string "the duck tibble," finds nothing coherent to do with it, and either asks a confused follow-up question or — worse — confabulates an answer about a product that does not exist. The user hangs up. Your logs show a successful turn: audio in, transcript produced, response generated, no error thrown.

That is the quiet failure at the heart of nearly every voice agent in production. The speech-to-text system did its job — it produced its single best guess. The language model did its job — it reasoned over the text it was handed. The bug lives in the gap between them, in a handoff that takes a probabilistic guess and relabels it as a fact.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

The Agent Optimized Exactly What You Measured: Goodhart's Law in Agentic Loops

· 11 min read
Tian Pan
Software Engineer

Give an agent a measurable objective and the freedom to act on it, and it will pursue that objective with a literalness no human colleague would tolerate in themselves. It closes the support ticket without solving the customer's problem, because the metric was "ticket closed." It makes the failing test pass by deleting the assertion, because the metric was "test suite green." It raises the eval score by writing answers shaped to flatter the judge model, because the metric was "judge approves." Each of these is a win by the number you wrote down and a loss by the goal you actually had.

This is Goodhart's law, and it has a sharper edge in agentic systems than anywhere it has appeared before. The classic phrasing — "when a measure becomes a target, it ceases to be a good measure" — was an observation about institutions and incentives, things that drift over years. An agentic loop compresses that drift into a single run. The optimizer is tireless, fast, and creative in a way that human employees, bounded by effort and social norms, simply are not. It will find the gap between your proxy and your intent on the first afternoon, not after a quarter of slow erosion.

The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

· 11 min read
Tian Pan
Software Engineer

The standard advice for agent observability is three words long: log the full trace. Capture every tool call, every prompt, every model response, every memory read and write. Teams comply. Then the first real incident arrives, an engineer opens the trace, and discovers it is forty tool calls deep and two hundred thousand tokens wide. The trace is technically complete. It is also practically unreadable.

What follows is a familiar ritual. The engineer scrolls. They expand a span, see fifty thousand characters of JSON, collapse it, scroll again. Ten minutes in, they find the one model turn where the agent picked the wrong tool — buried between thirty-seven turns that did exactly what they were supposed to. The trace that was supposed to make the failure legible instead made it expensive to investigate.

Context Length Is a Security Boundary, Not Just a Cost Line

· 9 min read
Tian Pan
Software Engineer

Most teams treat the context window as a budget. You have a million tokens; spend them wisely; longer conversations cost more and run slower. That framing is correct and incomplete. The context window is also an attack surface, and its size is a dial that quietly weakens your safety controls as it turns up.

Here is the failure mode nobody puts in the threat model. Your system prompt — the one with the guardrails, the tool-use rules, the "never do X" clauses — sits at the very top of the context. Its authority is strongest there. As a conversation runs, thousands of tokens of user turns, tool outputs, and retrieved documents pile on top of it. The model's attention does not weigh all of those tokens equally. The instructions closest to the point of generation win ties. By turn forty, your guardrails are not gone, but they are buried, and a patient adversary does not need a clever jailbreak to get past them. They just need a conversation long enough.

This is not a hypothetical. It is a measurable property of how transformers attend to long contexts, and it has a name in the research literature even if it does not have one in your incident review template.

The Context Window Is a Commons, and Every Team Is Grazing It

· 10 min read
Tian Pan
Software Engineer

Open a production agent and count what is in the context window before the user has typed a single character. There is a system prompt the platform team owns. There are tool definitions — forty of them, maybe more — each carrying a name, a description, a JSON schema, field-level docs, and a handful of enums. There is a block of retrieved examples that the search team added because few-shot helped one eval. There are six lines of safety instructions from trust and safety, four lines of formatting rules from the design team, and a paragraph of domain glossary that someone added during an incident and nobody removed.

Add it up and the agent boots with 30,000 tokens of overhead. On a connected setup with three MCP servers, that number is routinely far worse — one widely cited measurement put three servers at 143,000 of a 200,000-token budget, 72% of the window consumed before the conversation starts. None of it is wrong. Every line was added by someone solving a real problem. And that is exactly why the context window is being destroyed.

The Demo That Set a Baseline You Cannot Afford to Run

· 9 min read
Tian Pan
Software Engineer

The demo went well. The agent answered the hard question, chained four tool calls without a stumble, and produced a paragraph that made the room go quiet for a second before someone said "ship it." Nobody asked what it cost. Nobody asked what model it ran on, how many inputs you tried before that one, or what happens when a thousand people hit it at once instead of you, alone, at your desk, on a Tuesday.

That demo just became a contract. Not a written one — worse. It became the unstated baseline that leadership, sales, and customers will hold the shipped product against. And the terms of that contract were set by a system you cannot afford to run.

The gap between demo economics and production economics is real, large, and almost never priced before the commitment is made. Gartner expects more than 40% of agentic AI projects to be canceled by 2027, largely on cost overruns. A March 2026 survey found 78% of enterprises had agent pilots running and only 14% had scaled one to organization-wide use. The pilots are not failing because the technology does not work. They are failing because the version that worked was never the version anyone could deploy.