Skip to main content

702 posts tagged with "llm"

View all tags

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

· 8 min read
Tian Pan
Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The Happy Path Is the Only Path Your Agent Eval Ever Tested

· 10 min read
Tian Pan
Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

Your Internal API Became a Public API the Day an Agent Called It

· 10 min read
Tian Pan
Software Engineer

Internal APIs survive on a quiet arrangement: nobody writes the contract down because everybody already knows it. The fields that happen to be there, the error you throw that a caller secretly parses, the endpoint that returns 200 with an empty list instead of 404 — these are load-bearing behaviors held together by the fact that you can name every caller and Slack them before you change anything. That arrangement works right up until it doesn't.

It stops working the day you wire an agent to that API. Not because the agent is malicious or careless, but because the agent is a caller you cannot reach. It has no Slack handle. It did not read your migration note. It depends on response shapes it absorbed from an example payload or a schema snapshot, and it will keep depending on them long after you've moved on.

The uncomfortable truth is that "internal" was never a property of the API. It was a property of the caller list. Shorten that list to people you know and the API is internal; add one participant you can't coordinate with and the API is public — with all the discipline that word implies, and none of the infrastructure you'd have built if you'd known.

The Model Reached End of Life and Took Your Prompt With It

· 10 min read
Tian Pan
Software Engineer

A deprecation notice looks harmless. It arrives as a calm paragraph in a changelog or an email: this model snapshot will be removed from the API on a date a few months out, here is the recommended replacement, thank you for building with us. The implied work is a one-line change — swap the model string, redeploy, done.

That framing is wrong, and it is wrong in an expensive way. The model string is the smallest thing you are losing. The thing that actually leaves with the old model is the prompt you spent six months tuning — every edge-case patch, every reordered instruction, every "respond only with valid JSON, do not wrap it in markdown" you added because that specific model did that specific annoying thing. None of that was portable. It was fitted, in the statistical sense, to one model's behavior. The replacement is not bug-for-bug compatible, so the fit no longer holds.

A model end-of-life is a migration project. Treat it as a config change and you will discover the difference in production, on the new model, with real traffic.

Onboarding an Agent Like a Junior Engineer Is a Category Error

· 9 min read
Tian Pan
Software Engineer

When an agent joins your team, the nearest analogy in every engineering manager's head is the new hire. So the playbook writes itself: give it a sandbox and read-only logs, scope the first tasks small, pair with it, expect a ramp-up period, and grow it into bigger work as trust accumulates. It feels responsible. It feels like the same patient management that turned your last junior into a senior.

It is also a category error — not a slightly imperfect analogy, but a wrong one. A junior engineer is a person who does not yet know your system. An agent is a stateless function that will never know your system, no matter how many times it touches it. Those are different kinds of things, and the management instincts that work for one quietly misallocate your attention on the other.

The reason this matters is that the metaphor doesn't just mislead — it tells you to invest in the wrong place. "Grow the agent" is not a strategy. The agent is fixed. Everything you can actually change lives outside of it.

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

· 11 min read
Tian Pan
Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

Prompt Injection Is a Confused Deputy, Not a Content-Filtering Problem

· 10 min read
Tian Pan
Software Engineer

The most common post-incident finding for a prompt injection breach is some variation of "the model got tricked." A retrieved document contained hidden instructions, the agent followed them, customer data left the building. The fix that follows is almost always a content filter: scan the input, classify the malicious instruction, strip it out before it reaches the model. Ship the filter, close the ticket.

That finding is wrong, and the filter is a treadmill. "The model got tricked" describes the symptom, not the vulnerability. The vulnerability is that an agent holding real privileges — a database token, a send-email capability, filesystem write — accepted instructions from a source that should never have been allowed to command those privileges. That is not a new class of bug. It is a confused deputy, and operating systems named and largely solved it almost forty years ago.

If you treat prompt injection as a detection problem, you are signing up for an arms race against every attacker who can phrase a sentence. If you treat it as an authority problem, you get to reuse decades of security engineering that already works.

The Quadratic Cost of a Conversation: Why AI Chat Spend Grows Superlinearly

· 8 min read
Tian Pan
Software Engineer

A ten-turn conversation does not cost ten times a single turn. It costs closer to fifty-five times. This is the first thing most teams get wrong when they model the unit economics of an AI feature, and it is the reason a product that looks profitable in a spreadsheet bleeds money in production.

The mistake is treating a conversation as a sequence of independent requests. It is not. Because LLM APIs are stateless, every turn re-sends the entire accumulated history. Turn one sends one unit of context. Turn two sends two. Turn ten sends ten. The total tokens billed across the session is the sum 1 + 2 + ... + N, which grows as N²/2 — quadratically — while your product almost certainly charges a flat, linear price.

The users who love your product most are the ones holding the longest conversations. They are also the ones quietly destroying your margins.

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.