Skip to main content

702 posts tagged with "llm"

View all tags

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

· 11 min read
Tian Pan
Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

Prompt Bisect: Binary-Searching the Edit That Broke Your Eval

· 10 min read
Tian Pan
Software Engineer

The eval scoreboard dropped two points overnight. The only thing that shipped between the green run and the red run is last week's prompt PR — the one with seventeen edits in it. Two reordered sections. Three new few-shots. A tightened refusal clause. A swapped role description. A handful of word-level rewordings someone called "polish." When the post-mortem starts, somebody says the obvious thing: "It must be one of those." And then they spend the next two days figuring out which.

That two days is the most expensive way to find a single regression. The methodology that costs minutes instead is borrowed wholesale from a forty-year-old kernel-debugging trick: bisect the patch. Treat the prompt as a sequence of revertible hunks, run the eval suite as the predicate, and let binary search isolate the line that flipped the score. The math is the same math git bisect runs on commits, and the discipline it forces on prompt management is a side benefit worth more than the bisect itself.

Prompt-Eligibility: The Missing Column in Your Data Classification

· 11 min read
Tian Pan
Software Engineer

Pull up your company's data classification policy. Public, internal, confidential, restricted — four neat tiers, each mapped to a set of access controls and a list of approved storage locations. Now ask a question the policy was never written to answer: which of these tiers are allowed to leave the corporate perimeter as a token sequence sent to a third-party model API?

The answer is almost always silence. Not because the policy is wrong, but because it is incomplete. Every classification scheme in use today was designed for an access vector that asks "is this employee allowed to read this row?" The prompt layer introduced a different vector entirely: an authorized service reads the row, transforms it into a prompt, and ships it across the network to a vendor that may log it, train on it, or hold it in plaintext for thirty days. None of that is read-access. None of it is covered.

This is the missing column. Until you add it, your data classification document is confidently asserting a control posture you do not have.

Prompt-Version Skew Across Regions: The Unintended A/B Test Your CDN Ran for Six Hours

· 10 min read
Tian Pan
Software Engineer

You shipped a system-prompt change at 09:14. The rollout dashboard turned green at 09:31. By 11:00 your eval tracker still looked clean, the cost dashboard was unremarkable, and a customer-success engineer pinged the team: structured-output errors on the parser side were up about three percent in Asia-Pacific only. Nothing in North America. Nothing in Europe.

The rollout had paused itself at 67% region coverage because a non-load-bearing health check on one POP flapped during the cutover, and nobody had noticed. For six hours, us-east and eu-west were running prompt v47 while ap-south and ap-northeast were still on v46. You were running a live A/B test split by geography — except you didn't design the test, you couldn't see the test, and the eval suite that was supposed to catch quality regressions was hitting the new version in one region and shrugging.

This failure mode is not a bug in any single tool. It is the predictable consequence of pushing prompts through deployment systems built for a different kind of artifact.

The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration

· 9 min read
Tian Pan
Software Engineer

Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.

This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

· 11 min read
Tian Pan
Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

· 10 min read
Tian Pan
Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

Sampling Parameter Inheritance: When Temperature 0.7 Leaks From the Planner Into the Verifier

· 10 min read
Tian Pan
Software Engineer

A verifier that flips its own answer eight percent of the time is not a flaky model. It is a sampling configuration bug that reached production because the framework defaulted to inheritance. The planner needed temperature=0.7 to brainstorm subtask decompositions. The verifier — the role whose entire job is to give a low-variance yes-or-no on whether the answer satisfies the rubric — was instantiated through the same harness call, and silently picked up the same temperature. Nobody set it that way on purpose. Nobody set it at all.

This is the most expensive parameter in your stack that nobody owns. It compounds across the call tree: the summarizer above the verifier, the structured-output extractor below it, and the retry loop wrapping the whole thing all consume the planner's "be creative" knob as if it were a global. The bill arrives in three places at once — eval flakiness, token spend, and the half-day a senior engineer spends bisecting a regression that turns out to be no regression at all.

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

· 11 min read
Tian Pan
Software Engineer

A PM walks into the AI feature review with a slide that reads "+12% engagement, +8% session length, retention up 3 points." The room nods. Two desks over, the support lead is staring at a different chart: tickets touching the AI surface are up 22%, and the most common resolution code is "user gave up, agent helped manually." Both numbers are real. Both come from the same product. The PM's dashboard is built on the assumption that the AI feature emits the same shape of event as the button it replaced. It doesn't. And the gap between what the dashboard counts and what the user experienced is where AI features quietly fail in plain sight.

The deterministic-feature playbook treats interaction as a click stream: user fires an event, the system reacts, the user moves on. AI features have a different event shape — a task arc with phases, retries, side trips to a human, and an offline judgment the telemetry never sees. Importing the deterministic dashboard onto that arc is the analytics equivalent of running 2018's interview loop against 2026's job. The numbers go up. The thing the numbers were supposed to predict goes down.

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

· 12 min read
Tian Pan
Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

· 12 min read
Tian Pan
Software Engineer

The model is emitting JSON token by token. Your UI wants to render fields the moment they materialize — a confidence score before the long answer body, the arguments of a tool call as the model fills them in. Then someone wires up JSON.parse on every chunk and the whole thing falls over, because JSON.parse is all-or-nothing. It needs a balanced document to return anything. Until the model emits the closing brace, you have nothing to show.

This is not a parser problem you can fix with a try/catch. The standard JSON parser was designed against a content-length-known HTTP response. Partial input is not a state it models — it is "input error." When you treat a token stream as if it were an HTTP body, you inherit thirty years of "the document is either complete or invalid," and your UI pays the bill.