Skip to main content

720 posts tagged with "llm"

View all tags

The Citation URL That Resolved But No Longer Said What the Model Quoted

· 10 min read
Tian Pan
Software Engineer

A RAG agent answers a customer's regulatory question with a tidy paragraph and a citation. The verification layer fetches the URL, sees a 200 OK, ticks the box, and ships. Six months later a compliance audit pulls the transcript, clicks the same link, and finds a page that now says the opposite of what the agent quoted. The URL is fine. The quote is fine in the transcript. The two no longer match. The customer's compliance officer asks whether the agent fabricated the quote, and the team cannot prove it didn't, because the only surviving evidence of what the URL used to say is the agent's own assertion of what it said.

This is not a hallucination in the usual sense. The model retrieved real content, faithfully extracted a real sentence, and emitted a real URL that still resolves. Every link-checker on earth would call this citation valid. The audit fails anyway, because the verification layer was measuring the wrong property. Reachability is not fidelity. A URL is a pointer to a mutable document under someone else's editorial control, and the moment the document changes, every transcript that quoted it becomes a hallucination report waiting to happen.

The Few-Shot Example Your Model Treated as Binding Precedent

· 10 min read
Tian Pan
Software Engineer

A user submits a question. Your model produces an answer that is confidently wrong in a very specific way: the format is perfect, the reasoning is well-structured, and a particular qualifier — one that does not apply to this question at all — appears in exactly the place a similar qualifier appeared in example three of your system prompt. Not a hallucination. Not a prompt injection. The model did precisely what the examples taught it to do, on a question those examples were never meant to cover.

This is the failure mode that few-shot prompting actively encourages and that most eval suites are structurally blind to. Your examples are not neutral demonstrations of "what good looks like." They are case law. The model selects the closest match by surface tokens and applies the precedent — including its constraints — to whatever case is in front of it.

The JSON Schema Your Output Passed and Your Downstream Consumer Rejected for Semantic Drift

· 10 min read
Tian Pan
Software Engineer

A JSON schema validates the shape of your output. It does not validate the meaning of the values inside that shape. For nine months, every output your AI pipeline produces passes validation cleanly, your monitoring shows schema validity at 100%, and your team treats a schema-valid response as a contractually correct one. Then a model upgrade ships, every output continues to validate, and your Slack alerting channel goes from 50 messages a day to 800 overnight.

The schema did not break. The distribution of values inside it did. That is the gap most AI teams discover in production: the JSON contract is a type system, not a behavior system, and a downstream consumer was depending on a value distribution the contract was never asked to enforce.

The Logprobs Field Your Provider Removed That Broke Your Confidence Router Silently

· 12 min read
Tian Pan
Software Engineer

The most expensive line in the postmortem was the one nobody wrote: a 200 OK with a missing field. The router that was supposed to escalate hard questions to the stronger model had been escalating zero percent of traffic for six weeks. The cost dashboard was celebrating. The quality dashboard was sliding, but only on the hard-question slice the standing eval set underweighted. Everything looked like a win until a customer complained about a specific kind of question the system used to handle correctly.

The cause was a response shape change one tier up the contract stack. The provider's mid-tier plan had dropped per-token logprobs as part of what the release notes called a "tier-specific feature parity adjustment." The client still received valid JSON. The HTTP status was still 200. The model identifier in the response matched the model identifier in the request. The only thing that changed was that the field the router consumed to make its escalation decision was no longer there, and the defensive default added during an incident a year earlier had quietly become the production default for every request.

The max_tokens Default Your Provider Raised That Doubled Your Tail Response Length

· 12 min read
Tian Pan
Software Engineer

Your incident timeline shows no deploys. Your code did not change. Your traffic mix did not change. Your prompts did not change. And yet your p99 output length doubled inside a week, your downstream rendering layer started clipping responses, and your output-token bill rose 38% on traffic that wasn't asking for longer answers. The change was real, the regression was measurable, and nothing in your version control system records it — because the value that moved was one your code never sent.

The provider raised an implicit default. The release notes filed it under "improved long-form behavior." The parameter in question was max_tokens, which your application has been omitting since day one because the documented default was generous and your outputs rarely came close. The default moved from 4096 to 8192 to accommodate longer reasoning in the provider's newer models. Your application got the new default whether you wanted it or not, because the absence of a parameter is itself a configuration choice — and the provider owns the right to change the value behind it.

This is the failure mode where a "no-op" release on the provider's side propagates through your system as a behavior change, a cost change, and a UX change all at once, and your team's only diagnostic signal is the bill arriving at the end of the month.

The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number

· 11 min read
Tian Pan
Software Engineer

Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.

This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The OpenTelemetry Tail Sampler That Dropped Exactly the LLM Spans Your Post-Mortem Needed

· 11 min read
Tian Pan
Software Engineer

A user pings support: "the assistant told me to cancel my service to update my address, that's insane." Your team opens the incident, asks for the conversation ID, drops it into the tracing UI, and gets a polite "no spans found for this trace." The 24-hour retention window closed an hour ago. The tail sampler decided this conversation was a routine success because the response was a syntactically valid JSON object, returned with a 200, in 1.4 seconds. By every signal your collector understood, nothing happened.

The model returned a sentence that destroyed a customer relationship, and your observability pipeline classified it as uneventful. This is not a bug in the sampler. The sampler did exactly what you configured it to do. The problem is that the policy you wrote was designed for a request-response world where "success" and "worth keeping" were close enough to be the same thing, and you ported it unmodified into a system where they are not.

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.

The PII Redactor That Protected Your Logs and Let the Model Leak the Outputs

· 12 min read
Tian Pan
Software Engineer

A PII redactor that runs only on inbound traffic is a one-way valve installed at the wrong end of the pipeline. It catches user-submitted names, emails, and account numbers before they reach your logs. It does nothing about the model's own outputs — the place where the same model is now actively assembling text that may contain those same identifiers, drawn from RAG retrievals, tool returns, conversation history, or content the user pasted from another tenant's data. Every team I've watched ship an input-side redactor has a follow-up ticket in the backlog labeled "output-side parity." Most of those tickets never close, because no incident surfaces the gap for six months, and after six months the ticket has accumulated enough re-prioritization to look like a feature request rather than a missing half of a security control.

The failure mode is invariant: input redaction is treated as the canonical control because it is the easier engineering problem and the easier audit story. You wrote a regex set, you ran a labeled benchmark, you proved precision and recall on a fixed corpus, you shipped it behind a feature flag, and the security review accepted it as the PII boundary. The output side has none of that benefit. The model's response is generative, the surface area is unbounded, and the test methodology — "what should it not say in any of infinitely many contexts" — is structurally harder than "what should we strip from a known input." So the team that ships the inlet treats the outlet as future work and the future never arrives until a customer reports another customer's email landing in their transcript.

The Prompt Engineer Who Quietly Became Your Only Eval Set Reader

· 8 min read
Tian Pan
Software Engineer

The eval set is a file. It is also, secretly, a theory of what the AI feature is for. The two are not the same thing, and the team that confuses them has built a quality gate whose calibration depends on a single human's working memory. When that human leaves, the file stays and the theory walks out the door.

This is the failure mode you don't see in the org chart. You scoped a prompt engineering role. You hired someone good. They shipped the v1 prompts, looked at the thin benchmark, and rewrote it into something rich — a taxonomy of failure modes, weights per category, a labeling rubric that disambiguates edge cases. The eval set became the contract for "is this model good enough to ship." Six quarters later you discover that the contract is unreadable by anyone except the person who wrote it.

The Provider Auto-Router That Quietly Routed Your Premium Traffic To Haiku

· 10 min read
Tian Pan
Software Engineer

Your platform team adopted the provider's "auto" model identifier for cost reasons. The first dashboard after rollout was hard to argue with: a 34% spend reduction with no measurable quality drop on the weekly eval. Three months later, customer satisfaction on your shortest, highest-volume surface had been sliding for two quarters, and a product-led investigation eventually traced the regression to a model identifier nobody on the engineering team had touched. The code said "auto." The provider had been redefining what "auto" meant the whole time.

The lesson is not that auto-routing is bad. The lesson is that "auto" is a moving target whose distribution drifts with provider economics, and your eval's representativeness is the only check standing between vendor optimization and your product quality. If the eval does not match the traffic, the discount you celebrated is being paid out of a quality slope nobody is reviewing.