Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

· 11 min read
Tian Pan
Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Generative UI as a Production Discipline: When the Model Renders the Screen

· 12 min read
Tian Pan
Software Engineer

The button label that shipped to your users last Tuesday was never seen by a copywriter, never reviewed in Figma, never QA'd, and didn't exist until inference time. It was generated by a model that decided, mid-conversation, that the right way to collect a shipping address was a six-field form rendered inline rather than three more turns of prose. The form worked. The label was fine. Nobody on the team can tell you which model run produced it, because the trace was rotated out of hot storage and the eval suite tests text outputs, not component graphs.

This is generative UI in production: the model is no longer just a text generator that occasionally invokes a tool. It is a UI compiler whose output is a component tree, and the design system is now a contract the model is constrained to rather than a guideline a human loosely follows. The shift breaks an entire stack of assumptions — QA against static specs, accessibility audits of fixed layouts, copy review of finalized strings, design-system adherence checks at build time — and most teams ship the feature before they have replaced any of them.

The Hollow Explanation Problem: When Your Model's Reasoning Is Decoration, Not Evidence

· 11 min read
Tian Pan
Software Engineer

A loan-review tool flags an application. The reviewer clicks "explain" and gets four neat bullet points: income volatility over the last six months, credit utilization above 70%, a recent address change, two thin-file dependents. The rationale reads like something a careful underwriter would write. The reviewer approves the override and moves on.

The uncomfortable part: the model never used those signals to make the decision. They appeared in the explanation because they were the kind of factors that would justify a flag — not because the flag came from them. The actual computation was a narrow latent-feature pattern that the model can't articulate, plus a few correlations the explanation never mentions. The bullets are post-hoc rationalization, written to be credible rather than to be true.

This is the hollow explanation problem, and it is not the same as hallucination. Every individual claim in that explanation may be factually correct. The user's question — why did you decide that? — is the one being answered falsely.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

The Inverted Agent: When the User Is the Planner and the Model Is the Step-Executor

· 12 min read
Tian Pan
Software Engineer

Most agent products today implement a simple bargain: the model decides what to do, the user clicks "approve." This is the right shape for low-stakes consumer chat — booking a restaurant, summarizing an inbox, drafting a casual reply. It is catastrophically wrong for legal drafting, financial advisory, medical triage, and incident response, where the user holds the accountability the model never can, and where the cost of the wrong plan dwarfs the cost of any individual step.

The inverted agent flips the polarity. The user composes the plan as a sequence of named, reorderable steps. The model executes each step on demand — with full context, with tool access, with reasoning — but never decides what step comes next. The model can suggest, but suggestions are advisory, not autonomous. This is not a worse autonomous agent; it is a different product, with a strictly worse cost-and-latency profile and a strictly better trust profile, aimed at users who would otherwise decline to adopt the autonomous version at all.

The mistake teams keep making is treating "autonomy" as a default to push toward. It is a UX axis you choose per-surface. Get the polarity wrong and you ship a feature your highest-stakes users will quietly refuse to touch.

The Eval Pickle: When Your LLM Judge Gets Smarter Than the Model It Grades

· 9 min read
Tian Pan
Software Engineer

A regression alert fires on Monday morning. Faithfulness on your held-out eval set dropped from 0.86 to 0.78 over the weekend. Nobody shipped a new model. Nobody touched the prompt. Nobody changed the retrieval index. The on-call engineer spends three hours digging before noticing the only thing that changed was the judge model — the auto-evaluator quietly rolled forward to a newer snapshot that catches subtle hedging the old one waved through. Same answers. Same model. Worse score. Real number, fake regression.

This is the eval pickle: as your LLM-as-judge gets sharper, your scores on a frozen system slide down, and the dashboard that's supposed to detect regressions starts manufacturing them. The team that doesn't notice spends quarters chasing "quality drift" that lives entirely in the ruler.

Knowledge Graph Staleness Has a Different SLA Than Vector Staleness

· 10 min read
Tian Pan
Software Engineer

The vector index is wrong by approximately ten percent and nobody panics. The knowledge graph is wrong by one missing edge and somebody ships a wrong answer to a regulator. The two failure modes look identical from the data engineering org chart — both are "the index is stale" — and they sit behind the same change-data-capture pipeline with the same lag tolerance. The pipeline was sized for the vector workload because that was the louder consumer. The graph silently inherited those defaults, and the silence is the bug.

Vector retrieval and graph retrieval fail differently under staleness, and treating them as the same kind of lag problem is how you end up with a system that scores well on RAG benchmarks and is silently wrong on multi-hop queries — the silently-wrong case being, of course, the one users notice last. The fix is not faster pipelines. The fix is recognizing that "stale" means two different things, designing freshness tiers per edge class, and building the eval that catches the difference before a regulator does.

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

· 11 min read
Tian Pan
Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.