Skip to main content

458 posts tagged with "ai-engineering"

View all tags

The Hollow Explanation Problem: When Your Model's Reasoning Is Decoration, Not Evidence

· 11 min read
Tian Pan
Software Engineer

A loan-review tool flags an application. The reviewer clicks "explain" and gets four neat bullet points: income volatility over the last six months, credit utilization above 70%, a recent address change, two thin-file dependents. The rationale reads like something a careful underwriter would write. The reviewer approves the override and moves on.

The uncomfortable part: the model never used those signals to make the decision. They appeared in the explanation because they were the kind of factors that would justify a flag — not because the flag came from them. The actual computation was a narrow latent-feature pattern that the model can't articulate, plus a few correlations the explanation never mentions. The bullets are post-hoc rationalization, written to be credible rather than to be true.

This is the hollow explanation problem, and it is not the same as hallucination. Every individual claim in that explanation may be factually correct. The user's question — why did you decide that? — is the one being answered falsely.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

The Inverted Agent: When the User Is the Planner and the Model Is the Step-Executor

· 12 min read
Tian Pan
Software Engineer

Most agent products today implement a simple bargain: the model decides what to do, the user clicks "approve." This is the right shape for low-stakes consumer chat — booking a restaurant, summarizing an inbox, drafting a casual reply. It is catastrophically wrong for legal drafting, financial advisory, medical triage, and incident response, where the user holds the accountability the model never can, and where the cost of the wrong plan dwarfs the cost of any individual step.

The inverted agent flips the polarity. The user composes the plan as a sequence of named, reorderable steps. The model executes each step on demand — with full context, with tool access, with reasoning — but never decides what step comes next. The model can suggest, but suggestions are advisory, not autonomous. This is not a worse autonomous agent; it is a different product, with a strictly worse cost-and-latency profile and a strictly better trust profile, aimed at users who would otherwise decline to adopt the autonomous version at all.

The mistake teams keep making is treating "autonomy" as a default to push toward. It is a UX axis you choose per-surface. Get the polarity wrong and you ship a feature your highest-stakes users will quietly refuse to touch.

The Eval Pickle: When Your LLM Judge Gets Smarter Than the Model It Grades

· 9 min read
Tian Pan
Software Engineer

A regression alert fires on Monday morning. Faithfulness on your held-out eval set dropped from 0.86 to 0.78 over the weekend. Nobody shipped a new model. Nobody touched the prompt. Nobody changed the retrieval index. The on-call engineer spends three hours digging before noticing the only thing that changed was the judge model — the auto-evaluator quietly rolled forward to a newer snapshot that catches subtle hedging the old one waved through. Same answers. Same model. Worse score. Real number, fake regression.

This is the eval pickle: as your LLM-as-judge gets sharper, your scores on a frozen system slide down, and the dashboard that's supposed to detect regressions starts manufacturing them. The team that doesn't notice spends quarters chasing "quality drift" that lives entirely in the ruler.

Knowledge Graph Staleness Has a Different SLA Than Vector Staleness

· 10 min read
Tian Pan
Software Engineer

The vector index is wrong by approximately ten percent and nobody panics. The knowledge graph is wrong by one missing edge and somebody ships a wrong answer to a regulator. The two failure modes look identical from the data engineering org chart — both are "the index is stale" — and they sit behind the same change-data-capture pipeline with the same lag tolerance. The pipeline was sized for the vector workload because that was the louder consumer. The graph silently inherited those defaults, and the silence is the bug.

Vector retrieval and graph retrieval fail differently under staleness, and treating them as the same kind of lag problem is how you end up with a system that scores well on RAG benchmarks and is silently wrong on multi-hop queries — the silently-wrong case being, of course, the one users notice last. The fix is not faster pipelines. The fix is recognizing that "stale" means two different things, designing freshness tiers per edge class, and building the eval that catches the difference before a regulator does.

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

· 11 min read
Tian Pan
Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

The 30-Day Prompt Apprenticeship: Onboarding Engineers When 'Read the Code' Doesn't Work

· 12 min read
Tian Pan
Software Engineer

A senior engineer joins your team on Monday. By Friday they've shipped a TypeScript refactor that touches eleven files and passes review with two nits. The same engineer, two weeks later, opens the system prompt for your routing agent — 240 lines of instructions, three numbered example blocks, four "you must never" clauses, and a paragraph at the bottom that reads like an apology — and stares at it for an hour. They cannot tell you what would happen if you deleted lines 87–94. Neither can the engineer who wrote them six months ago.

This is the gap nobody puts on the onboarding doc. A prompt-heavy codebase looks like a codebase, lives in the same repo, runs through the same CI, and gets reviewed in the same PRs. But its semantics live somewhere else: in the observed behavior of a model that nobody on the team built, against a distribution of inputs nobody fully enumerated, with failure modes that surface as PRs to add a sentence rather than as bug reports. The traditional tools of code reading — types, signatures, tests, naming — do almost no work. A new hire who tries to "read the code" learns nothing about why each line is there, and a team that hands them a Notion doc and a Slack channel is implicitly outsourcing onboarding to the prompt's original author.

Prompt Asset Depreciation: The Maintenance Schedule Your AI Team Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Engineering leaders are comfortable with the idea that code rots. Dependencies need updating, infrastructure has lifecycle management, certificates expire on a calendar nobody disputes. Yet the prompt repository gets treated as a write-once-read-many artifact — even though it defines how your product talks to a probabilistic engine that ships behavior changes every six weeks.

The system prompt tuned six months ago against the model that was current then is still in production. The few-shot examples chosen against a tokenizer that has since changed are still being injected on every call. The reranker prompt was tuned against an embedding endpoint the vendor deprecated last quarter. Nobody scheduled a review. Nobody is going to.

This is not a hypothetical failure mode. When one team migrated their prompt suite — meticulously stabilized against GPT-4-32k — to GPT-4.1 and GPT-4.5-preview, only 95.1% and 97.3% of their regression tests passed. A 3-5% silent quality regression is not a rounding error in production; at any non-trivial scale it is a customer-visible degradation that nobody on the team intentionally shipped. And those are the teams that even had a regression test suite. The median team's "regression test" is whatever vibes the on-call engineer formed during the last incident.

The category we are missing is prompt asset depreciation: a maintenance discipline that treats every production prompt as a depreciating asset with a known lifespan, not a constant.

Prompt Bisect: Binary-Searching the Edit That Broke Your Eval

· 10 min read
Tian Pan
Software Engineer

The eval scoreboard dropped two points overnight. The only thing that shipped between the green run and the red run is last week's prompt PR — the one with seventeen edits in it. Two reordered sections. Three new few-shots. A tightened refusal clause. A swapped role description. A handful of word-level rewordings someone called "polish." When the post-mortem starts, somebody says the obvious thing: "It must be one of those." And then they spend the next two days figuring out which.

That two days is the most expensive way to find a single regression. The methodology that costs minutes instead is borrowed wholesale from a forty-year-old kernel-debugging trick: bisect the patch. Treat the prompt as a sequence of revertible hunks, run the eval suite as the predicate, and let binary search isolate the line that flipped the score. The math is the same math git bisect runs on commits, and the discipline it forces on prompt management is a side benefit worth more than the bisect itself.

Prompt-Eligibility: The Missing Column in Your Data Classification

· 11 min read
Tian Pan
Software Engineer

Pull up your company's data classification policy. Public, internal, confidential, restricted — four neat tiers, each mapped to a set of access controls and a list of approved storage locations. Now ask a question the policy was never written to answer: which of these tiers are allowed to leave the corporate perimeter as a token sequence sent to a third-party model API?

The answer is almost always silence. Not because the policy is wrong, but because it is incomplete. Every classification scheme in use today was designed for an access vector that asks "is this employee allowed to read this row?" The prompt layer introduced a different vector entirely: an authorized service reads the row, transforms it into a prompt, and ships it across the network to a vendor that may log it, train on it, or hold it in plaintext for thirty days. None of that is read-access. None of it is covered.

This is the missing column. Until you add it, your data classification document is confidently asserting a control posture you do not have.