Skip to main content

458 posts tagged with "ai-engineering"

View all tags

The AI Feature OKR Mismatch: Why Quarterly Cadence Breaks AI Roadmaps

· 10 min read
Tian Pan
Software Engineer

The team commits to "ship the AI summarizer this quarter," gets it past the technical bar by week ten, takes a victory lap at the all-hands, and ships. Six weeks later the telemetry curve starts bending the wrong way — quietly, slowly, in a way nobody dashboards because nobody owns the shape. The OKR is already marked green. The next quarter's OKRs are already drafted around new launches. The summarizer is now somebody's second-priority maintenance job, and by quarter-end review the team is wondering why customer satisfaction on the feature dropped fifteen points without anything obvious changing.

This is not a bug in the team. It's a bug in the operating model. Quarterly OKRs were calibrated for software where a feature can be scoped, built, shipped, and then largely left alone until the next major rev. AI features don't have that shape. They have a launch curve and a sustain curve, and the sustain curve is where most of the value — and most of the risk — actually lives. The OKR template that treats them as deliverables with launch dates quietly produces a portfolio of demos that decay before the next planning cycle.

The AI Feature RACI: Why Four Green Dashboards Add Up to a Broken Product

· 11 min read
Tian Pan
Software Engineer

An AI feature regresses on a Tuesday. The eval CI is green. The guardrail dashboards are clean. The retrieval P95 is in line. The model provider had no incident. And yet the support queue is filling up with users who say the assistant "feels worse this week." The PM is the only person in the room who can name the regression, and even she cannot tell you which dashboard would have caught it. Welcome to the seam bug — the kind of failure where every individual artifact owner can prove their piece is fine, and the integrated experience is still broken.

This is the predictable result of how AI features get staffed. The owner-of-record list looks reasonable on paper: a prompt author owns the system prompt, an eval owner owns the offline test set and CI gates, a tool/retrieval owner owns the function calls and search index, a guardrail owner owns moderation and policy filters. Plus a model-selection decision that often lives outside all four — sometimes with a platform team, sometimes with whichever engineer most recently filed the procurement ticket. Five owners. Zero of them are on the hook for "does this feature work for the user."

The AI Interview Has No Signal: Why Your Loop Doesn't Identify People Who Ship LLM Products

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months running their standard senior-engineer loop with an "AI round" bolted on. They interviewed seventy candidates. They hired three. None of the three shipped an agent that survived a production weekend. The team blamed the talent market. The talent market was fine. The loop was the problem.

The standard engineering interview was calibrated for a stack where correctness is verifiable, performance is measurable on a benchmark, and a good engineer is someone who can decompose a problem into deterministic components and reason about edge cases against a known specification. That stack still exists, and those skills still matter, but the cluster of skills that predicts shipping LLM products is largely orthogonal to it. Your loop is asking the right questions about the wrong job.

This is a structural problem, not a calibration nudge. Adding a forty-five-minute "AI round" to a loop calibrated for deterministic systems doesn't surface AI builders — it surfaces the intersection of classical-systems-strong and LLM-fluent candidates, which is a vanishingly small set, and produces six months of failed loops while everyone wonders where all the AI engineers went.

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model

· 13 min read
Tian Pan
Software Engineer

Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.

This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.

Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."

The Wiki Has a Second Tenant: Why Docs for AI Agents Are Different from Docs for Humans

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized SaaS company spent two days last quarter chasing a deployment bug that turned out to be the agent's fault. The agent had read a runbook last updated in 2023, faithfully followed step three, and ran a command that no longer existed in the deploy tooling. The runbook still rendered fine in the wiki — the screenshots were even still legible — but it had silently become hostile to a reader who couldn't tell that the surrounding context was stale. The human authors had no idea the doc was now a load-bearing input for every new hire's AI assistant.

This is the quiet shift that has happened in most engineering orgs over the past eighteen months: the internal wiki has accumulated a second audience. The same Confluence pages, the same architecture diagrams, the same "how we deploy" gists are now being read by two distinct consumers — the engineers themselves and the AI assistants their engineers use. The two readers consume the same words under entirely different constraints and produce systematically different failure modes when the docs were written with only the first one in mind.

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In

· 10 min read
Tian Pan
Software Engineer

Every "we'll just swap providers if needed" plan in the deck has a budget line for prompt rewrites. None of them has a line for the eval suite. That is the bug. The prompts are the visible coupling — the part you wrote, the part you can grep for, the part a junior engineer can rewrite in an afternoon. The eval harness is the invisible coupling, and it is the one that will eat a quarter of your roadmap when you actually try to migrate.

The pattern shows up the moment leverage matters. Your contract is up. A competitor releases a model that benchmarks better on your domain. Pricing on output tokens shifts under you. You go to run the candidate model through your eval suite to make the call, and within a day you discover that you cannot trust any score the harness produces, because the harness itself was written against the incumbent. You are not comparing models. You are comparing one model against a measurement instrument that was calibrated to the other one.

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

· 11 min read
Tian Pan
Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Generative UI as a Production Discipline: When the Model Renders the Screen

· 12 min read
Tian Pan
Software Engineer

The button label that shipped to your users last Tuesday was never seen by a copywriter, never reviewed in Figma, never QA'd, and didn't exist until inference time. It was generated by a model that decided, mid-conversation, that the right way to collect a shipping address was a six-field form rendered inline rather than three more turns of prose. The form worked. The label was fine. Nobody on the team can tell you which model run produced it, because the trace was rotated out of hot storage and the eval suite tests text outputs, not component graphs.

This is generative UI in production: the model is no longer just a text generator that occasionally invokes a tool. It is a UI compiler whose output is a component tree, and the design system is now a contract the model is constrained to rather than a guideline a human loosely follows. The shift breaks an entire stack of assumptions — QA against static specs, accessibility audits of fixed layouts, copy review of finalized strings, design-system adherence checks at build time — and most teams ship the feature before they have replaced any of them.