Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Browser Agent Session Bleed: When One Profile Serves Many Tenants

· 10 min read
Tian Pan
Software Engineer

A computer-use agent finishes a task on a customer's CRM, the worker pool returns the browser to its idle ring, the next request lands a few hundred milliseconds later, and the navigation to the dashboard succeeds — except it succeeds as the wrong user. The OAuth cookie from the previous session was still on the profile. The trace shows navigation succeeded, screenshot captured, action performed. Nothing in the run log says the agent was acting as someone who never asked it to.

This is the failure class that browser agents inherit silently from the libraries they're built on. Headless browser frameworks were designed for one user per profile because that's how a browser has worked for thirty years. When a worker pool reuses profiles to amortize the eight-second cold start of a fresh Chromium instance, that one-user assumption breaks, and the breakage is invisible to every layer of telemetry the team usually trusts.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Eval Datasets Are Customer Data With a Right Answer Attached

· 12 min read
Tian Pan
Software Engineer

Your golden eval set is a privacy boundary your security team didn't know existed. It is built by sampling production traces, which means it is a curated collection of real customer queries — often containing names, emails, account numbers, transcripts of frustrated calls, half-typed credit card digits — paired with the canonical correct response on top, and then committed to whatever bucket the eval pipeline reads from.

That last part is what makes eval data uniquely dangerous. A raw production trace is sensitive because it captures what the customer said. An eval case is sensitive in a new way because it captures what the customer said plus the labeled correct answer. The label is a derivative work that someone, often an annotator or a domain expert, applied with intent. It signals "this is canonical." It gives the trace a longevity that the original log never had — log retention will eventually rotate the trace out, but the eval case is now a permanent test fixture that the team is committed to keeping green.

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

· 10 min read
Tian Pan
Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

The Prompt Bench Press: Stress-Testing Prompts Outside the Happy Path

· 10 min read
Tian Pan
Software Engineer

A prompt that scores 92% on your eval set and 60% on real production traffic is not a prompt with a bug. It is a prompt whose evaluation set was structurally incapable of finding the bug. The gap is not noise. It is the consequence of optimizing against examples that share a register, a length distribution, a language, and a politeness level with the prompt's design intent — the very same intent that wrote the eval cases.

Real users do not cooperate with your design intent. They send three-word fragments, twelve-paragraph essays, code blocks pasted as questions, casual register that drops articles, formal register that adds honorifics, and queries in languages your few-shot examples never used. None of this is adversarial. It is just the input distribution. And if your eval set was curated by the same person who wrote the prompt, it almost certainly looks nothing like that distribution.

The discipline that closes this gap is not "more evals." It is a different kind of eval — a stress matrix that deliberately varies the dimensions your curated set holds constant, and that grades degradation curves rather than a single accuracy number. Call it the prompt bench press: you are not testing whether the prompt can do the work. You are testing how it fails as the input gets harder.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

· 11 min read
Tian Pan
Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Sampling Drift: When Temperature and Top-P Become Tribal Knowledge

· 9 min read
Tian Pan
Software Engineer

Open the production config of any AI feature that has been live for more than a year and you will find an archaeological dig site. temperature: 0.7 because someone needed the demo to feel less robotic. top_p: 0.85 because a customer complained the outputs were too generic. frequency_penalty: 0.4 because there was a bad week in 2024 where a now-retired model kept repeating itself. None of these decisions are documented. None of them have been re-tested against the current foundation model. They run on every request, in every eval, in every A/B, shaping behavior nobody has consciously chosen since the original ticket got closed.

This is sampling drift. It is the slow accumulation of expedient sampler tweaks whose original justifications evaporate while their effects compound. The values in your config are not "tuned" — they are a fossil record of past incidents, scaled to the volume of your current traffic.

The reason it is invisible is structural. Every eval you run scores against the current sampling config, so the headline number always looks fine. There is no alarm that fires when a temperature value is two foundation-model versions out of date. There is no calendar invite that says "re-grid sampling parameters this quarter." The decay is silent until somebody runs a clean experiment and finds a quality lift, a token reduction, or both, sitting in plain sight at no engineering cost.

Spec, Code, Tests, One Author: The Independence You Quietly Lost

· 11 min read
Tian Pan
Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

The Specification Translation Tax: When Spec, Prompt, and Eval Drift Apart

· 11 min read
Tian Pan
Software Engineer

A PM writes a feature spec in English. An engineer translates it into a system prompt with idiomatic LLM patterns — chain-of-thought scaffolding, output format coercion, a few hedge clauses to cover failure modes the spec never mentioned. An eval author opens the same spec, re-reads it cold, and writes JSON test cases against their interpretation. Three weeks later, all three artifacts disagree, and nobody can tell whether a regression is a prompt bug, a spec-implementation gap, or an eval that was wrong from day one.

This is the specification translation tax. Traditional software has it too — the gap between PRD and code, between code and tests — but compilers and type systems narrow it. AI features have no such backstop. The prompt is documentation that the system actually reads. The eval is a contract that nobody signed. The spec is a description of intent that nobody enforces. Each is a translation of the same intent into a different medium, and without bidirectional consistency, behavior leaks in through whichever artifact is easiest to edit.

Voice Agent Turn-Taking: The 250ms Threshold That Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

Linguists who study turn-taking across languages keep arriving at the same number: the gap between speakers in casual conversation is roughly 200 to 300 milliseconds. Anything longer reads as hesitation, distance, or deference; anything shorter reads as interruption. That window is so tight that humans demonstrably begin formulating their reply before the other person finishes — listening and planning happen in parallel, not in sequence.

Voice agents that miss this window do not feel slightly slow. They feel wrong. A 700ms gap that nobody notices in a chat product feels like the agent is dim, distracted, or about to be interrupted by the user out of impatience. A 1.5-second gap and the user is already repeating themselves. Hitting the budget is not a polish task — it forces architectural choices that text agents never have to face, and those choices reshape how the whole stack is built.