Skip to main content

763 posts tagged with "ai-engineering"

View all tags

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.

Your AI Product Needs an SRE Before It Needs Another Model

· 9 min read
Tian Pan
Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

The benchmark leak: how your eval set quietly joins the training corpus

· 11 min read
Tian Pan
Software Engineer

The benchmark you trust most is the one most likely lying to you. Public evals are a closed loop: you publish the test, someone scrapes it, the next generation of models trains on the scrape, and the score on your trusted yardstick rises by ten points without anyone touching the underlying capability. The measurement apparatus stays still while the thing it measures shifts under it, and the gap between "the model is better at this benchmark" and "the model is better at this task" widens every quarter. By the time the divergence is loud enough to argue about, the eval has already shipped six leaderboard updates and three product roadmaps that all assumed the number meant something.

This is not a hypothetical failure mode. The non-public pre-RLHF GPT-4 base model has been shown to reproduce the BIG-Bench canary GUID verbatim, and Claude 3.5 Sonnet has done the same, both indicating that supposedly-quarantined task data ended up in training. Roughly 40% of HumanEval examples have been identified as contaminated, and removing the contaminated subset from GSM8K drops measured accuracy by about 13 points. SWE-bench Verified now shows a documented 10.6% data leakage rate, and OpenAI publicly stopped reporting it in late 2025 after their internal audit found every major frontier model could reproduce verbatim gold patches for some tasks. The numbers we use to compare models are increasingly numbers about memorization, not capability.

Conversation History Is a Liability Your Prompt Never Admits

· 10 min read
Tian Pan
Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

The Demo Loop Bias: How Your Dev Process Quietly Optimizes for Impressive Failures

· 10 min read
Tian Pan
Software Engineer

There is a particular kind of meeting that happens at every AI-product team, usually on Thursdays. Someone shares their screen, drops a prompt into a notebook, and runs three or four examples. The room reacts. People say "wow." Someone takes a screenshot for Slack. A decision gets made — ship it, swap models, change the temperature. No one writes down the failure rate, because no one measured it.

This is the demo loop, and it has a structural bias that almost no team accounts for: it does not select for the best output. It selects for the most legible output. Over weeks and months, your prompt evolves to produce answers that land in a meeting — confident, fluent, well-formatted, on-topic. Whether they are correct is a separate variable, and it is one your process is not measuring.

The result is what I call charismatic failure: outputs that are wrong in ways your demo loop has been trained, by selection pressure, to ignore.

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

LLM-as-Compiler Is a Metaphor Your Codebase Can't Survive

· 10 min read
Tian Pan
Software Engineer

The pitch is seductive: describe the behavior in English, the model emits the code, ship it. Prompts become the source, artifacts become the target, and the LLM sits between them like gcc with a friendlier front-end. If that framing held, the rest of software engineering — review, refactoring, architecture — would be downstream of prompt quality. It does not hold. And the codebases built on the assumption that it does start failing in a pattern that is now boring to diagnose: around month six, nobody can explain why a particular function looks the way it does, and every incremental change produces a wave of duplicates.

The compiler metaphor is the root cause, not vibe coding, not model quality, not prompt skill. It is a category error that quietly excuses teams from doing the work that keeps a codebase coherent over years. When you believe the model is a compiler, the generated code is an implementation detail, the same way assembly is an implementation detail of a C program. When you are actually running a team of non-deterministic, context-limited collaborators, the generated code is the asset — and the prompts are closer to Slack messages than to source.

LLM-as-Judge Drift: When Your Evaluator Upgrades and All Your Numbers Move

· 11 min read
Tian Pan
Software Engineer

A regression suite that flips green-to-red without a single prompt change is usually one of three things: a broken test harness, a flaky retrieval store, or a judge that learned new taste over the weekend. The third one is the most common and the least debugged, because no commit in your repo caused it. The scoring model got a silent quality refresh, and every score you compare against last month's dashboard is now denominated in a different currency.

This is the uncomfortable part of LLM-as-judge: you have two moving models, not one. The candidate model is the thing you ship; the judge model is the thing that tells you how the candidate is doing. When both evolve independently, score deltas stop meaning what they used to, and the dashboard that your PM refreshes every morning quietly lies.

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

· 11 min read
Tian Pan
Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

· 9 min read
Tian Pan
Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.