Skip to main content

553 posts tagged with "ai-engineering"

View all tags

Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down

· 11 min read
Tian Pan
Software Engineer

The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.

Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.

The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.

Why AI-Generated Comments Rot Faster Than the Code They Describe

· 11 min read
Tian Pan
Software Engineer

When an agent writes a function and a comment in the same diff, the comment is not documentation. It is a paraphrase of the code at write-time, generated by the same model from the same context, and it is silently wrong the first time the code shifts. The function gets refactored, an argument changes type, an early-return gets added, the comment stays. By next quarter, the comment is encoding a specification that no longer matches the code, and the next reader trusts the comment because the comment is easier.

This is an old failure mode — humans-edit-code-comments-stay-stale — but agents accelerate it across three dimensions at once. Comment volume goes up because agents add a doc block to every function whether it needs one or not. The comments are grammatically perfect, so reviewers don't flag them as low-quality. And the comments paraphrase the code in different terms than the code actually executes, so they look like documentation but encode a second specification that drifts independently of the first.

Debate Diversity Collapse: When Three Agents Vote 3-0 Because They Read the Same Internet

· 11 min read
Tian Pan
Software Engineer

The architecture diagram says "ensemble of three frontier models, debate-and-reconcile, majority vote." The trace says all three agents converged on the same answer in round one and spent two more rounds politely paraphrasing each other. The eval says +0.4 points over a single call. The bill says 4.2x. Somewhere in there, somebody decided the panel was working.

Multi-agent debate is sold as a way to get disagreement-driven reasoning: three minds arguing toward a better answer than any one of them would reach alone. It depends on the agents actually disagreeing. Frontier LLMs trained on overlapping web corpora, instruction-tuned against overlapping preference datasets, and aligned against overlapping safety taxonomies share priors more than the architecture diagrams admit. After a round of "let's reconcile," what you observe is not three perspectives converging on truth — it is three samples from one distribution converging on the mode they were never that far from.

The pattern has a name in the recent literature: when an ensemble's vote-disagreement rate trends to zero independent of question difficulty, you have debate diversity collapse. The panel is still voting. The vote no longer carries information.

The Local-Maximum Trap in Prompt Iteration: How to Tell You're Tweaking the Wrong Thing

· 10 min read
Tian Pan
Software Engineer

There is a moment, six weeks into a serious LLM project, where the prompt iteration log starts to look like a therapy journal. Each tweak swaps one failure mode for another. Add a stricter "do not" clause and the model becomes evasive on cases it used to handle. Soften the tone and a different category of hallucination returns. The eval scoreboard hovers in a band three or four points wide, refusing to break out. Someone says, "let me try one more reordering," and another half day evaporates.

This is the local-maximum trap. The team is climbing a hill, but the hill does not go higher. The cruel part is that the hill is real — every prompt change does produce a measurable delta on some subset of cases, which is exactly the signal that keeps everyone tweaking. What's missing is the recognition that the ceiling above is not a prompt ceiling at all.

Accept Rate Is a Vanity Metric: Your Copilot ROI Hides in the 90 Seconds After the Keystroke

· 11 min read
Tian Pan
Software Engineer

The dashboard says your engineers accepted 45% of AI suggestions last quarter. Leadership reads that as "45% of a developer's time saved" and signs the renewal. The engineers, meanwhile, are quietly rewriting half of what they accepted, debugging the other half, and wondering why their sprints still feel the same length. Both sides are looking at the same number. Only one of them is looking at the right number.

The most quoted study of 2025 should have ended the vendor-dashboard era on its own. METR measured experienced open-source maintainers working on real issues in their own repos, with and without AI. The developers predicted AI would speed them up by 24%. After the experiment they still believed AI had sped them up by 20%. The stopwatch said they were 19% slower. A thirty-nine-point gap between the story and the data — and the story is what went into the quarterly review.

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

· 11 min read
Tian Pan
Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.

Your AI Product Needs an SRE Before It Needs Another Model

· 9 min read
Tian Pan
Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

The benchmark leak: how your eval set quietly joins the training corpus

· 11 min read
Tian Pan
Software Engineer

The benchmark you trust most is the one most likely lying to you. Public evals are a closed loop: you publish the test, someone scrapes it, the next generation of models trains on the scrape, and the score on your trusted yardstick rises by ten points without anyone touching the underlying capability. The measurement apparatus stays still while the thing it measures shifts under it, and the gap between "the model is better at this benchmark" and "the model is better at this task" widens every quarter. By the time the divergence is loud enough to argue about, the eval has already shipped six leaderboard updates and three product roadmaps that all assumed the number meant something.

This is not a hypothetical failure mode. The non-public pre-RLHF GPT-4 base model has been shown to reproduce the BIG-Bench canary GUID verbatim, and Claude 3.5 Sonnet has done the same, both indicating that supposedly-quarantined task data ended up in training. Roughly 40% of HumanEval examples have been identified as contaminated, and removing the contaminated subset from GSM8K drops measured accuracy by about 13 points. SWE-bench Verified now shows a documented 10.6% data leakage rate, and OpenAI publicly stopped reporting it in late 2025 after their internal audit found every major frontier model could reproduce verbatim gold patches for some tasks. The numbers we use to compare models are increasingly numbers about memorization, not capability.

Conversation History Is a Liability Your Prompt Never Admits

· 10 min read
Tian Pan
Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

The Demo Loop Bias: How Your Dev Process Quietly Optimizes for Impressive Failures

· 10 min read
Tian Pan
Software Engineer

There is a particular kind of meeting that happens at every AI-product team, usually on Thursdays. Someone shares their screen, drops a prompt into a notebook, and runs three or four examples. The room reacts. People say "wow." Someone takes a screenshot for Slack. A decision gets made — ship it, swap models, change the temperature. No one writes down the failure rate, because no one measured it.

This is the demo loop, and it has a structural bias that almost no team accounts for: it does not select for the best output. It selects for the most legible output. Over weeks and months, your prompt evolves to produce answers that land in a meeting — confident, fluent, well-formatted, on-topic. Whether they are correct is a separate variable, and it is one your process is not measuring.

The result is what I call charismatic failure: outputs that are wrong in ways your demo loop has been trained, by selection pressure, to ignore.

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.