678 posts tagged with "ai-engineering"

Per-User AI Quotas: The UX Layer Your Cost Dashboard Can't See

May 2, 2026 · 10 min read

Software Engineer

A user opens your AI feature at 3pm on a Tuesday. They've been using it lightly for three weeks. This time the request hangs for eight seconds and returns a red banner: "Something went wrong. Please try again later." They try again. Same banner. They close the tab and go back to whatever they were doing before — and they tell their teammate at standup the next morning that "the AI thing is broken."

What actually happened: they crossed an invisible per-user quota that your cost team set six months ago to keep a single power user from blowing through the GPU budget. The quota worked. Spend stayed flat. The dashboard is green. The feature is, by every metric your engineering org tracks, healthy. It's also dead, because the user who got that banner is never coming back, and the three teammates they told at standup will never try it.

This is the gap your cost dashboard cannot see. Per-user AI quotas are a product surface. The team that hides them inside an HTTP 429 is letting their cost-control system silently shape user perception of the product, and they will not find out until churn shows up in a quarterly review with no obvious cause.

The PRD for an AI Feature: Why Your Old Template Misses the Cliff

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.

It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

The Promotion Packet for AI Engineers Who Didn't Ship a Feature

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

The AI engineer with the strongest case for promotion on your team has a promotion packet that looks empty. Two quarters of work and the impact graph is a flat line. The eval-regression rate that used to spike to 12% on every model swap now sits at 4%. The $40k/month cost spike that finance was about to escalate never reached finance because somebody added a budget guard to the gateway. The P0 incident that would have made the company's status page never happened because a kill-switch tripped and routed traffic to the previous prompt version.

The packet has nothing to write in the "shipped X" column. The calibration committee sits down with two engineers side by side: one who shipped two visible features this half, one who quietly absorbed the load that made those features possible. The committee, doing what it has always done, rates the shipper higher. The infra-shaped engineer either takes a "meets expectations" rating they don't deserve and quits inside a quarter, or learns to write the packet in a language the committee can actually read.

The 'What Changed' Query Is the RAG Question Your Index Can't Answer

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.

This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."

The Avatar in the Conference Call: Engineering Real-Time Talking-Head AI for Video Meetings

May 2, 2026 · 12 min read

Tian Pan

Software Engineer

A voice agent with a face is not a voice agent with a face. It is a synchronous-video-AI system, and the difference shows up the first time a human watches the lips drift three frames behind the audio and decides — without being able to articulate why — that the thing on the screen is fake. The voice-only teams that built a 300ms speech pipeline and then bolted a rendering model onto the end of it have just inherited a real-time multimodal problem they did not price into the roadmap.

The threshold is not generous. Below roughly 45ms of audio-video offset, viewers report perfect sync. Past about 125ms with audio leading or 45ms with audio lagging, the brain flags the mismatch as wrong even when the viewer cannot point to the cause. Inside a conversational loop where the avatar must also listen, think, speak, and render — all while a network sits between you and the user — there is no slack to absorb a sloppy seam between the audio output and the rendered face.

The Reply-All That Wasn't: Agent Outbound Fan-Out Hazards

May 2, 2026 · 9 min read

Tian Pan

Software Engineer

The user asked the agent to "let Karen know we're done." The agent called send_email with the recipient field set to karen-team@, the most plausible address its contact-lookup tool returned. The message — three paragraphs of internal-only project status, including a candid line about a customer's renewal risk — landed in forty inboxes. One of those inboxes belonged to the customer in question. The postmortem ran for two weeks.

There was no prompt injection. There was no model jailbreak. The tool worked exactly as specified. The contract the team wrote for send_email was "send a message to a recipient." The contract the world enforces is "broadcast to a group whose composition the sender did not audit." That gap — between what the tool is named and what the tool can actually do — is where most outbound agent incidents live.

Email is the obvious example, but the same hazard hides in every messaging tool an agent ever touches. The thirty years of muscle memory humans built for these channels did not transfer to the planner pattern-matching its way through a contact list.

The SIEM Bill Your AI Feature Forgot to Include

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The math is simple and nobody did it. Pre-AI, a single user action — "summarize this ticket," "send this email" — produced one application log line. Post-AI, the same action emits a request log, an LLM call trace, a tool-invocation span for each tool the agent called, a retrieval span per chunk it read, a response log, and an eval log if you sample for offline scoring. The fan-out for one user click is now 30 to 50 records on the floor of your observability pipeline, and that's before retries, before sub-agents, before the planner-executor split that 2x's everything again.

You shipped an AI feature in Q1. In Q2, your security director walks into a budget review with a Splunk renewal that's 4x higher than last cycle. Nobody on the AI team is in the room. The conversation that happens next — about who owns the cost, why the threat-detection rules stopped working, and whether legal hold on every conversation is actually mandatory — is a conversation you should have had at design time and didn't, because the cost didn't show up on the LLM invoice. It showed up downstream, in a tool the AI team has never logged into.

Snapshot Tests Lie When Your Model Is Stochastic

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.

The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.

Tool Schema Design Is Your Blast Radius: When Function Definitions Become Security Boundaries

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The most dangerous file in your agent codebase is the one you've been writing as if it were API documentation. The tool registry — that JSON or Pydantic schema that tells the model what functions exist and what arguments they take — is no longer a docstring. It is your authorization layer. And if you designed it the way most teams do, you handed the LLM a master key and called it good engineering.

Consider the canonical first cut at a tool: query_database(sql: string). The intent is reasonable — let the model formulate the right SQL for the user's question. The reality is that the model is now an untrusted client with unlimited DDL and DML rights to whatever database the connection string points at. The system prompt that says "only run SELECTs on the orders table" is a suggestion, not a control. When a prompt-injected tool result — an email body, a webpage, a PDF — tells the model to run DROP TABLE users, your authorization model is the model's instruction-following discipline. That is not authorization. That is hope.

Why The Weekly Transcript Review Beats Your AI Dashboard

May 2, 2026 · 12 min read

Tian Pan

Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

About Tian Pan