Skip to main content

678 posts tagged with "ai-engineering"

View all tags

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

Mobile App Store Review Meets AI Features: The Deploy Cadence Collision

· 9 min read
Tian Pan
Software Engineer

A prompt regression lands in production at 9 AM. On the web app, an engineer rolls back the system prompt by lunch and the trace logs go quiet. On iOS, the same regression sits in the binary the App Store reviewed three weeks ago — and the team now has to choose between a server-side prompt swap that voids the store's review of the actual user-facing behavior, or an expedited review that costs 24-48 hours plus a soft favor with the platform team. Neither option is on the runbook.

This is the deploy cadence collision: web AI features iterate on the team's clock, mobile AI features iterate on the platform's clock, and most release trains were laid down before anyone thought to ask whether the prompt belongs on the same train as the binary. The result is a quietly accumulating tax — review delays, asymmetric rollback latency, undisclosed AI surfaces that fail privacy review on resubmit, and an entire class of AI bugs that mobile engineers fix at one-tenth the speed their web colleagues do.

On-Call at 3am for an AI Feature That Didn't 500

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.

This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.

The Prompt as Documentation: When the System Prompt Becomes the Only Artifact Anyone Trusts

· 10 min read
Tian Pan
Software Engineer

A product manager pings you in Slack asking what happens when a customer asks the assistant to cancel their subscription. You start typing the answer from memory, then second-guess yourself, then open the system prompt and read it for thirty seconds. You paste back a summary. They thank you and move on. Three hours later, support asks the same question. By Thursday, the head of partnerships pastes a screenshot of the prompt into a deal review.

This is the prompt-as-documentation anti-pattern, and the first time you notice it happening, it feels great. The artifact you spent six weeks tuning is now the canonical source of truth for what the product does. PMs are reading it. Support is reading it. Sales is reading it. Somewhere a designer is reading it. Your work is load-bearing in a way the old service-layer code never was, and you can prove it by counting the number of unrelated people who can pull the file from memory.

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

Quarterly Model Migration: Make It a Calendar Event, Not a Fire Drill

· 11 min read
Tian Pan
Software Engineer

The deprecation email arrives on a Tuesday afternoon. The model your billing pipeline has depended on for fourteen months is now on a sixty-day timer. The prompt was tuned by an engineer who left in March. The eval suite hasn't been re-baselined since launch. The customer-success team is asking why "the AI feels different" on two enterprise accounts. Nobody put this on the roadmap, and nobody will own it cleanly, because in your org's mental model this is a one-off project — even though it is the fourth one this year.

Every team running an AI feature in production runs into the same realization within eighteen months: the foundation-model provider is operating on a deprecation cadence that the team did not plan for, and the team's migration response keeps being a reactive scramble triggered by a notification email. The fix is not a better playbook for the next migration — there are already plenty of those, and your team has probably written one. The fix is to stop treating migration as a project and start treating it as a recurring operational primitive. Put it on the calendar.

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

· 11 min read
Tian Pan
Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.

Repeat-Question Detection: The Session-Level Blind Spot Your Per-Turn Eval Cannot See

· 11 min read
Tian Pan
Software Engineer

A user opens your chat, asks a question, and gets back a response your eval suite would score 4.6 out of 5. Then they ask the same question with different words. Same answer. Same score. They try once more, this time with the kind of hedging language people use when they suspect the machine isn't listening — "what I'm actually trying to do is…" — and then they close the tab. From the model's perspective, three clean Q&A turns. From the dashboard's perspective, an engaged session. From the user's perspective, a product that failed them three times in a row and won't be opened again.

This is the failure mode per-turn evaluation cannot see. Each individual turn looked correct in isolation. The judge gave a thumbs up. The hallucination detector stayed quiet. The relevance score was high. And yet the conversation, as a whole, did not resolve anything — and that's the unit the user was actually evaluating you on.

The Retrieval Citation Tax: Why Compliance Adds 30% to Your RAG Token Bill

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently sold their legal-AI product into a Fortune 500 in-house counsel office and added one line to their system prompt: "every factual claim must include an inline citation to the retrieved source." The product roadmap allocated a 5% buffer on their token budget for the new behavior. Sixty days after the regulated tenant went live, finance flagged a 34% jump in monthly inference spend. Nobody had broken the product. Nobody had shipped new features. The compliance requirement that closed the deal also quietly rewrote the unit economics underneath it.

This is the retrieval citation tax, and almost every RAG system serving a regulated industry — legal, healthcare, finance, audit-bound enterprise — eventually pays it. The tax is structural, not a bug. It comes from the way citation discipline forces the model into a different generation regime, and it shows up nowhere on the procurement spec the customer signed.

Shadow Evals: When Private Slices Replace Your Eval Rollup

· 10 min read
Tian Pan
Software Engineer

The fastest way to discover that your AI team has no eval discipline is to ask three engineers, in separate Slack DMs, "did your last prompt change improve quality?" — and watch them answer yes, all three of them, with three different numbers, against three different slices, on three different laptops, none of which is reproducible by anyone else in the room. That isn't an evals problem in the textbook sense. The textbook says you don't have evals. The reality is worse: you have too many evals, each of them privately owned, each of them measuring something real, and none of them rolling up into a single number the org can plan against.

This is the shadow eval anti-pattern, and most AI teams ship with it for longer than they admit. It looks productive — every engineer has a notebook, every PR comes with a screenshot of a pass rate, every standup mentions a "win on the long-tail slice" — and it survives quarterly reviews because the bar for "we do evals" is so low that running anything counts. But the org has no signal. Leadership cannot tell whether last month's three prompt edits moved the product forward or sideways, because the three engineers measured against three private slices and stopped tracking the previous baseline the moment they switched files.

Stale Few-Shot Examples and the Half-Life Your Prompt Repo Ignores

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that has been in production for more than nine months. Scroll past the role description, past the formatting rules, past the safety guardrails. Stop at the block titled <examples> or ## Examples or whatever your team called it the day someone copied the first three good Slack threads into a code block. Read them. There is a 60% chance at least one of them references a feature that has been renamed, a button that no longer exists, or a workflow the product manager quietly killed two quarters ago.

The decay is not visible from the eval dashboard. The eval scores are green. They have been green for months. They are green because the eval set was authored against the same product surface the few-shots reference, and the two have aged together in lockstep. The model is performing a flawless impression of last year's product, on a test set that grades it for being faithful to last year's product, while real users interact with this year's product and quietly tolerate the resulting confabulations. This is the half-life nobody puts in the LLMOps roadmap.