Skip to main content

861 posts tagged with "insider"

View all tags

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

The Internal Eval Set Is a Privacy Boundary Nobody Reviewed

· 11 min read
Tian Pan
Software Engineer

The dataset your AI team calls "the eval set" is, in most companies shipping LLM features, a collection of real customer conversations pulled from production logs. Nobody on the team thinks of it as a privacy event. The data never left the cluster. No new system was provisioned. No vendor was added. An engineer wrote a query, exported a few thousand traces into a labeling tool, and the team started grading model outputs against them. The legal team never heard about it because, from the inside, nothing changed — the same conversations that already lived in the same database were now also being read by a few engineers and a judge model.

That is the privacy boundary nobody reviewed. Customers gave you their messages so you could answer them. They did not give you their messages so you could measure your model against them. The two uses look identical at the storage layer and feel identical at the inference layer, but they are different processing purposes under every modern privacy regime — and the gap between the two is where the next round of compliance pain is going to land.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

· 10 min read
Tian Pan
Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

Per-Customer Prompt Forks: Why Your Next Model Migration Is 47 Migrations

· 12 min read
Tian Pan
Software Engineer

The CTO of an AI startup I talked to last month opened her laptop and showed me a number: 47. That was the count of distinct system prompts running in production, one per enterprise customer or per logical group of them. The base prompt had been forked once in month four for a healthcare customer that needed a softer refusal posture. Then once more for a legal customer that wanted citations. Then for a financial-services customer whose compliance team had a list of forbidden phrases. None of these felt like a big deal at the time. Each was a small ask, approved in isolation, that the account team could close the deal on.

Two years later, the model provider announced the cutover deadline for the version her prompts were tuned against. Her engineering team's first instinct was to run the eval suite against the new model. The eval suite was scoped to the base prompt. The base prompt was still serving customer zero, which had no overrides, and which represented roughly 9% of revenue.

The Prompt as Documentation: When the System Prompt Becomes the Only Artifact Anyone Trusts

· 10 min read
Tian Pan
Software Engineer

A product manager pings you in Slack asking what happens when a customer asks the assistant to cancel their subscription. You start typing the answer from memory, then second-guess yourself, then open the system prompt and read it for thirty seconds. You paste back a summary. They thank you and move on. Three hours later, support asks the same question. By Thursday, the head of partnerships pastes a screenshot of the prompt into a deal review.

This is the prompt-as-documentation anti-pattern, and the first time you notice it happening, it feels great. The artifact you spent six weeks tuning is now the canonical source of truth for what the product does. PMs are reading it. Support is reading it. Sales is reading it. Somewhere a designer is reading it. Your work is load-bearing in a way the old service-layer code never was, and you can prove it by counting the number of unrelated people who can pull the file from memory.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

Quarterly Model Migration: Make It a Calendar Event, Not a Fire Drill

· 11 min read
Tian Pan
Software Engineer

The deprecation email arrives on a Tuesday afternoon. The model your billing pipeline has depended on for fourteen months is now on a sixty-day timer. The prompt was tuned by an engineer who left in March. The eval suite hasn't been re-baselined since launch. The customer-success team is asking why "the AI feels different" on two enterprise accounts. Nobody put this on the roadmap, and nobody will own it cleanly, because in your org's mental model this is a one-off project — even though it is the fourth one this year.

Every team running an AI feature in production runs into the same realization within eighteen months: the foundation-model provider is operating on a deprecation cadence that the team did not plan for, and the team's migration response keeps being a reactive scramble triggered by a notification email. The fix is not a better playbook for the next migration — there are already plenty of those, and your team has probably written one. The fix is to stop treating migration as a project and start treating it as a recurring operational primitive. Put it on the calendar.

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.

Repeat-Question Detection: The Session-Level Blind Spot Your Per-Turn Eval Cannot See

· 11 min read
Tian Pan
Software Engineer

A user opens your chat, asks a question, and gets back a response your eval suite would score 4.6 out of 5. Then they ask the same question with different words. Same answer. Same score. They try once more, this time with the kind of hedging language people use when they suspect the machine isn't listening — "what I'm actually trying to do is…" — and then they close the tab. From the model's perspective, three clean Q&A turns. From the dashboard's perspective, an engaged session. From the user's perspective, a product that failed them three times in a row and won't be opened again.

This is the failure mode per-turn evaluation cannot see. Each individual turn looked correct in isolation. The judge gave a thumbs up. The hallucination detector stayed quiet. The relevance score was high. And yet the conversation, as a whole, did not resolve anything — and that's the unit the user was actually evaluating you on.

The Retrieval Citation Tax: Why Compliance Adds 30% to Your RAG Token Bill

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently sold their legal-AI product into a Fortune 500 in-house counsel office and added one line to their system prompt: "every factual claim must include an inline citation to the retrieved source." The product roadmap allocated a 5% buffer on their token budget for the new behavior. Sixty days after the regulated tenant went live, finance flagged a 34% jump in monthly inference spend. Nobody had broken the product. Nobody had shipped new features. The compliance requirement that closed the deal also quietly rewrote the unit economics underneath it.

This is the retrieval citation tax, and almost every RAG system serving a regulated industry — legal, healthcare, finance, audit-bound enterprise — eventually pays it. The tax is structural, not a bug. It comes from the way citation discipline forces the model into a different generation regime, and it shows up nowhere on the procurement spec the customer signed.