Skip to main content

553 posts tagged with "ai-engineering"

View all tags

LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization

· 10 min read
Tian Pan
Software Engineer

The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.

What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.

Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×

· 14 min read
Tian Pan
Software Engineer

The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.

Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

· 12 min read
Tian Pan
Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

On-Device AI Needs a Fleet Manager, Not a Model Card

· 12 min read
Tian Pan
Software Engineer

The on-device AI demo that shipped last quarter ran a single 4-bit Llama variant, ran it on a single test phone, and ran it well. Six months later, the same feature has a one-star tail of reviews complaining about heat, battery drain, or — worse — silent quality degradation that users only notice as "the AI got dumber on my old phone." The model didn't change. The fleet did. And the team that thought it was shipping a model has discovered, late, that it was actually shipping a fleet.

This is the gap that sinks most on-device AI launches: the strategy is built around picking the model, when the actual hard problem is delivering the right model to each device class, observing whether it's working, and rolling it back when it isn't. The discipline that closes that gap looks far more like CDN operations than like ML research — manifest-driven delivery, per-cohort telemetry, decoupled rollout channels, and a model-variant pipeline that produces N quantization tiers from one trained checkpoint. Most teams don't have any of that. They have a model card and a build artifact.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

Prompt Linting Is the Missing Layer Between Eval and Production

· 11 min read
Tian Pan
Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.

The Customer Record Hiding in Your Few-Shot Prompt Template

· 11 min read
Tian Pan
Software Engineer

The privacy auditor's question came two days before the SOC 2 renewal: "Why is the email field in your onboarding prompt's example a real customer address?" The product team rebuilt the chain in their heads. A year earlier, when they shipped the AI summarizer, someone needed a "see how this works" example for the few-shot template. They picked a representative customer record from staging, scrubbed the obvious fields — name, account ID, phone — and committed the file. The customer churned six months later. Their record was deleted from the database per the data retention policy. Their record was not deleted from the prompt template, which had been shipped to every tenant in production.

The team had assumed, like most teams, that the privacy boundary was the database. The prompt template was code. Code goes through review. Review doesn't flag PII because reviewers aren't looking for it in YAML strings labeled example_input:. The DLP scanner that catches PII in Slack messages and email attachments doesn't scan committed code, and even if it did, it wouldn't recognize a partially-scrubbed customer record as personal data because the fields it knew to look for had been removed. Everything that remained — the company size, the industry, the rare job title, the specific city — was data the scanner had no rule for.

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

Retiring an AI Feature Is a Trust Event, Not a Deprecation

· 13 min read
Tian Pan
Software Engineer

The metrics tell you to kill it. Three percent of monthly actives. The eval refresh has slipped two cycles. The prompt has a // TODO: revisit when we move off the legacy ticket schema from a year ago. Your senior AI engineer spends a full week per month babysitting the thing — model upgrades, label drift, the one tool integration that flakes whenever the upstream API changes its date format. Every quarterly review, somebody asks why this assistant still exists, and every quarter the answer is "we haven't gotten to it yet."

So you write the deprecation memo. You copy the structure from the API sunset playbook your platform team perfected: T-minus-six-months announcement, a migration guide, a banner in the product, a webhook for partners, the usual Sunset: HTTP header. You ship it on a Tuesday. By Thursday afternoon, your CSMs are forwarding emails that don't sound like API deprecation complaints. They sound like breakup letters.

That's the moment most teams realize they took a category error to production. The thing you're retiring isn't an API. It's a relationship the user formed with something that talked back.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

· 11 min read
Tian Pan
Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

Your Review Queue Is Where the Autonomy Promise Goes to Die

· 10 min read
Tian Pan
Software Engineer

The AI feature ships with a clean safety story. Anything above the confidence threshold is auto-actioned. Anything below gets queued for a human. At launch, the queue is empty by 5 PM every day. Marketing puts "human-in-the-loop" on the slide. Compliance signs off. Everyone goes home.

Six months later the feature has 10x'd. The review team didn't. The queue carries a 72-hour backlog. An item that requires "human review" sits unread for three days, then gets approved by a tired reviewer who is averaging eleven seconds per decision because that is what it takes to keep the queue from doubling overnight. The product still says "every action is reviewed." The reality is that "human-in-the-loop" has degraded into "human in the queue eventually" — which is functionally autonomous operation with a paperwork lag.

The safety story didn't break with a bug. It broke with a staffing plan that nobody owned.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.