Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

Retiring an AI Feature Is a Trust Event, Not a Deprecation

· 13 min read
Tian Pan
Software Engineer

The metrics tell you to kill it. Three percent of monthly actives. The eval refresh has slipped two cycles. The prompt has a // TODO: revisit when we move off the legacy ticket schema from a year ago. Your senior AI engineer spends a full week per month babysitting the thing — model upgrades, label drift, the one tool integration that flakes whenever the upstream API changes its date format. Every quarterly review, somebody asks why this assistant still exists, and every quarter the answer is "we haven't gotten to it yet."

So you write the deprecation memo. You copy the structure from the API sunset playbook your platform team perfected: T-minus-six-months announcement, a migration guide, a banner in the product, a webhook for partners, the usual Sunset: HTTP header. You ship it on a Tuesday. By Thursday afternoon, your CSMs are forwarding emails that don't sound like API deprecation complaints. They sound like breakup letters.

That's the moment most teams realize they took a category error to production. The thing you're retiring isn't an API. It's a relationship the user formed with something that talked back.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

· 11 min read
Tian Pan
Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

Your Review Queue Is Where the Autonomy Promise Goes to Die

· 10 min read
Tian Pan
Software Engineer

The AI feature ships with a clean safety story. Anything above the confidence threshold is auto-actioned. Anything below gets queued for a human. At launch, the queue is empty by 5 PM every day. Marketing puts "human-in-the-loop" on the slide. Compliance signs off. Everyone goes home.

Six months later the feature has 10x'd. The review team didn't. The queue carries a 72-hour backlog. An item that requires "human review" sits unread for three days, then gets approved by a tired reviewer who is averaging eleven seconds per decision because that is what it takes to keep the queue from doubling overnight. The product still says "every action is reviewed." The reality is that "human-in-the-loop" has degraded into "human in the queue eventually" — which is functionally autonomous operation with a paperwork lag.

The safety story didn't break with a bug. It broke with a staffing plan that nobody owned.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

The Structured-Output Retry Loop Is Your Hidden Compute Waste

· 11 min read
Tian Pan
Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

· 11 min read
Tian Pan
Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

· 12 min read
Tian Pan
Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Two-Language Problem: Why Type Safety Stops at the Prompt Boundary

· 10 min read
Tian Pan
Software Engineer

Your codebase has two languages, and only one of them has a compiler. There is the strictly-typed code your team writes — TypeScript with strict: true, Python with mypy in CI, Go with its enforced returns — and then there is the prompt: a templated string that gets concatenated, sent to a remote model, and returns another string the runtime hopes to parse. Between those two regions, the type system goes blind. The IDE highlights nothing. The compiler complains about nothing. And the team that ships a feature on the strength of "but it typechecks" has put the load-bearing contract somewhere the contract checker cannot see.

The seam is well-disguised. From the outside it looks like a function call: generate(input: UserQuery): Promise<AgentResponse>. The signature is honest about what flows in and what flows out. The dishonest part is what happens between the call site and the response: the input is interpolated into a prompt template that references field names by string, the model is asked to produce a JSON object that conforms to a schema described in prose inside that prompt, the response comes back as a string that gets handed to a parser, and the parser returns something the type system can finally see again. Every typed expression on either side is asserting things about a region in the middle that has no static guarantees at all.

This isn't a theoretical concern. Teams report a baseline 10–20% schema-failure rate on naive structured outputs in production, and the failures concentrate on exactly the inputs where you can least afford a silent drop — long contexts, deep tool chains, edge-case users. The type system gave a false sense of correctness right up to the moment the malformed JSON came back and the runtime swallowed it.

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 13 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.