Skip to main content

492 posts tagged with "ai-engineering"

View all tags

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

· 11 min read
Tian Pan
Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

· 12 min read
Tian Pan
Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Two-Language Problem: Why Type Safety Stops at the Prompt Boundary

· 10 min read
Tian Pan
Software Engineer

Your codebase has two languages, and only one of them has a compiler. There is the strictly-typed code your team writes — TypeScript with strict: true, Python with mypy in CI, Go with its enforced returns — and then there is the prompt: a templated string that gets concatenated, sent to a remote model, and returns another string the runtime hopes to parse. Between those two regions, the type system goes blind. The IDE highlights nothing. The compiler complains about nothing. And the team that ships a feature on the strength of "but it typechecks" has put the load-bearing contract somewhere the contract checker cannot see.

The seam is well-disguised. From the outside it looks like a function call: generate(input: UserQuery): Promise<AgentResponse>. The signature is honest about what flows in and what flows out. The dishonest part is what happens between the call site and the response: the input is interpolated into a prompt template that references field names by string, the model is asked to produce a JSON object that conforms to a schema described in prose inside that prompt, the response comes back as a string that gets handed to a parser, and the parser returns something the type system can finally see again. Every typed expression on either side is asserting things about a region in the middle that has no static guarantees at all.

This isn't a theoretical concern. Teams report a baseline 10–20% schema-failure rate on naive structured outputs in production, and the failures concentrate on exactly the inputs where you can least afford a silent drop — long contexts, deep tool chains, edge-case users. The type system gave a false sense of correctness right up to the moment the malformed JSON came back and the runtime swallowed it.

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 12 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

· 11 min read
Tian Pan
Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

AI Cyber Insurance: The Coverage Gap Your Agent Will Find First

· 11 min read
Tian Pan
Software Engineer

A coding agent merges a change at 2 a.m. that takes a customer's production database offline for ninety minutes. A customer-support agent fans out and sends fourteen thousand misworded refund-denial emails before the loop is killed. An autonomous reconciliation workflow charges 2,800 cards twice. The damages are real, the audit trail names your company, and your finance team files the claim against the cyber policy that was renewed six weeks ago. The carrier's response is a polite letter explaining that the policy covers "unauthorized access by malicious third parties" and "social engineering of an employee" — and the agent was authenticated, the action was authorized, and no employee was deceived. Coverage denied. The loss sits on your balance sheet.

This is not a hypothetical edge case. It is the modal claim profile for the next eighteen months, and the insurance industry knows it. Cyber, E&O, and D&O policy language was calibrated against a threat model where breach severity is a function of records exfiltrated and incident response is a function of forensic hours billed. Agentic AI does not produce that shape of incident. It produces a shape the underwriter has no actuarial baseline for, and the carrier's first instinct — when the actuarial baseline is missing — is to write the exposure out of the policy entirely.

The AI Engineer Interview Is Broken: Stop Testing Implementation, Start Probing Eval-Design

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter rejected three candidates in a row from their AI engineer pipeline. All three failed the coding screen — the kind of problem where you implement a sliding-window deduplicator under a 35-minute timer. The team then hired the candidate who passed it. Four months later that engineer was the one who shipped the feature where the eval scored 92% in CI and the support queue lit up the day after launch. The eval was measuring exact-match against a curated test set. Production users phrased their queries differently. Nobody on the hiring panel had asked the candidate how they would have caught that gap.

That's the shape of the bug. The interview pipeline was screening for the skill that mattered least to the job and was blind to the skill that mattered most. The team did not have a "judgment" round. They had a coding round, a system-design round, and a behavioral round, and they were running the same loop they had run in 2021 — the one calibrated for engineers who were going to write deterministic code against stable libraries.

Why Deprecating an AI Feature Is Harder Than You Think: Users Built Trust Scaffolding You Can't See

· 10 min read
Tian Pan
Software Engineer

When OpenAI tried to pull GPT-4o from ChatGPT in August 2025, the backlash was strong enough — organized hashtags, paying users threatening to cancel, public reversal within days — that the company restored it as a default option and promised "substantial notice" before any future removals. The replacement was, by every benchmark the team cared about, better. None of that mattered. Users had spent months learning the model's quirks, calibrating their judgment to its failure modes, and integrating its specific phrasing into workflows the team had never instrumented. Replacing it with "the better version" reset that calibration to zero.

This is the failure mode that the standard deprecation playbook does not cover. Sunsetting a regular SaaS feature — announce, migrate, dark-launch the removal, retire — assumes the user contract is the API surface. For AI features, the contract is the observed behavior of the model: phrasings, tendencies, failure modes, the specific way it handles ambiguity. Users build scaffolding on top of that behavior, and most of the scaffolding lives in their heads, on their laptops, and in downstream systems your team never touches.

The AI Feature OKR Mismatch: Why Quarterly Cadence Breaks AI Roadmaps

· 10 min read
Tian Pan
Software Engineer

The team commits to "ship the AI summarizer this quarter," gets it past the technical bar by week ten, takes a victory lap at the all-hands, and ships. Six weeks later the telemetry curve starts bending the wrong way — quietly, slowly, in a way nobody dashboards because nobody owns the shape. The OKR is already marked green. The next quarter's OKRs are already drafted around new launches. The summarizer is now somebody's second-priority maintenance job, and by quarter-end review the team is wondering why customer satisfaction on the feature dropped fifteen points without anything obvious changing.

This is not a bug in the team. It's a bug in the operating model. Quarterly OKRs were calibrated for software where a feature can be scoped, built, shipped, and then largely left alone until the next major rev. AI features don't have that shape. They have a launch curve and a sustain curve, and the sustain curve is where most of the value — and most of the risk — actually lives. The OKR template that treats them as deliverables with launch dates quietly produces a portfolio of demos that decay before the next planning cycle.

The AI Feature RACI: Why Four Green Dashboards Add Up to a Broken Product

· 11 min read
Tian Pan
Software Engineer

An AI feature regresses on a Tuesday. The eval CI is green. The guardrail dashboards are clean. The retrieval P95 is in line. The model provider had no incident. And yet the support queue is filling up with users who say the assistant "feels worse this week." The PM is the only person in the room who can name the regression, and even she cannot tell you which dashboard would have caught it. Welcome to the seam bug — the kind of failure where every individual artifact owner can prove their piece is fine, and the integrated experience is still broken.

This is the predictable result of how AI features get staffed. The owner-of-record list looks reasonable on paper: a prompt author owns the system prompt, an eval owner owns the offline test set and CI gates, a tool/retrieval owner owns the function calls and search index, a guardrail owner owns moderation and policy filters. Plus a model-selection decision that often lives outside all four — sometimes with a platform team, sometimes with whichever engineer most recently filed the procurement ticket. Five owners. Zero of them are on the hook for "does this feature work for the user."