Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

· 10 min read
Tian Pan
Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.

Prompt Edits Without PRs: The Velocity Metric Your AI Team Is Failing

· 9 min read
Tian Pan
Software Engineer

A head of engineering opens the velocity dashboard on a Monday morning. PRs merged per week, flat. Story points completed, flat. Lines changed, suspiciously low. The AI team is having a quiet quarter, the chart says. Two floors away, that team has rewritten the system prompt seven times in three weeks, swapped a tool description that doubled tool-call accuracy, added six new few-shot examples, and tuned the rerank instruction until the product feels like a different application. None of that work shows up in the PR graph. None of it is invisible to users.

The asymmetry between what AI teams change and what engineering dashboards measure has become the load-bearing misdiagnosis of 2026. Behavior change in an AI-heavy product is increasingly decoupled from code change, and the metrics that have governed software organizations for fifteen years — PR throughput, commit volume, lines touched — measure code change. A team can be reshaping production response distributions weekly and look idle on every chart leadership trusts.

Quantization Slippage: The Capability Tax Your Eval Set Was Never Built to Catch

· 11 min read
Tian Pan
Software Engineer

A self-hosted LLM team quantizes the production model from fp16 to int4. Memory drops 4×, throughput nearly doubles, the GPU bill shrinks, and the team reruns the same eval suite that gated the fp16 release. MMLU-Pro retains 98.1% of baseline. Aggregate quality looks fine. They ship.

Six weeks later, a support engineer notices the math tutoring feature has gotten quietly worse. The compliance team flags an uptick in policy-violation completions on adversarial prompts. The structured-output retry rate has crept from 1.4% to 6.8%. None of these show up on the eval dashboard, because the eval dashboard was built to validate a different model — the one that shared the same weights file but had four times more bits behind every activation.

This is quantization slippage. The cost analysis priced the memory win and the latency win. It did not price the eval re-anchoring that the swap silently demanded, and the eval suite, calibrated against the fp16 distribution, is now grading the wrong model with the wrong rubric.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

Tenancy Leaks Through Few-Shot Examples: When Your Prompt Library Becomes a Cross-Customer Data Store

· 11 min read
Tian Pan
Software Engineer

Open the production system prompt of a maturing AI product, scroll past the role description, and you will almost always find a section labeled # Examples or ## Few-shot demonstrations. The examples are excellent — they are concrete, they are domain-specific, they pattern-match exactly the failure modes the eval set was struggling with last quarter. They are also, on closer inspection, real customer data. A real ticket ID from a real account. A phrasing pattern lifted verbatim from a support thread. An internal product code that one tenant uses and the rest of the customer base has never heard of.

The team that put them there is not careless. The examples got into the prompt the way good examples always get into prompts: someone mined production traces for cases the model handled poorly, picked the cleanest worked example, pasted it into the system message, watched the eval scores climb, and shipped. That pipeline — production trace to system prompt — is the most reliable prompt-improvement loop in modern LLM engineering. It is also a structural cross-tenant data leak that the team built without noticing, and the system prompt has quietly become a multi-tenant data store the data-processing agreement never priced.

Your Fine-Tuning Corpus Is a Codebase. Stop Shipping It Through a Bucket.

· 11 min read
Tian Pan
Software Engineer

By month nine of any serious fine-tuning project, your training corpus has more authors than your codebase. Synthetic generation pipelines wrote a few million examples. The vendor labeling firm contributed 80K rows from a workforce you have never met. An engineer added 47 examples last Tuesday to fix a regression they spotted in eval. A scraping job pulls production traces into a "supplementary" parquet file every night. A CSV someone dropped into S3 in February is still there, still in the training mix, and the person who wrote it left the company in March.

Now look at your application code repo. Every line is attributable to a named author. Every change went through a PR with at least one reviewer. Commits are signed. The main branch is protected. Merges require a second human. There is an audit log. If an auditor asks who wrote line 47 of payment_processor.py, you have an answer within seconds.

If they ask who wrote example 47 of the corpus that produced model v2.3, the honest answer is "a Mechanical Turk batch from 2024-Q2, vendor unknown, justification absent." Your fine-tuning corpus is a higher-privilege deployment surface than your codebase — it directly shapes model behavior in production — and you are shipping it through a bucket while you ship code through a reviewed PR. The threat model is inverted.

On Intelligence, Chapter by Chapter: A 2004 Book That Predicted Half of Modern AI

· 133 min read
Tian Pan
Software Engineer

A 2004 book about brains argued that intelligence is, fundamentally, prediction. Twenty-two years later, the dominant paradigm in AI is literally trained to predict the next token. That book deserves another reading.

On Intelligence by Jeff Hawkins (with Sandra Blakeslee) is one of those rare technical books whose central claim has aged well in the most awkward way possible. The framework was right about what the brain does. It was almost certainly wrong about how you should engineer a machine to do it. And it is still the cleanest mental model I know for explaining why your LLM hallucinates with such confidence.

What follows is a chapter-by-chapter summary written for an engineer who is shipping AI features in 2026, not for a neuroscience seminar. I'll resist the temptation to relitigate every claim and just give you the spine, with a working engineer's annotation where the chapter has something to say about what you're building next week.

The AI Engineering Perf Packet: Making Stochastic Work Legible at Promotion Review

· 11 min read
Tian Pan
Software Engineer

A senior engineer walks into the promotion calibration meeting. They shipped a fine-tuned reranker that lifted retrieval quality eight points. They built the eval harness that turned a two-week QA cycle into a one-hour CI gate. They authored the prompt change that drove a two-point conversion lift. By any reasonable measure, they had a defining year.

They don't get promoted. The packet, as written, reads like "I tuned some numbers." The colleague next to them — who shipped a CRUD feature behind a launch banner with QPS, latency, and a Friday demo — gets the nod instead. The committee is not malicious. It is using a vocabulary it has, applied to a packet that didn't translate the work into that vocabulary.

This failure mode is now common enough to be a pattern. AI engineering work doesn't decompose cleanly into the artifacts that calibration committees were trained to evaluate. The packet template was written for deterministic systems shipped in deterministic ways, and the engineers who do the most leveraged work in the AI stack are paying the tax.

The Difficulty Concentrator: AI Support Deflection Burns Out the Humans Left Behind

· 9 min read
Tian Pan
Software Engineer

The dashboard says everything is going well. Deflection up to 65 percent. Ticket volume down. Cost-per-contact halved. Then the support team starts quitting, and the exit interviews say something the dashboard has no column for: "every shift is the bad one."

This is the hidden mechanic of AI-augmented support. The deflection rate is not a measure of difficulty removed. It is a measure of difficulty concentrated. The cases that reach a human are no longer a representative sample of customer reality — they are the residue, the cases the AI couldn't close. And the residue is heavier than the average.