Skip to main content

553 posts tagged with "ai-engineering"

View all tags

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

· 9 min read
Tian Pan
Software Engineer

Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.

Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.

This is the golden dataset decay problem, and it's more common than most teams admit.

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.

Defining Escalation Criteria That Actually Work in Human-AI Teams

· 10 min read
Tian Pan
Software Engineer

Most AI teams can tell you their containment rate — the percentage of interactions the AI handled without routing to a human. Far fewer can tell you whether that number is the right one.

Escalation criteria are the single most important design document in an AI-augmented team, and most teams don't have one. They have a threshold buried in a YAML file and an implicit assumption that the AI knows when it's stuck. That assumption is wrong in both directions: too high a threshold and humans spend their days redoing AI work; too low and users absorb AI errors without recourse. Both failures are invisible until they compound.

The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems

· 10 min read
Tian Pan
Software Engineer

There's a specific kind of dread that hits when you inherit an AI system from an engineer who just left. The system prompts are hundreds of lines long. There's a folder called evals/ with 340 test cases and no README. A comment in the code says # DO NOT CHANGE THIS — ask Chen and Chen is no longer reachable.

You don't know why the customer support bot is forbidden from discussing pricing on Tuesdays. You don't know which eval cases were written to catch a regression from six months ago versus which ones are just random examples. You don't know if the guardrail blocking certain product categories was a legal requirement, a compliance experiment, or something someone added because a VP saw one bad output.

The system still works. For now. But you can't safely change anything.

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

· 9 min read
Tian Pan
Software Engineer

You built an AI feature. You ran evals. You saw 95% accuracy on your test set. You shipped it. Six weeks later, users hate it and your team is quietly planning to roll it back.

This is the last-mile reliability problem, and it is probably the most common cause of AI feature failure in production today. It has nothing to do with your model being bad and everything to do with how average accuracy metrics hide the distribution of failures — and how certain failures are disproportionately expensive regardless of their statistical frequency.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

· 9 min read
Tian Pan
Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

· 9 min read
Tian Pan
Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

LLMs as Data Engineers: The Silent Failures in AI-Driven ETL

· 11 min read
Tian Pan
Software Engineer

Your hand-coded ETL pipeline handles 95% of records correctly. The edge cases — the currency strings with commas, the inconsistently formatted dates, the inconsistent country codes — flow through to your data warehouse and quietly corrupt your dashboards. Nobody notices until a quarterly report looks wrong. You add another special case to the pipeline. The cycle continues.

LLMs can solve this. They infer schemas from raw samples, handle messy edge cases that no engineer anticipated, and transform unstructured documents into structured records at a fraction of the development time. Several teams have shipped this. Some of them have also had LLMs silently transform "$1,200,000" into 1200 instead of 1200000, flip severity scores from "high" to "low" with complete structural validity, and join on the wrong foreign key in ways that passed every schema check.

The problem isn't that LLMs are bad at data engineering. It's that their failure mode is exactly wrong for ETL: high confidence, no error thrown, structurally valid output.

Model Deprecation Is a Systems Migration: How to Survive Provider Model Retirements

· 11 min read
Tian Pan
Software Engineer

A healthcare company running a production AI triage assistant gets the email every team dreads: their inference provider is retiring the model they're using in 90 days. They update the model string, run a quick manual smoke test, and ship the replacement. Three weeks later, the new model starts offering unsolicited diagnostic opinions. Token usage explodes 5×. Entire prompt templates break because the new model interprets instruction phrasing differently. JSON parsing fails because the output schema shifted.

This is not an edge case. It is the normal experience of surviving a model retirement when you treat it as a configuration change rather than a systems migration.

Model Upgrade as a Breaking Change: What Your Deployment Pipeline Is Missing

· 11 min read
Tian Pan
Software Engineer

When OpenAI deprecated max_tokens in favor of max_completion_tokens in their newer models, applications that had been running fine for months began returning 400 errors. No announcement triggered an alert. No error in your code. The model changed; your assumptions did not. This is the canonical story of a model upgrade as a breaking change — except most of them are quieter and therefore harder to catch.

Foundation model updates don't follow the same social contract as library releases. There's no BREAKING CHANGE: prefix in a git commit. There's no semver bump that tells your CI to fail. The output format narrows, the tone drifts, the JSON structure reorganizes, the reasoning path shortens — and downstream consumers discover it gradually, through degraded user experience and confused analytics, not a thrown exception.

Multi-User AI Sessions: The Context Ownership Problem Nobody Designs For

· 9 min read
Tian Pan
Software Engineer

In August 2024, security researchers discovered that Slack AI would pull both public and private channel content into the same context window when answering a query. An attacker in a public channel could craft a message that, when ingested by Slack AI, would inject instructions into a victim's session — and since Slack AI doesn't cite its sources, the resulting data exfiltration was nearly untraceable. The attack could leak API keys embedded in private DMs. Slack patched it after responsible disclosure.

This wasn't a bug in the traditional sense. It was a consequence of treating context as a shared mutable resource with no per-user access control. And it's a mistake that most teams building shared AI assistants are making right now, just more quietly.