Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

· 11 min read
Tian Pan
Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

· 10 min read
Tian Pan
Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

Conversation State Is Not a Chat Array: Multi-Turn Session Design for Production

· 10 min read
Tian Pan
Software Engineer

Most multi-turn LLM applications store conversation history as an array of messages. It works fine in demos. It breaks in production in ways that take days to diagnose because the failures look like model problems, not infrastructure problems.

A user disconnects mid-conversation and reconnects to a different server instance—session gone. An agent reaches turn 47 in a complex task and the payload quietly exceeds the context window—no error, just wrong answers. A product manager asks "can we let users try a different approach from step 3?"—and the engineering answer is "no, not with how we built this." These are not edge cases. They are the predictable consequences of treating conversation state as a transient array rather than a first-class resource.

The Data Quality Ceiling That Prompt Engineering Can't Break Through

· 11 min read
Tian Pan
Software Engineer

A telecommunications company spent months tuning prompts on their customer service chatbot. They iterated on system instructions, few-shot examples, chain-of-thought formatting. The hallucination rate stayed stubbornly above 50%. Then they audited their knowledge base and found it was filled with retired service plans, outdated billing information, and duplicate policy documents that contradicted each other. After fixing the data — not the prompts — hallucinations dropped to near zero. The fix that prompt engineering couldn't deliver took three weeks of data cleanup.

This is the data quality ceiling: a hard performance wall that blocks every LLM system fed on noisy, stale, or inconsistent data, and that no amount of prompt iteration can breach. It's one of the most common failure modes in production AI, and one of the most systematically underdiagnosed. Teams that hit this wall keep turning the prompt knobs when the problem is upstream.

EU AI Act Compliance Is an Engineering Problem: The Audit Trail You Have to Ship

· 10 min read
Tian Pan
Software Engineer

Most engineering teams building AI systems in 2026 understand that the EU AI Act exists. Very few understand what it actually requires them to build. The regulation's core obligations for high-risk AI systems — automatic event logging, human oversight mechanisms, risk management systems, technical documentation — are not policy artifacts that a legal team can produce on a deadline. They are engineering deliverables that require architectural decisions made at the start of a project, not in the final sprint before a compliance audit.

The hard deadline is August 2, 2026. High-risk AI systems deployed in the EU must be in full compliance with Articles 9 through 15. Organizations deploying AI in employment screening, credit scoring, benefits allocation, healthcare prioritization, biometric identification, or critical infrastructure management are in scope. If your system makes decisions that materially affect people in those domains and touches EU residents, it is almost certainly high-risk. And realistic compliance implementation timelines run 8 to 14 months — which means if you haven't started, you're already late.

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

· 9 min read
Tian Pan
Software Engineer

Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.

Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.

This is the golden dataset decay problem, and it's more common than most teams admit.

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.

Defining Escalation Criteria That Actually Work in Human-AI Teams

· 10 min read
Tian Pan
Software Engineer

Most AI teams can tell you their containment rate — the percentage of interactions the AI handled without routing to a human. Far fewer can tell you whether that number is the right one.

Escalation criteria are the single most important design document in an AI-augmented team, and most teams don't have one. They have a threshold buried in a YAML file and an implicit assumption that the AI knows when it's stuck. That assumption is wrong in both directions: too high a threshold and humans spend their days redoing AI work; too low and users absorb AI errors without recourse. Both failures are invisible until they compound.

The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems

· 10 min read
Tian Pan
Software Engineer

There's a specific kind of dread that hits when you inherit an AI system from an engineer who just left. The system prompts are hundreds of lines long. There's a folder called evals/ with 340 test cases and no README. A comment in the code says # DO NOT CHANGE THIS — ask Chen and Chen is no longer reachable.

You don't know why the customer support bot is forbidden from discussing pricing on Tuesdays. You don't know which eval cases were written to catch a regression from six months ago versus which ones are just random examples. You don't know if the guardrail blocking certain product categories was a legal requirement, a compliance experiment, or something someone added because a VP saw one bad output.

The system still works. For now. But you can't safely change anything.

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

· 9 min read
Tian Pan
Software Engineer

You built an AI feature. You ran evals. You saw 95% accuracy on your test set. You shipped it. Six weeks later, users hate it and your team is quietly planning to roll it back.

This is the last-mile reliability problem, and it is probably the most common cause of AI feature failure in production today. It has nothing to do with your model being bad and everything to do with how average accuracy metrics hide the distribution of failures — and how certain failures are disproportionately expensive regardless of their statistical frequency.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

· 9 min read
Tian Pan
Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.