Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think

· 8 min read
Tian Pan
Software Engineer

When GPT-4 scored 88% on MMLU, it felt like a watershed moment. MMLU — the Massive Multitask Language Understanding benchmark — tests 57 academic subjects from elementary math to professional law. An 88% accuracy across that breadth looked like strong evidence of genuine broad intelligence. Then researchers created MMLU-CF, a contamination-free variant that swapped out any questions with suspicious proximity to known training corpora. GPT-4o dropped to 73.4% — a 14.6 percentage point gap.

That gap isn't a small rounding error. It's the difference between "reliably correct on complex academic questions" and "reliably correct when you've seen the question before." For teams making model selection decisions based on leaderboard scores, it means buying a capability that doesn't fully exist.

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."

Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.

The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

· 11 min read
Tian Pan
Software Engineer

Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.

The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.

The Cognitive Load Inversion: Why AI Suggestions Feel Helpful but Exhaust You

· 9 min read
Tian Pan
Software Engineer

There's a number in the AI productivity research that almost nobody talks about: 39 percentage points. In a study of experienced developers, participants predicted AI tools would make them 24% faster. After completing the tasks, they still believed they'd been 20% faster. The measured reality: they were 19% slower. The perception gap is 39 points—and it compounds with every sprint, every code review, every feature shipped.

This is the cognitive load inversion. AI tools are excellent at offloading the cheap cognitive work—writing syntactically correct code, drafting boilerplate, suggesting function names—while generating a harder class of cognitive work: continuous evaluation of uncertain outputs. You didn't eliminate cognitive effort. You automated the easy half and handed yourself the hard half.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

The Context Window Cliff: Application-Level Strategies for Long Conversations

· 10 min read
Tian Pan
Software Engineer

A 90-minute support session. A research assistant that's been browsing documents for an hour. A coding agent that's touched a dozen files. All of these eventually hit the same wall — and when they do, they don't crash loudly. They get dumb.

The model starts forgetting what was decided twenty minutes ago. It contradicts itself. Retrieval results that should be obvious go missing. Users notice something is off but can't articulate why the assistant got worse. This is the context window cliff: not a hard error, but a gradual quality collapse that your monitoring almost certainly doesn't measure.

Expanding the context window doesn't fix this. Models with million-token windows still degrade on content in the middle, and even when they don't, you're paying for 100x more tokens while the model attends to a fraction of them. The solution is application-level context management — deliberate strategies for what stays in the window, what gets summarized, and what lives outside it entirely.

Continuous Deployment for AI Models: Your Rollback Signal Is Wrong

· 10 min read
Tian Pan
Software Engineer

Your deployment pipeline is green. Latency is nominal. Error rate: 0.02%. The new model version shipped successfully — or so your dashboard says.

Meanwhile, your customer-facing AI is subtly summarizing documents with less precision, hedging on questions it used to answer directly, and occasionally flattening the structured outputs your downstream pipeline depends on. No alerts fire. No on-call page triggers. The first signal you get is a support ticket, two weeks later.

This is the silent regression problem in AI deployments. Traditional rollback signals — HTTP errors, p99 latency, exception rates — are built for deterministic software. They cannot see behavioral drift. And as teams upgrade language models more frequently, the gap between "infrastructure is healthy" and "AI is working correctly" becomes a place where regressions hide.

The Conversation Designer's Hidden Role in AI Product Quality

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat system prompts as configuration files — technical strings to be iterated on quickly, stored in environment variables, and deployed with the same ceremony as changing a timeout value. The system prompt gets an inline comment. The error messages get none. The capability disclosure is whatever the PM typed into the Notion doc on launch day.

This is the root cause of an entire class of AI product failures that don't show up in your eval suite. The model answers the question. The latency is fine. The JSON validates. But users stop trusting the product after three sessions, and the weekly active usage curve never recovers.

The missing discipline is conversation design. And it shapes output quality in ways that most engineering instrumentation is architecturally blind to.

The AI Feature Sunset Playbook: Decommissioning Agents Without Breaking Your Users

· 10 min read
Tian Pan
Software Engineer

Most teams discover the same thing at the worst possible time: retiring an AI feature is nothing like deprecating an API. You add a sunset date to the docs, send the usual three-email sequence, flip the flag — and then watch your support queue spike 80% while users loudly explain that the replacement "doesn't work the same way." What they mean is: the old agent's quirks, its specific failure modes, its particular brand of wrong answer, had all become load-bearing. They'd built workflows around behavior they couldn't name until it was gone.

This is the core problem with AI feature deprecation. Deterministic APIs have explicit contracts. If you remove an endpoint, every caller that relied on it gets a 404. The breakage is traceable, finite, and predictable. Probabilistic AI outputs are different — users don't integrate the contract, they integrate the behavioral distribution. Removing a model doesn't just remove a capability; it removes a specific pattern of behavior that users may have spent months adapting to without realizing it.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.