Skip to main content

678 posts tagged with "ai-engineering"

View all tags

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

AI Product Metrics That Don't Lie: Behavioral Signals Over Thumbs-Up Scores

· 9 min read
Tian Pan
Software Engineer

Your AI feature has a 4.2/5 satisfaction score. Users click thumbs-up 68% of the time. The A/B test shows task completion rate is up 12%. Your team ships it. Six weeks later, users have quietly routed around it for anything they actually care about.

This is metric theater. You optimized for signals that look like success but aren't. The feedback you collected came from the 8% of users who bother rating anything — skewed toward the delighted and the furious, silent on the vast middle who found the feature unreliable just often enough to stop trusting it.

Building AI features requires a different measurement philosophy than traditional software. The signals you instrument from day one determine whether you learn fast enough to improve or spend six months chasing a satisfaction score that doesn't move.

The AI Reliability Floor: Why 80% Accurate Is Worse Than No AI at All

· 9 min read
Tian Pan
Software Engineer

Most teams measure AI feature quality by asking "how often is it right?" The more useful question is "how often does being wrong destroy trust faster than being right builds it?" These questions have different answers — and only the second one tells you whether to ship.

There is a reliability floor below which an AI feature does more damage than no feature at all. Below it, users learn to distrust the AI after enough errors, and that distrust generalizes: they stop trusting the feature when it is correct, they route around it, and eventually they stop using it entirely. At that point, you have not shipped a partially-useful product; you have shipped a conversion and retention hazard disguised as a feature.

Stop Writing Prompts by Hand: Automated Optimization with DSPy and MIPRO

· 9 min read
Tian Pan
Software Engineer

You are going to spend an afternoon tuning a prompt. You'll move a sentence around, swap "classify" for "categorize," add a note about edge cases, and run spot-checks against a handful of examples you keep in a notebook. By end of day the prompt is marginally better — you think. You can't prove it. You don't have a reproducible baseline. A week later a colleague changes a few words and the whole thing regresses.

This is the current state of prompt engineering at most teams. DSPy is Stanford's answer to it. Rather than hand-authoring instruction prose, you declare what your LLM program should do, define a metric, and let an optimizer compile the actual prompts for you. MIPRO — the Multi-prompt Instruction PRoposal Optimizer — is the algorithm that makes this approach competitive with (and often better than) the human-crafted alternative.

The Cognitive Offloading Trap: When Your Team Can't Work Without the AI

· 9 min read
Tian Pan
Software Engineer

Three months after rolling out an AI coding assistant to their entire engineering team, a company noticed something disturbing: their code review pass rate had dropped 18%, their sprint velocity was up, but the number of production incidents had climbed. When they asked developers to explain a recent AI-generated module during a post-mortem, nobody in the room could. Not even the person who merged it.

This is the cognitive offloading trap. And it's not a failure of AI tools — it's a failure of how teams integrate them.

Compound Failure Modes in AI Pipelines: When Partial Success Isn't Enough

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI pipelines think about each component in isolation: how often does retrieval succeed, how often does the LLM do the right thing, how often does the downstream tool call land. If each answer comes back "95%," the system feels solid.

It isn't. Three components at 95% each give you an 86% reliable system. Add a fourth at 95% and you're at 81%. Add a fifth and you're below 77%. What felt like a solid stack of high-quality components produces a pipeline that fails one in five requests before you've shipped a single feature.

That's the compound failure problem, and it's the calculation most AI engineering teams skip until users start filing tickets.

Context Compression Changes What Your Model Actually Sees

· 11 min read
Tian Pan
Software Engineer

When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?

The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

· 11 min read
Tian Pan
Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

· 10 min read
Tian Pan
Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

Documenting Probabilistic Features: The Missing Layer Between Model Behavior and Developer Onboarding

· 10 min read
Tian Pan
Software Engineer

Your documentation says the /summarize endpoint returns a concise summary. That is true. It returns a different concise summary every time, sometimes misses a key point, occasionally returns structured JSON when you forgot to specify format in the prompt, and degrades silently after a model update you didn't know happened. None of this appears in the docs.

Traditional API documentation captures contracts: given input X, expect output Y. AI-powered features break that model at its foundation. There is no stable contract to document. The same prompt, same model, same parameters — different output. And yet teams ship these features with the same style of documentation they'd write for a database query: a function signature, a return type, maybe a sentence about error codes.

The gap between what your docs say and what your feature actually does is where developer trust goes to die.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.