Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

· 10 min read
Tian Pan
Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

· 11 min read
Tian Pan
Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

The AI Dependency Footprint: When Every Feature Adds a New Infrastructure Owner

· 9 min read
Tian Pan
Software Engineer

Your team shipped a RAG-powered search feature last quarter. It required a vector database, an embedding model, an annotation pipeline, a chunking service, and an evaluation harness. Each component made sense individually. But six months later, you discover that three of those five components have no clear owner, two are running on engineers' personal cloud accounts, and one was quietly deprecated by its vendor without anyone noticing. The 3am page comes from a component nobody even remembers adding.

This is the AI dependency footprint problem: the compounding accumulation of infrastructure that each AI feature requires, combined with the organizational reality that teams rarely plan ownership for any of it before shipping.

AI Feature Decommissioning Forensics: What Dead Features Teach That Successful Ones Cannot

· 11 min read
Tian Pan
Software Engineer

Here's an uncomfortable pattern: the AI feature your team is about to launch next quarter already died at your company two years ago. It shipped under a different name, with a different prompt, solving a vaguely different problem, and it got quietly decommissioned after six months of flat adoption. Nobody wrote it up. Nobody connected the dots. The leading indicators that would have saved this cycle were sitting in dashboards that got archived along with the feature.

Most engineering orgs are elaborate machines for remembering successes. Launches get retrospectives, blog posts, internal celebrations. The features that got killed — the ones with 12% weekly active users despite a polished demo, the ones whose unit economics inverted when token costs compounded across a longer-than-expected tool chain, the ones users learned to trust, lost trust in, and then routed around — generate almost no institutional memory. And the failure patterns embedded in those deaths are exactly the ones your planning process has no way to price in.

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

AI Product Metrics That Don't Lie: Behavioral Signals Over Thumbs-Up Scores

· 9 min read
Tian Pan
Software Engineer

Your AI feature has a 4.2/5 satisfaction score. Users click thumbs-up 68% of the time. The A/B test shows task completion rate is up 12%. Your team ships it. Six weeks later, users have quietly routed around it for anything they actually care about.

This is metric theater. You optimized for signals that look like success but aren't. The feedback you collected came from the 8% of users who bother rating anything — skewed toward the delighted and the furious, silent on the vast middle who found the feature unreliable just often enough to stop trusting it.

Building AI features requires a different measurement philosophy than traditional software. The signals you instrument from day one determine whether you learn fast enough to improve or spend six months chasing a satisfaction score that doesn't move.

The AI Reliability Floor: Why 80% Accurate Is Worse Than No AI at All

· 9 min read
Tian Pan
Software Engineer

Most teams measure AI feature quality by asking "how often is it right?" The more useful question is "how often does being wrong destroy trust faster than being right builds it?" These questions have different answers — and only the second one tells you whether to ship.

There is a reliability floor below which an AI feature does more damage than no feature at all. Below it, users learn to distrust the AI after enough errors, and that distrust generalizes: they stop trusting the feature when it is correct, they route around it, and eventually they stop using it entirely. At that point, you have not shipped a partially-useful product; you have shipped a conversion and retention hazard disguised as a feature.

Stop Writing Prompts by Hand: Automated Optimization with DSPy and MIPRO

· 9 min read
Tian Pan
Software Engineer

You are going to spend an afternoon tuning a prompt. You'll move a sentence around, swap "classify" for "categorize," add a note about edge cases, and run spot-checks against a handful of examples you keep in a notebook. By end of day the prompt is marginally better — you think. You can't prove it. You don't have a reproducible baseline. A week later a colleague changes a few words and the whole thing regresses.

This is the current state of prompt engineering at most teams. DSPy is Stanford's answer to it. Rather than hand-authoring instruction prose, you declare what your LLM program should do, define a metric, and let an optimizer compile the actual prompts for you. MIPRO — the Multi-prompt Instruction PRoposal Optimizer — is the algorithm that makes this approach competitive with (and often better than) the human-crafted alternative.

The Cognitive Offloading Trap: When Your Team Can't Work Without the AI

· 9 min read
Tian Pan
Software Engineer

Three months after rolling out an AI coding assistant to their entire engineering team, a company noticed something disturbing: their code review pass rate had dropped 18%, their sprint velocity was up, but the number of production incidents had climbed. When they asked developers to explain a recent AI-generated module during a post-mortem, nobody in the room could. Not even the person who merged it.

This is the cognitive offloading trap. And it's not a failure of AI tools — it's a failure of how teams integrate them.

Compound Failure Modes in AI Pipelines: When Partial Success Isn't Enough

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI pipelines think about each component in isolation: how often does retrieval succeed, how often does the LLM do the right thing, how often does the downstream tool call land. If each answer comes back "95%," the system feels solid.

It isn't. Three components at 95% each give you an 86% reliable system. Add a fourth at 95% and you're at 81%. Add a fifth and you're below 77%. What felt like a solid stack of high-quality components produces a pipeline that fails one in five requests before you've shipped a single feature.

That's the compound failure problem, and it's the calculation most AI engineering teams skip until users start filing tickets.