Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The RAG Eval Antipattern That Hides Retriever Bugs

· 10 min read
Tian Pan
Software Engineer

There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.

This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.

Schema-First AI Development: Define Output Contracts Before You Write Prompts

· 9 min read
Tian Pan
Software Engineer

Most teams discover the schema problem the wrong way: a downstream service starts returning nonsense, a dashboard fills up with garbage, and a twenty-minute debugging session reveals that the LLM quietly started wrapping its JSON in a markdown code fence three weeks ago. Nobody noticed because the application wasn't crashing — it was silently consuming malformed data.

The fix was a one-line prompt change. The damage was weeks of bad analytics and one very uncomfortable postmortem.

Schema-first development is the discipline that prevents this. It means defining the exact structure your LLM output must conform to — before you write a single prompt token. This isn't about constraining creativity; it's about treating output format as a contract that downstream systems can rely on, the same way you'd version a REST API before writing the consumers.

What Semantic Versioning Actually Means for AI Agents

· 10 min read
Tian Pan
Software Engineer

Your customer service agent has been running reliably for three months. A routine model update rolls in on a Tuesday. By Wednesday afternoon, three downstream services are silently parsing the wrong fields from the agent's responses—the JSON keys shifted subtly but nothing returned an error. By Thursday you've traced a drop in order completions to a JSON field renamed from "status" to "current_state". The model updated, the agent stayed at v2.1.0, and nobody got paged.

This is the versioning gap that nobody in traditional API design had to solve. Semver works when you can deterministically reproduce outputs from a specification. AI agents can't make that promise. Yet downstream services depend on their behavior just as critically as they depend on any microservice API. The gap between "we tagged a release" and "downstream consumers are protected" has never been wider.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

· 10 min read
Tian Pan
Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

The Three Hidden Debts Killing Your AI System

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped on time. Users are using it. Everything looks fine — until one quarter later when a support ticket reveals the system has been confidently wrong for weeks, your evaluation suite caught nothing, and the vector index is silently returning stale results. Nothing broke. The system returned 200 OK the whole time.

This is what AI technical debt looks like. Unlike a failing unit test or a stack overflow, it degrades softly and probabilistically. You don't get a crash — you get subtle quality erosion. Three distinct liabilities drive most of this: prompt debt, eval debt, and embedding debt. Each accumulates independently. Each compounds the others. And most engineering teams are carrying all three.

Testing the Untestable: Integration Contracts for LLM-Powered APIs

· 10 min read
Tian Pan
Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

· 10 min read
Tian Pan
Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

· 11 min read
Tian Pan
Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

The AI Dependency Footprint: When Every Feature Adds a New Infrastructure Owner

· 9 min read
Tian Pan
Software Engineer

Your team shipped a RAG-powered search feature last quarter. It required a vector database, an embedding model, an annotation pipeline, a chunking service, and an evaluation harness. Each component made sense individually. But six months later, you discover that three of those five components have no clear owner, two are running on engineers' personal cloud accounts, and one was quietly deprecated by its vendor without anyone noticing. The 3am page comes from a component nobody even remembers adding.

This is the AI dependency footprint problem: the compounding accumulation of infrastructure that each AI feature requires, combined with the organizational reality that teams rarely plan ownership for any of it before shipping.

AI Feature Decommissioning Forensics: What Dead Features Teach That Successful Ones Cannot

· 11 min read
Tian Pan
Software Engineer

Here's an uncomfortable pattern: the AI feature your team is about to launch next quarter already died at your company two years ago. It shipped under a different name, with a different prompt, solving a vaguely different problem, and it got quietly decommissioned after six months of flat adoption. Nobody wrote it up. Nobody connected the dots. The leading indicators that would have saved this cycle were sitting in dashboards that got archived along with the feature.

Most engineering orgs are elaborate machines for remembering successes. Launches get retrospectives, blog posts, internal celebrations. The features that got killed — the ones with 12% weekly active users despite a polished demo, the ones whose unit economics inverted when token costs compounded across a longer-than-expected tool chain, the ones users learned to trust, lost trust in, and then routed around — generate almost no institutional memory. And the failure patterns embedded in those deaths are exactly the ones your planning process has no way to price in.

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.