Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

The Pilot Graveyard: Why Enterprise AI Rollouts Fail After the Demo

· 10 min read
Tian Pan
Software Engineer

Your AI demo was genuinely impressive. The executive audience nodded, the VP of Engineering said "this is the future," and the pilot was approved with real budget. Six months later, weekly active users have plateaued at 12%. The tool gets a polite mention in all-hands. Nobody has the heart to call it dead. This is the pilot graveyard — where good demos go to die.

It's not a rare failure. Roughly 88% of enterprise AI pilots never reach production. Only 6% of enterprises have successfully moved generative AI projects beyond pilot to production at any meaningful scale. The gap between "impressive in the conference room" and "load-bearing in the daily workflow" is where most enterprise AI investment disappears.

The reason isn't the model. It's everything that happens after the demo.

Post-Training Alignment for Product Engineers: What RLHF, DPO, and RLAIF Actually Mean for You

· 11 min read
Tian Pan
Software Engineer

Most teams building AI features assume that once they ship, user feedback becomes a resource they can tap later. Log the thumbs-up and thumbs-down signals, accumulate enough volume, and eventually fine-tune. The reality is more treacherous: a year of logged reactions is not the same as a year of alignment-quality training data. The gap between the two is where alignment techniques — RLHF, DPO, RLAIF — either save you or surprise you.

This post is not a survey of alignment research. It's a decision guide for engineers who need to understand what these techniques require from a data-collection perspective, so that what you instrument today actually enables the fine-tuning you're planning for six months from now.

Prompt Injection Detection at 100,000 Requests Per Day: Why Simple Defenses Break and What Actually Works

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection defense is broken after a user finds it, not before. You add "ignore all previous instructions" to your blocklist and ship. Three months later an attacker encodes the payload in Base64, or buries instructions in HTML comments retrieved via RAG, or uses typoglycemia ("ignroe all prevuois insrtucioins"), and your entire defense evaporates. The blocklist doesn't help because prompt injection has an unbounded attack surface — there is no closed vocabulary of malicious inputs.

At low traffic volumes you can absorb the cost of calling a second LLM to validate each request. At 100,000 requests per day, that math becomes ruinous and the latency becomes user-visible. This post is about what the architecture looks like when brute-force approaches stop working.

The Prompt-Model Coupling Trap: Why Your Prompts Only Speak One Model's Dialect

· 10 min read
Tian Pan
Software Engineer

Most prompt migrations look fine in staging. Ninety percent of test cases pass, the new model's responses feel crisper, and the demo runs cleanly. Then you ship, and within two days your structured output parser is throwing exceptions on 12% of responses, a customer-facing classification pipeline started returning wrong labels, and a tool-calling agent is looping on a schema it used to handle without issue. Nobody changed the prompts. The model changed.

This is the prompt-model coupling trap: prompts that work reliably on one model silently accumulate dependencies on that model's specific behavioral quirks, and those dependencies are invisible until migration day.

Property-Based Testing for LLM Outputs: Finding the Bugs Your Eval Set Never Imagined

· 11 min read
Tian Pan
Software Engineer

Your eval suite says 94% accuracy. Users report the feature is broken for names that aren't "John" or "Alice." Both things are true, and the gap between them has a name: your curated test set encodes only the failure modes you already imagined.

Property-based testing (PBT) was invented in 1999 to expose exactly this class of blind spot in deterministic software. Applied to LLM outputs, it generates tens of thousands of adversarial input variants automatically, probing domain boundaries that hand-written test cases structurally cannot reach. A 2025 OOPSLA study found that on average each property-based test discovers approximately 50 times as many mutant bugs as the average unit test. A separate study measured that PBT and example-based testing (EBT) fail on different bugs — combining both raised detection rates from 68.75% to 81.25%. That 12.5-point gap is not rounding error; it represents an entire class of failure invisible to one approach.

This article is for engineers who already have eval suites and want to find the bugs those suites structurally cannot find.

The RAG Eval Antipattern That Hides Retriever Bugs

· 10 min read
Tian Pan
Software Engineer

There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.

This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.

Schema-First AI Development: Define Output Contracts Before You Write Prompts

· 9 min read
Tian Pan
Software Engineer

Most teams discover the schema problem the wrong way: a downstream service starts returning nonsense, a dashboard fills up with garbage, and a twenty-minute debugging session reveals that the LLM quietly started wrapping its JSON in a markdown code fence three weeks ago. Nobody noticed because the application wasn't crashing — it was silently consuming malformed data.

The fix was a one-line prompt change. The damage was weeks of bad analytics and one very uncomfortable postmortem.

Schema-first development is the discipline that prevents this. It means defining the exact structure your LLM output must conform to — before you write a single prompt token. This isn't about constraining creativity; it's about treating output format as a contract that downstream systems can rely on, the same way you'd version a REST API before writing the consumers.

What Semantic Versioning Actually Means for AI Agents

· 10 min read
Tian Pan
Software Engineer

Your customer service agent has been running reliably for three months. A routine model update rolls in on a Tuesday. By Wednesday afternoon, three downstream services are silently parsing the wrong fields from the agent's responses—the JSON keys shifted subtly but nothing returned an error. By Thursday you've traced a drop in order completions to a JSON field renamed from "status" to "current_state". The model updated, the agent stayed at v2.1.0, and nobody got paged.

This is the versioning gap that nobody in traditional API design had to solve. Semver works when you can deterministically reproduce outputs from a specification. AI agents can't make that promise. Yet downstream services depend on their behavior just as critically as they depend on any microservice API. The gap between "we tagged a release" and "downstream consumers are protected" has never been wider.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

· 10 min read
Tian Pan
Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

The Three Hidden Debts Killing Your AI System

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped on time. Users are using it. Everything looks fine — until one quarter later when a support ticket reveals the system has been confidently wrong for weeks, your evaluation suite caught nothing, and the vector index is silently returning stale results. Nothing broke. The system returned 200 OK the whole time.

This is what AI technical debt looks like. Unlike a failing unit test or a stack overflow, it degrades softly and probabilistically. You don't get a crash — you get subtle quality erosion. Three distinct liabilities drive most of this: prompt debt, eval debt, and embedding debt. Each accumulates independently. Each compounds the others. And most engineering teams are carrying all three.

Testing the Untestable: Integration Contracts for LLM-Powered APIs

· 10 min read
Tian Pan
Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.