Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

· 10 min read
Tian Pan
Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.

Why LLMs Make Confident Mistakes When Analyzing Your Product Data

· 11 min read
Tian Pan
Software Engineer

Product teams have started routing analytical questions directly to LLMs: "What's causing the churn spike?" "Why did conversion drop after the redesign?" "Which cohort should we focus retention spend on?" The outputs land in executive decks, drive roadmap decisions, and get presented to investors. The models answer confidently, in polished prose, with specific numbers. And a significant fraction of those answers are wrong in ways that don't announce themselves.

This isn't a general criticism of LLMs for data work. There are tasks where they genuinely help. The problem is that the failure modes are invisible — the model doesn't hedge, doesn't caveat, and doesn't distinguish between "I computed this from your data" and "I generated something that sounds like what this number should be." Practitioners who understand where the breakdowns happen can capture the genuine value and route around the landmines.

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

· 10 min read
Tian Pan
Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

The Multi-Turn Session State Collapse Problem

· 10 min read
Tian Pan
Software Engineer

Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."

Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.

Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.

Multi-User Shared AI Sessions: The Concurrency Problem Nobody Has Solved

· 12 min read
Tian Pan
Software Engineer

Most AI products are built for a single user with a single intent, a single conversation thread, and a single identity. This works well enough when the product is a personal productivity tool—a writing assistant, a code completion engine, a summarizer. But something happens when teams start using AI collaboratively: the product silently breaks in ways that are hard to diagnose and harder to fix. Two users prompt the AI simultaneously, and one of their inputs disappears. A context window shared across five engineers fills up with duplicated history. The AI responds to user A's question using user B's permissions. Nobody designed for any of this, because shipping multi-user shared context means confronting one of the hardest distributed systems problems in modern AI infrastructure.

This post is about what actually makes simultaneous multi-user AI sessions hard, what production teams have tried, and what the emerging architectural patterns are. If you are building a collaborative AI feature and wondering why it feels impossibly complex, this is why.

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

The Pilot Graveyard: Why Enterprise AI Rollouts Fail After the Demo

· 10 min read
Tian Pan
Software Engineer

Your AI demo was genuinely impressive. The executive audience nodded, the VP of Engineering said "this is the future," and the pilot was approved with real budget. Six months later, weekly active users have plateaued at 12%. The tool gets a polite mention in all-hands. Nobody has the heart to call it dead. This is the pilot graveyard — where good demos go to die.

It's not a rare failure. Roughly 88% of enterprise AI pilots never reach production. Only 6% of enterprises have successfully moved generative AI projects beyond pilot to production at any meaningful scale. The gap between "impressive in the conference room" and "load-bearing in the daily workflow" is where most enterprise AI investment disappears.

The reason isn't the model. It's everything that happens after the demo.

Post-Training Alignment for Product Engineers: What RLHF, DPO, and RLAIF Actually Mean for You

· 11 min read
Tian Pan
Software Engineer

Most teams building AI features assume that once they ship, user feedback becomes a resource they can tap later. Log the thumbs-up and thumbs-down signals, accumulate enough volume, and eventually fine-tune. The reality is more treacherous: a year of logged reactions is not the same as a year of alignment-quality training data. The gap between the two is where alignment techniques — RLHF, DPO, RLAIF — either save you or surprise you.

This post is not a survey of alignment research. It's a decision guide for engineers who need to understand what these techniques require from a data-collection perspective, so that what you instrument today actually enables the fine-tuning you're planning for six months from now.

Prompt Injection Detection at 100,000 Requests Per Day: Why Simple Defenses Break and What Actually Works

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection defense is broken after a user finds it, not before. You add "ignore all previous instructions" to your blocklist and ship. Three months later an attacker encodes the payload in Base64, or buries instructions in HTML comments retrieved via RAG, or uses typoglycemia ("ignroe all prevuois insrtucioins"), and your entire defense evaporates. The blocklist doesn't help because prompt injection has an unbounded attack surface — there is no closed vocabulary of malicious inputs.

At low traffic volumes you can absorb the cost of calling a second LLM to validate each request. At 100,000 requests per day, that math becomes ruinous and the latency becomes user-visible. This post is about what the architecture looks like when brute-force approaches stop working.

The Prompt-Model Coupling Trap: Why Your Prompts Only Speak One Model's Dialect

· 10 min read
Tian Pan
Software Engineer

Most prompt migrations look fine in staging. Ninety percent of test cases pass, the new model's responses feel crisper, and the demo runs cleanly. Then you ship, and within two days your structured output parser is throwing exceptions on 12% of responses, a customer-facing classification pipeline started returning wrong labels, and a tool-calling agent is looping on a schema it used to handle without issue. Nobody changed the prompts. The model changed.

This is the prompt-model coupling trap: prompts that work reliably on one model silently accumulate dependencies on that model's specific behavioral quirks, and those dependencies are invisible until migration day.

Property-Based Testing for LLM Outputs: Finding the Bugs Your Eval Set Never Imagined

· 11 min read
Tian Pan
Software Engineer

Your eval suite says 94% accuracy. Users report the feature is broken for names that aren't "John" or "Alice." Both things are true, and the gap between them has a name: your curated test set encodes only the failure modes you already imagined.

Property-based testing (PBT) was invented in 1999 to expose exactly this class of blind spot in deterministic software. Applied to LLM outputs, it generates tens of thousands of adversarial input variants automatically, probing domain boundaries that hand-written test cases structurally cannot reach. A 2025 OOPSLA study found that on average each property-based test discovers approximately 50 times as many mutant bugs as the average unit test. A separate study measured that PBT and example-based testing (EBT) fail on different bugs — combining both raised detection rates from 68.75% to 81.25%. That 12.5-point gap is not rounding error; it represents an entire class of failure invisible to one approach.

This article is for engineers who already have eval suites and want to find the bugs those suites structurally cannot find.