Skip to main content

269 posts tagged with "ai-engineering"

View all tags

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.

The Autonomy Dial: Five Levels for Shipping AI Features Without Betting the Company

· 11 min read
Tian Pan
Software Engineer

Most teams shipping AI features make the same mistake: they jump straight from "prototype that impressed the VP" to "fully autonomous in production." Then something goes wrong — a bad recommendation, an incorrect auto-response, a financial transaction that should never have been approved — and the entire feature gets pulled. Not dialed back. Pulled.

The problem is not that AI autonomy is dangerous. The problem is that most teams treat autonomy as a binary switch — off or on — when it should be a dial with distinct, instrumented positions between those two extremes.

The Forgetting Problem: When Unbounded Agent Memory Degrades Performance

· 9 min read
Tian Pan
Software Engineer

An agent that remembers everything eventually remembers nothing useful. This sounds like a paradox, but it's the lived experience of every team that has shipped a long-running AI agent without a forgetting strategy. The memory store grows, retrieval quality degrades, and one day your agent starts confidently referencing a user's former employer, a deprecated API endpoint, or a project requirement that was abandoned six months ago.

The industry has spent enormous energy on giving agents memory. Far less attention has gone to the harder problem: teaching agents what to forget.

The Instruction-Following Cliff: Why Adding One More Rule to Your System Prompt Breaks Three Others

· 7 min read
Tian Pan
Software Engineer

Your system prompt started at twelve lines. It worked beautifully. Then product wanted tone guidelines. Legal needed a disclaimer rule. The safety team added three more constraints. Now you're at forty rules and the model ignores half of them — but not the same half each time.

This is the instruction-following cliff: the point where adding one more rule to your prompt doesn't just degrade that rule's compliance — it destabilizes rules that were working fine yesterday. And unlike most engineering failures, this one is maddeningly non-deterministic.

The Trust Calibration Curve: How Users Learn to (Mis)Trust AI

· 9 min read
Tian Pan
Software Engineer

Most AI products die the same way. The demo works. The beta users rave. You ship. And then, about three months in, session length drops, the feature sits idle, and your most engaged early users start routing around the AI to use the underlying tool directly.

It's not a model quality problem. It's a trust calibration problem.

The over-trust → failure → over-correction lifecycle is the most reliable killer of AI product adoption, and it's almost entirely preventable if you understand what's actually happening. The research is clear, the failure modes are predictable, and the design patterns exist. Most teams ignore all of it until they're looking at the retention curve and wondering what went wrong.

The Accuracy Threshold Problem: When Your AI Feature Is Too Good to Ignore and Too Bad to Trust

· 10 min read
Tian Pan
Software Engineer

McDonald's deployed its AI voice ordering system to over 100 locations. In testing, it hit accuracy numbers that seemed workable — low-to-mid 80s percent. Customers started posting videos of the system adding nine sweet teas to their order unprompted, placing bacon on ice cream, and confidently mishearing simple requests. Within two years, the partnership was dissolved and the technology removed from every location. The lab accuracy was real. The real-world distribution was not what the lab tested.

This is the accuracy threshold problem. There is a zone — roughly 70 to 85 percent accuracy — where an AI feature is precise enough to look like it works, but not reliable enough to actually work without continuous human intervention. Teams ship into this zone because the numbers feel close enough. Users get confused because the feature is just good enough to lure them into reliance and just bad enough to fail when it matters.

Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap

· 11 min read
Tian Pan
Software Engineer

A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.

The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.

Cognitive Tool Scaffolding: Near-Reasoning-Model Performance Without the Price Tag

· 10 min read
Tian Pan
Software Engineer

Your reasoning model bill is high, but the capability gap might be narrower than you think. A standard 70B model running four structured cognitive operations on AIME 2024 math benchmarks jumps from 13% to 30% accuracy — nearly matching o1-preview's 44%, at a fraction of the inference cost. On a more capable base model like GPT-4.1, the same technique pushes from 32% to 53%, which actually surpasses o1-preview on those benchmarks.

The technique is called cognitive tool scaffolding, and it's the latest evolution of a decade of research into making language models reason better without changing their weights.

The Cold Start Problem in AI Personalization

· 11 min read
Tian Pan
Software Engineer

A user signs up for your AI writing assistant. They type their first message. Your system has exactly one data point — and it has to decide: formal or casual? Verbose or terse? Technical depth or accessible overview? Most systems punt and serve a generic default. A few try to personalize immediately. The ones that personalize immediately often make things worse.

The cold start problem in AI personalization is not the same problem Netflix solved fifteen years ago. It is structurally harder, the failure modes are subtler, and the common fixes actively introduce new bugs. Here is what practitioners who have shipped personalization systems have learned about navigating it.

DAG-First Agent Orchestration: Why Linear Chains Break at Scale

· 10 min read
Tian Pan
Software Engineer

Most multi-agent systems start as a chain. Agent A calls Agent B, B calls C, C calls D. It works fine in demos, and it works fine with five agents on toy tasks. Then you add a sixth agent, a seventh, and the pipeline that once ran in eight seconds starts taking forty. You add a retry on step three, and now failures on step three silently cascade into corrupted state at step six. You try to add a parallel branch and discover your framework was never designed for that.

The problem is not the number of agents. The problem is the execution model. Linear chains serialize inherently parallel work, propagate failures in only one direction, and make partial recovery structurally impossible. The fix is not adding more infrastructure on top — it is rebuilding the execution model around a directed acyclic graph from the start.

The Explainability Trap: When AI Explanations Become a Liability

· 11 min read
Tian Pan
Software Engineer

Somewhere between the first stakeholder demand for "explainable AI" and the moment your product team spec'd out a "Why did the AI decide this?" feature, a trap was set. The trap is this: your model does not know why it made that decision, and asking it to explain doesn't produce an explanation — it produces text that looks like an explanation.

This distinction matters enormously in production. Not because users deserve better philosophy, but because post-hoc AI explanations are driving real-world harm through regulatory non-compliance, misdirected user behavior, and safety monitors that can be fooled. Engineers shipping explanation features without understanding this will build systems that satisfy legal checkboxes while making outcomes worse.

Fine-tuning vs. RAG for Knowledge Injection: The Decision Engineers Consistently Get Wrong

· 10 min read
Tian Pan
Software Engineer

A fintech team spent three months fine-tuning a model on their internal compliance documentation — thousands of regulatory PDFs, policy updates, and procedural guides. The results were mediocre. The model still hallucinated specific rule numbers. It forgot recent policy changes. And the one metric that actually mattered (whether advisors trusted its answers enough to stop double-checking) barely moved. Two weeks later, a different team built a RAG pipeline over the same document corpus. Advisors started trusting it within a week.

The fine-tuning team hadn't made a technical mistake. They'd made a definitional one: they were solving a knowledge retrieval problem with a behavior modification tool.