Skip to main content

907 posts tagged with "insider"

View all tags

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

The Annotation Economy: Why Every Label Source Has a Hidden Tax

· 9 min read
Tian Pan
Software Engineer

Most teams pick their annotation strategy by comparing unit costs: crowd workers run about 0.08perlabel,LLMgenerationunder0.08 per label, LLM generation under 0.003, human domain experts around $1. Run the spreadsheet, pick the cheapest option that seems "good enough," and ship. This math consistently gets teams into trouble.

The actual decision is not about cost per label in isolation. Every label source carries a hidden quality tax — compounding costs in the form of garbage gradients, misleading eval curves, or months spent debugging production failures that clean labels would have caught at training time. The cheapest source is often the most expensive one when you count the downstream cost of trusting it.

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

· 10 min read
Tian Pan
Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think

· 8 min read
Tian Pan
Software Engineer

When GPT-4 scored 88% on MMLU, it felt like a watershed moment. MMLU — the Massive Multitask Language Understanding benchmark — tests 57 academic subjects from elementary math to professional law. An 88% accuracy across that breadth looked like strong evidence of genuine broad intelligence. Then researchers created MMLU-CF, a contamination-free variant that swapped out any questions with suspicious proximity to known training corpora. GPT-4o dropped to 73.4% — a 14.6 percentage point gap.

That gap isn't a small rounding error. It's the difference between "reliably correct on complex academic questions" and "reliably correct when you've seen the question before." For teams making model selection decisions based on leaderboard scores, it means buying a capability that doesn't fully exist.

Browser Agents in Production: The DOM Fragility Tax

· 13 min read
Tian Pan
Software Engineer

A calendar date picker broke a production browser agent for three days before anyone noticed. The designer had swapped a native <input type="date"> for a custom React component during a minor UI refresh. No API changed. No content moved. Just 24px cells in a new layout — and the vision model that had been reliably clicking the right dates now missed by one cell, silently booking appointments on the wrong day.

This is the DOM fragility tax: the ongoing operational cost of building automated agents on top of a web that was never designed to be operated by machines. Unlike most infrastructure taxes, it compounds. The web changes. Anti-bot defenses evolve. SPAs get more dynamic. And your agent quietly degrades.

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Cognitive Load Inversion: Why AI Suggestions Feel Helpful but Exhaust You

· 9 min read
Tian Pan
Software Engineer

There's a number in the AI productivity research that almost nobody talks about: 39 percentage points. In a study of experienced developers, participants predicted AI tools would make them 24% faster. After completing the tasks, they still believed they'd been 20% faster. The measured reality: they were 19% slower. The perception gap is 39 points—and it compounds with every sprint, every code review, every feature shipped.

This is the cognitive load inversion. AI tools are excellent at offloading the cheap cognitive work—writing syntactically correct code, drafting boilerplate, suggesting function names—while generating a harder class of cognitive work: continuous evaluation of uncertain outputs. You didn't eliminate cognitive effort. You automated the easy half and handed yourself the hard half.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.

The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.

The Conversation Designer's Hidden Role in AI Product Quality

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat system prompts as configuration files — technical strings to be iterated on quickly, stored in environment variables, and deployed with the same ceremony as changing a timeout value. The system prompt gets an inline comment. The error messages get none. The capability disclosure is whatever the PM typed into the Notion doc on launch day.

This is the root cause of an entire class of AI product failures that don't show up in your eval suite. The model answers the question. The latency is fine. The JSON validates. But users stop trusting the product after three sessions, and the weekly active usage curve never recovers.

The missing discipline is conversation design. And it shapes output quality in ways that most engineering instrumentation is architecturally blind to.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.