Skip to main content

578 posts tagged with "insider"

View all tags

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Cognitive Load Inversion: Why AI Suggestions Feel Helpful but Exhaust You

· 9 min read
Tian Pan
Software Engineer

There's a number in the AI productivity research that almost nobody talks about: 39 percentage points. In a study of experienced developers, participants predicted AI tools would make them 24% faster. After completing the tasks, they still believed they'd been 20% faster. The measured reality: they were 19% slower. The perception gap is 39 points—and it compounds with every sprint, every code review, every feature shipped.

This is the cognitive load inversion. AI tools are excellent at offloading the cheap cognitive work—writing syntactically correct code, drafting boilerplate, suggesting function names—while generating a harder class of cognitive work: continuous evaluation of uncertain outputs. You didn't eliminate cognitive effort. You automated the easy half and handed yourself the hard half.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.

The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.

The Conversation Designer's Hidden Role in AI Product Quality

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat system prompts as configuration files — technical strings to be iterated on quickly, stored in environment variables, and deployed with the same ceremony as changing a timeout value. The system prompt gets an inline comment. The error messages get none. The capability disclosure is whatever the PM typed into the Notion doc on launch day.

This is the root cause of an entire class of AI product failures that don't show up in your eval suite. The model answers the question. The latency is fine. The JSON validates. But users stop trusting the product after three sessions, and the weekly active usage curve never recovers.

The missing discipline is conversation design. And it shapes output quality in ways that most engineering instrumentation is architecturally blind to.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

The Data Quality Tax in LLM Systems: Why Bad Input Hits Differently

· 9 min read
Tian Pan
Software Engineer

Your gradient boosting model degrades politely when data gets noisy. Accuracy drops, precision drops, a monitoring alert fires, and the on-call engineer knows exactly where to look. LLMs don't do that. Feed an LLM degraded, stale, or malformed input and it produces fluent, confident, authoritative-sounding output that is partially or entirely wrong — and the downstream system consuming it has no way to tell the difference.

This is the data quality tax: the compounding cost you pay when bad data enters an LLM pipeline, expressed not as lower confidence scores but as hallucinations dressed in the syntax of facts.

Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It

· 11 min read
Tian Pan
Software Engineer

Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.

Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.

Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.

Decision Provenance in Agentic Systems: Audit Trails That Actually Work

· 13 min read
Tian Pan
Software Engineer

An agent running in your production system deletes 10,000 database records. The deletion matches valid business logic — the records were flagged correctly. But three months later, a regulator asks a simple question: who authorized this, and on what basis did the agent decide? You open your logs. You find the SQL statement. You find the timestamp. You find nothing else.

This is the decision provenance problem. You can prove that your agent acted; you cannot prove why, or whether that action was ever sanctioned by a human who understood what they were approving. With autonomous agents now executing workflows that span hours, dozens of tool calls, and decisions with real-world consequences, the gap between "we have logs" and "we have accountability" has become operationally dangerous.

The AI Feature Sunset Playbook: Decommissioning Agents Without Breaking Your Users

· 10 min read
Tian Pan
Software Engineer

Most teams discover the same thing at the worst possible time: retiring an AI feature is nothing like deprecating an API. You add a sunset date to the docs, send the usual three-email sequence, flip the flag — and then watch your support queue spike 80% while users loudly explain that the replacement "doesn't work the same way." What they mean is: the old agent's quirks, its specific failure modes, its particular brand of wrong answer, had all become load-bearing. They'd built workflows around behavior they couldn't name until it was gone.

This is the core problem with AI feature deprecation. Deterministic APIs have explicit contracts. If you remove an endpoint, every caller that relied on it gets a 404. The breakage is traceable, finite, and predictable. Probabilistic AI outputs are different — users don't integrate the contract, they integrate the behavioral distribution. Removing a model doesn't just remove a capability; it removes a specific pattern of behavior that users may have spent months adapting to without realizing it.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.