Blog

Page 147

12 articles

Production AI Incident Response: When Your Agent Goes Wrong at 3am
AI incidents don't look like software incidents — no stack traces, no 500 errors, just confident wrong answers and runaway loops. A practical guide to detection, triage, containment, and post-mortems for production LLM systems.
agent-reliabilityllmops
Apr 911 min
The Prompt Ownership Problem: What Happens When Every Team Treats Prompts as Configuration
A $340K production incident exposed what happens when prompts have no owner, no version history, and no review gate — and the lightweight governance model that prevents it.
prompt-managementllmops
Apr 98 min
Prompt Sprawl: When System Prompts Grow Into Unmaintainable Legacy Code
System prompts that start at 200 tokens and balloon to 4,000 silently degrade LLM performance. How to audit, decompose, and architect modular prompts that stay maintainable — applying DRY, separation of concerns, and version control to prompt management.
insiderprompt-engineering
Apr 99 min
Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents
Fixed-size and semantic chunking both fail in predictable ways on production documents. Here's what the research shows about RAG chunking failures, and the evaluation and architecture patterns that close the accuracy gap.
ragllm
Apr 99 min
RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong
Retrieval success doesn't guarantee correct answers. A third failure mode lurks between retrieval and generation—context sufficiency—where retrieved documents rank correctly but lack the specific information needed. Here's how to detect it and what to do about it.
ragllm
Apr 99 min
The RAG Freshness Problem: How Stale Embeddings Silently Wreck Retrieval Quality
Semantic similarity has no temporal dimension — stale embeddings score just as high as fresh ones. The CDC pipelines, decay-weighted scoring, and monitoring stack that keep production RAG systems from silently serving outdated answers.
insiderrag
Apr 912 min
The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't
Reasoning models cost up to 86x more per query than standard models — and inside agent loops, that cost compounds with every iteration. A practical decision framework for when to route to reasoning models and when fast models are the better call.
insiderreasoning-models
Apr 910 min
The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production
Reasoning models correctly identify sensitive data 98% of the time yet leak it in their chain-of-thought 33% of the time. Here's why the scratchpad is a distinct attack surface and what production teams need to do about it.
insiderllm
Apr 99 min
The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking
Chain-of-thought traces leak more PII than final model outputs, create a readable attack surface for prompt injection, and turn your observability stack into a GDPR liability. Here's what to do about it.
llmprivacy
Apr 98 min
The Retry Storm Problem in Agentic Systems: Why Naive Retries Burn 200x the Tokens
Naive retry logic in AI agent systems creates exponential token-cost amplification across chained tool calls and multi-agent delegation. A layered defense architecture — circuit breakers, conversation-level budgets, deadline propagation, and honest degradation — prevents a single flaky API from cascading into a full agent meltdown.
ai-agentsreliability
Apr 910 min
The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget
Naive retry logic across chained agent tool calls creates exponential cost amplification — a $0.01 task becomes a $2 meltdown. A four-layer defense stack with tool budgets, agent budgets, orchestration backpressure, and error classification prevents cascading failures in production AI agents.
agent-reliabilitydistributed-systems
Apr 910 min
Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You
GPU memory planning for self-hosted LLMs is almost always wrong because teams size for model weights and ignore the KV cache. A breakdown of the math, quantization tradeoffs between INT4/FP8/FP16, framework selection, and the real break-even calculation for going off cloud APIs.
insiderllm-inference
Apr 910 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 147

Production AI Incident Response: When Your Agent Goes Wrong at 3am

The Prompt Ownership Problem: What Happens When Every Team Treats Prompts as Configuration

Prompt Sprawl: When System Prompts Grow Into Unmaintainable Legacy Code

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

The RAG Freshness Problem: How Stale Embeddings Silently Wreck Retrieval Quality

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production

The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking

The Retry Storm Problem in Agentic Systems: Why Naive Retries Burn 200x the Tokens

The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

About Tian Pan