Blog

Page 71

12 articles

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production
When your LLM feature degrades in production, standard SRE runbooks leave you blind. Here's the diagnosis tree, prompt rollback strategy, and postmortem template built specifically for AI systems.
ai engineeringreliability
Apr 1813 min
The AI Incident Runbook: When Your Agent Causes Real-World Harm
When an AI agent causes real-world harm, your existing outage runbook will mislead you. Here is the playbook built for stochastic systems: how to bound blast radius without stack traces, preserve evidence before it disappears, and investigate beyond 'the model hallucinated.'
insiderai-engineering
Apr 1811 min
The AI Output Copyright Trap: What Engineers Need to Know Before It's a Legal Problem
Training data memorization, derivative works doctrine, and output ownership are live legal disputes with direct engineering consequences. Here's the risk surface and the controls that actually reduce liability.
ai-engineeringlegal
Apr 1811 min
The AI Taste Problem: Measuring Quality When There's No Ground Truth
How to evaluate AI outputs when accuracy metrics are meaningless — the engineering discipline behind pairwise studies, inter-rater reliability, and LLM-as-judge for copywriting, creative content, and design.
aievaluation
Apr 1811 min
The Three Silent Clocks of AI Technical Debt
AI-specific technical debt—prompt drift, eval erosion, and embedding staleness—compounds invisibly unlike code debt. Here's how to detect each clock before it runs down.
ai-engineeringllmops
Apr 1810 min
The Annotation Economy: Why Every Label Source Has a Hidden Tax
A decision framework for choosing between human domain experts, crowd workers, synthetic LLM generation, and behavioral inference for eval label sourcing—and when annotation-free is actually right.
insiderevaluation
Apr 189 min
Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth
A practical guide to measuring LLM output quality in week one — before you have labeled data. Covers self-consistency, constraint satisfaction, behavioral invariants, and LLM-as-judge, with the failure modes of each.
llm-evaluationai-engineering
Apr 1812 min
The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth
Explicit thumbs-up ratings are a lie. Edit rates, retry patterns, and session abandonment reveal far more about AI quality — and you can turn them into eval datasets without an annotation budget.
insiderai-engineering
Apr 1810 min
Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think
Frontier models score impressively on standard benchmarks, but contamination — where test data leaks into pretraining — inflates those numbers significantly. Here's what the gap actually looks like and how to design evaluations that give honest signals.
insiderai-engineering
Apr 188 min
Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems
The 'fix the prompt' reflex is displacing real root cause analysis in AI incident postmortems. Here's why it happens and how to apply blameless SRE culture to non-deterministic systems.
ai-engineeringsre
Apr 189 min
Board-Level AI Governance: The Five Decisions Only Executives Can Make
Most AI governance writing targets MLOps teams. But five strategic decisions can only be made at the board level — and the regulatory exposure for getting them wrong is growing fast.
aiengineering-leadership
Apr 189 min
Browser Agents in Production: The DOM Fragility Tax
Browser and computer-use AI agents break in ways that neither benchmarks nor demos reveal. Here's what actually causes failures in production and the architectural patterns that keep them running.
insiderbrowser-agents
Apr 1813 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 71

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

The AI Incident Runbook: When Your Agent Causes Real-World Harm

The AI Output Copyright Trap: What Engineers Need to Know Before It's a Legal Problem

The AI Taste Problem: Measuring Quality When There's No Ground Truth

The Three Silent Clocks of AI Technical Debt

The Annotation Economy: Why Every Label Source Has a Hidden Tax

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

Board-Level AI Governance: The Five Decisions Only Executives Can Make

Browser Agents in Production: The DOM Fragility Tax

About Tian Pan