Blog

Page 113

12 articles

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale
Why accuracy metrics that look fine in offline evals become catastrophic at production volume, how to set SLOs for AI features that account for tail behavior, and the product decision of what to do when a model is good enough but still wrong millions of times per month.
production-aireliability
Apr 1611 min
The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust
A practical guide for engineers and PMs on how to deprecate LLM-powered features cleanly — covering data lifecycle teardown, behavioral migration testing, user trust dynamics, and communication strategy.
ai-engineeringllm
Apr 1612 min
What 'Done' Means for AI-Powered Features: Engineering the Perpetual Beta
AI-powered features never reach a stable 'done' state — model drift, world drift, and expectation drift create continuous iteration pressure. Here's the engineering and governance infrastructure that makes 'stable but evolving' feel like quality rather than incompleteness.
ai-engineeringllmops
Apr 1610 min
The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late
Teams adopting coding agents see dramatic velocity gains in months one through three. By month twelve, many find themselves unable to ship features without understanding their own systems. Here's the failure pattern — and how to avoid it.
insiderai
Apr 1611 min
AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet
AI inference now produces 2.5–3.7% of global emissions and is growing 15% annually. Here's how to measure your team's contribution and why it will become a compliance concern whether you plan for it or not.
ai-engineeringsustainability
Apr 169 min
Choosing a Vector Database for Production: What Benchmarks Won't Tell You
Benchmark leaderboards measure the wrong things. Here's the evaluation framework that actually predicts whether your vector database will hold up in production.
vector-databaseproduction
Apr 1610 min
AI Oncall: What to Page On When Your System Thinks
How to design alerting for non-deterministic AI systems, what an AI incident looks like vs. a traditional failure, and runbook structures that actually help an on-call engineer at 2am.
insiderai-engineering
Apr 1611 min
When Everyone Has an AI Coding Agent: The Team Dynamics Nobody Warned You About
When every engineer on your team has an AI coding agent, individual productivity gains can quietly destroy collective code ownership, accelerate knowledge silos, and break code review culture — here's what to do about it.
insiderai
Apr 1610 min
The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't
How teams measure session count and completion rate while missing what actually predicts value — and why the first 30 days of AI feature metrics are almost always wrong.
ai-engineeringproduct-metrics
Apr 1611 min
AI for SRE Log Analysis: The Tiered Architecture That Actually Works
Real-time frontier model analysis of streaming logs is financially and latency-untenable. Here's the tiered approach—fast anomaly detection gating selective LLM calls—that actually works in production.
observabilitySRE
Apr 169 min
AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves
When the engineer who wrote your system prompt leaves, the reasoning behind every phrasing decision leaves with them. Here's how to build AI systems that survive personnel changes.
insiderai-engineering
Apr 1611 min
AI User Research: What Users Actually Need Before You Write the First Prompt
Most AI features fail not because the technology is wrong, but because teams asked users what they wanted instead of observing what they actually do. Here's how to run user research that produces reliable behavioral signal before you build.
insiderai-product
Apr 1610 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 113

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale

The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust

What 'Done' Means for AI-Powered Features: Engineering the Perpetual Beta

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

Choosing a Vector Database for Production: What Benchmarks Won't Tell You

AI Oncall: What to Page On When Your System Thinks

When Everyone Has an AI Coding Agent: The Team Dynamics Nobody Warned You About

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

AI User Research: What Users Actually Need Before You Write the First Prompt

About Tian Pan