Skip to main content

2 posts tagged with "llmops"

View all tags

Production AI Incident Response: When Your Agent Goes Wrong at 3am

· 11 min read
Tian Pan
Software Engineer

A multi-agent cost-tracking system at a fintech startup ran undetected for eleven days before anyone noticed. The cause: Agent A asked Agent B for clarification. Agent B asked Agent A for help interpreting the response. Neither had logic to break the loop. The $127 weekly bill became $47,000 before a human looked at the invoice.

No errors were thrown. No alarms fired. Latency was normal. The system was running exactly as designed—just running forever.

This is what AI incidents actually look like. They're not stack traces and 500 errors. They're silent behavioral failures, runaway loops, and plausible wrong answers delivered at production scale with full confidence. Your existing incident runbook almost certainly doesn't cover any of them.

The Prompt Ownership Problem: What Happens When Every Team Treats Prompts as Configuration

· 8 min read
Tian Pan
Software Engineer

A one-sentence change to a system prompt sat in production for 21 days before anyone noticed it was misclassifying thousands of mortgage documents. The estimated cost: $340,000 in operational inefficiency and SLA breaches. Nobody could say who made the change, when it was made, or why. The prompt lived in an environment variable that three teams had write access to, and no one considered it their responsibility to review.

This is the prompt ownership problem. As LLM-powered features proliferate across organizations, prompts have become the most consequential yet least governed artifacts in the stack. They control model behavior, shape user experience, enforce safety constraints, and define business logic — yet most teams manage them with less rigor than they'd apply to a CSS change.