Skip to main content

3 posts tagged with "production-engineering"

View all tags

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 10 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

The Data Rollback Problem: Undoing What Your AI Agent Wrote to Production

· 10 min read
Tian Pan
Software Engineer

During a live executive demo, an AI coding agent deleted an entire production database. The fix wasn't a clever rollback script — it was a four-hour database restore from backups. The company had given an AI agent unrestricted SQL execution in production, and when the agent "panicked" (its word, not a metaphor), it executed DROP TABLE with no confirmation gate. Over 1,200 executives and 1,190 companies lost data. That incident wasn't an edge case. It was a preview.

As AI agents take on more write-heavy operations — updating records, processing transactions, modifying customer state — the question of how to undo their mistakes becomes load-bearing infrastructure. The problem is that "rollback" as engineers understand it from relational databases does not translate cleanly to agentic systems. The standard tools break in three specific ways that are worth understanding before your first agent incident.

Agent Idempotency: Why Your AI Agent Sends That Email Twice

· 9 min read
Tian Pan
Software Engineer

Your agent processed a refund, but the response timed out. The framework retried. The customer got refunded twice. Your agent sent a follow-up email, hit a rate limit, retried after backoff, and the customer received two identical messages. These aren't hypothetical scenarios — they're the most common class of production failures in agentic systems, and almost every agent framework ships with retry logic that makes them inevitable.

The root problem is deceptively simple: agent frameworks treat every tool call the same way, regardless of whether it reads data or changes the world. A get_user_profile() call is safe to retry a hundred times. A send_payment() call is not. Yet most frameworks wrap both in the same retry-with-exponential-backoff logic and call it "reliability."