214 posts tagged with "production"

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

May 13, 2026 · 12 min read

Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For

May 9, 2026 · 11 min read

Tian Pan

Software Engineer

A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.

Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.

This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

May 7, 2026 · 11 min read

Tian Pan

Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

Gradual Context Replacement: Managing Long AI Conversations Without Losing Quality

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Your chatbot works perfectly for the first fifteen turns. Then something goes wrong. It contradicts an earlier decision. It asks for information the user already provided. It loses the thread of a multi-step task that was clearly defined at the start. The conversation history is technically all there—you haven't deleted anything—but the model is behaving as if it wasn't.

This is context rot: the gradual degradation of output quality as conversation histories grow. A 2024 evaluation of 18 state-of-the-art models across nearly 200,000 controlled calls found that reliability decreases significantly beyond 30,000 tokens, even in models with much larger nominal windows. High-performing models become as unreliable as much smaller ones in extended dialogues. The problem isn't that your context window ran out. It's that transformer attention is quadratic—100,000 tokens means 10 billion pairwise relationships—and the model is forced to distribute focus so thinly that important earlier content gets effectively ignored.

When teams hit this wall, they usually reach for one of two fixes: truncation or summarization. Both make things worse in predictable ways.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

May 7, 2026 · 11 min read

Tian Pan

Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Rate Limits Are a Design Constraint, Not an Error Code

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

A team I know built a financial assistant with an agentic loop. Week one, API spend was $127. Week eleven, it was$ 47,000 — same system, same feature, no intentional change in scope. The agent hit a rate limit, the retry logic dutifully retried, the loop had no circuit breaker, and the costs compounded in silence until someone noticed the billing alert they had set too high.

This isn't a story about a bug. It's a story about architecture. The team's mental model treated rate limits as an error to handle reactively. The system they built reflected that model exactly. The $47,000 week was the system working as designed.

Soft Constraints vs. Hard Constraints in LLM Systems: Why the Mismatch Causes Real Failures

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Most LLM system failures don't come from the model being wrong. They come from the system being wrong about what the model can enforce. When you write "never reveal customer data" in a system prompt and treat that as equivalent to "revoke the database credential," you have introduced a category error that will eventually cause a security incident, a reliability failure, or a broken user experience — and you won't know which one until it happens in production.

The distinction between soft constraints and hard constraints is architectural, not stylistic. Getting it wrong doesn't produce style regressions. It produces breaches.

The Staging Environment Lie: Why Pre-Production Fails for AI Systems

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Your staging environment passed all its checks. The LLM responded correctly to every test prompt. Latency was good. Quality scores looked fine. You shipped. Then, two days later, production started hallucinating on a class of queries your eval set never covered, your costs spiked 3x because the cache was cold, and a model update your provider pushed silently changed behavior in ways your old test suite couldn't detect. Staging said green. Production said otherwise.

This isn't a testing gap you can close by writing more test cases. Pre-production environments are structurally misleading for AI systems in ways they aren't for traditional software. The failure modes are systematic, and the fix isn't better staging — it's a different architecture.

Tool Call Convergence: Designing Agents That Know When to Stop

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

A LangChain analyzer/verifier agent pair ran for 264 hours straight and racked up $47,000 in API costs. It produced nothing useful. The verifier kept rejecting the analyzer's output without saying what was wrong. The analyzer defaulted to trying again. No one had written a stopping criterion. The loop ran until someone noticed the invoice.

This is the failure mode that doesn't make it into architecture diagrams: agents that know how to call tools but don't know when to stop. The canonical agent loop is a while True that asks the model "should I call a tool?" — but that question has no built-in answer for "I've seen enough." Without convergence logic, you're not building an agent. You're building an expensive polling function.

About Tian Pan