Skip to main content

218 posts tagged with "production"

View all tags

The max_tokens Default Your Provider Raised That Doubled Your Tail Response Length

· 12 min read
Tian Pan
Software Engineer

Your incident timeline shows no deploys. Your code did not change. Your traffic mix did not change. Your prompts did not change. And yet your p99 output length doubled inside a week, your downstream rendering layer started clipping responses, and your output-token bill rose 38% on traffic that wasn't asking for longer answers. The change was real, the regression was measurable, and nothing in your version control system records it — because the value that moved was one your code never sent.

The provider raised an implicit default. The release notes filed it under "improved long-form behavior." The parameter in question was max_tokens, which your application has been omitting since day one because the documented default was generous and your outputs rarely came close. The default moved from 4096 to 8192 to accommodate longer reasoning in the provider's newer models. Your application got the new default whether you wanted it or not, because the absence of a parameter is itself a configuration choice — and the provider owns the right to change the value behind it.

This is the failure mode where a "no-op" release on the provider's side propagates through your system as a behavior change, a cost change, and a UX change all at once, and your team's only diagnostic signal is the bill arriving at the end of the month.

The Prompt Hot-Reload That Orphaned Every In-Flight Agent Run

· 11 min read
Tian Pan
Software Engineer

The pager went off at 11:47pm. A customer had been ten minutes into a refund conversation when the agent suddenly stopped calling the process_refund tool it had been reasoning about for the entire session, hallucinated a confirmation number, and ended the chat. By the time we traced it back, the cause was obvious in retrospect: a teammate had pushed an updated system prompt at 11:46. The push was clean, the tests passed, and every new conversation worked perfectly. The few hundred conversations already in progress did not.

We had built our prompt registry to support what every prompt-versioning vendor in 2026 markets as a feature: hot-reload without redeploy. We had treated that capability as if it were a CDN cache flush — a global swap that takes effect everywhere at once. What it actually was, we learned that night, was a contract break. Every active conversation was an in-flight negotiation between an LLM and a set of instructions plus tool definitions it had been reasoning against. When the registry swapped the prompt under those conversations, half the negotiated context was now stale.

When Safety Training Collapses the Operator Into the User

· 10 min read
Tian Pan
Software Engineer

The on-call engineer is paged at 3am. A queue is backed up, the customer-facing API is throwing 503s, and the documented mitigation is to drain the affected node and force a failover. She types the command into the operations agent and waits for the confirmation. Instead she gets a paragraph about how draining production nodes could affect users, a suggestion to consult her manager, and a polite refusal to proceed without "additional authorization." It is 3:04am. The runbook she is following was approved by her director, her VP, and the compliance team. The agent has no idea who she is.

This is not a model alignment failure. The model is doing exactly what it was trained to do: refuse risky-sounding requests from unidentified prompts. The failure is architectural. The compliance review that signed off on user-facing refusals also, without anyone noticing, signed off on blocking the on-call engineer.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

· 12 min read
Tian Pan
Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For

· 11 min read
Tian Pan
Software Engineer

A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.

Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.

This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

· 10 min read
Tian Pan
Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

Gradual Context Replacement: Managing Long AI Conversations Without Losing Quality

· 9 min read
Tian Pan
Software Engineer

Your chatbot works perfectly for the first fifteen turns. Then something goes wrong. It contradicts an earlier decision. It asks for information the user already provided. It loses the thread of a multi-step task that was clearly defined at the start. The conversation history is technically all there—you haven't deleted anything—but the model is behaving as if it wasn't.

This is context rot: the gradual degradation of output quality as conversation histories grow. A 2024 evaluation of 18 state-of-the-art models across nearly 200,000 controlled calls found that reliability decreases significantly beyond 30,000 tokens, even in models with much larger nominal windows. High-performing models become as unreliable as much smaller ones in extended dialogues. The problem isn't that your context window ran out. It's that transformer attention is quadratic—100,000 tokens means 10 billion pairwise relationships—and the model is forced to distribute focus so thinly that important earlier content gets effectively ignored.

When teams hit this wall, they usually reach for one of two fixes: truncation or summarization. Both make things worse in predictable ways.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

· 11 min read
Tian Pan
Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.