Skip to main content

233 posts tagged with "observability"

View all tags

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.

Tool Behavior Drift: The Schema Held, the Semantics Didn't

· 11 min read
Tian Pan
Software Engineer

Your contract tests are green. The schema validator is happy. The tool returns the same shape it did last quarter. And the user-facing answer has been quietly wrong for six weeks.

This is the failure mode that contract testing was never designed to catch. Contract tests verify that the wire format hasn't changed — that search() still returns { results: [{ id, title, score }] }, that create_event still accepts an ISO 8601 string, that the geocoder still emits { lat, lng }. What they don't catch is the moment the search endpoint starts ranking by recency instead of relevance, the calendar API silently snaps your 14:07 start time to 14:00 in the EU region, the geocoder picks a different point inside the same ambiguous polygon, or the LLM-classifier-as-a-tool is upgraded to a new model behind a stable endpoint and the false-positive rate moves four points in a category your eval set never sampled. The schema held. The behavior didn't. Your agent kept reading green checkmarks and produced regressed answers no error log captured.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

Wall-Clock Deadline Drift: Why Your Agent Thinks It Has Time It Doesn't

· 9 min read
Tian Pan
Software Engineer

A user clicks send. The agent is configured with a thirty-second budget. The planner inspects the task, sees a deep-research path that takes about twelve seconds and a quick lookup that takes three, and confidently picks the deep path because "we have plenty of time." Twenty-eight seconds later the response lands, two seconds past the SLA the team published last quarter. The dashboard says the agent's reasoning was correct. The retry logic was correct. The tool calls succeeded. Nobody can explain why the user's spinner sat for forty-six seconds.

The bug is not in any single component. It is in the seam between them, in a value the system never thought to refresh: the agent's belief about how much time is left. Somewhere between request acceptance and the model's next planning step, a transparent retry happened, the wall clock advanced, and the deadline metadata didn't. The model is now reasoning about a budget it cashed out fifteen seconds ago and doesn't know it.

Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think

· 11 min read
Tian Pan
Software Engineer

A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.

This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.

Agent State Diff: Why Eyeballing Two Traces Doesn't Scale

· 9 min read
Tian Pan
Software Engineer

A regression slips into production. The team picks the failing input, replays it against last week's prompt, and gets a different output. Now they have to figure out why — and the answer is buried in three megabytes of differing text, divergent tool-call sequences, and shuffled retrieved chunks that no human can productively diff. So they paste both transcripts into a side-by-side viewer, scroll for twenty minutes, conclude "the model just felt different today," and ship a hotfix that doesn't address the root cause because they never found it.

This is the agent state diff problem, and it is the first place where general-purpose engineering tooling stops working for agentic systems. A traditional regression bisect runs against deterministic code: the same input produces the same output, and git bisect walks history until you find the commit that broke it. Agent runs aren't deterministic, the inputs aren't a single string, and the "history" is a multi-axis envelope — model snapshot, sampling config, retrieved context, tool catalog, harness flags — any of which can independently change behavior.

The Audit Trail Mismatch: When User, Agent, and Tool Each Have Different Logs

· 10 min read
Tian Pan
Software Engineer

A regulator emails you a single question: did this user authorize this transaction? Six hours later, three engineers are in a chat trying to join the chat surface's conversation log to the planner agent's reasoning trace to the tool's API record. The chat log has a turn ID and the user-visible message but no tool call detail. The planner trace has a tool-invocation record with timestamps that drift from the chat log by several hundred milliseconds. The tool's log has the API call with its own correlation ID that appears nowhere in the agent's record. The downstream service's log has yet another ID with no link back. The team eventually reconstructs the answer by joining on user IDs and approximate timestamps, hopes nothing critical is off by a turn, and ships a PDF to legal.

This is the audit trail mismatch. Every layer's owner believes their logs are fine — and individually, they are. The joined view is the artifact that doesn't exist, and nobody owns its absence. The team only finds out it doesn't exist when an incident, a customer escalation, or a regulator forces the join.

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

Diurnal Latency: Why Your AI Feature Is Slowest at 9am ET

· 8 min read
Tian Pan
Software Engineer

Sometime in the last quarter, an engineer on your team opened a Slack thread that started with "the model got slow." They had a graph: p95 latency for your assistant feature climbed steadily from 7am, peaked around 10am Eastern, plateaued through lunch, and quietly recovered after 5pm. The shape repeated the next day, and the day after that. The team retraced their deploys, blamed a tokenizer change, then a context-length regression, then nothing in particular. The fix never landed because the bug never lived in your code.

Frontier model providers run shared inference fleets. When your users wake up, so does the rest of North America, plus the European afternoon, plus every internal tool at every other company that bought into the same API. Queue depth at the provider doubles, GPU contention rises, and your p95 doubles with it — without a single line of your codebase changing. It is the most predictable production incident in your stack and almost no team builds a dashboard for it.

Hidden SDK Retries: Why You're Paying Twice and Don't Know It

· 10 min read
Tian Pan
Software Engineer

Open the OpenAI Python SDK source and you will find a quiet line: DEFAULT_MAX_RETRIES = 2. The Anthropic SDK ships the same default. Most TypeScript SDKs match. Two retries, exponential backoff, automatic on connection errors, 408, 409, 429, and any 5xx — fired before your code ever sees the failure. You do not configure this. You do not opt in. You usually do not know it is happening, because the metric your app records is request_count, not attempt_count, and the only span your tracer ever sees is the outer one the SDK closes after the final attempt.

This is fine, mostly, until it is not. Add an application-level retry decorator on top of that SDK call — the kind every team writes after their first 429 — and you have built a 3×3 storm: the SDK tries three times, your wrapper tries three times around the SDK, and a single user request fans out to nine inference calls during a provider degradation. The provider's bill counts every attempt. Your dashboards count one. The reconciliation, when someone finally runs it, is a quarter-end conversation nobody enjoys.