Skip to main content

330 posts tagged with "observability"

View all tags

The Agent Wall-Clock Budget That Raced Your Tool's Own Timeout

· 11 min read
Tian Pan
Software Engineer

There is a class of agent bug that does not appear in any single component when you look at it in isolation. The model is fine. The tool is fine. The retry policy is fine. The timeout values are even, on paper, generous. And yet a tool that consistently completes in eight seconds keeps landing against an agent that has already declared it a failure at seven point nine, replanned around an "error" that never happened, and started a second call that the first call's result is about to collide with.

The bug is not in any of the boxes. It is in the gap between two clocks that nobody agreed should be the same clock.

The Citation Index Your Chunker Shifted by One When It Started Prefixing Line Numbers

· 11 min read
Tian Pan
Software Engineer

The chunker started prepending [line N] to every chunk. The eval went green. Every citation the model produced after that day pointed to the paragraph one position before the actual evidence, on every document, in the regulated industry the product serves. The team did not find out from the eval. The team found out from an auditor who looked at the cited sentence, read it, and pointed out that it contradicted the claim it was supposed to support.

This is the kind of regression that survives a code review, a manual QA pass on three sample documents, and a feature-flag rollout. None of those checks were wrong in isolation. They were all asking the same question — does a citation appear where one is expected — and none of them were asking the question the auditor asked, which is whether the citation points at the sentence the claim came from. The gap between those two questions is where the off-by-one lived for as long as it lived.

What makes this failure mode worth a separate write-up is not the bug itself. Off-by-one errors are old news. The interesting part is that the failure was produced by two systems that continued to agree on the structure of an integer while silently disagreeing about what the integer meant.

The Conversation Summary Your Agent Regenerated Each Turn Because the Cache Key Included a Timestamp

· 11 min read
Tian Pan
Software Engineer

A cache that is being written to but never read from is not a cache. It is a logging system with extra latency, billed by the kilobyte. And the cruelest version of this failure mode is the one where the cache looks healthy from every angle except the one that matters: the set calls succeed, the get calls return quickly, the keys are well-formed, the values are valid, the TTLs are sensible. The only thing wrong is that no get call ever finds the key a previous set call wrote, because a single field in the key changes every time it is computed.

This is the story of a debugging session that added a timestamp to a cache key "so I can tell which cache entry I'm looking at," and the system that quietly paid for fourteen extra LLM calls per conversation for two weeks before anyone noticed.

The Debug Logger That Put Your System Prompt in a Customer-Readable Audit Feed

· 10 min read
Tian Pan
Software Engineer

A security-conscious customer pulled their tenant's audit export, opened the JSON, and read the verbatim refusal policy, retrieval pipeline structure, and a handful of internal product identifiers from a field called llm.request.system. No exploit. No prompt injection. No jailbreak. Just a log field your platform team added six months earlier so engineers could correlate prompt versions with incidents — surfaced through a feed your enterprise team had separately opened to tenants for SOC 2 reasons.

The disclosure happened during a normal Wednesday afternoon. Your security team got paged by the customer, not by an alert. The incident timeline doesn't show a deploy on the day of the leak — the misconfiguration shipped on the day the audit feed expanded its field allowlist, which was a different team, a different sprint, and a different ticket. Both reviewers signed off on what they were looking at. Neither was looking at the composition.

The Embedding Deprecation That Halved Your Retrieval Recall Without a Deploy

· 10 min read
Tian Pan
Software Engineer

The most expensive embedding bug a RAG system can ship is the one where nothing in your repository changes. Your retrieval code is the same. Your index is the same. Your query path is the same. And one Tuesday in week six, somebody notices that the answers used to be better.

The provider posted a sunset notice for the embedding family your index was built against twelve months ago. The platform team filed it in a deprecations dashboard with a year of runway and moved on. The sunset path wasn't a hard cutoff — it was a quiet quality regression where the deprecated endpoint started routing to a "compatibility" successor that returned vectors in the same dimensionality and a subtly different semantic geometry. Query embeddings began drifting against the corpus you embedded a year ago. Recall@10 on your standing eval slid by 47% over six weeks. The team only traced it back when an unrelated quality dashboard crossed a threshold, dragging a senior engineer into a root-cause exercise that ended at an embedding endpoint no one on the call had touched in a year.

The Eval Harness That Ran on Yesterday's Prompt Template After Your Team Shipped a New One

· 9 min read
Tian Pan
Software Engineer

The incident timeline reads cleanly. At 9:02 your platform team pushed prompt-template@v38 to the config service. At 11:14 your dashboards showed everything green. At 16:51 someone in support flagged a spike in escalations. At 17:03 you opened the eval suite, found a regression score of 0.34, and rolled back. The post-mortem says "caught in eight hours, no customer harm beyond the 0.04% who saw it." Engineering leadership applauds the response time.

It is wrong. The regression was caught in zero hours. The eval suite running at 17:03 was the same eval suite running at 09:03. It had been pointed at v37 the entire time. The harness loaded the template from your config service at process startup, cached the rendered prompts as Python objects in module-level scope, and never reread the source. Your live traffic moved to v38 at 9am. Your eval moved at 17:03, when someone restarted the worker pool to "rerun the regression." Eight hours of customer interactions ran against a prompt that no eval had ever scored, while the eval kept grading a prompt that no production request was using.

The JSON Schema Your Output Passed and Your Downstream Consumer Rejected for Semantic Drift

· 10 min read
Tian Pan
Software Engineer

A JSON schema validates the shape of your output. It does not validate the meaning of the values inside that shape. For nine months, every output your AI pipeline produces passes validation cleanly, your monitoring shows schema validity at 100%, and your team treats a schema-valid response as a contractually correct one. Then a model upgrade ships, every output continues to validate, and your Slack alerting channel goes from 50 messages a day to 800 overnight.

The schema did not break. The distribution of values inside it did. That is the gap most AI teams discover in production: the JSON contract is a type system, not a behavior system, and a downstream consumer was depending on a value distribution the contract was never asked to enforce.

The KV Cache Warm-Up Cron That Ran in Blue and Never in Green Because the Host Pinning Never Moved

· 11 min read
Tian Pan
Software Engineer

The incident review reconstructed a deployment from twelve days earlier as the cause of a 3.6× spend increase, and nobody on the call had been in the room when the change shipped. The deployment was routine: blue/green swap, traffic moved to green on schedule, blue decommissioned, the pipeline turned green, the release engineer closed the ticket. None of the production SLOs tripped. None of the application-layer alerts fired. The system ran exactly as designed.

What had been designed was a five-minute cron that pre-warmed the provider's prompt cache against the stable system-prompt prefix every five minutes. The warm-up gave the team a 91% cache hit rate on cold starts and roughly a 4× cost advantage on the first request per session. The cron had been authored a year ago when the blue/green pattern was first introduced, and its host selector was pinned to the blue pool to avoid running the warm-up twice during overlap windows. When green became the live color and blue went away, the cron lost its host and silently transitioned from "running every five minutes" to "running never." The cache hit rate decayed over the next 36 hours as the provider's cache TTL aged out the pre-warmed prefixes. The cost dashboard, averaging per-request cost across a daily window, smoothed the slope until the next billing cycle made it loud.

The Logprobs Field Your Provider Removed That Broke Your Confidence Router Silently

· 12 min read
Tian Pan
Software Engineer

The most expensive line in the postmortem was the one nobody wrote: a 200 OK with a missing field. The router that was supposed to escalate hard questions to the stronger model had been escalating zero percent of traffic for six weeks. The cost dashboard was celebrating. The quality dashboard was sliding, but only on the hard-question slice the standing eval set underweighted. Everything looked like a win until a customer complained about a specific kind of question the system used to handle correctly.

The cause was a response shape change one tier up the contract stack. The provider's mid-tier plan had dropped per-token logprobs as part of what the release notes called a "tier-specific feature parity adjustment." The client still received valid JSON. The HTTP status was still 200. The model identifier in the response matched the model identifier in the request. The only thing that changed was that the field the router consumed to make its escalation decision was no longer there, and the defensive default added during an incident a year earlier had quietly become the production default for every request.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The Model Rollout Flag That Bucketed by Session and Drifted Your A/B Cohort

· 11 min read
Tian Pan
Software Engineer

The post-mortem opened with a sentence everyone in the room wanted to be true: the new model won by 4 percent on satisfaction, p less than 0.01, ship it. A month later a colder analysis found that the lift was a confound, the model was actually flat or slightly worse, and the team had spent the intervening weeks debating which prompt change had "caused" the win. Nothing about the model had caused anything. The experiment had been measuring the wrong thing because the flag service and the analysis pipeline disagreed, silently, about what a cohort was.

This is one of the most expensive failure modes in A/B testing because nothing in the system is broken. The flag service works. The experiment tracker works. The dashboard renders. The statistics are computed correctly on the data they receive. The failure lives in the seam between three components that each carry a different assumption about identity, and the seam is invisible until you go looking for it.

The Nightly Batch That Starved Your Interactive Traffic After a Quota Window Rewrite

· 11 min read
Tian Pan
Software Engineer

A cron job that ran cleanly for ten months is the most dangerous job in your system, because nothing in it changed and nothing in your code changed and the only thing that did change was a sentence in someone else's release notes that nobody on your team reads. The nightly embedding refresh that kicked off at 00:05 UTC every night, drained its work queue in under ten minutes, and went back to sleep was textbook. It coexisted with daytime interactive traffic by occupying the freshly-reset minute quota for a few minutes before users woke up, and by staying well under the daily allotment for the rest of the day. Then the provider rewrote how the daily window was accounted, kept the minute window unchanged, and left every signature your client tested against intact. The batch kept running clean. The interactive surface started returning 429s at 00:13 UTC every night. The team chased an upstream maintenance window that wasn't happening for a week.

The bug was never in your code. The bug was that "a daily limit" stopped meaning what it had meant the day before, and your scheduler was pinned to a wall-clock boundary that aligned with the old meaning. This post is about rate-limit accounting as a contract the provider can revise without breaking any signature, about how two independently-correct schedules compose into a denial-of-service pattern, and about the architectural moves that make a cron job stop being a time bomb wired to someone else's clock.