Skip to main content

233 posts tagged with "observability"

View all tags

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.

User-Side Concept Drift: When Your Prompt Held but Your Users Moved

· 10 min read
Tian Pan
Software Engineer

Most teams set up drift monitoring on the wrong side of the contract. They watch the model — capability shifts when a vendor pushes a new checkpoint, output distribution changes after a prompt rewrite, refusal-rate spikes that signal a safety filter retune. The dashboards are detailed, the alerts are wired into PagerDuty, and the team has a runbook for "the model moved." None of that helps when the model didn't move and the dashboard still goes red, because the thing that shifted was your users.

User-side concept drift is the version of this problem that almost every eval pipeline misses. Your prompt, your model, and your tools are byte-identical to the day you launched. Your golden test set still passes 91%. But the prompt that hit 91% in week one is now serving 78% in week thirty, because the input distribution has moved underneath it — users learned the product and changed how they ask, vocabulary mutated, seasonal task types appeared, a competitor reframed the category, a viral thread taught a new way to phrase the same intent. The model and prompt held. The contract held. The world the contract was negotiated against did not.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

You Accidentally Built a Feature-Flag System for Prompts — Without the Governance

· 10 min read
Tian Pan
Software Engineer

Pull up the config repo your team uses to ship prompt changes. Look at the last thirty commits. How many had a code review? How many had an eval gate in CI? How many can you attribute — with certainty — to a measurable change in production behavior for the users who saw them? If your answer is "most," you are an outlier. For everyone else, those commits are running in production right now, and the system reading them is doing exactly what a feature-flag service does: hot-reload a value, fan it out to users, change product behavior. The difference is that your feature-flag service has audit logs, exposure tracking, kill switches, and per-cohort targeting. Your prompt deploy pipeline has git push.

This is not a metaphor. It is an accurate description of the production system your team is running. The prompt config repo, the S3 bucket your workers poll, the "prompts" collection in your database, the LangSmith/PromptLayer/Braintrust asset that your app fetches on boot — these are all feature-flag services. They have the same runtime shape: a value lives outside the binary, the binary reads it on a hot path, changing the value changes behavior for real users without a deploy. The only thing missing is every control your SRE team demanded before they would approve the actual feature-flag service.

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

· 10 min read
Tian Pan
Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

Your Chain-of-Thought Is a Story, Not an Audit Log

· 11 min read
Tian Pan
Software Engineer

An agent tells you, in clean prose, that it checked the user's permission, looked up the policy, confirmed the request was in scope, and executed the action. Legal reads the trace. Auditors read the trace. Your incident review reads the trace. Everyone reads the same paragraph and everyone comes away satisfied.

None of them know whether the permission check actually ran. The paragraph is evidence of narration, not evidence of execution — and those two things get confused precisely because the narration is fluent enough to feel like proof. Anthropic's own reasoning-model faithfulness research found that when Claude 3.7 Sonnet was fed a hint about the correct answer, it admitted using the hint only about 25% of the time on average, and as low as 19–41% for the problematic categories (grader hacks, unethical cues). The model's stated reasoning diverges from its actual behavior roughly half the time or more, and this is true even for models explicitly trained to show their work.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.