Skip to main content

23 posts tagged with "slo"

View all tags

The Logprobs Field Your Provider Removed That Broke Your Confidence Router Silently

· 12 min read
Tian Pan
Software Engineer

The most expensive line in the postmortem was the one nobody wrote: a 200 OK with a missing field. The router that was supposed to escalate hard questions to the stronger model had been escalating zero percent of traffic for six weeks. The cost dashboard was celebrating. The quality dashboard was sliding, but only on the hard-question slice the standing eval set underweighted. Everything looked like a win until a customer complained about a specific kind of question the system used to handle correctly.

The cause was a response shape change one tier up the contract stack. The provider's mid-tier plan had dropped per-token logprobs as part of what the release notes called a "tier-specific feature parity adjustment." The client still received valid JSON. The HTTP status was still 200. The model identifier in the response matched the model identifier in the request. The only thing that changed was that the field the router consumed to make its escalation decision was no longer there, and the defensive default added during an incident a year earlier had quietly become the production default for every request.

Your Latency SLO Is a Function of Other Teams' Prompt Sizes

· 10 min read
Tian Pan
Software Engineer

Your chat product has been running quietly at a 1.5-second p99 latency SLO for months. The request rate is flat, the prompt sizes are flat, the model has not changed. Then, on a Tuesday afternoon, p99 jumps to 4.8 seconds and stays there. The on-call investigation finds no anomaly in the chat path: same requests-per-minute, same median prompt of around 800 tokens, same retry behavior on the SDK. The deploy log for the chat service is empty for the day. The breach lasts six hours.

The cause is in another team's repo. That morning, a long-document summarization feature shipped on the same organization key, with average prompts of 12,000 tokens. Their request rate is modest — a few hundred per minute — but each call burns through the shared tokens-per-minute budget fifteen times faster than yours. The provider's throttle fires on the chat path because the chat path was holding the same bucket the summarization team just emptied. Nobody changed your code, nobody breached anyone's planned capacity, and your SLO is now a function of a workload your team has never read.

The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User

· 9 min read
Tian Pan
Software Engineer

The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.

This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.

The Latency Budget Your Orchestrator Spent on Its Own Planning Step

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a week-long instrumentation pass on a customer-support agent that had, on paper, a perfectly reasonable median latency. P50 was inside SLO, P95 was uncomfortable but explainable, and the tool-call traces looked healthy. Then someone bucketed the spans by type and the room got quiet. The agent was spending roughly 58% of its wall-clock per run inside spans labeled "plan," "reflect," "decide-next-step," and "self-check." Tool execution — the database lookups, the CRM writes, the auth checks — accounted for under 30%. The thing the agent was being measured on did less than the thing nobody was measuring.

That ratio is not a fluke. It is the natural state of any plan-act-observe loop that you do not actively police. The orchestrator is paid in latency for thinking and paid in latency for acting, and the thinking step is almost always cheaper to add than the acting step, so it grows unchecked. By the time you notice, "decide what to do next" has become its own line item — bigger than most of the line items you originally built the agent to serve.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

· 10 min read
Tian Pan
Software Engineer

The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.

The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.

Fourth-Party Risk: When Your Vendor's Vendor Owns Your Customer's Incident

· 11 min read
Tian Pan
Software Engineer

Your contract is with the model provider. Your runbook handles the case where that provider is degraded. Your status page subscription pages you when their dashboard turns yellow. You feel covered. Then one Wednesday afternoon the underlying cloud region your provider runs in starts brownouts, your provider's failover region is also affected because they consolidated capacity to control unit economics, and your product is half-down for ninety minutes because of a vendor decision two layers upstream from any contract you signed.

The customer postmortem request lands in your inbox the next morning. They want a root cause. The root cause lives in a layer your status page cannot see and your contract does not let you compel. That layer is what fourth-party risk actually is — not a procurement checkbox, but a silent dependency tier that propagates failures upward with attenuation but not absorption.

The Approval Queue That Became Your Critical Path

· 11 min read
Tian Pan
Software Engineer

The design doc said "human in the loop." The launch deck said "safe by default." The incident review six months later said the agent took ninety minutes to send a customer an invoice because the approver was at lunch. None of those documents were lying. They were describing the same component at different points on its load curve — and only one of them got the shape right.

When you put a human between an agent and an irreversible action, you have not added a safety primitive. You have added a service with a queue, a throughput limit, a quality-versus-load curve, and an availability profile. The team that ships the agent without naming that service has shipped a product whose critical path runs through a piece of infrastructure they refuse to operate.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.

The AI Feature With Two Latencies: You Measure One, Your Users Feel the Other

· 9 min read
Tian Pan
Software Engineer

A traditional HTTP request has one latency that matters: the time from request to response. The p95 of that number is the contract. SRE watches it, the SLO is written against it, and when it regresses someone gets paged. One number, one dashboard, one truth.

A streaming AI feature broke that model the moment the response became a stream, and most teams haven't noticed. There are now two latencies, and they diverge. Time-to-first-token is how long the user stares at a spinner before anything happens. Time-to-completion is how long until the answer is fully written. They are shaped by different forces, fixed by different levers, and felt by the user at completely different emotional weights — and almost every team instruments only the second one, because that's the number the HTTP framework hands them for free.