Skip to main content

18 posts tagged with "llm-infrastructure"

View all tags

The Agent Wall-Clock Budget That Raced Your Tool's Own Timeout

· 11 min read
Tian Pan
Software Engineer

There is a class of agent bug that does not appear in any single component when you look at it in isolation. The model is fine. The tool is fine. The retry policy is fine. The timeout values are even, on paper, generous. And yet a tool that consistently completes in eight seconds keeps landing against an agent that has already declared it a failure at seven point nine, replanned around an "error" that never happened, and started a second call that the first call's result is about to collide with.

The bug is not in any of the boxes. It is in the gap between two clocks that nobody agreed should be the same clock.

The Conversation Summary Your Agent Regenerated Each Turn Because the Cache Key Included a Timestamp

· 11 min read
Tian Pan
Software Engineer

A cache that is being written to but never read from is not a cache. It is a logging system with extra latency, billed by the kilobyte. And the cruelest version of this failure mode is the one where the cache looks healthy from every angle except the one that matters: the set calls succeed, the get calls return quickly, the keys are well-formed, the values are valid, the TTLs are sensible. The only thing wrong is that no get call ever finds the key a previous set call wrote, because a single field in the key changes every time it is computed.

This is the story of a debugging session that added a timestamp to a cache key "so I can tell which cache entry I'm looking at," and the system that quietly paid for fourteen extra LLM calls per conversation for two weeks before anyone noticed.

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

· 10 min read
Tian Pan
Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The Session Affinity Your Provider Load Balancer Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

Your dashboard says cache hit rate is 71%. Your finance partner is pleased. Your latency p50 is fine. Then a customer support thread arrives from a long-running agent session: turn 14 took eleven seconds to produce the first token, turn 15 took eight, turn 16 took nine. You pull the trace. Every one of those turns reports a cache_read_input_tokens value of zero. The system prompt is sixteen thousand tokens. The user thinks the agent is broken. You think your provider is broken. Neither of you is right. The aggregate hit rate is a survivorship statistic — it averages over the short conversations that hit cache trivially and quietly absorbs the long conversations that have collapsed to cold-on-first-token mid-session.

This is the failure mode that no provider postmortem will ever describe to you, because from their telemetry the system is working as designed. The load balancer is making the routing decision it was told to make. The cache is being populated and evicted on the schedule it was told to follow. The hint you passed — the prompt_cache_key, the conversation ID, the user ID, whatever string you serialized into that field — was advisory the whole time, and "advisory" means "ignored when convenient." Under load, when a scaling event happens, when an upstream pod is draining, when the affinity-aware tier is saturated, your hint quietly degrades to a uniform routing decision. The request lands on a cold pod. The KV tensors that would have served the prefix at sub-millisecond cost are sixteen feet away in a sibling rack and unreachable. Your conversation pays full-prefix cost again, and your dashboard's headline number doesn't move because two thousand other one-turn conversations hit cache fine.

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

The Backpressure Signal Your Inference Provider Refuses to Send

· 9 min read
Tian Pan
Software Engineer

Your retry logic backs off on 429. Your queue depth alarm fires when latency rises. Between those two signals there is a region of provider load where the right action is "slow down by twenty percent" — and the only thing the provider will tell you is the binary throttle that arrives too late. The single most useful signal for an agent fleet to coordinate on is the one no inference API actually exposes.

A 429 is a tombstone, not a warning. By the time you receive one, the provider has already decided your traffic is excessive, you have already wasted a request's worth of token accounting, and — if you are sharing a tenant with other consumers — they have probably gotten one too. The interesting failure mode is not the 429 itself; it is the seconds before it, when every client in the world is flying blind between "everything is fine" and "you are cut off."

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months

· 11 min read
Tian Pan
Software Engineer

The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.

The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.

The AI Gateway Is the SPOF Nobody Named

· 10 min read
Tian Pan
Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

· 9 min read
Tian Pan
Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.

Per-Tenant Inference Isolation: When Shared Cache, Fine-Tunes, and Embeddings Leak Across Customers

· 12 min read
Tian Pan
Software Engineer

Multi-tenant SaaS solved data isolation a decade ago. Row-level security in Postgres, per-tenant encryption keys, S3 bucket policies scoped to tenant prefixes — by 2018 the playbook was so well-rehearsed that an auditor asking "show me how customer A's data cannot reach customer B" had a one-page answer with a citation per layer. AI features quietly reintroduced the question and the answer is no longer one page.

The interesting part is not that AI broke isolation. The interesting part is where it broke isolation: not at the data layer the audit team has been guarding for ten years, but at four new layers nobody put on the diagram. Prompt cache prefixes share KV state across requests in ways that turn time-to-first-token into a side channel. Fine-tunes trained on aggregated customer data memorize tenant-specific phrasing and surface it back to the wrong customer. Embedding indexes get partitioned logically by query filter when the threat model demands physical separation. KV-cache reuse across requests creates timing channels that nobody threat-modeled when "shared inference is fine" was a reasonable shortcut.

This post is about what changed and what the discipline looks like once you take the problem seriously.