Skip to main content

14 posts tagged with "llm-infrastructure"

View all tags

The Inference Region Your Data Residency Policy Forgot to Pin

· 9 min read
Tian Pan
Software Engineer

The compliance audit always starts with the same question and your team always answers it the same way. "Where is customer data processed?" In the EU region, the slide deck says, and the SDK config screenshot confirms it, and the DPA promises it. Then the auditor pulls a sample of last quarter's request logs, joins them to the provider's per-request region header, and the room gets quiet. Something like four percent of EU enterprise prompts were served by a US-region inference node during a forty-minute capacity event the team did not know happened. The cache that holds reusable prefixes was in the global pool. The trace store the support team queries is in us-east. The DPA was a slide deck. The contract was a routing hint.

This is the kind of incident that does not show up in a postmortem because no service degraded. The model returned an answer, the user got a response, the latency graph stayed flat. The thing that broke is a thing the dashboards were never wired to see: the geographic path of the request through the provider's infrastructure. Engineers who would never confuse a us-east-1 URL with "the request actually executed in us-east-1" routinely make that exact mistake at the LLM API layer, because the provider's region parameter looks like the AWS one, behaves like the AWS one in the happy path, and silently degrades to "best effort" the moment the preferred region runs out of GPU.

The Backpressure Signal Your Inference Provider Refuses to Send

· 9 min read
Tian Pan
Software Engineer

Your retry logic backs off on 429. Your queue depth alarm fires when latency rises. Between those two signals there is a region of provider load where the right action is "slow down by twenty percent" — and the only thing the provider will tell you is the binary throttle that arrives too late. The single most useful signal for an agent fleet to coordinate on is the one no inference API actually exposes.

A 429 is a tombstone, not a warning. By the time you receive one, the provider has already decided your traffic is excessive, you have already wasted a request's worth of token accounting, and — if you are sharing a tenant with other consumers — they have probably gotten one too. The interesting failure mode is not the 429 itself; it is the seconds before it, when every client in the world is flying blind between "everything is fine" and "you are cut off."

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months

· 11 min read
Tian Pan
Software Engineer

The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.

The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.

The AI Gateway Is the SPOF Nobody Named

· 10 min read
Tian Pan
Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

· 9 min read
Tian Pan
Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.

Per-Tenant Inference Isolation: When Shared Cache, Fine-Tunes, and Embeddings Leak Across Customers

· 12 min read
Tian Pan
Software Engineer

Multi-tenant SaaS solved data isolation a decade ago. Row-level security in Postgres, per-tenant encryption keys, S3 bucket policies scoped to tenant prefixes — by 2018 the playbook was so well-rehearsed that an auditor asking "show me how customer A's data cannot reach customer B" had a one-page answer with a citation per layer. AI features quietly reintroduced the question and the answer is no longer one page.

The interesting part is not that AI broke isolation. The interesting part is where it broke isolation: not at the data layer the audit team has been guarding for ten years, but at four new layers nobody put on the diagram. Prompt cache prefixes share KV state across requests in ways that turn time-to-first-token into a side channel. Fine-tunes trained on aggregated customer data memorize tenant-specific phrasing and surface it back to the wrong customer. Embedding indexes get partitioned logically by query filter when the threat model demands physical separation. KV-cache reuse across requests creates timing channels that nobody threat-modeled when "shared inference is fine" was a reasonable shortcut.

This post is about what changed and what the discipline looks like once you take the problem seriously.

Durable Agents: Why Async Queues Break for Long-Running AI Workflows

· 11 min read
Tian Pan
Software Engineer

An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.

The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.

This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.

Agentic Web Data Extraction at Scale: When Agents Replace Scrapers

· 10 min read
Tian Pan
Software Engineer

The demo takes 20 minutes to build. You paste a URL, an LLM reads the HTML, and structured data comes out the other end. It feels like the future of web extraction has arrived.

Then you run it at 1,000 pages per hour. Costs spiral, blocks accumulate, and extracted fields start drifting in ways that don't look like errors — they look like normal data until your downstream pipeline has silently ingested three weeks of garbage. The "LLM reads the page" pattern is not wrong; it's just priced for prototype throughput.

Agentic web extraction genuinely solves problems that traditional scrapers cannot. But scaling it past proof-of-concept requires understanding a different set of failure modes than most teams expect.

Multi-User Shared AI Sessions: The Concurrency Problem Nobody Has Solved

· 12 min read
Tian Pan
Software Engineer

Most AI products are built for a single user with a single intent, a single conversation thread, and a single identity. This works well enough when the product is a personal productivity tool—a writing assistant, a code completion engine, a summarizer. But something happens when teams start using AI collaboratively: the product silently breaks in ways that are hard to diagnose and harder to fix. Two users prompt the AI simultaneously, and one of their inputs disappears. A context window shared across five engineers fills up with duplicated history. The AI responds to user A's question using user B's permissions. Nobody designed for any of this, because shipping multi-user shared context means confronting one of the hardest distributed systems problems in modern AI infrastructure.

This post is about what actually makes simultaneous multi-user AI sessions hard, what production teams have tried, and what the emerging architectural patterns are. If you are building a collaborative AI feature and wondering why it feels impossibly complex, this is why.

Agentic Task Complexity Estimation: Budget Tokens Before You Execute

· 10 min read
Tian Pan
Software Engineer

Two agents receive the same user message. One finishes in 3 seconds and 400 tokens. The other enters a Reflexion loop, burns through 40,000 tokens, hits the context limit mid-task, and produces a half-finished answer. Neither the agent nor the calling system predicted which outcome was coming. This is not an edge case — it is the default behavior when agents start tasks without any model of how deep the work will go.

LLM-based agents have no native sense of task scope before execution. A request that reads as simple in natural language might require a dozen tool calls and multiple planning cycles; a complex-sounding request might resolve in a single lookup. Without pre-execution complexity estimation, agents commit resources blindly: the context window fills quadratically as turn history accumulates, planning overhead dominates execution time, and by the time the system detects a problem, the early decisions that caused it are irreversible.