Skip to main content

11 posts tagged with "llm-infrastructure"

View all tags

Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months

· 11 min read
Tian Pan
Software Engineer

The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.

The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.

The AI Gateway Is the SPOF Nobody Named

· 10 min read
Tian Pan
Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

· 9 min read
Tian Pan
Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.

Per-Tenant Inference Isolation: When Shared Cache, Fine-Tunes, and Embeddings Leak Across Customers

· 12 min read
Tian Pan
Software Engineer

Multi-tenant SaaS solved data isolation a decade ago. Row-level security in Postgres, per-tenant encryption keys, S3 bucket policies scoped to tenant prefixes — by 2018 the playbook was so well-rehearsed that an auditor asking "show me how customer A's data cannot reach customer B" had a one-page answer with a citation per layer. AI features quietly reintroduced the question and the answer is no longer one page.

The interesting part is not that AI broke isolation. The interesting part is where it broke isolation: not at the data layer the audit team has been guarding for ten years, but at four new layers nobody put on the diagram. Prompt cache prefixes share KV state across requests in ways that turn time-to-first-token into a side channel. Fine-tunes trained on aggregated customer data memorize tenant-specific phrasing and surface it back to the wrong customer. Embedding indexes get partitioned logically by query filter when the threat model demands physical separation. KV-cache reuse across requests creates timing channels that nobody threat-modeled when "shared inference is fine" was a reasonable shortcut.

This post is about what changed and what the discipline looks like once you take the problem seriously.

Durable Agents: Why Async Queues Break for Long-Running AI Workflows

· 11 min read
Tian Pan
Software Engineer

An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.

The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.

This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.

Agentic Web Data Extraction at Scale: When Agents Replace Scrapers

· 10 min read
Tian Pan
Software Engineer

The demo takes 20 minutes to build. You paste a URL, an LLM reads the HTML, and structured data comes out the other end. It feels like the future of web extraction has arrived.

Then you run it at 1,000 pages per hour. Costs spiral, blocks accumulate, and extracted fields start drifting in ways that don't look like errors — they look like normal data until your downstream pipeline has silently ingested three weeks of garbage. The "LLM reads the page" pattern is not wrong; it's just priced for prototype throughput.

Agentic web extraction genuinely solves problems that traditional scrapers cannot. But scaling it past proof-of-concept requires understanding a different set of failure modes than most teams expect.

Multi-User Shared AI Sessions: The Concurrency Problem Nobody Has Solved

· 12 min read
Tian Pan
Software Engineer

Most AI products are built for a single user with a single intent, a single conversation thread, and a single identity. This works well enough when the product is a personal productivity tool—a writing assistant, a code completion engine, a summarizer. But something happens when teams start using AI collaboratively: the product silently breaks in ways that are hard to diagnose and harder to fix. Two users prompt the AI simultaneously, and one of their inputs disappears. A context window shared across five engineers fills up with duplicated history. The AI responds to user A's question using user B's permissions. Nobody designed for any of this, because shipping multi-user shared context means confronting one of the hardest distributed systems problems in modern AI infrastructure.

This post is about what actually makes simultaneous multi-user AI sessions hard, what production teams have tried, and what the emerging architectural patterns are. If you are building a collaborative AI feature and wondering why it feels impossibly complex, this is why.

Agentic Task Complexity Estimation: Budget Tokens Before You Execute

· 10 min read
Tian Pan
Software Engineer

Two agents receive the same user message. One finishes in 3 seconds and 400 tokens. The other enters a Reflexion loop, burns through 40,000 tokens, hits the context limit mid-task, and produces a half-finished answer. Neither the agent nor the calling system predicted which outcome was coming. This is not an edge case — it is the default behavior when agents start tasks without any model of how deep the work will go.

LLM-based agents have no native sense of task scope before execution. A request that reads as simple in natural language might require a dozen tool calls and multiple planning cycles; a complex-sounding request might resolve in a single lookup. Without pre-execution complexity estimation, agents commit resources blindly: the context window fills quadratically as turn history accumulates, planning overhead dominates execution time, and by the time the system detects a problem, the early decisions that caused it are irreversible.

The Batch LLM Pipeline Blind Spot: Offline Processing and the Queue Design Nobody Talks About

· 11 min read
Tian Pan
Software Engineer

Most teams building with LLMs optimize for the wrong workload. They obsess over time-to-first-token, streaming latency, and response speed — then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time. The latency-first mental model that works for chat applications actively sabotages these offline workloads.

The batch LLM pipeline is the unglamorous workhorse of production AI. It's the nightly job that classifies 50,000 support tickets, the weekly pipeline that enriches your CRM with company descriptions, the daily run that generates embeddings for new documents. These workloads have fundamentally different design constraints than real-time serving, and treating them as slow versions of your chat API is where the problems start.

Cross-Tenant Data Leakage in Shared LLM Infrastructure: The Isolation Failures Nobody Tests For

· 11 min read
Tian Pan
Software Engineer

Most multi-tenant LLM products have a security gap that their engineers haven't tested for. Not a theoretical gap — a practical one, with documented attack vectors and real confirmed incidents. The gap is this: each layer of the modern AI stack introduces its own isolation primitive, and each one can fail silently in ways that let one customer's data reach another customer's context.

This isn't about prompt injection or jailbreaking. It's about the infrastructure itself — prompt caches, vector indexes, memory stores, and fine-tuning pipelines — and the organizational fiction of "isolation" that most teams ship without validating.