Skip to main content

67 posts tagged with "infrastructure"

View all tags

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

· 12 min read
Tian Pan
Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

The Embedding API Hidden Tax: Why Vector Spend Quietly Eclipses Generation

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter had a moment of quiet panic when their finance partner flagged the AI bill. They had assumed, like most teams do, that the expensive line item would be generation — the GPT-class calls behind chat, summarization, and agent reasoning. It wasn't. Their monthly embedding spend had silently crossed generation in January, doubled it by March, and was on track to triple it by mid-year. Nobody had modeled it because per-token pricing on embedding models looks like rounding error: two cents per million tokens for small, thirteen cents for large. At that rate, who budgets for it?

The answer is: anyone whose product survives past prototype and starts indexing things at scale. Semantic search over a growing corpus, duplicate detection, classification, clustering, reindexing when you swap models — every one of these workloads burns embedding tokens by the billion, not by the million. And unlike generation, which is gated by user requests, embedding throughput is only gated by what you decide to index. That decision rarely gets a cost review.

This post is about the specific mechanics of how embedding spend escalates, the architectural levers that bend the curve, and the breakeven math for moving off a hosted API onto something you run yourself.

Your P99 Is Following a Stranger's Traffic: The Noisy-Neighbor Tax in Hosted LLM Inference

· 10 min read
Tian Pan
Software Engineer

Your dashboards are clean. The deployment from yesterday rolled back cleanly. The model version is pinned. The prompt didn't change. But your TTFT p99 just doubled, your customer success channel is on fire, and the only honest answer you can give is "it's the provider." That answer feels small — like a shrug — and it usually leads to a follow-up question that nobody on your team can answer: prove it.

This is the part of hosted LLM inference that the marketing pages do not discuss. When you call a frontier model API, you are sharing a GPU, a PCIe fabric, a continuous batch, and a KV-cache budget with workloads you cannot see. Your p99 is a function of their bursts. The economics of large-scale inference depend on multiplexing tenants tightly enough that hardware utilization stays north of 60-70%, which means your tail latency is structurally coupled to the largest, jankiest, lumpiest tenant on the same shard. You are not buying capacity; you are buying a slice of a queue that someone else is also standing in.

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

· 13 min read
Tian Pan
Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

· 11 min read
Tian Pan
Software Engineer

Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.

The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.

Dev/Prod Parity for AI Apps: The Seven Ways Your Staging Environment Is Lying to You

· 11 min read
Tian Pan
Software Engineer

The 12-Factor App doctrine made dev/prod parity famous: keep development, staging, and production as similar as possible. For traditional web services, this is mostly achievable. For LLM applications, it is structurally impossible — and the gap is far larger than most teams realize.

The problem is not that developers are careless. It is that LLM applications depend on a class of infrastructure (cached computation, living model weights, evolving vector indexes, and stochastic generation) where the differences between staging and production are not merely inconvenient but categorically different in kind. A staging environment that looks correct will lie to you in at least seven specific ways.

Evaluating AI Service Vendors Beyond Your LLM Provider

· 10 min read
Tian Pan
Software Engineer

Most engineering teams spend weeks evaluating LLM providers—benchmarking latency, testing accuracy, negotiating pricing. Then they pick an observability tool, a guardrail vendor, and an embedding provider in an afternoon, on the basis of a well-designed landing page and a favorable blog post. The asymmetry is backwards. Your LLM provider is probably a well-capitalized company with stable APIs. The niche vendors surrounding it often are not.

The AI service ecosystem has exploded into dozens of categories: guardrail vendors, embedding providers, observability and tracing tools, fine-tuning platforms, evaluation frameworks. Each category has ten startups competing for the same enterprise budgets. Some will be acquired. More will shut down. A few will pivot and deprecate your critical workflow with a 90-day notice email. Building on this ecosystem without rigorous evaluation is a form of technical debt that doesn't show up in your backlog until it's already a production incident.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.