Skip to main content

59 posts tagged with "infrastructure"

View all tags

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

· 11 min read
Tian Pan
Software Engineer

Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.

The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.

Dev/Prod Parity for AI Apps: The Seven Ways Your Staging Environment Is Lying to You

· 11 min read
Tian Pan
Software Engineer

The 12-Factor App doctrine made dev/prod parity famous: keep development, staging, and production as similar as possible. For traditional web services, this is mostly achievable. For LLM applications, it is structurally impossible — and the gap is far larger than most teams realize.

The problem is not that developers are careless. It is that LLM applications depend on a class of infrastructure (cached computation, living model weights, evolving vector indexes, and stochastic generation) where the differences between staging and production are not merely inconvenient but categorically different in kind. A staging environment that looks correct will lie to you in at least seven specific ways.

Evaluating AI Service Vendors Beyond Your LLM Provider

· 10 min read
Tian Pan
Software Engineer

Most engineering teams spend weeks evaluating LLM providers—benchmarking latency, testing accuracy, negotiating pricing. Then they pick an observability tool, a guardrail vendor, and an embedding provider in an afternoon, on the basis of a well-designed landing page and a favorable blog post. The asymmetry is backwards. Your LLM provider is probably a well-capitalized company with stable APIs. The niche vendors surrounding it often are not.

The AI service ecosystem has exploded into dozens of categories: guardrail vendors, embedding providers, observability and tracing tools, fine-tuning platforms, evaluation frameworks. Each category has ten startups competing for the same enterprise budgets. Some will be acquired. More will shut down. A few will pivot and deprecate your critical workflow with a 90-day notice email. Building on this ecosystem without rigorous evaluation is a form of technical debt that doesn't show up in your backlog until it's already a production incident.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

On-Device LLM Inference in Production: When Edge Models Are Right and What They Actually Cost

· 10 min read
Tian Pan
Software Engineer

Most teams decide to use on-device LLM inference the same way they decide to rewrite their database: impulsively, in response to a problem that a cheaper solution could have solved. The pitch is always compelling—no network round-trips, full privacy, zero inference costs—and the initial prototype validates it. Then six months post-ship, the model silently starts returning worse outputs, a new OS update breaks quantization compatibility, and your users on budget Android phones are running a version you can't push an update to.

This guide is about making that decision with eyes open. On-device inference is genuinely the right call in specific situations, but the cost structure is different from what teams expect, and the production failure modes are almost entirely unlike cloud LLM deployment.

Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional

· 12 min read
Tian Pan
Software Engineer

Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.

The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

Choosing a Vector Database for Production: What Benchmarks Won't Tell You

· 10 min read
Tian Pan
Software Engineer

When engineers evaluate vector databases, they typically load ANN benchmarks and pick whoever tops the recall-at-10 chart. Three months later, they're filing migration tickets. The benchmarks measured query throughput on a static, perfectly-indexed dataset with a single client. Production looks nothing like that.

This guide covers the five dimensions that predict whether a vector database holds up under real workloads — and a decision framework for matching those dimensions to your stack.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

LLM Rate Limits Are a Distributed Systems Problem

· 11 min read
Tian Pan
Software Engineer

Your AI product has two surfaces: a user-facing chat feature and a background report generation job. Both call the same LLM API under the same key. One afternoon, a support ticket arrives: "Chat responses are getting cut off halfway." No alerts fired. No 429s in the logs. The API was returning HTTP 200 the entire time.

What happened: the report generation job gradually consumed most of your shared token quota. Chat requests started completing, but only up to your max_tokens limit — semantically truncated, syntactically valid, silently wrong. Your standard monitoring never noticed because there was nothing to notice at the HTTP layer.

This is not an edge case. It is what happens when engineers treat LLM rate limits as a simple throttle problem instead of recognizing the class of distributed systems failure they actually are.

The Hidden Switching Costs of LLM Vendor Lock-In

· 11 min read
Tian Pan
Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.