Skip to main content

76 posts tagged with "infrastructure"

View all tags

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

Choosing a Vector Database for Production: What Benchmarks Won't Tell You

· 10 min read
Tian Pan
Software Engineer

When engineers evaluate vector databases, they typically load ANN benchmarks and pick whoever tops the recall-at-10 chart. Three months later, they're filing migration tickets. The benchmarks measured query throughput on a static, perfectly-indexed dataset with a single client. Production looks nothing like that.

This guide covers the five dimensions that predict whether a vector database holds up under real workloads — and a decision framework for matching those dimensions to your stack.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

LLM Rate Limits Are a Distributed Systems Problem

· 11 min read
Tian Pan
Software Engineer

Your AI product has two surfaces: a user-facing chat feature and a background report generation job. Both call the same LLM API under the same key. One afternoon, a support ticket arrives: "Chat responses are getting cut off halfway." No alerts fired. No 429s in the logs. The API was returning HTTP 200 the entire time.

What happened: the report generation job gradually consumed most of your shared token quota. Chat requests started completing, but only up to your max_tokens limit — semantically truncated, syntactically valid, silently wrong. Your standard monitoring never noticed because there was nothing to notice at the HTTP layer.

This is not an edge case. It is what happens when engineers treat LLM rate limits as a simple throttle problem instead of recognizing the class of distributed systems failure they actually are.

The Hidden Switching Costs of LLM Vendor Lock-In

· 11 min read
Tian Pan
Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale

· 12 min read
Tian Pan
Software Engineer

Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.

This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

· 10 min read
Tian Pan
Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

The AI Dependency Footprint: When Every Feature Adds a New Infrastructure Owner

· 9 min read
Tian Pan
Software Engineer

Your team shipped a RAG-powered search feature last quarter. It required a vector database, an embedding model, an annotation pipeline, a chunking service, and an evaluation harness. Each component made sense individually. But six months later, you discover that three of those five components have no clear owner, two are running on engineers' personal cloud accounts, and one was quietly deprecated by its vendor without anyone noticing. The 3am page comes from a component nobody even remembers adding.

This is the AI dependency footprint problem: the compounding accumulation of infrastructure that each AI feature requires, combined with the organizational reality that teams rarely plan ownership for any of it before shipping.

The Embedding Drift Problem: How Your Semantic Search Silently Degrades

· 9 min read
Tian Pan
Software Engineer

Your semantic search is probably getting worse right now, and your dashboards are not telling you.

There is no error log. No p99 spike. No failed health check. Queries still return results with high cosine similarity scores. But the relevance is quietly deteriorating, one missed term at a time, as the language your users type diverges from the language your embedding model was trained on.

This is the embedding drift problem. It is insidious precisely because it produces no visible failure signal — only a slow erosion of retrieval quality that users attribute to the product being "not that useful anymore" before they stop using it entirely.

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

TTFT Is the Only Latency Metric Your Users Actually Feel

· 9 min read
Tian Pan
Software Engineer

Your model generates a 500-word response in 8 seconds. A competing model generates the same response in 12 seconds. Intuitively, yours should feel faster. But if your first token arrives at 2.5 seconds and theirs arrives at 400 milliseconds, your users will describe your product as slow — regardless of total generation time. This is the central paradox of LLM latency: the metric your infrastructure team optimizes for (end-to-end generation time, tokens per second) is not the metric your users experience. Time-to-first-token is.

TTFT is not a detail. It is the primary signal users use to judge whether your AI feature is responsive. Getting it wrong means building fast systems that feel slow.