Skip to main content

67 posts tagged with "infrastructure"

View all tags

On-Device LLM Inference in Production: When Edge Models Are Right and What They Actually Cost

· 10 min read
Tian Pan
Software Engineer

Most teams decide to use on-device LLM inference the same way they decide to rewrite their database: impulsively, in response to a problem that a cheaper solution could have solved. The pitch is always compelling—no network round-trips, full privacy, zero inference costs—and the initial prototype validates it. Then six months post-ship, the model silently starts returning worse outputs, a new OS update breaks quantization compatibility, and your users on budget Android phones are running a version you can't push an update to.

This guide is about making that decision with eyes open. On-device inference is genuinely the right call in specific situations, but the cost structure is different from what teams expect, and the production failure modes are almost entirely unlike cloud LLM deployment.

Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional

· 12 min read
Tian Pan
Software Engineer

Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.

The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

Choosing a Vector Database for Production: What Benchmarks Won't Tell You

· 10 min read
Tian Pan
Software Engineer

When engineers evaluate vector databases, they typically load ANN benchmarks and pick whoever tops the recall-at-10 chart. Three months later, they're filing migration tickets. The benchmarks measured query throughput on a static, perfectly-indexed dataset with a single client. Production looks nothing like that.

This guide covers the five dimensions that predict whether a vector database holds up under real workloads — and a decision framework for matching those dimensions to your stack.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

LLM Rate Limits Are a Distributed Systems Problem

· 11 min read
Tian Pan
Software Engineer

Your AI product has two surfaces: a user-facing chat feature and a background report generation job. Both call the same LLM API under the same key. One afternoon, a support ticket arrives: "Chat responses are getting cut off halfway." No alerts fired. No 429s in the logs. The API was returning HTTP 200 the entire time.

What happened: the report generation job gradually consumed most of your shared token quota. Chat requests started completing, but only up to your max_tokens limit — semantically truncated, syntactically valid, silently wrong. Your standard monitoring never noticed because there was nothing to notice at the HTTP layer.

This is not an edge case. It is what happens when engineers treat LLM rate limits as a simple throttle problem instead of recognizing the class of distributed systems failure they actually are.

The Hidden Switching Costs of LLM Vendor Lock-In

· 11 min read
Tian Pan
Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale

· 12 min read
Tian Pan
Software Engineer

Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.

This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

· 10 min read
Tian Pan
Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

The AI Dependency Footprint: When Every Feature Adds a New Infrastructure Owner

· 9 min read
Tian Pan
Software Engineer

Your team shipped a RAG-powered search feature last quarter. It required a vector database, an embedding model, an annotation pipeline, a chunking service, and an evaluation harness. Each component made sense individually. But six months later, you discover that three of those five components have no clear owner, two are running on engineers' personal cloud accounts, and one was quietly deprecated by its vendor without anyone noticing. The 3am page comes from a component nobody even remembers adding.

This is the AI dependency footprint problem: the compounding accumulation of infrastructure that each AI feature requires, combined with the organizational reality that teams rarely plan ownership for any of it before shipping.