Skip to main content

17 posts tagged with "infrastructure"

View all tags

The Streaming Infrastructure Behind Real-Time Agent UIs

· 12 min read
Tian Pan
Software Engineer

Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.

This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.

Multi-Tenant LLM API Infrastructure: What Breaks at Scale

· 9 min read
Tian Pan
Software Engineer

Most teams start with a single API key for their LLM provider, shared across everything. It works until it doesn't. Then one afternoon, a bulk job in the data pipeline consumes the entire rate limit and the user-facing chat feature goes silent. Or finance asks you to break down the $40k LLM bill by team, and you realize you have no way to answer that question.

A production API gateway in front of your LLM providers solves both of these problems — but it introduces a category of complexity that most teams underestimate until they're already in trouble.

Streaming AI Applications in Production: What Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

The first sign something is wrong: your staging environment streams perfectly, but in production every user sees a blank screen, then the entire response appears at once. You check the LLM provider — fine. You check the backend — fine. The server is streaming tokens. They just never make it to the browser.

The culprit, 90% of the time: NGINX is buffering your response.

This is the most common streaming failure mode, and it's entirely invisible unless you know to look for it. It also captures something broader about production streaming: the problems aren't usually in the LLM integration. They're in all the infrastructure between the model and the user.

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

· 9 min read
Tian Pan
Software Engineer

Most AI agent demos work beautifully.

They run in under 30 seconds, hit three tools, and return a clean result. Then someone asks the agent to do something that actually matters — cross-reference a codebase, run a multi-stage data pipeline, process a batch of documents — and the whole thing falls apart in a cascade of timeouts, partial state, and duplicate side effects.

The problem is not the model. It is the infrastructure. Agents that run for minutes or hours face a completely different class of systems problems than agents that finish in seconds, and most teams hit this wall at the worst possible time: after they have already shipped something users depend on.