Skip to main content

720 posts tagged with "llm"

View all tags

Multi-Tenant LLM API Infrastructure: What Breaks at Scale

· 9 min read
Tian Pan
Software Engineer

Most teams start with a single API key for their LLM provider, shared across everything. It works until it doesn't. Then one afternoon, a bulk job in the data pipeline consumes the entire rate limit and the user-facing chat feature goes silent. Or finance asks you to break down the $40k LLM bill by team, and you realize you have no way to answer that question.

A production API gateway in front of your LLM providers solves both of these problems — but it introduces a category of complexity that most teams underestimate until they're already in trouble.

Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About

· 9 min read
Tian Pan
Software Engineer

Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.

This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.

Prompt Versioning in Production: The Engineering Discipline Teams Learn the Hard Way

· 10 min read
Tian Pan
Software Engineer

You get paged at 2am. Users are reporting garbage output. You SSH in, check logs, stare at traces — everything looks structurally fine. The model is responding. Latency is normal. But something is wrong with the answers. Then the question lands in your incident channel: "Which prompt version is actually running right now?"

If you can't answer that question in under thirty seconds, you have a prompt versioning problem.

Prompts are treated like configuration in most early-stage LLM projects. A product manager edits a string in a .env file, a developer pastes an updated instruction into a hardcoded constant, someone else pastes a slightly different version into a staging Slack channel. Eventually the versions diverge, and nobody has a complete picture of what's running where. The experimentation-phase casualness that got you to launch becomes a liability the moment you have real users.

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

· 8 min read
Tian Pan
Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

· 10 min read
Tian Pan
Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

· 10 min read
Tian Pan
Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.

The Three Memory Systems Every Production AI Agent Needs

· 10 min read
Tian Pan
Software Engineer

Most AI agents fail the same way: they work perfectly in demos and fall apart after the tenth real conversation. The agent that helped a user configure a billing integration last Tuesday has no idea who that user is today. It asks for their company name again. Then their plan tier. Then re-explains concepts the user already knows. The experience degrades from "useful assistant" to "chatbot with amnesia."

The instinct is to throw more context at the problem — stuff the conversation history into the prompt and call it solved. That works until it doesn't. At scale, full-context approaches become prohibitively expensive, and more troublingly, performance degrades as input grows. Research shows LLM accuracy drops measurably as context length increases, even within a model's advertised limits. A 1M-token context window is not a memory system.

The agents that work in production treat memory as a first-class architectural concern, not an afterthought. And the ones that get it right distinguish between three fundamentally different types of information that need to persist — each with different storage patterns, retrieval strategies, and decay characteristics.

What Nobody Tells You About Running MCP in Production

· 10 min read
Tian Pan
Software Engineer

The Model Context Protocol sells itself as a USB-C port for AI — plug any tool into any model and watch them talk. In practice, the first day feels like that. The second day you hit a scaling bug. By the third day you're reading CVEs about tool poisoning attacks you didn't know existed.

MCP is a genuinely useful standard. Introduced in late 2024 and quickly adopted across the industry, it has solved real integration friction between LLMs and external systems. But the gap between "got a demo working" and "running reliably under load with real users" is larger than most teams expect. Here's what that gap actually looks like.

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.