Skip to main content

639 posts tagged with "llm"

View all tags

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.

Fine-Tuning Is Usually the Wrong Move: A Decision Framework for LLM Customization

· 9 min read
Tian Pan
Software Engineer

Most engineering teams building LLM products follow the same progression: prompt a base model, hit a performance ceiling, and immediately reach for fine-tuning as the solution. This instinct is wrong more often than it's right.

Fine-tuning is a powerful tool. It can unlock real performance gains, cut inference costs at scale, and give you precise control over model behavior. But it carries hidden costs — in data, time, infrastructure, and ongoing maintenance — that teams systematically underestimate. And in many cases, prompt engineering or retrieval augmentation would have gotten them there faster and cheaper.

This post gives you a concrete framework for when each approach wins, grounded in recent benchmarks and production experience.

Streaming AI Applications in Production: What Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

The first sign something is wrong: your staging environment streams perfectly, but in production every user sees a blank screen, then the entire response appears at once. You check the LLM provider — fine. You check the backend — fine. The server is streaming tokens. They just never make it to the browser.

The culprit, 90% of the time: NGINX is buffering your response.

This is the most common streaming failure mode, and it's entirely invisible unless you know to look for it. It also captures something broader about production streaming: the problems aren't usually in the LLM integration. They're in all the infrastructure between the model and the user.

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

Data Flywheels for LLM Applications: Closing the Loop Between Production and Improvement

· 9 min read
Tian Pan
Software Engineer

Most LLM applications launch, observe some failures, patch the prompt, and repeat. That's not a flywheel — it's a treadmill. A real data flywheel is a self-reinforcing loop: production generates feedback, feedback improves the system, the improved system generates better interactions, which generate better feedback. Each revolution compounds the last.

The difference matters because foundation models have erased the traditional moat. Everyone calls the same GPT-4o or Claude endpoint. The new moat is proprietary feedback data from real users doing real tasks — data that's expensive, slow, and impossible to replicate from the outside.

Model Context Protocol: The Standard That Finally Solves AI Tool Integration

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an AI product knows the integration tax. You want your agent to read from a database, trigger a GitHub PR, and post a Slack message. So you write a database connector, a GitHub connector, and a Slack connector — each a custom blob of code embedded in your prompt pipeline. Multiply that across three products and five data sources, and you have fifteen different integration paths to maintain. Anthropic called this "the M×N problem," and they're right.

The Model Context Protocol (MCP), launched in November 2024 and now stewarded by the Linux Foundation, is the industry's answer. Think of it the way the Language Server Protocol (LSP) transformed code editors: before LSP, every editor had to implement its own TypeScript language server. After LSP, VS Code, Neovim, and Emacs all share the same server. MCP applies the same logic to AI: write a server once, connect it to any MCP-compatible client — Claude, ChatGPT, Cursor, GitHub Copilot, all of them.

Reasoning Models in Production: When to Use Them and When Not To

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.

Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

Prompt Caching: The Optimization That Cuts LLM Costs by 90%

· 7 min read
Tian Pan
Software Engineer

Most teams building on LLMs are overpaying by 60–90%. Not because they're using the wrong model or prompting inefficiently — but because they're reprocessing the same tokens on every single request. Prompt caching fixes this, and it takes about ten minutes to implement. Yet it remains one of the most underutilized optimizations in production LLM systems.

Here's what's happening: every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you're handling 1,000 requests per day, you're paying to process 10 million tokens daily just for the static part of your prompt — context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.

LLM Observability in Production: The Four Silent Failures Engineers Miss

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.

Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.