Skip to main content

269 posts tagged with "ai-engineering"

View all tags

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.

Fine-Tuning Is Usually the Wrong Move: A Decision Framework for LLM Customization

· 9 min read
Tian Pan
Software Engineer

Most engineering teams building LLM products follow the same progression: prompt a base model, hit a performance ceiling, and immediately reach for fine-tuning as the solution. This instinct is wrong more often than it's right.

Fine-tuning is a powerful tool. It can unlock real performance gains, cut inference costs at scale, and give you precise control over model behavior. But it carries hidden costs — in data, time, infrastructure, and ongoing maintenance — that teams systematically underestimate. And in many cases, prompt engineering or retrieval augmentation would have gotten them there faster and cheaper.

This post gives you a concrete framework for when each approach wins, grounded in recent benchmarks and production experience.

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

Data Flywheels for LLM Applications: Closing the Loop Between Production and Improvement

· 9 min read
Tian Pan
Software Engineer

Most LLM applications launch, observe some failures, patch the prompt, and repeat. That's not a flywheel — it's a treadmill. A real data flywheel is a self-reinforcing loop: production generates feedback, feedback improves the system, the improved system generates better interactions, which generate better feedback. Each revolution compounds the last.

The difference matters because foundation models have erased the traditional moat. Everyone calls the same GPT-4o or Claude endpoint. The new moat is proprietary feedback data from real users doing real tasks — data that's expensive, slow, and impossible to replicate from the outside.

Reasoning Models in Production: When to Use Them and When Not To

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.

Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

Load Testing LLM Applications: Why k6 and Locust Lie to You

· 11 min read
Tian Pan
Software Engineer

You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?

The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.