Skip to main content

200 posts tagged with "ai-engineering"

View all tags

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

Structured Concurrency for AI Pipelines: Why asyncio.gather() Isn't Enough

· 9 min read
Tian Pan
Software Engineer

When an LLM returns three tool calls in a single response, the obvious thing is to run them in parallel. You reach for asyncio.gather(), fan the calls out, collect the results, return them to the model. The code works in testing. It works in staging. Six weeks into production, you start noticing your application holding open HTTP connections it should have released. Token quota is draining faster than usage metrics suggest. Occasionally, a tool that sends an email fires twice.

The underlying issue is not the LLM or the tool — it's the concurrency primitive. asyncio.gather() was not designed for the failure modes that multi-step agent pipelines produce, and using it as the backbone of parallel tool execution creates problems that are invisible until they compound.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.

Fine-Tuning Is Usually the Wrong Move: A Decision Framework for LLM Customization

· 9 min read
Tian Pan
Software Engineer

Most engineering teams building LLM products follow the same progression: prompt a base model, hit a performance ceiling, and immediately reach for fine-tuning as the solution. This instinct is wrong more often than it's right.

Fine-tuning is a powerful tool. It can unlock real performance gains, cut inference costs at scale, and give you precise control over model behavior. But it carries hidden costs — in data, time, infrastructure, and ongoing maintenance — that teams systematically underestimate. And in many cases, prompt engineering or retrieval augmentation would have gotten them there faster and cheaper.

This post gives you a concrete framework for when each approach wins, grounded in recent benchmarks and production experience.

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

Data Flywheels for LLM Applications: Closing the Loop Between Production and Improvement

· 9 min read
Tian Pan
Software Engineer

Most LLM applications launch, observe some failures, patch the prompt, and repeat. That's not a flywheel — it's a treadmill. A real data flywheel is a self-reinforcing loop: production generates feedback, feedback improves the system, the improved system generates better interactions, which generate better feedback. Each revolution compounds the last.

The difference matters because foundation models have erased the traditional moat. Everyone calls the same GPT-4o or Claude endpoint. The new moat is proprietary feedback data from real users doing real tasks — data that's expensive, slow, and impossible to replicate from the outside.

Reasoning Models in Production: When to Use Them and When Not To

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.

Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.