Skip to main content

19 posts tagged with "ai-engineering"

View all tags

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

Data Flywheels for LLM Applications: Closing the Loop Between Production and Improvement

· 9 min read
Tian Pan
Software Engineer

Most LLM applications launch, observe some failures, patch the prompt, and repeat. That's not a flywheel — it's a treadmill. A real data flywheel is a self-reinforcing loop: production generates feedback, feedback improves the system, the improved system generates better interactions, which generate better feedback. Each revolution compounds the last.

The difference matters because foundation models have erased the traditional moat. Everyone calls the same GPT-4o or Claude endpoint. The new moat is proprietary feedback data from real users doing real tasks — data that's expensive, slow, and impossible to replicate from the outside.

Reasoning Models in Production: When to Use Them and When Not To

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.

Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

LLM Guardrails in Production: What Actually Works

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first LLM feature, get burned by a bad output in production, and then bolt on a guardrail as damage control. The result is a brittle system that blocks legitimate requests, slows down responses, and still fails on the edge cases that matter. Guardrails are worth getting right — but the naive approach will hurt you in ways you don't expect.

Here's what the tradeoffs actually look like, and how to build a guardrail layer that doesn't quietly destroy your product.

Prompt Engineering in Production: What Actually Matters

· 8 min read
Tian Pan
Software Engineer

Most engineers learn prompt engineering backwards. They start with "be creative" and "think step by step," iterate on a demo until it works, then discover in production that the model is hallucinating 15% of the time and their JSON parser is throwing exceptions every few hours. The techniques that make a chatbot feel impressive are often not the ones that make a production system reliable.

After a year of shipping LLM features into real systems, here's what actually separates prompts that work from prompts that hold up under load.

Tool Use in Production: Function Calling Patterns That Actually Work

· 9 min read
Tian Pan
Software Engineer

The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.

This is good news, because schemas are cheap to fix.

Your AI Product Needs Evals

· 8 min read
Tian Pan
Software Engineer

Every AI product demo looks great. The model generates something plausible, the stakeholders nod along, and everyone leaves the meeting feeling optimistic. Then the product ships, real users appear, and things start going sideways in ways nobody anticipated. The team scrambles to fix one failure mode, inadvertently creates another, and after weeks of whack-a-mole, the prompt has grown into a 2,000-token monster that nobody fully understands anymore.

The root cause is almost always the same: no evaluation system. Teams that ship reliable AI products build evals early and treat them as infrastructure, not an afterthought. Teams that stall treat evaluation as something to worry about "once the product is more mature." By then, they're already stuck.

The Unglamorous Work Behind Rapidly Improving AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI teams hit the same wall six weeks after launch. Initial demos were impressive, the prototype shipped on time, and early users said nice things. Then the gap between "good enough to show" and "good enough to keep" becomes unavoidable. The team scrambles — tweaking prompts, swapping models, adding guardrails — and the product barely moves.

The teams that actually improve quickly share one counterintuitive habit: they spend less time on architecture and more time staring at data. Not dashboards. Not aggregate metrics. The raw, ugly, individual failures that live inside conversation logs.

This is a field guide to the practices that separate fast-moving AI teams from ones that stay stuck.