Skip to main content

207 posts tagged with "llm"

View all tags

Structured Output in Production: Getting LLMs to Return Reliable JSON

· 8 min read
Tian Pan
Software Engineer

At some point in production, every LLM-powered application needs to stop treating model output as prose and start treating it as data. The moment you try to reliably extract a JSON object from a language model — and feed it downstream into a database, API call, or UI — you discover just how many ways this can go wrong. The model wraps JSON in markdown fences. It generates a valid object but omits required fields. It formats dates inconsistently across calls. It hallucinates enum values. Any one of these failures silently corrupts downstream state.

Structured output has evolved from an afterthought into a first-class concern for production LLM systems. This post covers the three main mechanisms for enforcing it, where each breaks down, and how to design schemas that keep quality high under constraint.

Prompt Engineering Deep Dive: From Basics to Advanced Techniques

· 10 min read
Tian Pan
Software Engineer

Most engineers treat prompts as magic words — tweak a phrase, hope it works, move on. That works fine for demos. In production, it produces a system where nobody knows why the model behaves differently on Tuesday than on Monday, and where a routine model update silently breaks three features. Prompt engineering done right is a discipline, not a ritual. This post covers the full stack: when to use each technique, what the benchmarks actually show, and where the traps are.

What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

· 10 min read
Tian Pan
Software Engineer

When GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B all score 88–93% on MMLU, what does that number actually tell you about which model to deploy? The uncomfortable answer: almost nothing. The benchmark that once separated capable models from mediocre ones has saturated. Every frontier model aces it, yet they behave very differently in production. The gap between benchmark performance and real-world utility has never been wider, and understanding why is now essential for any engineer building on top of LLMs.

Benchmarks feel rigorous because they produce numbers. A number looks like measurement, and measurement looks like truth. But the legitimacy of a benchmark score depends entirely on the validity of what it's measuring—and that validity breaks down in ways that are rarely surfaced on leaderboards.

Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

· 9 min read
Tian Pan
Software Engineer

The first time your agent silently retries the same broken tool call three times before giving up, you realize that "just add tools" is not a production strategy. Tool calling unlocks genuine capabilities — external data, side effects, guaranteed-shape outputs — but the agentic loop that makes it work has sharp edges that don't show up in demos.

This post is about those edges: how the loop actually runs, the formatting rules that quietly destroy parallel execution, how to write tool descriptions that make the model choose correctly, and how to handle errors in a way that lets the model recover instead of spiral.

LLM Latency in Production: What Actually Moves the Needle

· 10 min read
Tian Pan
Software Engineer

Most LLM latency advice falls into one of two failure modes: it focuses on the wrong metric, or it recommends optimizations that are too hardware-specific to apply unless you're running your own inference cluster. If you're building on top of a hosted API or a managed inference provider, a lot of that advice is noise.

This post focuses on what actually moves the needle — techniques that apply whether you control the stack or not, grounded in production data rather than benchmark lab conditions.

LLM Guardrails in Production: Why One Layer Is Never Enough

· 10 min read
Tian Pan
Software Engineer

Here is a math problem that catches teams off guard: if you stack five guardrails and each one operates at 90% accuracy, your overall system correctness is not 90%—it is 59%. Stack ten guards at the same accuracy and you get under 35%. The compound error problem means that "adding more guardrails" can make a system less reliable than adding fewer, better-calibrated ones. Most teams discover this only after they've wired up a sprawling moderation pipeline and started watching their false-positive rate climb past anything users will tolerate.

Guardrails are not optional for production LLM applications. Hallucinations appear in roughly 31% of real-world LLM responses under normal conditions, and that figure climbs to 60–88% in regulated domains like law and medicine. Jailbreak attacks against modern models succeed at rates ranging from 57% to near-100% depending on the technique. But treating guardrails as a bolt-on compliance checkbox—rather than a carefully designed subsystem—is how teams end up with systems that block legitimate requests constantly while still missing adversarial ones.

Context Engineering: Why What You Feed the LLM Matters More Than How You Ask

· 11 min read
Tian Pan
Software Engineer

Most LLM quality problems aren't prompt problems. They're context problems.

You spend hours crafting the perfect system prompt. You add XML tags, chain-of-thought instructions, and careful persona definitions. You test it on a handful of inputs and it looks great. Then you ship it, and two weeks later you're staring at a ticket where the agent confidently told a user the wrong account balance — because it retrieved the previous user's transaction history. The model understood the instructions perfectly. It just had the wrong inputs.

This is the core distinction between prompt engineering and context engineering. Prompt engineering asks: "How should I phrase this?" Context engineering asks: "What does the model need to know right now, and how do I make sure it gets exactly that?" One is copywriting. The other is systems architecture.

LLM Guardrails in Production: What Actually Works

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first LLM feature, get burned by a bad output in production, and then bolt on a guardrail as damage control. The result is a brittle system that blocks legitimate requests, slows down responses, and still fails on the edge cases that matter. Guardrails are worth getting right — but the naive approach will hurt you in ways you don't expect.

Here's what the tradeoffs actually look like, and how to build a guardrail layer that doesn't quietly destroy your product.

Fine-Tuning vs. Prompting: A Decision Framework for Production LLMs

· 8 min read
Tian Pan
Software Engineer

Most teams reach for fine-tuning too early or too late. The ones who fine-tune too early burn weeks on a training pipeline before realizing a better system prompt would have solved the problem. The ones who wait too long run expensive 70B inferences on millions of repetitive tasks while accepting accuracy that a fine-tuned 7B model could have beaten—at a tenth of the cost.

The decision is not about which technique is "better." It's about matching the right tool to your specific constraints: data volume, latency budget, accuracy requirements, and how stable the task definition is. Here's how to think through it.

Prompt Engineering in Production: What Actually Matters

· 8 min read
Tian Pan
Software Engineer

Most engineers learn prompt engineering backwards. They start with "be creative" and "think step by step," iterate on a demo until it works, then discover in production that the model is hallucinating 15% of the time and their JSON parser is throwing exceptions every few hours. The techniques that make a chatbot feel impressive are often not the ones that make a production system reliable.

After a year of shipping LLM features into real systems, here's what actually separates prompts that work from prompts that hold up under load.

Why Multi-Agent LLM Systems Fail (and How to Build Ones That Don't)

· 8 min read
Tian Pan
Software Engineer

Most multi-agent LLM systems deployed in production fail within weeks — not from infrastructure outages or model regressions, but from coordination problems that were baked in from the start. A comprehensive analysis of 1,642 execution traces across seven open-source frameworks found failure rates ranging from 41% to 86.7% on standard benchmarks. That's not a model quality problem. That's a systems engineering problem.

The uncomfortable finding: roughly 79% of those failures trace back to specification and coordination issues, not compute limits or model capability. You can swap in a better model and still watch your multi-agent pipeline collapse in the exact same way. Understanding why requires looking at the failure taxonomy carefully.

Tool Use in Production: Function Calling Patterns That Actually Work

· 9 min read
Tian Pan
Software Engineer

The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.

This is good news, because schemas are cheap to fix.