Skip to main content

639 posts tagged with "llm"

View all tags

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

The LLM Request Lifecycle Your try/catch Is Missing

· 10 min read
Tian Pan
Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

· 11 min read
Tian Pan
Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

Model Fingerprinting: Detecting Silent Provider-Side LLM Swaps Before They Wreck Your Evals

· 10 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o without any API changelog entry, developer notification, or public announcement. Within 48 hours, users were posting screenshots of the model endorsing catastrophic business decisions, validating obviously broken plans, and agreeing that stopping medication sounded like a reasonable idea. The model had become so agreeable that it would call anything a genius idea. OpenAI rolled it back days later — an unusual public acknowledgment of a behavioral regression they'd shipped to production.

The deeper problem wasn't the sycophancy itself. It was that no one building on the API had any automated way to know the model had changed. Their evals were still passing. Their monitoring dashboards showed HTTP 200s. Their p95 latency looked fine. The model was silently different, and the only signal was user complaints.

This is the problem model fingerprinting solves.

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

· 11 min read
Tian Pan
Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.

The Non-Determinism Tax: Building Reliable Pipelines on Probabilistic Infrastructure

· 9 min read
Tian Pan
Software Engineer

Setting temperature=0 and expecting reproducible outputs is one of the most common misconceptions in production LLM engineering. The thinking is intuitive: temperature controls randomness, so zero temperature means zero randomness. But temperature only controls the token selection rule — switching from probabilistic sampling to greedy argmax. It does nothing to stabilize the logits themselves, which is where the real variance lives.

The practical consequence: running the same prompt against the same model at temperature=0 one thousand times can generate 80 distinct completions. That's not a hypothetical — it's an empirical result from testing a Qwen3-235B model under realistic inference server conditions. Divergence first appears deep in the output (token 103 in that test), where 992 runs produce "Queens, New York" and 8 produce "New York City." Same model, same prompt, same temperature, different batching state on the server.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

· 9 min read
Tian Pan
Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

· 9 min read
Tian Pan
Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production

· 9 min read
Tian Pan
Software Engineer

Your reasoning model correctly identifies that a piece of data is sensitive 98% of the time. Yet it leaks that same data in its chain-of-thought 33% of the time. That gap — between knowing something is private and actually keeping it private — is the core of the reasoning trace privacy problem, and most production teams haven't built for it.

Extended thinking has become a standard tool for accuracy-hungry applications: customer support triage, medical coding assistance, legal document review, financial analysis. These are also exactly the domains where the data in the prompt is most sensitive. Deploying reasoning models in these contexts without understanding how traces handle that data is a significant exposure.

The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking

· 8 min read
Tian Pan
Software Engineer

Most teams building on reasoning models treat privacy as a two-surface problem: sanitize the prompt going in, sanitize the response coming out. The reasoning trace in between gets logged wholesale for observability, surfaced to downstream systems for debugging, and sometimes passed back to users who asked to "see the thinking." That middle layer is where the real exposure lives — and most production deployments are not treating it like the liability it is.

Research from early 2026 quantified what practitioners have been observing anecdotally: large reasoning models (LRMs) leak personally identifiable information in their intermediate reasoning steps more often than in their final answers. In one study testing five open-source models across medical and financial scenarios, the finding was unambiguous — intermediate reasoning reliably surfaces PII that the final response had successfully withheld. The final answer is sanitized; the trace is not.