Skip to main content

720 posts tagged with "llm"

View all tags

Vision Inputs in Production AI Pipelines: The Preprocessing Decisions Nobody Documents

· 10 min read
Tian Pan
Software Engineer

Your vision model benchmarks 90%+ on your eval suite. Then real users upload photos of physical documents, screenshots from low-DPI monitors, and scanned PDFs that have been round-tripped through three fax machines. Accuracy craters. The model "works" — it returns coherent responses — but the responses are wrong in ways that are hard to catch without knowing the ground truth. You file it under "model limitations" and move on.

The model probably isn't the problem. The input pipeline is.

Most teams building with vision LLMs spend enormous effort on prompt engineering and model selection, and nearly zero effort on the preprocessing that happens before the image ever reaches the model. That asymmetry is where production quality goes to die. The preprocessing decisions nobody documents are also the ones responsible for the biggest silent accuracy drops in production multimodal systems.

When Your Agents Disagree: Consensus and Arbitration in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

Multi-agent systems are sold on a promise: multiple specialized agents, working in parallel, will produce better answers than any single agent could alone. That promise has a hidden assumption — that when agents produce different answers, you'll know how to reconcile them. Most teams discover too late that they won't.

The naive approach is to average outputs, or pick the majority answer, and move on. In practice, a multi-agent system where all agents share the same training distribution will amplify their shared errors through majority vote, not cancel them out. A system that always defers to the most confident agent will blindly follow the most overconfident one. And a system that runs every disagreement through an LLM judge will inherit twelve documented bias types from that judge. The arbitration problem is harder than it looks, and getting it wrong is how you end up with four production incidents in a week.

How Agents Teach Themselves: The Closed-Loop Self-Improvement Architecture

· 11 min read
Tian Pan
Software Engineer

The most expensive part of training an agent isn't GPU time. It's the human annotators who label whether a multi-step task succeeded or failed. A single expert annotation of a long-horizon agentic trajectory — verifying that an agent correctly booked a flight, wrote a functional program, or filled out a legal form — can cost more than thousands of inference calls. Closed-loop self-improvement is the architectural pattern that eliminates this bottleneck by replacing human judgment with an automated verifier, then using that verifier to run the generate-attempt-verify-train cycle without any human in the loop. When done correctly, it works: a recent NeurIPS paper showed the pattern doubled average task success rates across multi-turn tool-use environments, going from 12% to 23.5%, without a single human annotation.

The key insight isn't that the model improves itself — it's that the verifier is free. Code execution returns a pass/fail signal deterministically, in milliseconds, at near-zero marginal cost. When your tasks have checkable outcomes, you can run thousands of training episodes per hour with ground-truth labels the model cannot fake (assuming your sandbox is designed correctly). That assumption is doing a lot of work, and we'll come back to it.

Cognitive Tool Scaffolding: Near-Reasoning-Model Performance Without the Price Tag

· 10 min read
Tian Pan
Software Engineer

Your reasoning model bill is high, but the capability gap might be narrower than you think. A standard 70B model running four structured cognitive operations on AIME 2024 math benchmarks jumps from 13% to 30% accuracy — nearly matching o1-preview's 44%, at a fraction of the inference cost. On a more capable base model like GPT-4.1, the same technique pushes from 32% to 53%, which actually surpasses o1-preview on those benchmarks.

The technique is called cognitive tool scaffolding, and it's the latest evolution of a decade of research into making language models reason better without changing their weights.

The Cold Start Problem in AI Personalization

· 11 min read
Tian Pan
Software Engineer

A user signs up for your AI writing assistant. They type their first message. Your system has exactly one data point — and it has to decide: formal or casual? Verbose or terse? Technical depth or accessible overview? Most systems punt and serve a generic default. A few try to personalize immediately. The ones that personalize immediately often make things worse.

The cold start problem in AI personalization is not the same problem Netflix solved fifteen years ago. It is structurally harder, the failure modes are subtler, and the common fixes actively introduce new bugs. Here is what practitioners who have shipped personalization systems have learned about navigating it.

Domain-Specialized Agent Architectures: Why Generic Agents Underperform in High-Stakes Verticals

· 10 min read
Tian Pan
Software Engineer

A generic AI agent that can summarize a contract, draft a product spec, and write a SQL query is genuinely impressive — until you deploy it into a radiology department and discover it suggests plausible-sounding dosing that contradicts the patient's actual drug allergies. The failure is not a hallucination problem. It's an architecture problem.

The assumption baked into most agent demos is that a sufficiently capable foundation model plus a broad tool set equals a capable agent in any domain. In practice, the gap between that assumption and production reality is where patients get hurt, lawsuits materialize, and experiments produce unreproducible results. Generic agents are a reasonable starting point, not a destination.

The Explainability Trap: When AI Explanations Become a Liability

· 11 min read
Tian Pan
Software Engineer

Somewhere between the first stakeholder demand for "explainable AI" and the moment your product team spec'd out a "Why did the AI decide this?" feature, a trap was set. The trap is this: your model does not know why it made that decision, and asking it to explain doesn't produce an explanation — it produces text that looks like an explanation.

This distinction matters enormously in production. Not because users deserve better philosophy, but because post-hoc AI explanations are driving real-world harm through regulatory non-compliance, misdirected user behavior, and safety monitors that can be fooled. Engineers shipping explanation features without understanding this will build systems that satisfy legal checkboxes while making outcomes worse.

Fine-tuning vs. RAG for Knowledge Injection: The Decision Engineers Consistently Get Wrong

· 10 min read
Tian Pan
Software Engineer

A fintech team spent three months fine-tuning a model on their internal compliance documentation — thousands of regulatory PDFs, policy updates, and procedural guides. The results were mediocre. The model still hallucinated specific rule numbers. It forgot recent policy changes. And the one metric that actually mattered (whether advisors trusted its answers enough to stop double-checking) barely moved. Two weeks later, a different team built a RAG pipeline over the same document corpus. Advisors started trusting it within a week.

The fine-tuning team hadn't made a technical mistake. They'd made a definitional one: they were solving a knowledge retrieval problem with a behavior modification tool.

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

· 10 min read
Tian Pan
Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.