Skip to main content

119 posts tagged with "llm"

View all tags

How Agents Teach Themselves: The Closed-Loop Self-Improvement Architecture

· 11 min read
Tian Pan
Software Engineer

The most expensive part of training an agent isn't GPU time. It's the human annotators who label whether a multi-step task succeeded or failed. A single expert annotation of a long-horizon agentic trajectory — verifying that an agent correctly booked a flight, wrote a functional program, or filled out a legal form — can cost more than thousands of inference calls. Closed-loop self-improvement is the architectural pattern that eliminates this bottleneck by replacing human judgment with an automated verifier, then using that verifier to run the generate-attempt-verify-train cycle without any human in the loop. When done correctly, it works: a recent NeurIPS paper showed the pattern doubled average task success rates across multi-turn tool-use environments, going from 12% to 23.5%, without a single human annotation.

The key insight isn't that the model improves itself — it's that the verifier is free. Code execution returns a pass/fail signal deterministically, in milliseconds, at near-zero marginal cost. When your tasks have checkable outcomes, you can run thousands of training episodes per hour with ground-truth labels the model cannot fake (assuming your sandbox is designed correctly). That assumption is doing a lot of work, and we'll come back to it.

Cognitive Tool Scaffolding: Near-Reasoning-Model Performance Without the Price Tag

· 10 min read
Tian Pan
Software Engineer

Your reasoning model bill is high, but the capability gap might be narrower than you think. A standard 70B model running four structured cognitive operations on AIME 2024 math benchmarks jumps from 13% to 30% accuracy — nearly matching o1-preview's 44%, at a fraction of the inference cost. On a more capable base model like GPT-4.1, the same technique pushes from 32% to 53%, which actually surpasses o1-preview on those benchmarks.

The technique is called cognitive tool scaffolding, and it's the latest evolution of a decade of research into making language models reason better without changing their weights.

The Cold Start Problem in AI Personalization

· 11 min read
Tian Pan
Software Engineer

A user signs up for your AI writing assistant. They type their first message. Your system has exactly one data point — and it has to decide: formal or casual? Verbose or terse? Technical depth or accessible overview? Most systems punt and serve a generic default. A few try to personalize immediately. The ones that personalize immediately often make things worse.

The cold start problem in AI personalization is not the same problem Netflix solved fifteen years ago. It is structurally harder, the failure modes are subtler, and the common fixes actively introduce new bugs. Here is what practitioners who have shipped personalization systems have learned about navigating it.

Domain-Specialized Agent Architectures: Why Generic Agents Underperform in High-Stakes Verticals

· 10 min read
Tian Pan
Software Engineer

A generic AI agent that can summarize a contract, draft a product spec, and write a SQL query is genuinely impressive — until you deploy it into a radiology department and discover it suggests plausible-sounding dosing that contradicts the patient's actual drug allergies. The failure is not a hallucination problem. It's an architecture problem.

The assumption baked into most agent demos is that a sufficiently capable foundation model plus a broad tool set equals a capable agent in any domain. In practice, the gap between that assumption and production reality is where patients get hurt, lawsuits materialize, and experiments produce unreproducible results. Generic agents are a reasonable starting point, not a destination.

The Explainability Trap: When AI Explanations Become a Liability

· 11 min read
Tian Pan
Software Engineer

Somewhere between the first stakeholder demand for "explainable AI" and the moment your product team spec'd out a "Why did the AI decide this?" feature, a trap was set. The trap is this: your model does not know why it made that decision, and asking it to explain doesn't produce an explanation — it produces text that looks like an explanation.

This distinction matters enormously in production. Not because users deserve better philosophy, but because post-hoc AI explanations are driving real-world harm through regulatory non-compliance, misdirected user behavior, and safety monitors that can be fooled. Engineers shipping explanation features without understanding this will build systems that satisfy legal checkboxes while making outcomes worse.

Fine-tuning vs. RAG for Knowledge Injection: The Decision Engineers Consistently Get Wrong

· 10 min read
Tian Pan
Software Engineer

A fintech team spent three months fine-tuning a model on their internal compliance documentation — thousands of regulatory PDFs, policy updates, and procedural guides. The results were mediocre. The model still hallucinated specific rule numbers. It forgot recent policy changes. And the one metric that actually mattered (whether advisors trusted its answers enough to stop double-checking) barely moved. Two weeks later, a different team built a RAG pipeline over the same document corpus. Advisors started trusting it within a week.

The fine-tuning team hadn't made a technical mistake. They'd made a definitional one: they were solving a knowledge retrieval problem with a behavior modification tool.

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

· 10 min read
Tian Pan
Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

Knowledge Distillation Economics: When Compressing a Frontier Model Actually Pays Off

· 11 min read
Tian Pan
Software Engineer

Most teams burning money on GPT-4o try the same thing first: swap to a cheaper model. GPT-4o mini is 16.7× cheaper per token, Llama 3.1 8B is self-hostable for pennies. But quality drops in ways that break production — the classification task that scored 94% on the frontier model crashes to 71% on the smaller one, or the extraction pipeline starts hallucinating fields that simply don't exist in the source document. So teams either stay on the expensive model and keep paying, or they accept degraded quality.

Knowledge distillation offers a third path: train a small model specifically to replicate the behavior of a large one on your task, not on general language understanding. Done right, you get small-model speed and cost with near-frontier accuracy. Done wrong, you inherit the teacher's confident mistakes at 10× the production volume. Understanding which outcome you get — and when the economics actually work — is what this post covers.

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

The LLM Request Lifecycle Your try/catch Is Missing

· 10 min read
Tian Pan
Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.