Skip to main content

639 posts tagged with "llm"

View all tags

Stateful vs. Stateless AI Features: The Architectural Decision That Shapes Everything Downstream

· 12 min read
Tian Pan
Software Engineer

When a shopping assistant recommends baby products to a user who mentioned a pregnancy two years ago, nobody threw an exception. The system worked exactly as designed. The LLM returned a confident response with HTTP 200. The bug was in the data — a stale memory that was never invalidated — and it was completely invisible until a customer complained. That's the ghost that lives in stateful AI systems, and it behaves nothing like the bugs you're used to debugging.

The decision between stateful and stateless AI features looks deceptively simple on the surface. In practice, it's one of the earliest architectural choices you'll make for an AI product, and it propagates consequences through your storage layer, your debugging toolchain, your security posture, and your operational costs. Most teams make this decision implicitly, by defaulting to one pattern without examining the tradeoffs. This post is about making it explicitly.

Synthetic Data Pipelines That Don't Collapse: Generating Training Data at Scale

· 8 min read
Tian Pan
Software Engineer

Train a model on its own output, then train the next model on that model's output, and within three generations you've built a progressively dumber machine. This is model collapse — a degenerative process where each successive generation of synthetic training data narrows the distribution until the model forgets the long tail of rare but important patterns. A landmark Nature study confirmed what practitioners had observed anecdotally: even tiny fractions of synthetic contamination (as low as 1 in 1,000 samples) trigger measurable degradation in lexical, syntactic, and semantic diversity.

Yet synthetic data isn't optional. Real-world labeled data is expensive, scarce in specialized domains, and increasingly exhausted at the scale frontier models demand. The teams shipping successful fine-tunes in 2025–2026 aren't avoiding synthetic data — they're engineering their pipelines to generate it without collapsing. The difference between a productive pipeline and a self-poisoning one comes down to diversity preservation, verification loops, and knowing when to stop.

The Instruction-Following Cliff: Why Adding One More Rule to Your System Prompt Breaks Three Others

· 7 min read
Tian Pan
Software Engineer

Your system prompt started at twelve lines. It worked beautifully. Then product wanted tone guidelines. Legal needed a disclaimer rule. The safety team added three more constraints. Now you're at forty rules and the model ignores half of them — but not the same half each time.

This is the instruction-following cliff: the point where adding one more rule to your prompt doesn't just degrade that rule's compliance — it destabilizes rules that were working fine yesterday. And unlike most engineering failures, this one is maddeningly non-deterministic.

Vision Inputs in Production AI Pipelines: The Preprocessing Decisions Nobody Documents

· 10 min read
Tian Pan
Software Engineer

Your vision model benchmarks 90%+ on your eval suite. Then real users upload photos of physical documents, screenshots from low-DPI monitors, and scanned PDFs that have been round-tripped through three fax machines. Accuracy craters. The model "works" — it returns coherent responses — but the responses are wrong in ways that are hard to catch without knowing the ground truth. You file it under "model limitations" and move on.

The model probably isn't the problem. The input pipeline is.

Most teams building with vision LLMs spend enormous effort on prompt engineering and model selection, and nearly zero effort on the preprocessing that happens before the image ever reaches the model. That asymmetry is where production quality goes to die. The preprocessing decisions nobody documents are also the ones responsible for the biggest silent accuracy drops in production multimodal systems.

When Your Agents Disagree: Consensus and Arbitration in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

Multi-agent systems are sold on a promise: multiple specialized agents, working in parallel, will produce better answers than any single agent could alone. That promise has a hidden assumption — that when agents produce different answers, you'll know how to reconcile them. Most teams discover too late that they won't.

The naive approach is to average outputs, or pick the majority answer, and move on. In practice, a multi-agent system where all agents share the same training distribution will amplify their shared errors through majority vote, not cancel them out. A system that always defers to the most confident agent will blindly follow the most overconfident one. And a system that runs every disagreement through an LLM judge will inherit twelve documented bias types from that judge. The arbitration problem is harder than it looks, and getting it wrong is how you end up with four production incidents in a week.

How Agents Teach Themselves: The Closed-Loop Self-Improvement Architecture

· 11 min read
Tian Pan
Software Engineer

The most expensive part of training an agent isn't GPU time. It's the human annotators who label whether a multi-step task succeeded or failed. A single expert annotation of a long-horizon agentic trajectory — verifying that an agent correctly booked a flight, wrote a functional program, or filled out a legal form — can cost more than thousands of inference calls. Closed-loop self-improvement is the architectural pattern that eliminates this bottleneck by replacing human judgment with an automated verifier, then using that verifier to run the generate-attempt-verify-train cycle without any human in the loop. When done correctly, it works: a recent NeurIPS paper showed the pattern doubled average task success rates across multi-turn tool-use environments, going from 12% to 23.5%, without a single human annotation.

The key insight isn't that the model improves itself — it's that the verifier is free. Code execution returns a pass/fail signal deterministically, in milliseconds, at near-zero marginal cost. When your tasks have checkable outcomes, you can run thousands of training episodes per hour with ground-truth labels the model cannot fake (assuming your sandbox is designed correctly). That assumption is doing a lot of work, and we'll come back to it.

Cognitive Tool Scaffolding: Near-Reasoning-Model Performance Without the Price Tag

· 10 min read
Tian Pan
Software Engineer

Your reasoning model bill is high, but the capability gap might be narrower than you think. A standard 70B model running four structured cognitive operations on AIME 2024 math benchmarks jumps from 13% to 30% accuracy — nearly matching o1-preview's 44%, at a fraction of the inference cost. On a more capable base model like GPT-4.1, the same technique pushes from 32% to 53%, which actually surpasses o1-preview on those benchmarks.

The technique is called cognitive tool scaffolding, and it's the latest evolution of a decade of research into making language models reason better without changing their weights.

The Cold Start Problem in AI Personalization

· 11 min read
Tian Pan
Software Engineer

A user signs up for your AI writing assistant. They type their first message. Your system has exactly one data point — and it has to decide: formal or casual? Verbose or terse? Technical depth or accessible overview? Most systems punt and serve a generic default. A few try to personalize immediately. The ones that personalize immediately often make things worse.

The cold start problem in AI personalization is not the same problem Netflix solved fifteen years ago. It is structurally harder, the failure modes are subtler, and the common fixes actively introduce new bugs. Here is what practitioners who have shipped personalization systems have learned about navigating it.

Domain-Specialized Agent Architectures: Why Generic Agents Underperform in High-Stakes Verticals

· 10 min read
Tian Pan
Software Engineer

A generic AI agent that can summarize a contract, draft a product spec, and write a SQL query is genuinely impressive — until you deploy it into a radiology department and discover it suggests plausible-sounding dosing that contradicts the patient's actual drug allergies. The failure is not a hallucination problem. It's an architecture problem.

The assumption baked into most agent demos is that a sufficiently capable foundation model plus a broad tool set equals a capable agent in any domain. In practice, the gap between that assumption and production reality is where patients get hurt, lawsuits materialize, and experiments produce unreproducible results. Generic agents are a reasonable starting point, not a destination.

The Explainability Trap: When AI Explanations Become a Liability

· 11 min read
Tian Pan
Software Engineer

Somewhere between the first stakeholder demand for "explainable AI" and the moment your product team spec'd out a "Why did the AI decide this?" feature, a trap was set. The trap is this: your model does not know why it made that decision, and asking it to explain doesn't produce an explanation — it produces text that looks like an explanation.

This distinction matters enormously in production. Not because users deserve better philosophy, but because post-hoc AI explanations are driving real-world harm through regulatory non-compliance, misdirected user behavior, and safety monitors that can be fooled. Engineers shipping explanation features without understanding this will build systems that satisfy legal checkboxes while making outcomes worse.

Fine-tuning vs. RAG for Knowledge Injection: The Decision Engineers Consistently Get Wrong

· 10 min read
Tian Pan
Software Engineer

A fintech team spent three months fine-tuning a model on their internal compliance documentation — thousands of regulatory PDFs, policy updates, and procedural guides. The results were mediocre. The model still hallucinated specific rule numbers. It forgot recent policy changes. And the one metric that actually mattered (whether advisors trusted its answers enough to stop double-checking) barely moved. Two weeks later, a different team built a RAG pipeline over the same document corpus. Advisors started trusting it within a week.

The fine-tuning team hadn't made a technical mistake. They'd made a definitional one: they were solving a knowledge retrieval problem with a behavior modification tool.

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

· 10 min read
Tian Pan
Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.