Skip to main content

720 posts tagged with "llm"

View all tags

Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails

· 9 min read
Tian Pan
Software Engineer

When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.

The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

Gradual Context Replacement: Managing Long AI Conversations Without Losing Quality

· 9 min read
Tian Pan
Software Engineer

Your chatbot works perfectly for the first fifteen turns. Then something goes wrong. It contradicts an earlier decision. It asks for information the user already provided. It loses the thread of a multi-step task that was clearly defined at the start. The conversation history is technically all there—you haven't deleted anything—but the model is behaving as if it wasn't.

This is context rot: the gradual degradation of output quality as conversation histories grow. A 2024 evaluation of 18 state-of-the-art models across nearly 200,000 controlled calls found that reliability decreases significantly beyond 30,000 tokens, even in models with much larger nominal windows. High-performing models become as unreliable as much smaller ones in extended dialogues. The problem isn't that your context window ran out. It's that transformer attention is quadratic—100,000 tokens means 10 billion pairwise relationships—and the model is forced to distribute focus so thinly that important earlier content gets effectively ignored.

When teams hit this wall, they usually reach for one of two fixes: truncation or summarization. Both make things worse in predictable ways.

The Helpful AI Paradox: Why Instruction-Following Is a Security Vulnerability

· 9 min read
Tian Pan
Software Engineer

There's an uncomfortable truth about LLMs that doesn't get discussed enough in product reviews: the property that makes them useful is identical to the property that makes them exploitable. An LLM that obediently follows instructions — any instructions, from any source, delivered in any format — will follow malicious instructions with the same cheerful compliance it applies to legitimate ones. The model cannot tell the difference.

This isn't a bug that will be patched away. It's an architectural reality. And as these systems take on more agentic roles — reading emails, browsing the web, executing code, calling APIs — the exposure surface grows in ways that most engineering teams haven't mapped.

LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed

· 9 min read
Tian Pan
Software Engineer

Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.

This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

· 11 min read
Tian Pan
Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.

LLM Self-Debugging: When the Explanation Is the Signal vs. When It's the Lie

· 8 min read
Tian Pan
Software Engineer

When your LLM agent fails, the most tempting thing in the world is to ask it why. It will answer fluently, specifically, and with what feels like self-awareness. It might say: "I misunderstood the user's intent and retrieved documents about X when I should have targeted Y." That sounds exactly like a root cause. You write it down, open the prompt editor, and spend forty minutes chasing the wrong problem.

This is the central trap of LLM self-debugging. The model's explanation and the model's actual failure mechanism are two different things. Sometimes they overlap. Often they don't. Knowing which situation you're in before you act on the explanation is the discipline that separates fast debugging from expensive detours.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

PII in the Prompt: The Data Minimization Patterns Your AI Pipeline Is Missing

· 12 min read
Tian Pan
Software Engineer

Research from 2025 found that 8.5% of prompts submitted to commercial LLMs contain sensitive information — PII, credentials, and internal file references. That statistic probably undersells the problem. It counts what users explicitly type. It doesn't count what your system silently adds: retrieved customer records, tool outputs from database queries, memories persisted from previous sessions, or fine-tuning data that wasn't scrubbed before training. Most AI pipelines leak PII not through user mistakes but through architectural blind spots that no single engineer owns.

The failure mode is almost always the same: a team ships an AI feature thinking "we don't send personal data," but personal data enters through the seams — in the RAG retrieval chunk that includes a customer's address, in the agent tool output that returns a full user profile, in the fine-tuning dataset that was exported from a CRM without redaction. GDPR's data minimization principle requires that you collect only what's necessary for a specific purpose. LLM architectures violate this by default.

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

· 10 min read
Tian Pan
Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

Privacy Mode That Actually Keeps Its Promise: Engineering User-Controlled Data Boundaries in AI Features

· 10 min read
Tian Pan
Software Engineer

In March 2026, a class action lawsuit alleged that Perplexity's "Incognito Mode" was routing conversational data and user identifiers to Meta and Google's ad networks — even for paying subscribers who had explicitly activated it. The feature was called incognito. Users assumed that meant private. The implementation said otherwise.

This is the most common failure mode in AI privacy modes: the name is marketing, the implementation is retention theater. Engineers ship a toggle. Legal approves the wording. Users flip the switch and trust it. And somewhere in the data pipeline, inputs are still flowing to a logging service, a training job, or a third-party analytics SDK that nobody remembered to gate.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.