Skip to main content

861 posts tagged with "insider"

View all tags

LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed

· 9 min read
Tian Pan
Software Engineer

Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.

This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

· 10 min read
Tian Pan
Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

Privacy Mode That Actually Keeps Its Promise: Engineering User-Controlled Data Boundaries in AI Features

· 10 min read
Tian Pan
Software Engineer

In March 2026, a class action lawsuit alleged that Perplexity's "Incognito Mode" was routing conversational data and user identifiers to Meta and Google's ad networks — even for paying subscribers who had explicitly activated it. The feature was called incognito. Users assumed that meant private. The implementation said otherwise.

This is the most common failure mode in AI privacy modes: the name is marketing, the implementation is retention theater. Engineers ship a toggle. Legal approves the wording. Users flip the switch and trust it. And somewhere in the data pipeline, inputs are still flowing to a logging service, a training job, or a third-party analytics SDK that nobody remembered to gate.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

Prompt Injection Is Not Primarily an Attacker Problem

· 9 min read
Tian Pan
Software Engineer

Most teams defending against prompt injection picture an attacker: someone crafting a carefully engineered string to override an AI's instructions. That framing is wrong, and it's costing them. The harder version of this problem doesn't require attackers at all.

Every time your AI application ingests user-generated content — a product review, a support ticket, a document upload, a CRM note — it faces the same structural vulnerability. No malicious intent needed. The ordinary text that ordinary users produce for ordinary reasons can, at scale, behave identically to a deliberate injection. If your application is only defended against the adversarial case, you're defended against the minority case.

The RAG Eval Invalidation Paradox: Why Updating Your Knowledge Base Breaks Your Benchmarks

· 10 min read
Tian Pan
Software Engineer

Your RAG eval suite passes at 0.89 faithfulness. You add 5,000 new support documents to the knowledge base. You re-run the same evals. Faithfulness drops to 0.79. Your team files a model regression ticket.

Nothing regressed. Your eval just became a lie.

This is the RAG eval invalidation paradox: the moment you update your knowledge base, the evaluation set you built against the old index silently stops measuring what it was designed to measure. Most teams discover this months later — after burning engineering cycles on phantom regressions — if they ever discover it at all.

The Data Contract Problem in RAG: When Your Ingestion Pipeline Silently Breaks Retrieval Quality

· 10 min read
Tian Pan
Software Engineer

Your RAG system has a bug that doesn't throw exceptions. It doesn't spike your error rate. It doesn't show up in your latency dashboards. Instead, it quietly delivers confident, plausible-sounding answers that are wrong — and nobody notices for weeks.

This is the data contract problem in RAG: your ingestion pipeline is the source of truth for everything downstream, but it has no schema enforcement, no freshness guarantees, and no alerting when the shape of the world changes underneath it. Every time an upstream data source adds a field, a chunking parameter shifts, or an embedding model gets updated, your retrieval quality silently degrades.

Eighty percent of enterprise RAG projects experience critical failures in production. The most insidious of those failures don't announce themselves.

The Retrograde Accuracy Problem: Why AI Features Degrade as Your Product Grows

· 10 min read
Tian Pan
Software Engineer

Your AI feature ships clean. Accuracy on the eval set: 91%. Latency: acceptable. The team is proud. Six months later, users are complaining that the feature feels "dumb," support tickets are climbing, and your aggregate metrics are quietly 8% worse than launch day. Nobody changed the model. The underlying data pipeline is intact. What happened?

This is the retrograde accuracy problem. As your product grows — new features, new user segments, new edge cases, new flows — the input distribution your AI sees in production quietly drifts away from the distribution it was trained on. No model update. No data pipeline failure. The product itself outgrew the model.

Scheduling Fairness in Multi-Tenant LLM Inference: Why FIFO Is the Wrong Default

· 11 min read
Tian Pan
Software Engineer

Your company runs a shared LLM serving cluster. Two tenants use it: a customer-facing chatbot with a 500ms first-token latency SLO, and a batch document enrichment pipeline that processes thousands of long-context prompts overnight. One morning, the chatbot team pages you at 3am because their P95 TTFT spiked to 12 seconds. Root cause: the batch job started earlier than expected, filled the GPU memory with prefill work, and the chatbot's short requests sat in queue behind a parade of 8,000-token prompts. Your FIFO scheduler gave them equal priority. The chatbot's SLO was violated 4,000 times before you killed the batch job manually.

This failure mode is common, well-understood in theory, and surprisingly widespread in practice. Most teams deploy vLLM or TGI with the default FIFO scheduler, add multiple workloads over time, and only discover the priority inversion when an incident happens.

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

Ship Your AI Feature Before It Feels Ready

· 9 min read
Tian Pan
Software Engineer

Most AI features that ship late don't ship late because they're broken. They ship late because the team is still optimizing for a test suite that doesn't reflect how real users behave. The benchmarks look better each week. The evals trend upward. And the gap between "lab performance" and "production value" quietly widens.

The uncomfortable truth is that the first 500 real users will surface more actionable problems in two weeks than four more weeks of prompt tuning ever could. This is not an argument for shipping garbage. It's an argument for recognizing that your current calibration of "ready" is almost certainly miscalibrated — and that real usage data is the only thing that corrects it.

The SLA Illusion: Why 99.9% Uptime Means Nothing for AI-Powered Features

· 9 min read
Tian Pan
Software Engineer

Your dashboards are green. Latency is nominal. Error rate is 0.2%. Uptime is 99.97% for the month. And your AI assistant is confidently telling users the wrong thing, in the wrong format, at twice the expected length — and has been doing so for eleven days.

This is the SLA illusion: the infrastructure contract that covers the pipe, not the water flowing through it. For AI-powered features, the gap between "is it responding?" and "is it responding well?" is the gap where product quality quietly dies.