639 posts tagged with "llm"

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

May 7, 2026 · 10 min read

Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

Privacy Mode That Actually Keeps Its Promise: Engineering User-Controlled Data Boundaries in AI Features

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

In March 2026, a class action lawsuit alleged that Perplexity's "Incognito Mode" was routing conversational data and user identifiers to Meta and Google's ad networks — even for paying subscribers who had explicitly activated it. The feature was called incognito. Users assumed that meant private. The implementation said otherwise.

This is the most common failure mode in AI privacy modes: the name is marketing, the implementation is retention theater. Engineers ship a toggle. Legal approves the wording. Users flip the switch and trust it. And somewhere in the data pipeline, inputs are still flowing to a logging service, a training job, or a third-party analytics SDK that nobody remembered to gate.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

Prompt Injection in Multimodal Inputs: The Visual Attack Surface Your Text-Only Defense Misses

May 7, 2026 · 11 min read

Tian Pan

Software Engineer

When teams harden their AI pipelines against prompt injection, they usually focus on text: sanitizing user input strings, scanning outputs for exfiltrated data, filtering known jailbreak patterns. That work matters, but it addresses roughly half the attack surface of a modern AI system. The other half lives inside images, PDFs, audio clips, and charts — formats that bypass every text-scanning rule you've written, because the model processes them through entirely different pathways than it processes text.

Steganographic injection attacks against vision-language models achieve success rates around 24% across production models including GPT-4V, Claude, and LLaVA. That number isn't a lab artifact. It measures real attack payloads, hidden in ordinary-looking images, causing production models to deviate from their intended behavior. Your text injection scanner doesn't see any of it.

Prompt Injection Is Not Primarily an Attacker Problem

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams defending against prompt injection picture an attacker: someone crafting a carefully engineered string to override an AI's instructions. That framing is wrong, and it's costing them. The harder version of this problem doesn't require attackers at all.

Every time your AI application ingests user-generated content — a product review, a support ticket, a document upload, a CRM note — it faces the same structural vulnerability. No malicious intent needed. The ordinary text that ordinary users produce for ordinary reasons can, at scale, behave identically to a deliberate injection. If your application is only defended against the adversarial case, you're defended against the minority case.

Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

Your RAG system is probably indexing its own outputs. You just don't know it yet.

It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.

This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

May 7, 2026 · 11 min read

Tian Pan

Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

The RAG Eval Invalidation Paradox: Why Updating Your Knowledge Base Breaks Your Benchmarks

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG eval suite passes at 0.89 faithfulness. You add 5,000 new support documents to the knowledge base. You re-run the same evals. Faithfulness drops to 0.79. Your team files a model regression ticket.

Nothing regressed. Your eval just became a lie.

This is the RAG eval invalidation paradox: the moment you update your knowledge base, the evaluation set you built against the old index silently stops measuring what it was designed to measure. Most teams discover this months later — after burning engineering cycles on phantom regressions — if they ever discover it at all.

The Data Contract Problem in RAG: When Your Ingestion Pipeline Silently Breaks Retrieval Quality

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG system has a bug that doesn't throw exceptions. It doesn't spike your error rate. It doesn't show up in your latency dashboards. Instead, it quietly delivers confident, plausible-sounding answers that are wrong — and nobody notices for weeks.

This is the data contract problem in RAG: your ingestion pipeline is the source of truth for everything downstream, but it has no schema enforcement, no freshness guarantees, and no alerting when the shape of the world changes underneath it. Every time an upstream data source adds a field, a chunking parameter shifts, or an embedding model gets updated, your retrieval quality silently degrades.

Eighty percent of enterprise RAG projects experience critical failures in production. The most insidious of those failures don't announce themselves.

Rate Limits Are a Design Constraint, Not an Error Code

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

A team I know built a financial assistant with an agentic loop. Week one, API spend was $127. Week eleven, it was$ 47,000 — same system, same feature, no intentional change in scope. The agent hit a rate limit, the retry logic dutifully retried, the loop had no circuit breaker, and the costs compounded in silence until someone noticed the billing alert they had set too high.

This isn't a story about a bug. It's a story about architecture. The team's mental model treated rate limits as an error to handle reactively. The system they built reflected that model exactly. The $47,000 week was the system working as designed.

Your Refusal Logs Are a Product Backlog in Disguise

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Every AI product team has a security dashboard somewhere showing refused requests. Filters triggered, jailbreaks blocked, policy violations caught. The operational teams look at it to make sure the guardrails are holding. Nobody else looks at it at all.

That's a mistake. The requests your AI refuses are the most concentrated, honest user research signal you have access to. A user who tries three different phrasings to get your product to do something it won't do is telling you, with extraordinary clarity, exactly what they want and can't have. Treating that signal as a security artifact rather than a product artifact is leaving the richest feedback you'll ever collect on the floor.

About Tian Pan