Skip to main content

702 posts tagged with "llm"

View all tags

LLM Self-Debugging: When the Explanation Is the Signal vs. When It's the Lie

· 8 min read
Tian Pan
Software Engineer

When your LLM agent fails, the most tempting thing in the world is to ask it why. It will answer fluently, specifically, and with what feels like self-awareness. It might say: "I misunderstood the user's intent and retrieved documents about X when I should have targeted Y." That sounds exactly like a root cause. You write it down, open the prompt editor, and spend forty minutes chasing the wrong problem.

This is the central trap of LLM self-debugging. The model's explanation and the model's actual failure mechanism are two different things. Sometimes they overlap. Often they don't. Knowing which situation you're in before you act on the explanation is the discipline that separates fast debugging from expensive detours.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

PII in the Prompt: The Data Minimization Patterns Your AI Pipeline Is Missing

· 12 min read
Tian Pan
Software Engineer

Research from 2025 found that 8.5% of prompts submitted to commercial LLMs contain sensitive information — PII, credentials, and internal file references. That statistic probably undersells the problem. It counts what users explicitly type. It doesn't count what your system silently adds: retrieved customer records, tool outputs from database queries, memories persisted from previous sessions, or fine-tuning data that wasn't scrubbed before training. Most AI pipelines leak PII not through user mistakes but through architectural blind spots that no single engineer owns.

The failure mode is almost always the same: a team ships an AI feature thinking "we don't send personal data," but personal data enters through the seams — in the RAG retrieval chunk that includes a customer's address, in the agent tool output that returns a full user profile, in the fine-tuning dataset that was exported from a CRM without redaction. GDPR's data minimization principle requires that you collect only what's necessary for a specific purpose. LLM architectures violate this by default.

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

· 10 min read
Tian Pan
Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

Privacy Mode That Actually Keeps Its Promise: Engineering User-Controlled Data Boundaries in AI Features

· 10 min read
Tian Pan
Software Engineer

In March 2026, a class action lawsuit alleged that Perplexity's "Incognito Mode" was routing conversational data and user identifiers to Meta and Google's ad networks — even for paying subscribers who had explicitly activated it. The feature was called incognito. Users assumed that meant private. The implementation said otherwise.

This is the most common failure mode in AI privacy modes: the name is marketing, the implementation is retention theater. Engineers ship a toggle. Legal approves the wording. Users flip the switch and trust it. And somewhere in the data pipeline, inputs are still flowing to a logging service, a training job, or a third-party analytics SDK that nobody remembered to gate.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

Prompt Injection in Multimodal Inputs: The Visual Attack Surface Your Text-Only Defense Misses

· 11 min read
Tian Pan
Software Engineer

When teams harden their AI pipelines against prompt injection, they usually focus on text: sanitizing user input strings, scanning outputs for exfiltrated data, filtering known jailbreak patterns. That work matters, but it addresses roughly half the attack surface of a modern AI system. The other half lives inside images, PDFs, audio clips, and charts — formats that bypass every text-scanning rule you've written, because the model processes them through entirely different pathways than it processes text.

Steganographic injection attacks against vision-language models achieve success rates around 24% across production models including GPT-4V, Claude, and LLaVA. That number isn't a lab artifact. It measures real attack payloads, hidden in ordinary-looking images, causing production models to deviate from their intended behavior. Your text injection scanner doesn't see any of it.

Prompt Injection Is Not Primarily an Attacker Problem

· 9 min read
Tian Pan
Software Engineer

Most teams defending against prompt injection picture an attacker: someone crafting a carefully engineered string to override an AI's instructions. That framing is wrong, and it's costing them. The harder version of this problem doesn't require attackers at all.

Every time your AI application ingests user-generated content — a product review, a support ticket, a document upload, a CRM note — it faces the same structural vulnerability. No malicious intent needed. The ordinary text that ordinary users produce for ordinary reasons can, at scale, behave identically to a deliberate injection. If your application is only defended against the adversarial case, you're defended against the minority case.

Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself

· 8 min read
Tian Pan
Software Engineer

Your RAG system is probably indexing its own outputs. You just don't know it yet.

It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.

This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

· 10 min read
Tian Pan
Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

· 11 min read
Tian Pan
Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

The RAG Eval Invalidation Paradox: Why Updating Your Knowledge Base Breaks Your Benchmarks

· 10 min read
Tian Pan
Software Engineer

Your RAG eval suite passes at 0.89 faithfulness. You add 5,000 new support documents to the knowledge base. You re-run the same evals. Faithfulness drops to 0.79. Your team files a model regression ticket.

Nothing regressed. Your eval just became a lie.

This is the RAG eval invalidation paradox: the moment you update your knowledge base, the evaluation set you built against the old index silently stops measuring what it was designed to measure. Most teams discover this months later — after burning engineering cycles on phantom regressions — if they ever discover it at all.