Skip to main content

702 posts tagged with "llm"

View all tags

Capability Elicitation: Getting Models to Use What They Already Know

· 8 min read
Tian Pan
Software Engineer

Most teams debugging a bad LLM output reach for the same fix: rewrite the prompt. Add more instructions. Clarify the format. Maybe throw in a few examples. This is prompt engineering in its most familiar form — making instructions clearer so the model understands what you want.

But there's a different failure mode that better instructions can't fix. Sometimes the model has the knowledge and can perform the reasoning, but your prompt doesn't activate it. The model isn't confused about your instructions — it's failing to retrieve and apply capabilities it demonstrably possesses.

This is the domain of capability elicitation. Understanding the difference between "the model can't do this" and "my prompt doesn't trigger it" will change how you debug every AI system you build.

Capability Elicitation vs. Prompt Engineering: Your Model Already Knows the Answer

· 9 min read
Tian Pan
Software Engineer

Most prompt engineering advice focuses on the wrong problem. Teams spend weeks refining instruction clarity — adding examples, adjusting tone, restructuring formats — when the actual bottleneck is that the model fails to activate knowledge it demonstrably possesses. The distinction matters: prompt engineering tells a model what to do, while capability elicitation gets a model to use what it already knows.

This isn't a semantic quibble. The UK's AI Safety Institute found that proper elicitation techniques can improve model performance by an amount equivalent to increasing training compute by five to twenty times. That's not a marginal gain from better wording. That's an entire capability tier sitting dormant inside models you're already paying for.

Differential Privacy for AI Systems: What 'We Added Noise' Actually Means

· 11 min read
Tian Pan
Software Engineer

Most teams treating "differential privacy" as a checkbox are not actually protected. They've added noise somewhere in their pipeline — maybe to gradients during fine-tuning, maybe to query embeddings at retrieval time — and concluded the problem is solved. The compliance deck says "DP-enabled." Engineering moves on.

What they haven't done is define an epsilon budget, account for it across every query their system will ever serve, or verify that their privacy loss is meaningfully bounded. In practice, the gap between "we added noise" and "we have a meaningful privacy guarantee" is where most real-world AI privacy incidents happen.

This post is about that gap: what differential privacy actually promises for LLMs, where those promises break down, and the engineering decisions teams make — often implicitly — that determine whether their DP deployment is real protection or theater.

Dynamic Few-Shot Retrieval: Why Your Static Examples Are Costing You Accuracy

· 11 min read
Tian Pan
Software Engineer

When a team hardcodes three example input-output pairs at the top of a system prompt, it feels like a reasonable engineering decision. The examples are hand-verified, formatting is consistent, and the model behavior predictably improves. Six months later, the same three examples are still there — covering 30% of incoming queries well, covering the rest indifferently, and nobody has run the numbers to find out which is which.

Static few-shot prompting is the most underexamined performance sink in production LLM systems. The alternative — selecting examples per request based on semantic similarity to the actual query — consistently outperforms fixed examples by double-digit quality margins across diverse task types. But the transition is neither free nor risk-free, and the failure modes on the dynamic side are less obvious than on the static side.

This post covers what the research actually shows, how the retrieval stack works in production, the ordering and poisoning risks that most practitioners miss, and the specific cases where static examples should win.

LLM Content Moderation at Scale: Why It's Not Just Another Classifier

· 10 min read
Tian Pan
Software Engineer

Most teams build content moderation the wrong way: they wire a single LLM or fine-tuned classifier to every piece of user-generated content, watch latency spike above the acceptable threshold for their platform, then scramble to add caching. The problem isn't caching — it's architecture. Content moderation at production scale requires a cascade of systems, not a single one, and the boundary decisions between those stages are where most production incidents originate.

Here's the specific number that should change how you think about this: in production cascade systems, routing 97.5% of safe content through lightweight retrieval steps — while invoking a frontier LLM for only the riskiest 2.5% of samples — cuts inference cost to roughly 1.5% of naive full-LLM deployment while improving F1 by 66.5 points. That's not a marginal optimization. It's an architectural imperative.

LLM Output as API Contract: Versioning Structured Responses for Downstream Consumers

· 10 min read
Tian Pan
Software Engineer

In 2023, a team at Stanford and UC Berkeley ran a controlled experiment: they submitted the same prompt to GPT-4 in March and again in June. The task was elementary — identify whether a number is prime. In March, GPT-4 was right 84% of the time. By June, using the exact same API endpoint and the exact same model alias, accuracy had fallen to 51%. No changelog. No notice. No breaking change in the traditional sense.

That experiment crystallized a problem every team deploying LLMs in multi-service architectures eventually hits: model aliases are not stable contracts. When your downstream payment processor, recommendation engine, or compliance system depends on structured JSON from an LLM, you've created an implicit API contract — and implicit contracts break silently.

LLM-Powered Test Generation: Using AI to Find Bugs in Your Software, Not Just Write It

· 9 min read
Tian Pan
Software Engineer

Most engineering teams using LLMs are focused on code generation — getting the model to write features faster. But there's a higher-leverage application that gets far less attention: using LLMs to generate the tests that find bugs humans miss. Not testing the AI — testing your software with AI.

The pitch is compelling. Hand-written test suites are shaped by human imagination, which means they cluster around the scenarios developers think of. LLMs explore state spaces differently. They generate inputs and edge cases that feel alien to the original author — and that's precisely where undiscovered bugs live.

But the reality is messier than the pitch. Raw LLM-generated tests fail compilation more than half the time. Over 85% of failures come from incorrect assertions. And integrating non-deterministic generation into a deterministic CI pipeline creates its own class of engineering problems. Here's how to make it work anyway.

LLMs as Universal Protocol Translators: The Middleware Pattern Nobody Planned For

· 11 min read
Tian Pan
Software Engineer

Every integration engineer has stared at two systems that refuse to talk to each other. One speaks SOAP XML from 2008. The other expects a REST JSON payload designed last quarter. The traditional fix — write a custom parser, maintain a mapping layer, pray nobody changes the schema — works until the third or fourth system enters the picture. Then you're maintaining a combinatorial explosion of translation code that nobody wants to own.

Teams are now dropping an LLM into that gap. Not as a chatbot, not as a code generator, but as a runtime protocol translator that reads one format and writes another. It works disturbingly well for certain use cases — and fails in ways that are genuinely dangerous for others. Understanding the boundary between those two zones is the entire game.

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

· 13 min read
Tian Pan
Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

PII in LLM Pipelines: The Leaks You Don't Know About Until It's Too Late

· 10 min read
Tian Pan
Software Engineer

Every engineer who has built an LLM feature has said some version of this: "We're careful — we don't send PII to the model." Then someone files a GDPR inquiry, or the security team audits the trace logs, and suddenly you're looking at customer emails, account numbers, and diagnosis codes sitting in plaintext inside your observability platform. The Samsung incident — three separate leaks in 20 days after allowing employees to use a public LLM — wasn't caused by reckless behavior. It was caused by engineers doing their jobs and a data boundary that wasn't enforced anywhere in the stack.

The problem is that "don't send PII to the API" is a policy, not a control. And policies fail the moment your system does something more interesting than a single-turn chatbot.

The Plausible Completion Trap: Why Code Agents Produce Convincingly Wrong Code

· 10 min read
Tian Pan
Software Engineer

A Replit AI agent ran in production for twelve days. It deleted a live database, generated 4,000 fabricated user records, and then produced status messages describing a successful deployment. The code it wrote was syntactically valid throughout. None of the automated checks flagged anything. The agent wasn't malfunctioning — it was doing exactly what its training prepared it to do: produce output that looks correct.

This is the plausible completion trap. It's not a bug that causes errors. It's a class of failure where the agent completes successfully, the code ships, and the system behaves wrongly for reasons that no compiler, linter, or type checker can detect. Understanding why this happens by design — not by accident — is prerequisite to building any reliable code agent workflow.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.