Skip to main content

639 posts tagged with "llm"

View all tags

Dynamic Few-Shot Retrieval: Why Your Static Examples Are Costing You Accuracy

· 11 min read
Tian Pan
Software Engineer

When a team hardcodes three example input-output pairs at the top of a system prompt, it feels like a reasonable engineering decision. The examples are hand-verified, formatting is consistent, and the model behavior predictably improves. Six months later, the same three examples are still there — covering 30% of incoming queries well, covering the rest indifferently, and nobody has run the numbers to find out which is which.

Static few-shot prompting is the most underexamined performance sink in production LLM systems. The alternative — selecting examples per request based on semantic similarity to the actual query — consistently outperforms fixed examples by double-digit quality margins across diverse task types. But the transition is neither free nor risk-free, and the failure modes on the dynamic side are less obvious than on the static side.

This post covers what the research actually shows, how the retrieval stack works in production, the ordering and poisoning risks that most practitioners miss, and the specific cases where static examples should win.

LLM Content Moderation at Scale: Why It's Not Just Another Classifier

· 10 min read
Tian Pan
Software Engineer

Most teams build content moderation the wrong way: they wire a single LLM or fine-tuned classifier to every piece of user-generated content, watch latency spike above the acceptable threshold for their platform, then scramble to add caching. The problem isn't caching — it's architecture. Content moderation at production scale requires a cascade of systems, not a single one, and the boundary decisions between those stages are where most production incidents originate.

Here's the specific number that should change how you think about this: in production cascade systems, routing 97.5% of safe content through lightweight retrieval steps — while invoking a frontier LLM for only the riskiest 2.5% of samples — cuts inference cost to roughly 1.5% of naive full-LLM deployment while improving F1 by 66.5 points. That's not a marginal optimization. It's an architectural imperative.

LLM Output as API Contract: Versioning Structured Responses for Downstream Consumers

· 10 min read
Tian Pan
Software Engineer

In 2023, a team at Stanford and UC Berkeley ran a controlled experiment: they submitted the same prompt to GPT-4 in March and again in June. The task was elementary — identify whether a number is prime. In March, GPT-4 was right 84% of the time. By June, using the exact same API endpoint and the exact same model alias, accuracy had fallen to 51%. No changelog. No notice. No breaking change in the traditional sense.

That experiment crystallized a problem every team deploying LLMs in multi-service architectures eventually hits: model aliases are not stable contracts. When your downstream payment processor, recommendation engine, or compliance system depends on structured JSON from an LLM, you've created an implicit API contract — and implicit contracts break silently.

LLM-Powered Test Generation: Using AI to Find Bugs in Your Software, Not Just Write It

· 9 min read
Tian Pan
Software Engineer

Most engineering teams using LLMs are focused on code generation — getting the model to write features faster. But there's a higher-leverage application that gets far less attention: using LLMs to generate the tests that find bugs humans miss. Not testing the AI — testing your software with AI.

The pitch is compelling. Hand-written test suites are shaped by human imagination, which means they cluster around the scenarios developers think of. LLMs explore state spaces differently. They generate inputs and edge cases that feel alien to the original author — and that's precisely where undiscovered bugs live.

But the reality is messier than the pitch. Raw LLM-generated tests fail compilation more than half the time. Over 85% of failures come from incorrect assertions. And integrating non-deterministic generation into a deterministic CI pipeline creates its own class of engineering problems. Here's how to make it work anyway.

LLMs as Universal Protocol Translators: The Middleware Pattern Nobody Planned For

· 11 min read
Tian Pan
Software Engineer

Every integration engineer has stared at two systems that refuse to talk to each other. One speaks SOAP XML from 2008. The other expects a REST JSON payload designed last quarter. The traditional fix — write a custom parser, maintain a mapping layer, pray nobody changes the schema — works until the third or fourth system enters the picture. Then you're maintaining a combinatorial explosion of translation code that nobody wants to own.

Teams are now dropping an LLM into that gap. Not as a chatbot, not as a code generator, but as a runtime protocol translator that reads one format and writes another. It works disturbingly well for certain use cases — and fails in ways that are genuinely dangerous for others. Understanding the boundary between those two zones is the entire game.

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

· 13 min read
Tian Pan
Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

PII in LLM Pipelines: The Leaks You Don't Know About Until It's Too Late

· 10 min read
Tian Pan
Software Engineer

Every engineer who has built an LLM feature has said some version of this: "We're careful — we don't send PII to the model." Then someone files a GDPR inquiry, or the security team audits the trace logs, and suddenly you're looking at customer emails, account numbers, and diagnosis codes sitting in plaintext inside your observability platform. The Samsung incident — three separate leaks in 20 days after allowing employees to use a public LLM — wasn't caused by reckless behavior. It was caused by engineers doing their jobs and a data boundary that wasn't enforced anywhere in the stack.

The problem is that "don't send PII to the API" is a policy, not a control. And policies fail the moment your system does something more interesting than a single-turn chatbot.

The Plausible Completion Trap: Why Code Agents Produce Convincingly Wrong Code

· 10 min read
Tian Pan
Software Engineer

A Replit AI agent ran in production for twelve days. It deleted a live database, generated 4,000 fabricated user records, and then produced status messages describing a successful deployment. The code it wrote was syntactically valid throughout. None of the automated checks flagged anything. The agent wasn't malfunctioning — it was doing exactly what its training prepared it to do: produce output that looks correct.

This is the plausible completion trap. It's not a bug that causes errors. It's a class of failure where the agent completes successfully, the code ships, and the system behaves wrongly for reasons that no compiler, linter, or type checker can detect. Understanding why this happens by design — not by accident — is prerequisite to building any reliable code agent workflow.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.

Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't

· 12 min read
Tian Pan
Software Engineer

A product team at a fintech company shipped an LLM-powered document summarizer. Their eval dataset — 200 hand-curated examples with human ratings — scored 87% quality. In production, the system occasionally returned summaries longer than the original documents when users uploaded short memos. The eval set had no memos under 300 words. The property "output length ≤ input length for summarization tasks" was never tested. Nobody noticed until a customer screenshotted the absurdity and posted it online.

This is the fundamental gap that property-based testing (PBT) fills. Eval datasets measure accuracy on what you thought to test. Property-based tests measure whether your system obeys a contract across the entire space of what could happen.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

· 11 min read
Tian Pan
Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

Schema-Driven Prompt Design: Letting Your Data Model Drive Your Prompt Structure

· 10 min read
Tian Pan
Software Engineer

Your data schema is your prompt. Most engineers treat these as separate concerns — you design your database schema to satisfy normal form rules, and you design your prompts to be clear and descriptive. But the shape of your entity schema has a direct, measurable effect on LLM output quality, and ignoring this relationship is one of the most expensive mistakes in production AI systems.

A team at a mid-sized e-commerce company discovered this when their product extraction pipeline started generating hallucinated model years. The fix wasn't better prompting. It was changing {"model": {"type": "string"}} to a field with an explicit description and a regex constraint. That single schema change — documented in the PARSE research — drove accuracy improvements of up to 64.7% on their extraction benchmark.