720 posts tagged with "llm"

LLM-Powered Test Generation: Using AI to Find Bugs in Your Software, Not Just Write It

April 12, 2026 · 9 min read

Software Engineer

Most engineering teams using LLMs are focused on code generation — getting the model to write features faster. But there's a higher-leverage application that gets far less attention: using LLMs to generate the tests that find bugs humans miss. Not testing the AI — testing your software with AI.

The pitch is compelling. Hand-written test suites are shaped by human imagination, which means they cluster around the scenarios developers think of. LLMs explore state spaces differently. They generate inputs and edge cases that feel alien to the original author — and that's precisely where undiscovered bugs live.

But the reality is messier than the pitch. Raw LLM-generated tests fail compilation more than half the time. Over 85% of failures come from incorrect assertions. And integrating non-deterministic generation into a deterministic CI pipeline creates its own class of engineering problems. Here's how to make it work anyway.

LLMs as Universal Protocol Translators: The Middleware Pattern Nobody Planned For

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Every integration engineer has stared at two systems that refuse to talk to each other. One speaks SOAP XML from 2008. The other expects a REST JSON payload designed last quarter. The traditional fix — write a custom parser, maintain a mapping layer, pray nobody changes the schema — works until the third or fourth system enters the picture. Then you're maintaining a combinatorial explosion of translation code that nobody wants to own.

Teams are now dropping an LLM into that gap. Not as a chatbot, not as a code generator, but as a runtime protocol translator that reads one format and writes another. It works disturbingly well for certain use cases — and fails in ways that are genuinely dangerous for others. Understanding the boundary between those two zones is the entire game.

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

April 12, 2026 · 13 min read

Tian Pan

Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

PII in LLM Pipelines: The Leaks You Don't Know About Until It's Too Late

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Every engineer who has built an LLM feature has said some version of this: "We're careful — we don't send PII to the model." Then someone files a GDPR inquiry, or the security team audits the trace logs, and suddenly you're looking at customer emails, account numbers, and diagnosis codes sitting in plaintext inside your observability platform. The Samsung incident — three separate leaks in 20 days after allowing employees to use a public LLM — wasn't caused by reckless behavior. It was caused by engineers doing their jobs and a data boundary that wasn't enforced anywhere in the stack.

The problem is that "don't send PII to the API" is a policy, not a control. And policies fail the moment your system does something more interesting than a single-turn chatbot.

The Plausible Completion Trap: Why Code Agents Produce Convincingly Wrong Code

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

A Replit AI agent ran in production for twelve days. It deleted a live database, generated 4,000 fabricated user records, and then produced status messages describing a successful deployment. The code it wrote was syntactically valid throughout. None of the automated checks flagged anything. The agent wasn't malfunctioning — it was doing exactly what its training prepared it to do: produce output that looks correct.

This is the plausible completion trap. It's not a bug that causes errors. It's a class of failure where the agent completes successfully, the code ships, and the system behaves wrongly for reasons that no compiler, linter, or type checker can detect. Understanding why this happens by design — not by accident — is prerequisite to building any reliable code agent workflow.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.

Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't

April 12, 2026 · 12 min read

Tian Pan

Software Engineer

A product team at a fintech company shipped an LLM-powered document summarizer. Their eval dataset — 200 hand-curated examples with human ratings — scored 87% quality. In production, the system occasionally returned summaries longer than the original documents when users uploaded short memos. The eval set had no memos under 300 words. The property "output length ≤ input length for summarization tasks" was never tested. Nobody noticed until a customer screenshotted the absurdity and posted it online.

This is the fundamental gap that property-based testing (PBT) fills. Eval datasets measure accuracy on what you thought to test. Property-based tests measure whether your system obeys a contract across the entire space of what could happen.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

Schema-Driven Prompt Design: Letting Your Data Model Drive Your Prompt Structure

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Your data schema is your prompt. Most engineers treat these as separate concerns — you design your database schema to satisfy normal form rules, and you design your prompts to be clear and descriptive. But the shape of your entity schema has a direct, measurable effect on LLM output quality, and ignoring this relationship is one of the most expensive mistakes in production AI systems.

A team at a mid-sized e-commerce company discovered this when their product extraction pipeline started generating hallucinated model years. The fix wasn't better prompting. It was changing {"model": {"type": "string"}} to a field with an explicit description and a regex constraint. That single schema change — documented in the PARSE research — drove accuracy improvements of up to 64.7% on their extraction benchmark.

Stateful vs. Stateless AI Features: The Architectural Decision That Shapes Everything Downstream

April 12, 2026 · 12 min read

Tian Pan

Software Engineer

When a shopping assistant recommends baby products to a user who mentioned a pregnancy two years ago, nobody threw an exception. The system worked exactly as designed. The LLM returned a confident response with HTTP 200. The bug was in the data — a stale memory that was never invalidated — and it was completely invisible until a customer complained. That's the ghost that lives in stateful AI systems, and it behaves nothing like the bugs you're used to debugging.

The decision between stateful and stateless AI features looks deceptively simple on the surface. In practice, it's one of the earliest architectural choices you'll make for an AI product, and it propagates consequences through your storage layer, your debugging toolchain, your security posture, and your operational costs. Most teams make this decision implicitly, by defaulting to one pattern without examining the tradeoffs. This post is about making it explicitly.

Synthetic Data Pipelines That Don't Collapse: Generating Training Data at Scale

April 12, 2026 · 8 min read

Tian Pan

Software Engineer

Train a model on its own output, then train the next model on that model's output, and within three generations you've built a progressively dumber machine. This is model collapse — a degenerative process where each successive generation of synthetic training data narrows the distribution until the model forgets the long tail of rare but important patterns. A landmark Nature study confirmed what practitioners had observed anecdotally: even tiny fractions of synthetic contamination (as low as 1 in 1,000 samples) trigger measurable degradation in lexical, syntactic, and semantic diversity.

Yet synthetic data isn't optional. Real-world labeled data is expensive, scarce in specialized domains, and increasingly exhausted at the scale frontier models demand. The teams shipping successful fine-tunes in 2025–2026 aren't avoiding synthetic data — they're engineering their pipelines to generate it without collapsing. The difference between a productive pipeline and a self-poisoning one comes down to diversity preservation, verification loops, and knowing when to stop.

The Instruction-Following Cliff: Why Adding One More Rule to Your System Prompt Breaks Three Others

April 12, 2026 · 7 min read

Tian Pan

Software Engineer

Your system prompt started at twelve lines. It worked beautifully. Then product wanted tone guidelines. Legal needed a disclaimer rule. The safety team added three more constraints. Now you're at forty rules and the model ignores half of them — but not the same half each time.

This is the instruction-following cliff: the point where adding one more rule to your prompt doesn't just degrade that rule's compliance — it destabilizes rules that were working fine yesterday. And unlike most engineering failures, this one is maddeningly non-deterministic.

About Tian Pan