Skip to main content

720 posts tagged with "llm"

View all tags

Open-Weight Models in Production: When Self-Hosting Actually Beats the API

· 8 min read
Tian Pan
Software Engineer

Every few months, someone on your team forwards a blog post about how Llama or Qwen "matches GPT-4" on some benchmark, followed by the inevitable question: "Why are we paying for API calls when we could just run this ourselves?" The math looks compelling on a napkin. The reality is that most teams who attempt self-hosting end up spending more than they saved, not because the models are bad, but because they underestimated everything that isn't the model.

That said, there are specific situations where self-hosting open-weight models is the clearly correct decision. The trick is knowing which situation you're actually in, rather than the one you wish you were in.

The Post-Framework Era: Build Agents with an API Client and a While Loop

· 8 min read
Tian Pan
Software Engineer

The most effective AI agents in production today look nothing like the framework demos. They are not directed acyclic graphs with seventeen node types. They are not multi-agent swarms coordinating through message buses. They are a prompt, a tool list, and a while loop — and they ship faster, break less, and cost less to maintain than their framework-heavy counterparts.

This is not a contrarian take for its own sake. It is the conclusion that team after team reaches after burning weeks on framework migration, abstraction debugging, and DSL archaeology. The pattern is so consistent it deserves a name: the post-framework era.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

· 12 min read
Tian Pan
Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.

Capability Elicitation: Getting Models to Use What They Already Know

· 8 min read
Tian Pan
Software Engineer

Most teams debugging a bad LLM output reach for the same fix: rewrite the prompt. Add more instructions. Clarify the format. Maybe throw in a few examples. This is prompt engineering in its most familiar form — making instructions clearer so the model understands what you want.

But there's a different failure mode that better instructions can't fix. Sometimes the model has the knowledge and can perform the reasoning, but your prompt doesn't activate it. The model isn't confused about your instructions — it's failing to retrieve and apply capabilities it demonstrably possesses.

This is the domain of capability elicitation. Understanding the difference between "the model can't do this" and "my prompt doesn't trigger it" will change how you debug every AI system you build.

Capability Elicitation vs. Prompt Engineering: Your Model Already Knows the Answer

· 9 min read
Tian Pan
Software Engineer

Most prompt engineering advice focuses on the wrong problem. Teams spend weeks refining instruction clarity — adding examples, adjusting tone, restructuring formats — when the actual bottleneck is that the model fails to activate knowledge it demonstrably possesses. The distinction matters: prompt engineering tells a model what to do, while capability elicitation gets a model to use what it already knows.

This isn't a semantic quibble. The UK's AI Safety Institute found that proper elicitation techniques can improve model performance by an amount equivalent to increasing training compute by five to twenty times. That's not a marginal gain from better wording. That's an entire capability tier sitting dormant inside models you're already paying for.

Differential Privacy for AI Systems: What 'We Added Noise' Actually Means

· 11 min read
Tian Pan
Software Engineer

Most teams treating "differential privacy" as a checkbox are not actually protected. They've added noise somewhere in their pipeline — maybe to gradients during fine-tuning, maybe to query embeddings at retrieval time — and concluded the problem is solved. The compliance deck says "DP-enabled." Engineering moves on.

What they haven't done is define an epsilon budget, account for it across every query their system will ever serve, or verify that their privacy loss is meaningfully bounded. In practice, the gap between "we added noise" and "we have a meaningful privacy guarantee" is where most real-world AI privacy incidents happen.

This post is about that gap: what differential privacy actually promises for LLMs, where those promises break down, and the engineering decisions teams make — often implicitly — that determine whether their DP deployment is real protection or theater.

Dynamic Few-Shot Retrieval: Why Your Static Examples Are Costing You Accuracy

· 11 min read
Tian Pan
Software Engineer

When a team hardcodes three example input-output pairs at the top of a system prompt, it feels like a reasonable engineering decision. The examples are hand-verified, formatting is consistent, and the model behavior predictably improves. Six months later, the same three examples are still there — covering 30% of incoming queries well, covering the rest indifferently, and nobody has run the numbers to find out which is which.

Static few-shot prompting is the most underexamined performance sink in production LLM systems. The alternative — selecting examples per request based on semantic similarity to the actual query — consistently outperforms fixed examples by double-digit quality margins across diverse task types. But the transition is neither free nor risk-free, and the failure modes on the dynamic side are less obvious than on the static side.

This post covers what the research actually shows, how the retrieval stack works in production, the ordering and poisoning risks that most practitioners miss, and the specific cases where static examples should win.

LLM Content Moderation at Scale: Why It's Not Just Another Classifier

· 10 min read
Tian Pan
Software Engineer

Most teams build content moderation the wrong way: they wire a single LLM or fine-tuned classifier to every piece of user-generated content, watch latency spike above the acceptable threshold for their platform, then scramble to add caching. The problem isn't caching — it's architecture. Content moderation at production scale requires a cascade of systems, not a single one, and the boundary decisions between those stages are where most production incidents originate.

Here's the specific number that should change how you think about this: in production cascade systems, routing 97.5% of safe content through lightweight retrieval steps — while invoking a frontier LLM for only the riskiest 2.5% of samples — cuts inference cost to roughly 1.5% of naive full-LLM deployment while improving F1 by 66.5 points. That's not a marginal optimization. It's an architectural imperative.

LLM Output as API Contract: Versioning Structured Responses for Downstream Consumers

· 10 min read
Tian Pan
Software Engineer

In 2023, a team at Stanford and UC Berkeley ran a controlled experiment: they submitted the same prompt to GPT-4 in March and again in June. The task was elementary — identify whether a number is prime. In March, GPT-4 was right 84% of the time. By June, using the exact same API endpoint and the exact same model alias, accuracy had fallen to 51%. No changelog. No notice. No breaking change in the traditional sense.

That experiment crystallized a problem every team deploying LLMs in multi-service architectures eventually hits: model aliases are not stable contracts. When your downstream payment processor, recommendation engine, or compliance system depends on structured JSON from an LLM, you've created an implicit API contract — and implicit contracts break silently.