Skip to main content

639 posts tagged with "llm"

View all tags

Fine-Tuning Data Saturation: When Adding Examples Makes Your Model Worse

· 9 min read
Tian Pan
Software Engineer

There's a pattern that repeats across almost every fine-tuning project that runs past the initial demo: the team hits a quality plateau, decides they need more data, adds 50% more examples, retrains, and discovers the model is either identically mediocre or measurably worse. The instinct to add data is correct for most software problems — more signal generally helps. But fine-tuning has a saturation regime that pre-training does not, and most practitioners don't recognize when they've entered it.

A 2024 study testing LLM fine-tuning on the Qasper dataset found that expanding the training set from 500 to 1,000 examples caused Mixtral's accuracy score to drop from 4.04 to 3.28 and completeness from 3.75 to 2.58. This wasn't a hyperparameter bug. It was data saturation: the model had begun memorizing distribution noise rather than learning generalizable patterns. The team added fuel after the engine had already flooded.

The First-Mover Disadvantage in AI: A Framework for Timing Your AI Feature Launch

· 10 min read
Tian Pan
Software Engineer

The conventional wisdom in tech—move fast, ship early, establish moats—turns lethal in AI at a particular moment in the model improvement curve. In 2023, dozens of teams built viable businesses around a single capability: let users upload a PDF and ask questions about it. Then OpenAI added native file upload to ChatGPT. The businesses didn't die because they were slow. They died because they were early.

This isn't an isolated incident. It's a structural feature of building on top of rapidly improving base models, and most launch timing frameworks were designed for slower-moving technology curves. The framework you used to decide when to ship a SaaS feature doesn't translate to AI—the inputs are different and the failure modes are entirely distinct.

The Frozen Feature Trap: When Your AI Differentiator Becomes a Maintenance Anchor

· 9 min read
Tian Pan
Software Engineer

In 2022, a team spent three months fine-tuning a BERT-based classifier to categorize customer support tickets. It was a genuine win — 94% accuracy where their old rule-based system topped out at 70%. Two years later, the same classifier runs on aging infrastructure, requires a specialist to retrain whenever categories shift, and gets beaten on a fresh benchmark by a zero-shot prompt to a frontier model. Nobody wants to touch it. The engineer who built it left. The current team is afraid that deprecating it will break something. The feature is frozen.

This is the frozen feature trap. It's one of the quieter forms of AI technical debt, and it's accumulating across the industry as teams discover that what looked like a moat was actually a hole they've been shoveling money into.

Function Calling vs Code Generation for Agent Actions: The Tradeoffs Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

An agent running in production once received the instruction "clean up the test data" and executed a DROP TABLE command against a production database. The tool call succeeded. The audit log showed a perfectly structured JSON payload. The agent had done exactly what it was asked — just not what anyone meant. This isn't a story about prompt injection. It's a story about an architectural choice: the team had given their agent the ability to generate and execute arbitrary code, and they had underestimated what that actually means at runtime.

The choice between function calling and code generation as the action layer for AI agents is one of the most consequential decisions in agent architecture, and almost nobody benchmarks it directly. Papers measure accuracy on task completion; they rarely measure the failure modes that matter in production — silent semantic errors, irreversible side effects, security exposure surface, and debugging cost when something goes wrong.

The Generalization Cliff: How Fine-Tuning Creates Silent Capability Regressions

· 9 min read
Tian Pan
Software Engineer

A team at an enterprise software company fine-tuned a 7B model on customer support tickets. The target metric — resolution accuracy — improved by 12 percentage points. The team shipped it. Three weeks later, the product had a second failure mode nobody expected: the model had quietly lost the ability to handle multi-step questions. Users would ask something slightly outside the support domain and receive a confident but incoherent answer. The model had traded breadth it didn't know it needed for depth it could measure.

This is the generalization cliff: the silent capability degradation that follows narrow fine-tuning. Unlike a crash or a timeout, it produces no error. The model still responds. It just responds worse on tasks adjacent to its training distribution — and those tasks never appeared in the eval suite.

The Helpful-But-Wrong Problem: Operational Hallucination in Production AI Agents

· 9 min read
Tian Pan
Software Engineer

Your AI agent just completed a complex database migration task. It called the right tool, used proper terminology, referenced the correct library, and returned output that looks completely reasonable. Then your DBA runs it against a 50M-row production table — and the backup flag was wrong. The flag exists in a neighboring library version, it's syntactically valid, but it silently no-ops the backup step.

The agent wasn't hallucinating wildly. It was confident, fluent, and directionally correct. It was also operationally wrong in exactly the way that causes data loss.

This is the hallucination category the field underinvests in, the one that your evals are almost certainly not catching.

The Hyperparameter Illusion: Why Temperature and Top-P Are the Last Things to Tune

· 9 min read
Tian Pan
Software Engineer

When LLM outputs feel wrong, engineers reach for the temperature dial. It's one of the first moves in the debugging playbook — crank it down for more consistency, nudge it up for more creativity. It feels productive because it's easy to change and produces immediately visible effects. It is almost never the right move.

Temperature and top-p are the last 10% of output quality, not the first 90%. The variables that actually determine whether your model succeeds are context quality, instruction clarity, and model selection — in that order. Misconfiguring sampling parameters on top of a broken prompt is like adjusting the seasoning on a dish that hasn't been cooked through. The fundamental problem doesn't move.

The Inherited AI System Audit: How to Take Ownership of an LLM Feature You Didn't Build

· 10 min read
Tian Pan
Software Engineer

Someone left. The onboarding doc says "ask Sarah" but Sarah is at a different company now. You're staring at a 900-line system prompt with sections titled things like ## DO NOT REMOVE THIS SECTION, and you have no idea what happens if you do.

This is the inherited AI system problem, and it's different from inheriting regular code. With legacy code, a determined engineer can trace execution paths, read tests, and reconstruct intent from behavior. With an inherited LLM feature, the prompt is the logic — but it's written in natural language, its failure modes are probabilistic, and the author's intent is trapped inside their head. There are no stack traces that tell you which guardrail fired and why.

Lazy Evaluation in AI Pipelines: Stop Calling the LLM Until You Have To

· 11 min read
Tian Pan
Software Engineer

Most AI pipelines are written as if every request deserves a full LLM call. The user submits a message, the pipeline passes it to the model, waits for a response, and returns it — every time, unconditionally. This works, but it's expensive, slow, and often unnecessary.

The fraction of requests that actually require a full LLM inference is smaller than most engineers assume. Research on token-level routing shows that only about 11% of tokens differ between a 1.5B and a 32B parameter model, and only 4.9% of tokens are genuinely "divergent" — meaning they alter the reasoning path if handled by the smaller model. Production semantic caches show that 65% of incoming traffic is semantically similar to something the pipeline has already answered. These aren't edge cases. They're the majority of your traffic, and you're paying full price to handle them.

The fix is lazy evaluation: don't invoke the expensive model until you've confirmed that the expensive model is actually needed.

LLM Code Review in Production: Building a Diff Pipeline That Engineers Actually Trust

· 9 min read
Tian Pan
Software Engineer

Most teams that deploy an LLM code reviewer discover the same failure mode within two weeks: the model produces 10–20 comments per pull request, 80% of which are noise. After the third PR where a developer dismisses every comment without reading them, the tool is effectively dead — notifications routed to a channel no one watches, the bot still spending compute on every push.

The problem isn't the model. It's that the teams shipped a comment generator and called it a reviewer.

The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.

The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

· 11 min read
Tian Pan
Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.