Skip to main content

702 posts tagged with "llm"

View all tags

The AI Interview Has No Signal: Why Your Loop Doesn't Identify People Who Ship LLM Products

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months running their standard senior-engineer loop with an "AI round" bolted on. They interviewed seventy candidates. They hired three. None of the three shipped an agent that survived a production weekend. The team blamed the talent market. The talent market was fine. The loop was the problem.

The standard engineering interview was calibrated for a stack where correctness is verifiable, performance is measurable on a benchmark, and a good engineer is someone who can decompose a problem into deterministic components and reason about edge cases against a known specification. That stack still exists, and those skills still matter, but the cluster of skills that predicts shipping LLM products is largely orthogonal to it. Your loop is asking the right questions about the wrong job.

This is a structural problem, not a calibration nudge. Adding a forty-five-minute "AI round" to a loop calibrated for deterministic systems doesn't surface AI builders — it surfaces the intersection of classical-systems-strong and LLM-fluent candidates, which is a vanishingly small set, and produces six months of failed loops while everyone wonders where all the AI engineers went.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

· 10 min read
Tian Pan
Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

Calibrated Abstention: The Capability Every Layer of Your LLM Stack Punishes

· 11 min read
Tian Pan
Software Engineer

There is a capability your model could have that would, on the days it mattered, be worth more than any other behavioral upgrade you could ship: the ability to say "I don't have a reliable answer to this" and mean it. Not the keyword-matched safety refusal. Not the hedging tic the model picked up from RLHF on controversial topics. The real thing — a calibrated abstention that fires when, and only when, the model's internal evidence does not support a confident response.

You will never get it by accident. Every default in the LLM stack pushes the other way.

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In

· 10 min read
Tian Pan
Software Engineer

Every "we'll just swap providers if needed" plan in the deck has a budget line for prompt rewrites. None of them has a line for the eval suite. That is the bug. The prompts are the visible coupling — the part you wrote, the part you can grep for, the part a junior engineer can rewrite in an afternoon. The eval harness is the invisible coupling, and it is the one that will eat a quarter of your roadmap when you actually try to migrate.

The pattern shows up the moment leverage matters. Your contract is up. A competitor releases a model that benchmarks better on your domain. Pricing on output tokens shifts under you. You go to run the candidate model through your eval suite to make the call, and within a day you discover that you cannot trust any score the harness produces, because the harness itself was written against the incumbent. You are not comparing models. You are comparing one model against a measurement instrument that was calibrated to the other one.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

· 10 min read
Tian Pan
Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.

The Hollow Explanation Problem: When Your Model's Reasoning Is Decoration, Not Evidence

· 11 min read
Tian Pan
Software Engineer

A loan-review tool flags an application. The reviewer clicks "explain" and gets four neat bullet points: income volatility over the last six months, credit utilization above 70%, a recent address change, two thin-file dependents. The rationale reads like something a careful underwriter would write. The reviewer approves the override and moves on.

The uncomfortable part: the model never used those signals to make the decision. They appeared in the explanation because they were the kind of factors that would justify a flag — not because the flag came from them. The actual computation was a narrow latent-feature pattern that the model can't articulate, plus a few correlations the explanation never mentions. The bullets are post-hoc rationalization, written to be credible rather than to be true.

This is the hollow explanation problem, and it is not the same as hallucination. Every individual claim in that explanation may be factually correct. The user's question — why did you decide that? — is the one being answered falsely.

JSON Mode Is a Dialect, Not a Standard: The Silent Breakage in Your Fallback Path

· 11 min read
Tian Pan
Software Engineer

The first time I watched a fallback router cause a worse incident than the outage it was trying to mitigate, the postmortem document had a header that read: "Primary degraded for 11 minutes. Fallback degraded our parser for 6 days." Nobody had written code wrong. Nobody had skipped the schema review. The integration tests against the secondary provider had been green when the fallback was wired up, eighteen months earlier. What had happened in between was that one of the two providers had quietly tightened its enum coercion policy, and the contract our downstream parsers had been written against — a contract we believed was "JSON Schema, more or less" — had drifted from a shared standard into two slightly incompatible dialects.

This is the failure mode I keep seeing, and it keeps surprising teams that should know better. "JSON mode" sounds like a feature you turn on. It is not. It is a contract you maintain — separately, against every provider you might route to — and the contract drifts every quarter as vendors evolve their structured-output stacks. The "drop-in replacement" your provider docs gestured at when you signed the contract is, in production, a maintained translation layer whose absence converts your fallback path into a paper compliance artifact: present in the architecture diagram, broken on the day you needed it.

The Knowledge Cutoff Is a UX Surface, Not a Footnote

· 12 min read
Tian Pan
Software Engineer

The model has a knowledge cutoff. The user does not know what it is. The product, in almost every case, does not tell them. And on the day the user asks a question whose right answer changed three months ago, the assistant gives a confidently-stated wrong one — not because the model failed, but because the product never gave it a way to flag the gap. The trust contract between your users and your assistant is implicit, asymmetric, and silently broken every time the world moves and your UX pretends it didn't.

The dominant pattern is to treat the cutoff as a footnote: a line of disclosure copy buried in a help center, a /about page no one reads, a one-time tooltip dismissed in week one. That framing is a bug. Knowledge cutoff is not a property of the model the way "context length" is. It is a UX surface — instrumented, designed, and evolved — and treating it as anything less ships a product that confabulates around its own ignorance in a register the user cannot audit.