Skip to main content

861 posts tagged with "insider"

View all tags

Conversation Branching as a First-Class Primitive: Why Linear Threads Force Users to Kill and Restart

· 10 min read
Tian Pan
Software Engineer

The clearest signal that your chat product needs branching is also the easiest one to ignore: users keep copy-pasting old conversations into new sessions. They are not migrating providers. They are not bored. They are trying to ask "what if I had pushed back on that earlier assumption?" without losing the forty turns of context they spent building. The linear thread offers them exactly two options — overwrite the next message and lose the original, or start a new chat and lose the prefix. So they invent a third one with a clipboard.

Every time a user does this, your product is leaking a feature request through a workaround. The workaround is bad: it strips message metadata, breaks tool-call linkage, drops file attachments, and creates orphaned threads that no longer map to a coherent task. But it persists because the alternative — abandoning context that took thirty minutes to assemble — is worse. The conversation is structurally a tree. The UI insists it is a list. Users patch the gap manually.

The Demo Loop Bias: How Your Dev Process Quietly Optimizes for Impressive Failures

· 10 min read
Tian Pan
Software Engineer

There is a particular kind of meeting that happens at every AI-product team, usually on Thursdays. Someone shares their screen, drops a prompt into a notebook, and runs three or four examples. The room reacts. People say "wow." Someone takes a screenshot for Slack. A decision gets made — ship it, swap models, change the temperature. No one writes down the failure rate, because no one measured it.

This is the demo loop, and it has a structural bias that almost no team accounts for: it does not select for the best output. It selects for the most legible output. Over weeks and months, your prompt evolves to produce answers that land in a meeting — confident, fluent, well-formatted, on-topic. Whether they are correct is a separate variable, and it is one your process is not measuring.

The result is what I call charismatic failure: outputs that are wrong in ways your demo loop has been trained, by selection pressure, to ignore.

Embedding Model Rotation Is a Database Migration, Not a Deploy

· 11 min read
Tian Pan
Software Engineer

Somewhere in a staging channel, an engineer writes "bumping the embedder to v3, new model scored +4 on MTEB, merging after the smoke test." Two days later support tickets start trickling in about search results that feel "weirdly off." A week later retrieval precision is down fourteen points, cosine scores have collapsed from 0.85 into the 0.65 range, and nobody can explain why — because the deploy looked identical to the last five model bumps. It wasn't a deploy. It was a database migration wearing a deploy's costume.

Embedding model rotation is the most misfiled change type in AI infrastructure. It lands in your system through the same channels as a prompt tweak or a generation-model pin update — a config file, a PR, a CI check — so it gets the governance of a config change. But under the hood, a new embedder does not produce a better version of your old vectors. It produces vectors that live in a different coordinate system entirely, where cosine similarity across the two manifolds is a category error. The correct mental model is not "rev the dependency." It is "swap the primary key encoding on a fifty-million-row table while serving reads."

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.

Your Fine-Tuning Corpus Is a GDPR Data Artifact, Not Just an ML Asset

· 11 min read
Tian Pan
Software Engineer

The moment your first fine-tune lands in production, your weights become a new kind of record your privacy program has never cataloged. A customer support transcript that made it into your training mix is no longer just a row in a database you can DELETE — it is now encoded, redundantly and non-extractably, into the parameters your API serves. The original record can be scrubbed from S3, erased from your warehouse, and removed from your RAG index, while the model continues to complete prompts with fragments of that customer's name, account ID, or medical history. The Data Protection Agreement your sales team signed promised you'd honor erasure requests. Nobody asked the ML team whether that was technically possible.

Research on PII extraction shows this is not hypothetical. The PII-Scope benchmark reports that adversarial extraction rates can increase up to fivefold against pretrained models under realistic query budgets, and membership inference attacks using self-prompt calibration have pushed AUC from 0.7 to 0.9 on fine-tuned models. Llama 3.2 1B, a small and widely copied base, has been demonstrated to memorize sensitive records present in its training set. The takeaway for anyone shipping fine-tunes on production traces is blunt: you cannot assume your weights forgot.

This matters because most fine-tuning pipelines were designed by ML engineers optimizing for loss, not by data stewards optimizing for Article 17. The result is an artifact whose legal status is ambiguous, whose lineage is rarely documented, and whose "delete user X" workflow doesn't exist.

Free Tier Abuse Economics: When Your AI Generosity Gets Ratio'd by Bots

· 10 min read
Tian Pan
Software Engineer

A startup CTO checked their OpenAI dashboard one morning and found a $67,000 invoice. Their normal monthly bill was $400. Nothing in their product had changed — no viral launch, no new feature, no marketing push. What had changed is that an attacker fingerprinted their endpoint, harvested a leaked key from a build artifact, and resold the inference at 40-60% below retail to buyers who paid in crypto. The startup paid the bill while the attacker pocketed the spread.

This is not the typical free tier abuse story SaaS founders tell each other. The typical story goes: a few power users abuse generous trials, churn rates spike, you tighten the limits, and unit economics recover within a quarter. That playbook is dead for AI products. The math broke when your unit cost per anonymous request stopped being effectively zero, and the abuse playbook scaled the moment your generosity could be liquidated for cash.

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

· 11 min read
Tian Pan
Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.