Skip to main content

861 posts tagged with "insider"

View all tags

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.

Fine-Tuning Dataset Provenance: The Audit Question You Can't Answer Six Months Later

· 10 min read
Tian Pan
Software Engineer

Six months after you shipped your fine-tuned model, a regulator asks: "Which training examples came from users who have since revoked consent?" You open a spreadsheet, search a Slack archive, and find yourself reconstructing history from annotation batch emails and a README that hasn't been updated since the first sprint. This is the norm, not the exception. An audit of 44 major instruction fine-tuning datasets found over 70% of their licenses listed as "unspecified," with error rates above 50% in how license categories were actually applied. The provenance problem is structural, and it bites hardest when you can least afford it.

This post is about building a provenance registry for fine-tuning data before you need it — the schema, the audit scenarios that drive its requirements, and the production patterns that make it tractable without becoming a second job.

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Judge Model Independence: Why Your Eval Breaks When the Grader Shares Blind Spots with the Graded

· 9 min read
Tian Pan
Software Engineer

Your eval suite scores 91%. Users report the system feels unreliable. The post-mortem reveals the culprit: you used GPT-4o to both generate responses and grade them. The model was judging its own mirror image, and it liked what it saw.

This is the judge model independence problem. It is more widespread than most teams realize, the score inflation it produces is large enough to matter, and the fix is neither complicated nor expensive. But you have to know to look for it.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

· 11 min read
Tian Pan
Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

PII in the Prompt Layer: The Privacy Engineering Gap Most Teams Ignore

· 12 min read
Tian Pan
Software Engineer

Your organization has a privacy policy. It says something reasonable about user data being handled carefully, retention limits, and compliance with GDPR and HIPAA. What it almost certainly does not say is whether the text of that user's name, email address, or medical history was transmitted verbatim to a hosted LLM API before any policy control was applied.

That gap — between the privacy policy you can point to and the privacy guarantee you can actually prove — is where most production LLM systems are silently failing. Research shows roughly 8.5% of prompts submitted to tools like ChatGPT and Copilot contain sensitive information, including PII, credentials, and internal file references. In enterprise environments where users paste emails, customer data, and support tickets into AI-assisted workflows, that number almost certainly runs higher.

The problem is not that developers are careless. It is that the LLM prompt layer was never designed as a data processing boundary. It inherits content from upstream systems — user input, RAG retrievals, agent context — without enforcing the data classification rules that govern every other part of the stack.

Pricing Your AI Product: Escaping the Compute Cost Trap

· 10 min read
Tian Pan
Software Engineer

There is a company charging £50 per month per user. Their AI feature consumes £30 in API fees. That leaves £20 to cover hosting, support, and profit — before accounting for a single refund or churned seat. They built a product users love, grew to thousands of subscribers, and unknowingly constructed a business where more customers means more losses.

This is not a cautionary tale about a bad idea. It is a cautionary tale about a pricing architecture imported from a world where the marginal cost of serving the next user was effectively zero. That world no longer fully applies when your product calls a language model.

Traditional SaaS gross margins run 70–90%. AI-forward companies are reporting 50–60% — and the gap is mostly explained by one line item: inference. When tokens are 20–40% of your cost of goods sold, the standard SaaS playbook inverts.

Proactive Agents: Event-Driven and Scheduled Automation for Background AI

· 11 min read
Tian Pan
Software Engineer

Almost every tutorial on building AI agents starts the same way: user types a message, agent reasons, agent responds. That model works fine for chatbots and copilots. It fails to describe the majority of production AI work that organizations are now deploying.

The agents that quietly matter most in enterprise environments don't wait for a message. They wake up when a database row changes, when a queue crosses a depth threshold, when a scheduled cron fires at 3 AM, or when monitoring detects that a metric drifted outside bounds. They act without a user present. When they fail, nobody notices until the damage has compounded.

Building these proactive agents requires a substantially different design vocabulary than building reactive assistants. The session-scoped mental model that works for conversational AI breaks down when your agent runs in a loop, retries in the background, and has no human to catch its mistakes.

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.