Skip to main content

907 posts tagged with "insider"

View all tags

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

Your CS Team Built a Shadow Agent. That's Your Roadmap.

· 9 min read
Tian Pan
Software Engineer

A senior CSM in your support org spent a weekend wiring up an internal Slack bot. They wrote the system prompt themselves. They pointed it at the public docs, a Zendesk export of resolved tickets, and the changelog. Six weeks later it answers about 40% of the tier-1 questions their team used to type out by hand. Nobody on your engineering org chart knows it exists. The first time the platform team finds out, somebody from security will be asking why a service account is hitting Zendesk's API at 3am.

The default reaction is panic. Lock down the API token. Send a company-wide email about unsanctioned AI. Add a slide to the next governance review. Then promise that the platform team will build "the official version" next quarter, on the proper roadmap.

That reaction misses what actually happened. The CS team didn't go rogue — they built a working prototype of a product the engineering team hasn't shipped. They have real usage data, real prompt iteration cycles, and real user feedback. Your platform roadmap has none of those. Treating the bot as a compliance violation throws away the most accurate prioritization signal your AI program is going to get this year.

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

· 9 min read
Tian Pan
Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

The First 100 Tickets After You Launch an AI Feature

· 12 min read
Tian Pan
Software Engineer

The bug count after an AI launch is not a quality problem. It is a discovery sequence — a sequence so predictable that you can sketch it on a whiteboard before the launch announcement goes out, week by week, ticket by ticket, and be embarrassingly close to right by the time the dashboards catch up. Every team that ships an AI feature runs this sequence. The only choice is whether you run it with a runbook or with a series of unscheduled all-hands.

I have watched enough launches now to believe the sequence is not really about engineering quality. It is about an information gap. Pre-launch, the team has a synthetic traffic mix, a curated eval set, a happy-path demo, and a board deck. Post-launch, real users arrive with intents the synthetic traffic never modeled, a marketing team that runs campaigns engineering hears about secondhand, a model provider that ships changes the team did not authorize, and a privacy reviewer who was on vacation when the feature shipped. The sequence below is the friction that happens when those two worlds collide.

The LLM-as-Validator Antipattern: Why Your AI Quality Gate Has a Blind Spot

· 8 min read
Tian Pan
Software Engineer

Your AI feature ships with a quality gate: every response runs through a GPT-4 prompt that scores it on helpfulness, accuracy, and tone. Green scores trigger no alerts. The dashboard shows 97% pass rate. Meanwhile, your support tickets double.

The problem is structural. You used the same class of system that generates your outputs to validate those outputs. When the generator hallucinates a plausible-sounding fact, the judge — trained on the same distribution of internet text — reads the hallucination as credible and passes it through. Both models share the blind spot. Your quality gate is measuring confidence, not correctness.

MTBF Is Dead When Your Agent Self-Heals

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter had every dashboard green. Tool error rate flat at 0.3%. End-to-end success at 98%. SLO budget barely touched. They were also burning four times their projected token spend, and nobody could explain it. When they finally instrumented retry depth per trace, the picture inverted: the median successful request was making 2.7 tool calls instead of the 1.0 the architecture diagram promised. The agent was not failing. It was failing and recovering, over and over, inside the same span, and the success rate metric had no way to tell them.

This is the part of agentic reliability that the old reliability vocabulary cannot reach. MTBF — mean time between failures — assumes failures are punctuated, observable events you can count between. You measure the gap, you compute the mean, you alert when the gap shrinks. It worked for hard drives, networks, deterministic services. It does not work for systems that retry, reroute, fall back, and recover silently inside a single user-visible operation.

Open-Weight Licenses Are a Compliance Minefield Your Team Hasn't Mapped

· 9 min read
Tian Pan
Software Engineer

The word "open" is doing an extraordinary amount of work in "open-weight." When an engineer downloads a safetensors file from a model hub, they tend to file the act under the same mental category as npm install lodash — pull a dependency, ship a feature, move on. But the license that ships next to those weights is rarely Apache 2.0 or MIT. It is more often a custom community license with acceptable-use carve-outs, attribution requirements, derivative-naming rules, and user-count thresholds that switch the contract terms once your product gets popular. And almost none of it is enforced by the loader. The model runs whether you complied or not.

This is how compliance debt accumulates silently. The team that treats license review as a one-time download check is signing the company up for an audit finding that will ship years after the developer who clicked "I agree" has left. The fix is not a stricter procurement gate at the door — it is a discipline of treating model weights as a supply chain, with provenance, periodic re-review, and a manifest that traces every deployed inference path back to its upstream license.

Persona Overlays: When One Agent Needs Many Voices for Different Customer Cohorts

· 11 min read
Tian Pan
Software Engineer

A Fortune 500 procurement lead opens your support agent and asks why the SOC 2 report references a control your product no longer implements. Your agent answers in the same chipper voice it uses with hobbyists on the free tier — three exclamation points, an emoji, and a cheerful suggestion to "ping our team" with no escalation path or citation. The procurement lead forwards the screenshot to her CISO with one line: "This is who they sent to handle our compliance question." You lose the renewal not because the answer was wrong, but because the voice was wrong for the room.

Most teams ship one agent persona because the org chart has one support team. The customer base, however, is rarely that uniform. Enterprise buyers expect formality, citations, and named escalation paths. Self-serve users want quick answers and zero friction. Developers want code, not paragraphs. The single-persona agent reads as condescending to one cohort and unprofessional to another, and "let users pick a tone" punts a product decision to the user that the user shouldn't have to make.

The PRD for an AI Feature: Why Your Old Template Misses the Cliff

· 10 min read
Tian Pan
Software Engineer

The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.

It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.

The Pre-Launch Blast Radius Inventory: The Document Your Agent Team Forgot to Write

· 10 min read
Tian Pan
Software Engineer

The first hour of an agent incident is always the same. Someone notices the agent did something it shouldn't have — invoiced the wrong customer, deleted a calendar event for the CEO, posted a half-finished apology in a public Slack channel — and the response team starts asking questions nobody has written answers to. Which downstream system holds the audit log? Which on-call rotation owns that system? Was the call reversible, and within what window? Who owns the credential the agent used, and does that credential also let it touch other systems we haven't checked yet? The team that wrote the agent rarely owns those answers, because the answers live in the systems the agent calls, and nobody at launch wrote them down in one place.

That document is the blast radius inventory, and it is the artifact most agent teams discover the absence of during their first incident. It is not a security checklist, not a tool schema, not a runbook. It is an enumerated list of every external system the agent can touch and every fact you need on the worst day of that system's life. Teams that ship agents without one are betting that incident-response context can be reconstructed faster than the blast spreads, and that bet keeps losing as agents get more tools and the tools get more powerful.