Skip to main content

861 posts tagged with "insider"

View all tags

Choosing Eval Metrics Is a Product Decision, Not a Technical One

· 10 min read
Tian Pan
Software Engineer

A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.

This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

· 11 min read
Tian Pan
Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

· 10 min read
Tian Pan
Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

Why AI Engineering Training Programs Are Perpetually Behind the Models

· 9 min read
Tian Pan
Software Engineer

In early 2023, a flood of corporate AI training programs launched with the same selling point: we will teach your engineers prompt engineering. By the time most of them finished their first cohort, the specific techniques they were teaching had already been automated away by the models themselves. By 2025, the role of "prompt engineer" — briefly advertised at $200,000 salaries — was effectively obsolete. The training programs are still running.

This is the AI curriculum trap. It is not a problem of effort or budget. Organizations invest heavily in structured AI training, certification programs, and hiring rubrics built around tool proficiency. But the tools change faster than any curriculum can track, and the result is a permanent, structural lag: training programs are always teaching the AI engineering of 18 months ago.

AI Feature Payback: The ROI Model Your Finance Team Won't Fight You On

· 10 min read
Tian Pan
Software Engineer

Every engineering team shipping AI features eventually hits the same wall: finance wants a spreadsheet that justifies the spend, and the spreadsheet you built doesn't actually work.

The problem isn't that AI features lack ROI. The problem is that AI economics break every assumption the standard ROI model was built on — fixed capital, linear cost curves, predictable timelines. Teams that treat AI spending like SaaS licensing get numbers that either look deceptively good before launch or collapse six months into production. The ten-fold gap between measured AI initiatives (55% ROI) and ad-hoc deployments (5.9% ROI) comes almost entirely from whether teams got the measurement model right before they shipped.

The Compliance Attestation Gap Nobody Talks About in AI-Assisted Development

· 9 min read
Tian Pan
Software Engineer

Your engineers are shipping AI-generated code every day. Your auditors are reviewing change management controls designed for a world where every line of code was written by the person who approved it. Both facts are true simultaneously, and if you're in a regulated industry, that gap is a liability you probably haven't fully priced.

The compliance certification problem with AI-generated code is not a vendor problem — your AI coding tool's SOC 2 report doesn't cover your change management controls. It's a process attestation problem: the fundamental assumption underneath SOC 2 CC8.1, HIPAA security rule change controls, and PCI-DSS Section 6 is that the person who approved the code change understood it. That assumption no longer holds.

The AI Onboarding Gap: Why Engineers Can't Learn What They Can't Test

· 11 min read
Tian Pan
Software Engineer

A new engineer joins an AI-heavy team. On their third day, they see a prompt with an awkward double negation in the system instructions. It looks like a bug. They clean it up — the kind of small polish any reasonable person would do. Two hours later, customer-facing classification accuracy on a critical pipeline drops from 91% to 74%. Nobody has any idea why.

This scenario plays out in some form at almost every team building on LLMs. The new engineer isn't careless. The prompt did look wrong. But that double negation was load-bearing in a way that only the person who wrote it — after weeks of experimentation — actually understood. And they never wrote that understanding down.

This is the AI onboarding gap: the chasm between what an AI codebase appears to do and what it actually does, and why that gap is invisible until someone falls into it.

AI as the Permanent Intern: The Role-Task Gap in Enterprise Workflows

· 9 min read
Tian Pan
Software Engineer

There's a pattern that appears in nearly every enterprise AI deployment: the tool performs brilliantly in the demo, ships to production, and then quietly stalls at 70–80% of its potential. Teams attribute the stall to model quality, context window limits, or retrieval failures. Most of the time, that diagnosis is wrong. The actual problem is that they're asking the AI to play a role it structurally cannot occupy — not yet, possibly not ever in its current form.

The gap between "AI can do this task" and "AI can play this role" is the most expensive misunderstanding in enterprise AI.

AI Pipeline Exception Handling: Hallucinations, Refusals, and Format Violations Are First-Class Errors

· 10 min read
Tian Pan
Software Engineer

Your AI pipeline reported zero errors last night. The output was completely wrong.

That's not a hypothetical. A recent industry report found that roughly 1 in 20 production LLM requests fail in ways that never surface as exceptions — valid HTTP 200, well-formed JSON, fluent prose, factually wrong. The observability stack stays green while the pipeline quietly lies to its users.

The root cause is an architectural assumption borrowed from traditional service engineering: that HTTP status codes and parse errors cover the failure space. They don't. LLM pipelines have at least four failure types that the underlying infrastructure cannot see — hallucinations, refusals, format violations, and context overflow — and treating them as edge cases instead of first-class error types is how production AI systems ship invisible bugs at scale.

Your AI Product's Dark Energy: The Background Compute Nobody Budgeted

· 10 min read
Tian Pan
Software Engineer

When your AI feature ships, you build a latency budget: how long does the model call take, how long does retrieval take, what's the p99 for the full request. What you almost certainly don't build is a budget for the inference that happens when no user is watching.

Every AI product with persistent state runs invisible work in the background. Documents get preprocessed when uploaded. Long conversations get re-summarized at session boundaries so the next session doesn't blow the context window. Proactive suggestions get generated on a schedule nobody set deliberately. Embeddings get regenerated when someone updates the schema. None of this shows up in your latency dashboard. Frequently it isn't in your cost model. Almost never is it in your monitoring.

This is your AI product's dark energy — the compute that explains the gap between what your inference bill should be and what it actually is.