Skip to main content

722 posts tagged with "insider"

View all tags

The Agent Portfolio Audit: How to Consolidate 15 Independent Agents Into a Platform Without Killing Team Autonomy

· 9 min read
Tian Pan
Software Engineer

Six months after launching their first AI agent, most engineering organizations discover they have fifteen of them. Not because anyone planned a fleet — because each team solved a real problem and shipped. The customer support team built a triage agent. The data team built a report-generation agent. Platform engineering built a runbook agent. Infrastructure built three more. None of them share auth, logging, tooling, or evaluation methodology. Tokens are bleeding from a dozen provider accounts and nobody can tell you which agent is responsible.

This is the moment that separates engineering organizations that can scale AI from those that can't. The answer is not to slow down agent development — it's to run a portfolio audit before entropy makes consolidation impossible.

AI Writes Code in Seconds. Your Team Reviews It for Hours. The Math Isn't Working.

· 8 min read
Tian Pan
Software Engineer

The ROI pitch for AI coding tools is irresistible on paper: developers complete tasks 55% faster in controlled experiments, ship 98% more pull requests, and report saving 3.6 hours per week. But when organizations look at their actual delivery metrics — bug rates, release cycle times, incident frequency — the numbers barely move. Something is absorbing all those gained hours, and it's not hard to find.

AI generates code in seconds. Engineers still review it at the same pace they always have.

The Ethics Review Gate Your AI Shipping Process Is Missing

· 9 min read
Tian Pan
Software Engineer

Most engineering teams treat ethics like they used to treat security: something you address after the feature ships, if someone complains. The parallels are uncomfortable. In 2004, SQL injection was a "we'll fix it later" problem. Today, every serious team has automated injection detection in CI. Ethics reviews in AI are at the same inflection point — and teams that don't build the gate now will learn the hard way why it exists.

The gap is not intent. It's structure. Security reviews have a 20-year head start on standardization: OWASP checklists, CVE scoring, penetration tests, mandatory sign-offs before production. Ethics reviews have none of that ceremony. Most teams have no defined trigger, no checklist, no exit criteria, and no named owner. The result: a healthcare algorithm that reduced identification of Black patients for care by over 50% not because engineers were malicious, but because no one ran disaggregated accuracy numbers before the thing went live. A recruiting model that systematically downranked resumes containing the word "women's" — trained on historical data, shipped without a fairness pass, discovered months into production. These aren't edge cases. They're what happens when ethics is a post-launch checkbox with no teeth.

Adding AI to Trusted Features: How Variance Destroys the Trust You Spent Years Building

· 11 min read
Tian Pan
Software Engineer

Your most-trusted feature is also your most dangerous AI deployment target. That's the counterintuitive reality that product teams keep discovering the hard way: the features users rely on the most, the ones where trust is deep and automatic, are exactly the ones where AI-introduced variance causes the most catastrophic trust damage. A new feature that fails is a disappointment. An existing feature that suddenly behaves unpredictably is a betrayal.

This is the AI product retrofit trap. Not the decision to add AI — that's often right. The trap is the belief that adding AI to an established feature is safer than building a new one because you already have the users. In reality, the reverse is true. The trust you've spent months or years earning is not a foundation for AI experiments; it's a liability if the experiment fails.

The Automation Cliff Edge: When Partial AI Automation Is Worse Than None

· 11 min read
Tian Pan
Software Engineer

The first time a team automates 70% of a manual process and ships worse outcomes than before, the diagnosis almost always starts in the wrong place. Engineers look at the automated portion: maybe the model accuracy is off, maybe the pipeline has a bug. What they rarely examine is whether the automation itself—by existing—made the remaining 30% of human work structurally impossible to do well.

This is the automation cliff edge. Not a failure of the automated component, but a failure of the seam between automated and manual.

Choosing Eval Metrics Is a Product Decision, Not a Technical One

· 10 min read
Tian Pan
Software Engineer

A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.

This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

· 11 min read
Tian Pan
Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

· 10 min read
Tian Pan
Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

Why AI Engineering Training Programs Are Perpetually Behind the Models

· 9 min read
Tian Pan
Software Engineer

In early 2023, a flood of corporate AI training programs launched with the same selling point: we will teach your engineers prompt engineering. By the time most of them finished their first cohort, the specific techniques they were teaching had already been automated away by the models themselves. By 2025, the role of "prompt engineer" — briefly advertised at $200,000 salaries — was effectively obsolete. The training programs are still running.

This is the AI curriculum trap. It is not a problem of effort or budget. Organizations invest heavily in structured AI training, certification programs, and hiring rubrics built around tool proficiency. But the tools change faster than any curriculum can track, and the result is a permanent, structural lag: training programs are always teaching the AI engineering of 18 months ago.

AI Feature Payback: The ROI Model Your Finance Team Won't Fight You On

· 10 min read
Tian Pan
Software Engineer

Every engineering team shipping AI features eventually hits the same wall: finance wants a spreadsheet that justifies the spend, and the spreadsheet you built doesn't actually work.

The problem isn't that AI features lack ROI. The problem is that AI economics break every assumption the standard ROI model was built on — fixed capital, linear cost curves, predictable timelines. Teams that treat AI spending like SaaS licensing get numbers that either look deceptively good before launch or collapse six months into production. The ten-fold gap between measured AI initiatives (55% ROI) and ad-hoc deployments (5.9% ROI) comes almost entirely from whether teams got the measurement model right before they shipped.