Skip to main content

861 posts tagged with "insider"

View all tags

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The AI Code Review Inversion: What to Focus on When the Author Is a Machine

· 9 min read
Tian Pan
Software Engineer

Your code review is optimizing for the wrong thing. When AI agents contribute the majority of your commits, reviewing for local correctness — does this function do what it says? — is like grading a math test by checking the handwriting. The machine already passed your linter, ran your test suite, and formatted the output to spec. The bugs it ships are not the bugs line-by-line review catches.

A large-scale study of GitHub pull requests found that AI-co-authored PRs contain 1.7x more issues than human-only PRs — including 75% more logic and correctness issues, 2.74x more security vulnerabilities, and 3x more readability problems. Not because the code looks wrong. Because it does the wrong thing, in the wrong place, with the wrong assumptions about the rest of the system. Those are precisely the failure modes that traditional code review, optimized for catching typos and style violations, is not designed to find.

AI Co-Pilot vs. AI Pilot: The Evidence-Based Product Decision Framework

· 9 min read
Tian Pan
Software Engineer

Every product team building with AI faces the same fork in the road: should the AI advise humans, or should it act on its own? The framing sounds philosophical, but the answer is actually measurable — and getting it wrong is expensive in ways that don't show up until six months after launch, when your override metrics look fine and your user trust scores are quietly collapsing.

Klarna replaced 700 customer service agents with an autonomous AI system in early 2024. By 2025, the CEO admitted they had "gone too far" and began quietly rehiring humans for complex cases. The AI handled 2.3 million conversations in a month and resolved issues in under 2 minutes instead of 11. The numbers looked great. The underlying problem — that customer service for financial products requires empathy and judgment, not just resolution speed — showed up later, in declining satisfaction on anything outside the happy path.

Why AI Quality Monitors Conflate Model Drift, Data Drift, and Prompt Drift — and What to Do About Each

· 10 min read
Tian Pan
Software Engineer

A fraud detection model's accuracy silently halved over three weeks. Latency was normal, error rates were zero, and every infrastructure dashboard was green. Engineers spent the first week auditing the data pipeline, the second week comparing model weights, and the third week reopening tickets before someone noticed that fraudsters had simply changed their language patterns. The fix — retraining on recent examples — took two days. The misdiagnosis took three weeks.

This pattern repeats across production AI teams: degradation sets off a generalized "model problem" alarm, and the team starts pulling levers based on intuition rather than root cause. The reason isn't a lack of monitoring discipline; it's that most observability stacks treat three structurally distinct problems as one. Model drift, data drift, and prompt drift have different detection signatures, different alert topologies, and different remediation paths. Conflating them is how weeks get wasted on the wrong fix.

Story Points Don't Survive First Contact With an LLM

· 8 min read
Tian Pan
Software Engineer

Here is a failure mode that happens quietly, at every company with a mature Agile practice that decides to add an LLM feature: the team estimates the work in story points, assigns it to a two-week sprint, and then spends three sprints in a row reporting "70% done" while the engineering manager stares at a burndown chart that refuses to burn down. Nobody lied. The feature is genuinely hard to finish — because the conditions that make story points a useful planning tool don't exist for AI features, and nobody noticed until they were already committed.

The problem is not that engineers are bad at estimating. The problem is that story points encode assumptions about the nature of software work — assumptions that LLM features violate structurally, not accidentally.

The AI Incident Postmortem Nobody Writes: A Four-Layer Diagnosis Framework

· 11 min read
Tian Pan
Software Engineer

When a recommendation engine surfaced offensive content last quarter, the post-incident review produced a familiar outcome: a two-hour call where ML engineers pointed at the retrieval corpus, data engineers pointed at the prompt, product engineers pointed at monitoring, and infrastructure pointed at the model version that nobody remembered upgrading. Three action items were created. None had owners. The incident closed. The same failure mode shipped again six weeks later.

This is not a story about one team. It is the default ending for AI incidents at most organizations. Responsibility for what an AI feature does in production is distributed across enough parties that a standard postmortem cannot pin causation. The 5-why analysis that works well for database timeouts breaks when the failure is "the model gave the wrong answer" — because the correct next question is never obvious.

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

Your AI Feature's Quiet Quitters: How to Detect Silent User Distrust

· 10 min read
Tian Pan
Software Engineer

The McDonald's drive-thru AI didn't fail because users complained. It failed because users stopped using the drive-thru. For three years the system logged healthy "acceptance rates" while viral videos showed customers pleading with it to remove 260 chicken nuggets from their order. When the partnership ended, the official reason was that the technology "wasn't yet ready." The real signal had been sitting in foot traffic data the whole time — unread, unmeasured, unreported.

This is the shape of most AI feature failures in production. Users don't disable your feature. They don't file tickets. They don't leave one-star reviews. They quietly route around it, and your dashboards keep showing green.

Training Your AI on Production Data Without Triggering a Legal Blocker

· 11 min read
Tian Pan
Software Engineer

Your AI feature launched. Users are engaging with it. The gap between what it does and what it should do is visible in every session replay, every thumbs-down, every request that returns a wrong answer. You have the signal. The question is whether you can legally act on it.

This is where teams hit the compliance wall. Not a theoretical wall — a concrete one. In 2024 alone, European regulators issued over €1.2 billion in GDPR fines, with OpenAI, Meta, and LinkedIn among the named defendants. The common thread across most enforcement actions: using behavioral data in ways that weren't explicitly scoped at collection time, or collecting more than was necessary to operate the feature. The fact that your intent is model improvement rather than advertising doesn't move regulators the way engineers assume it does.

API Documentation Is Reliability Infrastructure: How Your Docs Determine Agent Success Rates

· 10 min read
Tian Pan
Software Engineer

Most engineering teams think of API documentation as a developer experience concern — something you improve to reduce support tickets and onboarding time. That framing made sense when your primary consumer was a human reading docs in a browser. It is no longer adequate.

When an AI agent calls your API via tool use, your documentation stops being a guide and becomes runtime behavior. A vague parameter description isn't a UX inconvenience — it is a direct instruction to the model that produces hallucinated values. A missing error code isn't a gap in your reference docs — it is an ambiguous signal that can send an agent into a retry loop with no exit condition. The documentation you wrote three years ago for a human audience is now being parsed by a stateless language model that will execute confidently regardless of whether it understood correctly.

Code-Specific RAG: Why General Retrieval Fails for Codebases

· 10 min read
Tian Pan
Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.