Skip to main content

722 posts tagged with "insider"

View all tags

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

· 11 min read
Tian Pan
Software Engineer

A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.

The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

· 10 min read
Tian Pan
Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

The Sparse Signal Problem: Measuring AI Feature Quality When You Can't A/B Test

· 11 min read
Tian Pan
Software Engineer

You've shipped an AI writing assistant to your enterprise customers. Twenty-three people use it every day. Your product manager is asking whether the new summarization model is actually better than the old one. You have two weeks before the next sprint, and you need a decision.

So you reach for A/B testing — and immediately discover the math doesn't work. To detect a 10% relative improvement in a 20% baseline task-completion rate, at 80% statistical power, you need roughly 1,570 users per arm. At 23 daily users, you'd need 136 days to accumulate enough data. The feature will be deprecated before the test concludes.

This is the sparse signal problem. It isn't a B2B startup edge case. Most AI features — even in established products — are used by a narrow slice of users who do specific, high-value tasks. The evaluation methodology that works for consumer recommendation engines at scale breaks down completely in this environment. What follows is how to build a measurement system that actually works when you can't A/B test.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The AI Code Review Inversion: What to Focus on When the Author Is a Machine

· 9 min read
Tian Pan
Software Engineer

Your code review is optimizing for the wrong thing. When AI agents contribute the majority of your commits, reviewing for local correctness — does this function do what it says? — is like grading a math test by checking the handwriting. The machine already passed your linter, ran your test suite, and formatted the output to spec. The bugs it ships are not the bugs line-by-line review catches.

A large-scale study of GitHub pull requests found that AI-co-authored PRs contain 1.7x more issues than human-only PRs — including 75% more logic and correctness issues, 2.74x more security vulnerabilities, and 3x more readability problems. Not because the code looks wrong. Because it does the wrong thing, in the wrong place, with the wrong assumptions about the rest of the system. Those are precisely the failure modes that traditional code review, optimized for catching typos and style violations, is not designed to find.

AI Co-Pilot vs. AI Pilot: The Evidence-Based Product Decision Framework

· 9 min read
Tian Pan
Software Engineer

Every product team building with AI faces the same fork in the road: should the AI advise humans, or should it act on its own? The framing sounds philosophical, but the answer is actually measurable — and getting it wrong is expensive in ways that don't show up until six months after launch, when your override metrics look fine and your user trust scores are quietly collapsing.

Klarna replaced 700 customer service agents with an autonomous AI system in early 2024. By 2025, the CEO admitted they had "gone too far" and began quietly rehiring humans for complex cases. The AI handled 2.3 million conversations in a month and resolved issues in under 2 minutes instead of 11. The numbers looked great. The underlying problem — that customer service for financial products requires empathy and judgment, not just resolution speed — showed up later, in declining satisfaction on anything outside the happy path.

Why AI Quality Monitors Conflate Model Drift, Data Drift, and Prompt Drift — and What to Do About Each

· 10 min read
Tian Pan
Software Engineer

A fraud detection model's accuracy silently halved over three weeks. Latency was normal, error rates were zero, and every infrastructure dashboard was green. Engineers spent the first week auditing the data pipeline, the second week comparing model weights, and the third week reopening tickets before someone noticed that fraudsters had simply changed their language patterns. The fix — retraining on recent examples — took two days. The misdiagnosis took three weeks.

This pattern repeats across production AI teams: degradation sets off a generalized "model problem" alarm, and the team starts pulling levers based on intuition rather than root cause. The reason isn't a lack of monitoring discipline; it's that most observability stacks treat three structurally distinct problems as one. Model drift, data drift, and prompt drift have different detection signatures, different alert topologies, and different remediation paths. Conflating them is how weeks get wasted on the wrong fix.

Story Points Don't Survive First Contact With an LLM

· 8 min read
Tian Pan
Software Engineer

Here is a failure mode that happens quietly, at every company with a mature Agile practice that decides to add an LLM feature: the team estimates the work in story points, assigns it to a two-week sprint, and then spends three sprints in a row reporting "70% done" while the engineering manager stares at a burndown chart that refuses to burn down. Nobody lied. The feature is genuinely hard to finish — because the conditions that make story points a useful planning tool don't exist for AI features, and nobody noticed until they were already committed.

The problem is not that engineers are bad at estimating. The problem is that story points encode assumptions about the nature of software work — assumptions that LLM features violate structurally, not accidentally.

The AI Incident Postmortem Nobody Writes: A Four-Layer Diagnosis Framework

· 11 min read
Tian Pan
Software Engineer

When a recommendation engine surfaced offensive content last quarter, the post-incident review produced a familiar outcome: a two-hour call where ML engineers pointed at the retrieval corpus, data engineers pointed at the prompt, product engineers pointed at monitoring, and infrastructure pointed at the model version that nobody remembered upgrading. Three action items were created. None had owners. The incident closed. The same failure mode shipped again six weeks later.

This is not a story about one team. It is the default ending for AI incidents at most organizations. Responsibility for what an AI feature does in production is distributed across enough parties that a standard postmortem cannot pin causation. The 5-why analysis that works well for database timeouts breaks when the failure is "the model gave the wrong answer" — because the correct next question is never obvious.