907 posts tagged with "insider"

Prompt Regression Tests That Actually Block PRs

April 18, 2026 · 10 min read

Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

Retrieval Debt: Why Your RAG Pipeline Degrades Silently Over Time

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

Six months after you shipped your RAG pipeline, something changed. Users aren't complaining loudly — they're just trusting the answers a little less. Feedback ratings dropped from 4.2 to 3.7. A few support tickets reference "outdated information." Your engineers look at the logs and see no errors, no timeouts, no obvious regression. The retrieval pipeline looks healthy by every metric you've configured.

It isn't. It's rotting.

Retrieval debt is the accumulated technical decay in a vector index: stale embeddings that no longer represent current document content, tombstoned chunks from deleted records that pollute search results, and semantic drift between the encoder version that indexed your corpus and the encoder version now computing query embeddings. Unlike code rot, retrieval debt produces no stack traces. It produces subtly wrong answers with confident-looking citations.

Writing Acceptance Criteria for Non-Deterministic AI Features

April 17, 2026 · 12 min read

Tian Pan

Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

The Silent Regression: How to Communicate AI Behavioral Changes Without Losing User Trust

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Your power users are your canaries. When you ship a new model version or update a system prompt, aggregate evaluation metrics tick upward — task completion rates improve, hallucination scores drop, A/B tests declare victory. Then your most sophisticated users start filing bug reports. "It used to just do X. Now it lectures me first." "The formatting changed and broke my downstream parser." "I can't get it to stay in character anymore." They aren't imagining things. You shipped a regression, you just didn't see it in your dashboards.

This is the central paradox of AI product development: the users most harmed by behavioral drift are the ones who invested most in understanding the system's quirks. They built workflows around specific output patterns. They learned which prompts reliably triggered which behaviors. When you change the model, you don't just ship updates — you silently invalidate months of their calibration work.

AI-Assisted Codebase Migration at Scale: Automating the Upgrades Nobody Wants to Touch

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

When Airbnb needed to migrate 3,500 React test files from Enzyme to React Testing Library, they estimated the project at 1.5 years of manual effort. They shipped it in 6 weeks using an LLM-powered pipeline. When Google studied 39 distinct code migrations executed over 12 months by a team of 3 developers—595 code changes, 93,574 edits—they found that 74% of the edits were AI-generated, 87% of those were committed without human modification, and the overall migration timeline was cut by 50%.

These numbers are real. But so is this: during those same migrations, engineers spent approximately 50% of their time validating AI output—fixing context window failures, cleaning up hallucinated imports, and untangling business logic errors the tests didn't catch. The efficiency gains are genuine and the pain points are genuine. The question isn't whether AI belongs in code migrations; it's knowing exactly where it helps and where it creates more cleanup than it saves.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

AI Oncall: What to Page On When Your System Thinks

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

When Everyone Has an AI Coding Agent: The Team Dynamics Nobody Warned You About

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

A team of twelve engineers adopts AI coding tools enthusiastically. Six months later, each engineer is merging nearly twice as many pull requests. The engineering manager celebrates. Then the on-call rotation starts paging. Debugging sessions last twice as long. Nobody can explain why a particular module was structured the way it was. The engineer who wrote it replies honestly: "I don't know — the AI generated most of it and it seemed fine."

This scenario is playing out at companies everywhere. The individual productivity story is real: developers finish tasks faster, write more tests, and clear backlogs more efficiently. The team-level story is more complicated, and most organizations aren't ready for it.

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

AI User Research: What Users Actually Need Before You Write the First Prompt

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams decide they're building an AI feature, then ask users: "Would you want this?" Users say yes. The feature ships. Three months later, weekly active usage is at 12% and plateauing. The postmortem blames implementation or adoption, but the real failure happened before a single line of code was written — in the user research phase that felt thorough but was methodologically broken.

The core problem: users cannot accurately predict their preferences for capabilities they have never experienced. This isn't a minor wrinkle. A study on AI writing assistance found that systems designed from users' stated preferences achieved only 57.7% accuracy — actually underperforming naive baselines that ignored user-stated preferences entirely. You can do a user research sprint that runs for weeks, collect extensive qualitative feedback, and end up with a product nobody uses — not despite the research, but partly because of how it was conducted.

Ambient AI Architecture: Designing Always-On Agents That Don't Get Disabled

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams building ambient AI ship something users immediately turn off.

The pattern is consistent: the team demos the feature internally, everyone agrees it's useful in theory, and within two weeks of launch the disable rate exceeds 60%. This isn't a model quality problem. It's an architecture problem — and specifically an interrupt threshold problem. Teams design their ambient agents around what the AI can do rather than what users will tolerate when they didn't ask for help.

The gap between explicit invocation ("ask the AI") and ambient monitoring ("the AI watches and acts") is not just a UX question. It demands a fundamentally different system architecture, a different event model, and a different mental model for when an AI agent earns the right to speak.

Your Annotation Pipeline Is the Real Bottleneck in Your AI Product

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Every team working on an AI product eventually ships a feedback widget. Thumbs up. Thumbs down. Maybe a star rating or a correction field. The widget launches. The data flows. And then nothing changes about the model — for weeks, then months — while the team remains genuinely convinced they have a working feedback loop.

The widget was the easy part. The annotation pipeline behind it is where AI products actually stall.

About Tian Pan