Skip to main content

861 posts tagged with "insider"

View all tags

Model Deprecation Readiness: Auditing Your Behavioral Dependency Before the 90-Day Countdown

· 8 min read
Tian Pan
Software Engineer

When Anthropic deprecated a Claude model last year, a company noticed — but only because a downstream parser started throwing errors in production. The culprit? The new model occasionally wrapped its JSON responses in markdown code blocks. The old model never did. Nobody had documented that assumption. Nobody had tested for it. The fix took an afternoon; the diagnosis took three days.

That pattern — silent behavioral dependency breaking loudly in production — is the defining failure mode of model migrations. You update a model ID, run a quick sanity check, and ship. Six weeks later, something subtle is wrong. Your JSON parsing is 0.6% more likely to fail. Your refusal rate on edge cases doubled. Your structured extraction misses a field it used to reliably populate. The diff isn't in the code — it's in the model's behavior, and you never wrote a contract for it.

With major providers now running on 60–180 day deprecation windows, and the pace of model releases accelerating, this is no longer a theoretical concern. It's a recurring operational challenge. Here's how to get ahead of it.

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

Retrieval Debt: Why Your RAG Pipeline Degrades Silently Over Time

· 10 min read
Tian Pan
Software Engineer

Six months after you shipped your RAG pipeline, something changed. Users aren't complaining loudly — they're just trusting the answers a little less. Feedback ratings dropped from 4.2 to 3.7. A few support tickets reference "outdated information." Your engineers look at the logs and see no errors, no timeouts, no obvious regression. The retrieval pipeline looks healthy by every metric you've configured.

It isn't. It's rotting.

Retrieval debt is the accumulated technical decay in a vector index: stale embeddings that no longer represent current document content, tombstoned chunks from deleted records that pollute search results, and semantic drift between the encoder version that indexed your corpus and the encoder version now computing query embeddings. Unlike code rot, retrieval debt produces no stack traces. It produces subtly wrong answers with confident-looking citations.

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

The Silent Regression: How to Communicate AI Behavioral Changes Without Losing User Trust

· 9 min read
Tian Pan
Software Engineer

Your power users are your canaries. When you ship a new model version or update a system prompt, aggregate evaluation metrics tick upward — task completion rates improve, hallucination scores drop, A/B tests declare victory. Then your most sophisticated users start filing bug reports. "It used to just do X. Now it lectures me first." "The formatting changed and broke my downstream parser." "I can't get it to stay in character anymore." They aren't imagining things. You shipped a regression, you just didn't see it in your dashboards.

This is the central paradox of AI product development: the users most harmed by behavioral drift are the ones who invested most in understanding the system's quirks. They built workflows around specific output patterns. They learned which prompts reliably triggered which behaviors. When you change the model, you don't just ship updates — you silently invalidate months of their calibration work.

AI-Assisted Codebase Migration at Scale: Automating the Upgrades Nobody Wants to Touch

· 11 min read
Tian Pan
Software Engineer

When Airbnb needed to migrate 3,500 React test files from Enzyme to React Testing Library, they estimated the project at 1.5 years of manual effort. They shipped it in 6 weeks using an LLM-powered pipeline. When Google studied 39 distinct code migrations executed over 12 months by a team of 3 developers—595 code changes, 93,574 edits—they found that 74% of the edits were AI-generated, 87% of those were committed without human modification, and the overall migration timeline was cut by 50%.

These numbers are real. But so is this: during those same migrations, engineers spent approximately 50% of their time validating AI output—fixing context window failures, cleaning up hallucinated imports, and untangling business logic errors the tests didn't catch. The efficiency gains are genuine and the pain points are genuine. The question isn't whether AI belongs in code migrations; it's knowing exactly where it helps and where it creates more cleanup than it saves.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

· 11 min read
Tian Pan
Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

When Everyone Has an AI Coding Agent: The Team Dynamics Nobody Warned You About

· 10 min read
Tian Pan
Software Engineer

A team of twelve engineers adopts AI coding tools enthusiastically. Six months later, each engineer is merging nearly twice as many pull requests. The engineering manager celebrates. Then the on-call rotation starts paging. Debugging sessions last twice as long. Nobody can explain why a particular module was structured the way it was. The engineer who wrote it replies honestly: "I don't know — the AI generated most of it and it seemed fine."

This scenario is playing out at companies everywhere. The individual productivity story is real: developers finish tasks faster, write more tests, and clear backlogs more efficiently. The team-level story is more complicated, and most organizations aren't ready for it.

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

· 11 min read
Tian Pan
Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

AI User Research: What Users Actually Need Before You Write the First Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams decide they're building an AI feature, then ask users: "Would you want this?" Users say yes. The feature ships. Three months later, weekly active usage is at 12% and plateauing. The postmortem blames implementation or adoption, but the real failure happened before a single line of code was written — in the user research phase that felt thorough but was methodologically broken.

The core problem: users cannot accurately predict their preferences for capabilities they have never experienced. This isn't a minor wrinkle. A study on AI writing assistance found that systems designed from users' stated preferences achieved only 57.7% accuracy — actually underperforming naive baselines that ignored user-stated preferences entirely. You can do a user research sprint that runs for weeks, collect extensive qualitative feedback, and end up with a product nobody uses — not despite the research, but partly because of how it was conducted.