Skip to main content

61 posts tagged with "ai"

View all tags

The AI Feature Retirement Playbook: How to Sunset What Users Barely Adopted

· 11 min read
Tian Pan
Software Engineer

Your team shipped an AI-powered summarization feature six months ago. Adoption plateaued at 8% of users. The model calls cost $4,000 a month. The one engineer who built it has moved to a different team. And now the model provider is raising prices.

Every instinct says: kill it. But killing an AI feature turns out to be significantly harder than killing any other kind of feature — and most teams find this out the hard way, mid-retirement, when the compliance questions start arriving and the power users revolt.

This is the playbook that should exist before you ship the feature, but is most useful right now, when you're staring at usage graphs that point unmistakably toward the exit.

The AI Taste Problem: Measuring Quality When There's No Ground Truth

· 11 min read
Tian Pan
Software Engineer

Here's a scenario that plays out at most AI product teams: someone on leadership asks whether the new copywriting model is better than the old one. The team runs their eval suite, accuracy numbers look good, and they ship. Three weeks later, the marketing team quietly goes back to using the old model because the new one "sounds off." The accuracy metrics were real. They just measured the wrong thing.

This is the AI taste problem. It shows up wherever your outputs are subjective — copywriting, design suggestions, creative content, tone adjustments, style recommendations. When there's no objective ground truth, traditional ML evaluation frameworks give you a false sense of confidence. And most teams don't have a systematic answer for what to do instead.

Board-Level AI Governance: The Five Decisions Only Executives Can Make

· 9 min read
Tian Pan
Software Engineer

A major insurer's AI system was denying coverage claims. When humans reviewed those decisions, 90% were found to be wrong. The insurer's engineering team had built a performant model. Their MLOps team had solid deployment pipelines. Their data scientists had rigorous evaluation metrics. None of that mattered, because no one at the board level had ever answered the question: what is our acceptable failure rate for AI decisions that affect whether a sick person gets treated?

That gap — between functional technical systems and missing executive decisions — is where AI governance most often breaks down in practice. The result is organizations that are simultaneously running AI in production and exposed to liability they've never formally acknowledged.

The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks

· 10 min read
Tian Pan
Software Engineer

In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. The AI research community celebrated. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough.

This is the evaluation paradox. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision.

The Cold Start Problem in AI Personalization: Being Useful Before You Have Data

· 11 min read
Tian Pan
Software Engineer

Most personalization systems are built around a flywheel: users interact, you learn their preferences, you show better recommendations, they interact more. The flywheel spins faster as data accumulates. The problem is the flywheel needs velocity to generate lift — and a new user has none.

This is the cold start problem. And it's more dangerous than most teams recognize when they first ship personalization. A new user arrives with no history, no signal, and often a skeptical prior: "AI doesn't know me." You have roughly 5–15 minutes to prove otherwise before they form an opinion that determines whether they'll stay long enough to generate the data that would let you actually help them. Up to 75% of new users abandon products in the first week if that window goes badly.

The cold start problem isn't a data problem. It's an initialization problem. The engineering question is: what do you inject in place of history?

Why '92% Accurate' Is Almost Always a Lie

· 8 min read
Tian Pan
Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

Sampling Parameters in Production: The Tuning Decisions Nobody Explains

· 11 min read
Tian Pan
Software Engineer

Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.

This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.

The Accessibility Gap in AI Interfaces Nobody Is Shipping Around

· 8 min read
Tian Pan
Software Engineer

Most AI teams run accessibility audits on their landing pages. Almost none run them on the chat interface itself. The gap isn't laziness — it's that the tools don't exist. WCAG 2.2 has no success criterion for streaming content, no standard for non-deterministic outputs, and no guidance for token-by-token delivery. Which means every AI product streaming responses into a <div> right now is operating in a compliance grey zone while breaking the experience for a significant portion of its users.

This isn't a minor edge case. Blind and low-vision users report information-seeking as their top AI use case. Users with dyslexia, ADHD, and cognitive disabilities are actively trying to use AI tools to reduce reading load — and the default implementation pattern actively makes things worse for them.

AI Code Review at Scale: When Your Bot Creates More Work Than It Saves

· 10 min read
Tian Pan
Software Engineer

Most teams that adopt an AI code reviewer go through the same arc: initial excitement, a burst of flagged issues that feel useful, then a slow drift toward ignoring the bot entirely. Within a few months, engineers have developed a muscle memory for dismissing AI comments without reading them. The tool still runs. The comments still appear. Nobody acts on them anymore.

This is not a tooling problem. It is a measurement problem. Teams deploy AI code review without ever defining what "net positive" looks like — and without that baseline, alert fatigue wins.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

· 11 min read
Tian Pan
Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

When Everyone Has an AI Coding Agent: The Team Dynamics Nobody Warned You About

· 10 min read
Tian Pan
Software Engineer

A team of twelve engineers adopts AI coding tools enthusiastically. Six months later, each engineer is merging nearly twice as many pull requests. The engineering manager celebrates. Then the on-call rotation starts paging. Debugging sessions last twice as long. Nobody can explain why a particular module was structured the way it was. The engineer who wrote it replies honestly: "I don't know — the AI generated most of it and it seemed fine."

This scenario is playing out at companies everywhere. The individual productivity story is real: developers finish tasks faster, write more tests, and clear backlogs more efficiently. The team-level story is more complicated, and most organizations aren't ready for it.

The Copyright Exposure in AI-Generated Content: A Risk Framework for Engineering Teams

· 10 min read
Tian Pan
Software Engineer

GPT-4 reproduced exact passages from books in 43% of test prompts when asked to continue a given excerpt. In one 2025 study, researchers extracted nearly an entire book near-verbatim from a production LLM — no jailbreaking required, just a persistent prefix-feeding loop. If your product generates content using a language model, the copyright exposure is not a future risk. It is happening in your users' sessions today, and you probably have no instrumentation to catch it.

This is not primarily a legal article. It's an engineering article about a legal problem that engineering decisions either create or contain. Lawyers will tell you what constitutes infringement. This framework tells you where your system leaks, how to measure it, and what actually reduces risk versus what only looks like it does.