75 posts tagged with "ai"

AI Coding Agents on Legacy Codebases: Why They Fail Where You Need Them Most

April 19, 2026 · 9 min read

Software Engineer

The teams that most urgently need AI coding help are usually not the ones building new greenfield services. They're the ones maintaining 500,000-line Rails monoliths from 2012, COBOL payment systems that have processed billions of transactions, or microservice meshes where the original architects left three acquisitions ago. These are the codebases where a single misplaced refactor can introduce a silent data corruption bug that surfaces three weeks later in production.

And this is exactly where current AI coding agents fail most spectacularly.

The frustrating part is that the failure mode is invisible until it isn't. The agent produces code that compiles, passes existing tests, and looks reasonable in review. The problem surfaces in staging, in the nightly batch job, or in the edge case that only one customer hits on a specific day of the month.

AI Content Provenance in Production: C2PA, Audit Trails, and the Compliance Deadline Engineers Are Missing

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

When the EU AI Act's transparency obligations take effect on August 2, 2026, every system that generates synthetic content for EU-resident users will need to mark that content with machine-readable provenance. Most engineering teams building AI products are vaguely aware of this. Far fewer have actually stood up the infrastructure to comply — and of those that have, a substantial fraction have implemented only part of what regulators require.

The dominant technical response to "AI content provenance" has been to point at C2PA (the Coalition for Content Provenance and Authenticity standard) and declare the problem solved. C2PA is important. It's real, it's being adopted by Adobe, Google, OpenAI, Sony, and Samsung, and it's the closest thing to a universal standard the industry has. But a C2PA implementation alone will not satisfy EU AI Act Article 50. It won't survive your CDN. And it won't prevent bad actors from producing "trusted" provenance for manipulated content.

This post is about what AI content provenance actually requires in production — the technical stack, the failure modes, and the compliance gaps that catch teams off guard.

The AI Feature Retirement Playbook: How to Sunset What Users Barely Adopted

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Your team shipped an AI-powered summarization feature six months ago. Adoption plateaued at 8% of users. The model calls cost $4,000 a month. The one engineer who built it has moved to a different team. And now the model provider is raising prices.

Every instinct says: kill it. But killing an AI feature turns out to be significantly harder than killing any other kind of feature — and most teams find this out the hard way, mid-retirement, when the compliance questions start arriving and the power users revolt.

This is the playbook that should exist before you ship the feature, but is most useful right now, when you're staring at usage graphs that point unmistakably toward the exit.

The AI Taste Problem: Measuring Quality When There's No Ground Truth

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Here's a scenario that plays out at most AI product teams: someone on leadership asks whether the new copywriting model is better than the old one. The team runs their eval suite, accuracy numbers look good, and they ship. Three weeks later, the marketing team quietly goes back to using the old model because the new one "sounds off." The accuracy metrics were real. They just measured the wrong thing.

This is the AI taste problem. It shows up wherever your outputs are subjective — copywriting, design suggestions, creative content, tone adjustments, style recommendations. When there's no objective ground truth, traditional ML evaluation frameworks give you a false sense of confidence. And most teams don't have a systematic answer for what to do instead.

Board-Level AI Governance: The Five Decisions Only Executives Can Make

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

A major insurer's AI system was denying coverage claims. When humans reviewed those decisions, 90% were found to be wrong. The insurer's engineering team had built a performant model. Their MLOps team had solid deployment pipelines. Their data scientists had rigorous evaluation metrics. None of that mattered, because no one at the board level had ever answered the question: what is our acceptable failure rate for AI decisions that affect whether a sick person gets treated?

That gap — between functional technical systems and missing executive decisions — is where AI governance most often breaks down in practice. The result is organizations that are simultaneously running AI in production and exposed to liability they've never formally acknowledged.

The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. The AI research community celebrated. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough.

This is the evaluation paradox. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision.

The Cold Start Problem in AI Personalization: Being Useful Before You Have Data

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

Most personalization systems are built around a flywheel: users interact, you learn their preferences, you show better recommendations, they interact more. The flywheel spins faster as data accumulates. The problem is the flywheel needs velocity to generate lift — and a new user has none.

This is the cold start problem. And it's more dangerous than most teams recognize when they first ship personalization. A new user arrives with no history, no signal, and often a skeptical prior: "AI doesn't know me." You have roughly 5–15 minutes to prove otherwise before they form an opinion that determines whether they'll stay long enough to generate the data that would let you actually help them. Up to 75% of new users abandon products in the first week if that window goes badly.

The cold start problem isn't a data problem. It's an initialization problem. The engineering question is: what do you inject in place of history?

Why '92% Accurate' Is Almost Always a Lie

April 18, 2026 · 8 min read

Tian Pan

Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

Sampling Parameters in Production: The Tuning Decisions Nobody Explains

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.

This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.

The Accessibility Gap in AI Interfaces Nobody Is Shipping Around

April 17, 2026 · 8 min read

Tian Pan

Software Engineer

Most AI teams run accessibility audits on their landing pages. Almost none run them on the chat interface itself. The gap isn't laziness — it's that the tools don't exist. WCAG 2.2 has no success criterion for streaming content, no standard for non-deterministic outputs, and no guidance for token-by-token delivery. Which means every AI product streaming responses into a <div> right now is operating in a compliance grey zone while breaking the experience for a significant portion of its users.

This isn't a minor edge case. Blind and low-vision users report information-seeking as their top AI use case. Users with dyslexia, ADHD, and cognitive disabilities are actively trying to use AI tools to reduce reading load — and the default implementation pattern actively makes things worse for them.

AI Code Review at Scale: When Your Bot Creates More Work Than It Saves

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams that adopt an AI code reviewer go through the same arc: initial excitement, a burst of flagged issues that feel useful, then a slow drift toward ignoring the bot entirely. Within a few months, engineers have developed a muscle memory for dismissing AI comments without reading them. The tool still runs. The comments still appear. Nobody acts on them anymore.

This is not a tooling problem. It is a measurement problem. Teams deploy AI code review without ever defining what "net positive" looks like — and without that baseline, alert fatigue wins.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

About Tian Pan