Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Accountability Transfer Problem: Why AI Gets Blamed for Decisions It Was Never Designed to Make Alone

· 10 min read
Tian Pan
Software Engineer

A major health insurer deployed an AI tool to evaluate post-acute care claims. The system had an error rate above 90% — meaning nine of every ten appealed denials were eventually overturned by human reviewers. Yet those denials weren't proactively corrected. Patients had to appeal, one by one. When the lawsuits came, the company's response was to point at the AI.

The AI denied nothing. Humans approved those denials at scale, embedded in a workflow they designed, in a system they chose to deploy. But "the AI decided" is a sentence that distributes blame in a direction that conveniently absolves the organization, the executives who approved the rollout, and the reviewers who signed off on each case.

This is the accountability transfer problem — and it's not a future risk. It's already endemic in production AI systems.

Why AI Coding Tools Amplify Juniors and Plateau Seniors

· 9 min read
Tian Pan
Software Engineer

Ask any VP of Engineering whether AI coding tools are a productivity win and they'll say yes. Ask the same question to a staff engineer who lives in a ten-year-old codebase with six undocumented data models and a deployment process held together with shell scripts, and you'll get a different answer.

The productivity story for AI coding tools is bifurcated in a way that most organizations haven't fully processed. Junior engineers are seeing 27–39% gains in completed weekly tasks. Experienced developers are, in a controlled study of real-world issues, taking 19% longer to finish tasks when they have AI assistance than when they don't. Both results are consistent with how these tools work — and they lead to a management trap that's playing out quietly on engineering teams right now.

AI Fallback Design Is an Architecture Problem, Not an Afterthought

· 9 min read
Tian Pan
Software Engineer

When McDonald's pulled the plug on its AI drive-thru after three years of operation, the failure wasn't that the model was bad at understanding orders. The failure was architectural: there was no clear escalation path to a human cashier, no confidence threshold that would trigger a retry, and no defined behavior for the system when it was confused. The AI just kept trying. Customers kept getting frustrated. The happy path was well-designed. Everything else wasn't.

That pattern repeats across almost every failed AI deployment. The model works in demos. It fails in production. And the post-mortem reveals the same root cause: fallback design was never part of the architecture. It was something someone planned to add later.

AI Documentation Debt: How Stochastic Systems Break Your Technical Knowledge Base

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped cleanly. The documentation looked good: input schema, expected outputs, a worked example. Three months later, a model update arrives silently. The outputs shift. Your docs are wrong but nobody knows it yet — because they still look right.

This is the core of AI documentation debt, and it compounds faster than any other kind of technical debt because the failure is invisible until a user finds it.

The Coverage Illusion: Why AI-Generated Tests Inherit Your Code's Blind Spots

· 9 min read
Tian Pan
Software Engineer

An engineer on a small team spent three months delegating test generation to AI. Code coverage jumped from 47% to 72% to 98%. Every PR came back green. Then production broke. A race condition in user registration allowed duplicate emails due to database replication lag. A promo code endpoint returned null instead of zero when a code was invalid, and the payment calculation silently broke for 4,700 customers. The total damage: $47,000 in refunds and 66 hours of engineering time. The tests hadn't missed a few edge cases. The tests had covered the code that was written, not the system that was deployed.

This is the coverage illusion. And it's getting easier to fall into as AI-assisted development becomes the default.

Chain-of-Thought Has Two Failure Modes Nobody Talks About

· 9 min read
Tian Pan
Software Engineer

Chain-of-thought prompting was supposed to solve the black-box problem with language models. Show the work, verify the steps, understand how the model reached its conclusion. The idea is intuitively right — and that's the problem. It feels so obviously correct that practitioners deploy visible reasoning chains into production systems without asking a harder question: what if showing the work makes things worse?

Recent research from 2024–2026 has started to systematically document what that "worse" looks like. Visible reasoning chains cause two distinct failure modes that often go unnoticed until something breaks in production. The first is a user-side problem: intermediate reasoning steps anchor users to potentially wrong conclusions before they've seen the final answer. The second is a systems problem: reasoning traces create the illusion of an audit trail while being fundamentally unreliable as explanations of how the model actually decided.

The Co-Pilot Trap: Why Full Autopilot Ships Faster but Fails Harder

· 9 min read
Tian Pan
Software Engineer

There's a pattern in how AI features die in production: they start as copilots and get promoted to autopilots. The promotion happens for obvious reasons—cost reduction, scale, reduced headcount—and the reasoning sounds solid at demo time. Then the edge cases accumulate. A user-facing recommendation becomes a user-facing decision. A suggestion becomes an action. And when the first systematic failure lands, the engineering team discovers that the error tolerance assumptions baked into the original design were never re-evaluated.

This is the co-pilot trap: building an AI feature for one tier of the automation spectrum, then promoting it to a higher tier without rebuilding the failure model that tier requires.

The Copy-Paste Contagion: How AI-Assisted Development Spreads Architectural Anti-Patterns

· 11 min read
Tian Pan
Software Engineer

Your codebase has the same authentication logic implemented three different ways, and nobody on the team wrote any of them. A quick git blame shows the same engineer on all three files, but ask that engineer and they'll tell you they just accepted what the AI suggested and it "looked right." The anti-pattern didn't spread because someone was lazy. It spread because an AI model with no memory of your existing auth module generated plausible-looking implementations every time someone opened a new file and asked for help.

This is the copy-paste contagion, and it's structurally different from the classic copy-paste problem you already know how to fight.

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

The Expertise Cliff: Why AI Coding Agents Fail in Mature Codebases

· 8 min read
Tian Pan
Software Engineer

A 2025 controlled trial gave experienced developers access to AI coding tools and measured whether they got faster. The developers predicted a 24% speedup. After completing the study, they reported feeling roughly 20% faster. Objective measurement showed they were actually 19% slower.

This isn't a story about AI hype. It's a story about tacit knowledge — the undocumented "why" that lives inside every mature codebase and cannot be recovered by reading the code alone. AI agents are remarkably productive in greenfield systems precisely because there is little tacit knowledge to violate. They degrade in mature codebases for exactly the same reason.

The Feedback Provenance Gap: Why Your Training Signal Might Not Be What You Collected

· 8 min read
Tian Pan
Software Engineer

Most teams have excellent instrumentation on the feedback capture side. Thumbs-down clicks are logged. Star ratings flow into dashboards. Human annotation jobs write every preference pair to a table. The intake is clean, timestamped, and queryable.

What happens between that capture and the next model update is, for most teams, a black box.

The data gets filtered. Some annotations get weighted higher than others. Rare categories get upsampled. Near-duplicates get dropped. A prompt template change makes last month's labels inconsistent with this month's, but the merge happens anyway. By the time the signal reaches a reward model or fine-tuning job, it has passed through six transformation steps with no audit trail, no version pinning, and no way to trace a degraded model weight back to a specific corruption point in the pipeline.

This is the feedback provenance gap: teams know where feedback enters the system, but not what it becomes before it shapes model behavior.

Graph Reasoning Gaps in LLMs: Scaffolding Relational Tasks That Fool Sequence-Trained Models

· 9 min read
Tian Pan
Software Engineer

A common mistake in AI system design is asking a language model to reason over a graph as if it were reading a document. The model will generate a confident, fluent answer. The answer will be wrong in a way that looks right — it will name real nodes, reference plausible paths, and describe relationships that almost exist. Then you discover your org-chart traversal hallucinates skip-level managers, your dependency resolution misses cycles in graphs over ten nodes, and your three-hop knowledge graph query has a 60% error rate at step two.

This is not a prompt quality problem. It is an architecture problem, and you can diagnose it before writing a single prompt.