Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Automation Cliff Edge: When Partial AI Automation Is Worse Than None

· 11 min read
Tian Pan
Software Engineer

The first time a team automates 70% of a manual process and ships worse outcomes than before, the diagnosis almost always starts in the wrong place. Engineers look at the automated portion: maybe the model accuracy is off, maybe the pipeline has a bug. What they rarely examine is whether the automation itself—by existing—made the remaining 30% of human work structurally impossible to do well.

This is the automation cliff edge. Not a failure of the automated component, but a failure of the seam between automated and manual.

Choosing Eval Metrics Is a Product Decision, Not a Technical One

· 10 min read
Tian Pan
Software Engineer

A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.

This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.

When AI Sounds Right but Isn't: LLM Confabulation in Technical and Scientific Domains

· 9 min read
Tian Pan
Software Engineer

The insidious thing about LLM confabulation in technical domains isn't that the model produces obviously wrong answers. It's that the model produces beautifully structured, confidently stated, technically plausible answers that are subtly wrong in ways that only domain experts catch — and often only after the damage is done.

A Monte Carlo physics simulation that initializes correctly but resamples particle positions from scratch at each step rather than making incremental updates. A chemical formula that follows the right naming conventions but has an incorrect oxidation state. An engineering specification that cites the right standard, references the right units, and has exactly the wrong load coefficient. Each output looks right. Each sounds authoritative. Each is wrong in ways that won't surface until someone runs the experiment, stress-tests the component, or critically reads the derivation.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

When Accuracy Becomes a Liability: How Users Build Workflows Around Your AI's Failure Modes

· 10 min read
Tian Pan
Software Engineer

A team ships an AI feature at 70% accuracy. Eighteen months pass. Users adapt, complain at first, then settle in. They learn which prompt phrases avoid the edge cases. They know to double-check outputs involving dates. They build a verification step into their workflow because the AI sometimes hallucinates specific field names. Then the team ships a new model. Accuracy jumps to 85%. Support tickets spike. The most frustrated users are the ones who were using the feature the most.

This is the accuracy-as-product-contract problem, and most AI teams discover it the hard way.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

Agent Memory Contamination: How One Bad Tool Response Poisons a Whole Session

· 10 min read
Tian Pan
Software Engineer

Your agent completes 80% of a multi-step research task correctly, then confidently delivers a conclusion that's completely wrong. You trace back through the logs and find the culprit at step three: a tool call returned stale data, the agent integrated that data as fact, and every subsequent reasoning step built on that poisoned premise. By the end of the session, the agent was correct about everything except the thing that mattered.

This is agent memory contamination — and it's one of the most insidious reliability failures in production agentic systems. Unlike a crash or timeout, it produces a confident wrong answer. Observability tooling records a successful run. The user walks away with bad information.

Agentic Systems Are Distributed Systems: Apply Microservices Lessons Before You Learn Them the Hard Way

· 12 min read
Tian Pan
Software Engineer

The failure rates for multi-agent AI systems in production are embarrassing. A landmark study analyzing over 1,600 execution traces across seven popular frameworks found failure rates ranging from 41% to 87%. Carnegie Mellon researchers put leading agent systems at 30–35% task completion on multi-step benchmarks. Gartner is predicting 40% of agentic AI projects will be cancelled by the end of 2027.

Here is the uncomfortable truth: these aren't AI problems. They're distributed systems problems that engineers already solved between 2010 and 2018, documented exhaustively in blog posts, conference talks, and eventually in Martin Kleppmann's Designing Data-Intensive Applications. The teams that are shipping reliable agent systems today aren't doing anything magical — they're applying circuit breakers, bulkheads, event sourcing, and idempotency keys. The teams that are failing are treating agents as a new paradigm when they're a new deployment target for old patterns.

Why AI Engineering Training Programs Are Perpetually Behind the Models

· 9 min read
Tian Pan
Software Engineer

In early 2023, a flood of corporate AI training programs launched with the same selling point: we will teach your engineers prompt engineering. By the time most of them finished their first cohort, the specific techniques they were teaching had already been automated away by the models themselves. By 2025, the role of "prompt engineer" — briefly advertised at $200,000 salaries — was effectively obsolete. The training programs are still running.

This is the AI curriculum trap. It is not a problem of effort or budget. Organizations invest heavily in structured AI training, certification programs, and hiring rubrics built around tool proficiency. But the tools change faster than any curriculum can track, and the result is a permanent, structural lag: training programs are always teaching the AI engineering of 18 months ago.

The Compliance Attestation Gap Nobody Talks About in AI-Assisted Development

· 9 min read
Tian Pan
Software Engineer

Your engineers are shipping AI-generated code every day. Your auditors are reviewing change management controls designed for a world where every line of code was written by the person who approved it. Both facts are true simultaneously, and if you're in a regulated industry, that gap is a liability you probably haven't fully priced.

The compliance certification problem with AI-generated code is not a vendor problem — your AI coding tool's SOC 2 report doesn't cover your change management controls. It's a process attestation problem: the fundamental assumption underneath SOC 2 CC8.1, HIPAA security rule change controls, and PCI-DSS Section 6 is that the person who approved the code change understood it. That assumption no longer holds.

AI Model APIs Are Software Dependencies You Can't See, Pin, or Track

· 9 min read
Tian Pan
Software Engineer

When OpenAI silently pulled a GPT-4o update in April 2025 after engineers discovered the model had become wildly sycophantic — validating bad ideas, agreeing with factually wrong claims, and generally becoming useless for any task requiring honest feedback — most affected teams found out through Reddit and Hacker News. Their package.json showed nothing changed. Their lockfile was identical. Their deployment pipeline flagged zero dependency updates. From every standard software-supply-chain perspective, nothing happened.

That's the dependency you can't see: the foundation model behind your application.

AI-Native API Design: Building Backends That Agents Can Actually Use

· 10 min read
Tian Pan
Software Engineer

Your REST API works fine. Documentation is thorough. Error codes are consistent. Every human-authored client you've ever tested handles it well. Then your team integrates an AI agent and within an hour it's generated 2,000 failed requests by retrying variations of an endpoint that doesn't exist — bulk_search_users, search_all_users, bulk_user_search — each attempt triggering real downstream processing.

This isn't a prompt engineering failure. It's an API design failure.

REST APIs were built for clients that parse documentation, respect contracts, and call exactly what's specified. AI agents are different: they reason about what an endpoint probably does based on names and descriptions, retry without tracking state, and treat error messages as instructions rather than diagnostic codes. Designing an API for an agentic caller requires rethinking assumptions that most backend engineers have never had to question.