Skip to main content

61 posts tagged with "ai"

View all tags

AI Feature Payback: The ROI Model Your Finance Team Won't Fight You On

· 10 min read
Tian Pan
Software Engineer

Every engineering team shipping AI features eventually hits the same wall: finance wants a spreadsheet that justifies the spend, and the spreadsheet you built doesn't actually work.

The problem isn't that AI features lack ROI. The problem is that AI economics break every assumption the standard ROI model was built on — fixed capital, linear cost curves, predictable timelines. Teams that treat AI spending like SaaS licensing get numbers that either look deceptively good before launch or collapse six months into production. The ten-fold gap between measured AI initiatives (55% ROI) and ad-hoc deployments (5.9% ROI) comes almost entirely from whether teams got the measurement model right before they shipped.

AI as the Permanent Intern: The Role-Task Gap in Enterprise Workflows

· 9 min read
Tian Pan
Software Engineer

There's a pattern that appears in nearly every enterprise AI deployment: the tool performs brilliantly in the demo, ships to production, and then quietly stalls at 70–80% of its potential. Teams attribute the stall to model quality, context window limits, or retrieval failures. Most of the time, that diagnosis is wrong. The actual problem is that they're asking the AI to play a role it structurally cannot occupy — not yet, possibly not ever in its current form.

The gap between "AI can do this task" and "AI can play this role" is the most expensive misunderstanding in enterprise AI.

Building Trust Recovery Flows: What Happens After Your AI Makes a Visible Mistake

· 9 min read
Tian Pan
Software Engineer

When Google's AI Overview told users to add glue to pizza sauce and eat rocks for digestive health, it didn't just embarrass a product team — it exposed a systemic gap in how we think about AI reliability. The failure wasn't just that the model was wrong. The failure was that the model was confidently wrong, in a high-visibility context, with no recovery path for the users it misled.

Trust in AI systems doesn't erode gradually. Research shows it follows a cliff-like collapse pattern: a single noticeable error can produce a disproportionate trust decline with measurable effect sizes. Only 29% of developers say they trust AI tools — an 11-point drop from the previous year, even as adoption climbs to 84%. We're building systems that people use but don't trust. That gap matters when your product ships agentic features that act on behalf of users.

This post is about what engineers and product builders should do after the mistake happens — not just how to prevent it.

The Context Limit Is a UX Problem: Why Silent Truncation Erodes User Trust

· 8 min read
Tian Pan
Software Engineer

A user spends an hour in a long coding session with an AI assistant. They've established conventions, shared codebase context, described a multi-file refactor in detail. Then, about 40 messages in, the AI starts giving advice that ignores everything it "knows." It recommends an approach they already rejected twenty minutes ago. When pressed, it seems confused.

No error was shown. No warning appeared. The model just quietly dropped earlier messages to make room for newer ones — and the user concluded the AI was unreliable.

This is not a model failure. It is a product design failure.

The Hidden Tax on Your AI Features: What Your Inference Bill Isn't Telling You

· 10 min read
Tian Pan
Software Engineer

When engineers pitch an AI feature, the cost conversation almost always centers on the inference API. How much per token? What's the monthly estimate at our expected call volume? Can we negotiate a volume discount? This is the wrong conversation — or at least an incomplete one.

In practice, the inference bill accounts for roughly 20-30% of what it actually costs to run a mature AI feature. The rest is distributed across a portfolio of costs that don't show up on your LLM provider's invoice: the vector database your retrieval pipeline depends on, the embedding jobs that populate it, the observability platform catching silent failures, the human reviewers validating model outputs, and the engineers who spend weeks tuning the prompts that make everything work. Teams discover this the hard way, usually six months after launch when they're trying to explain a cost center that's 3-5x higher than projected.

Org-Level Goodhart: When Teams Game AI Adoption Metrics

· 9 min read
Tian Pan
Software Engineer

According to one study, 95% of generative AI pilots technically succeed—and 74% of companies using generative AI still haven't shown tangible business value. That gap is not a coincidence. It's a measurement problem dressed up as a technology problem, and most organizations won't diagnose it correctly because the people doing the measuring are the same ones being measured.

This is Goodhart's Law at the organizational level: once an AI adoption metric becomes a performance target, it ceases to measure what you care about. The metric keeps going up. The underlying outcome stays flat or gets worse.

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

· 10 min read
Tian Pan
Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.

The Co-Evolution Trap: How Your AI Feature's Success Is Quietly Destroying Its Evaluations

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched. It's working well. Users are adopting it. Satisfaction scores are up. You go back and run the original eval suite—still green. Six months later, something is quietly wrong, but your dashboards don't show it yet.

This is the co-evolution trap. The moment your AI feature is deployed, it starts changing the people using it. They adapt their workflows, their phrasing, their expectations. That adaptation makes the distribution of inputs your feature actually processes diverge from the distribution you measured at launch. The eval suite stays green because it's frozen in the pre-deployment world. The real-world performance drifts in ways the suite never captures.

The First AI Feature Problem: Why What You Ship First Determines What Users Accept Next

· 9 min read
Tian Pan
Software Engineer

Most teams ship their boldest AI feature first. It's the one they've been working on for six months, the one that makes a good demo, the one that leadership is excited about. It fails in production — not catastrophically, just enough to make users uncomfortable — and suddenly every AI feature that follows inherits that skepticism. The team spends the next year wondering why adoption is flat even after they fixed the original problems.

This is the first AI feature problem. What you ship first establishes a precedent that persists long after the technical issues are resolved. User trust in AI is formed on the first failure, not the first success. The sequence of your launches matters more than the quality of any individual feature.

Why Your AI Sounds Wrong Even When It's Technically Correct

· 9 min read
Tian Pan
Software Engineer

A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.

This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.