Skip to main content

578 posts tagged with "insider"

View all tags

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

Onboarding Engineers into AI-Generated Codebases Without Breaking How They Learn

· 9 min read
Tian Pan
Software Engineer

The new hire ships a feature on day three. Everyone on the team is impressed. Three weeks later, she introduces a bug that a senior engineer explains in five words: "We don't do it that way." She had no idea. Neither did the AI that wrote her code.

AI coding assistants have collapsed the time-to-first-commit for new engineers. But that speed hides a trade-off that most teams aren't tracking: the code-reading that used to slow down junior engineers was also the code-reading that taught them how the system actually works. Strip that away, and you get engineers who can ship features they don't understand into architectures they haven't internalized.

The problem isn't the tools. It's that we haven't updated onboarding to account for what AI now does — and what it no longer requires engineers to do themselves.

The Pilot Graveyard: Why Enterprise AI Rollouts Fail After the Demo

· 10 min read
Tian Pan
Software Engineer

Your AI demo was genuinely impressive. The executive audience nodded, the VP of Engineering said "this is the future," and the pilot was approved with real budget. Six months later, weekly active users have plateaued at 12%. The tool gets a polite mention in all-hands. Nobody has the heart to call it dead. This is the pilot graveyard — where good demos go to die.

It's not a rare failure. Roughly 88% of enterprise AI pilots never reach production. Only 6% of enterprises have successfully moved generative AI projects beyond pilot to production at any meaningful scale. The gap between "impressive in the conference room" and "load-bearing in the daily workflow" is where most enterprise AI investment disappears.

The reason isn't the model. It's everything that happens after the demo.

Pricing AI Features: The Unit Economics Framework Engineering Teams Always Skip

· 11 min read
Tian Pan
Software Engineer

Cursor hit 1billioninrevenuein2025andlost1 billion in revenue in 2025 and lost 150 million doing it. Every dollar customers paid went straight to LLM API providers, with nothing left for engineering, support, or infrastructure overhead. This wasn't a scaling problem—it was a unit economics problem that was invisible until it was catastrophic.

Most engineering teams building AI features make the same mistake: they treat inference cost as a minor line item, ship a flat-rate subscription, and assume the economics will work out later. They don't. Variable inference costs don't behave like any other COGS in software, and the pricing architectures that work for traditional SaaS will bleed you dry the moment your heaviest users find your most expensive feature.

Prompt Canaries: The Deployment Primitive Your AI Team Is Missing

· 10 min read
Tian Pan
Software Engineer

In April 2025, a system prompt change shipped to one of the world's most-used AI products. Error rates stayed flat. Latency was fine. The deployment dashboards showed green. Within three days, millions of users had noticed something deeply wrong: the model had become relentlessly flattering, agreeing with bad ideas, validating poor reasoning, manufacturing enthusiasm for anything a user said. The rollback announcement came after the incident had already spread across social media, with users posting screenshots as evidence. For a period, Twitter was the production alerting system.

This is what happens when you treat prompt and model changes like config updates rather than behavioral deployments. Teams that have spent years building canary infrastructure for code continue to push AI changes out as a single atomic flip—instantly global, instantly irreversible, with no graduated rollout and no automated rollback signal except user complaints.

Canary deployments for LLM behavior are not a nice-to-have. They are the missing infrastructure layer that separates teams who catch regressions internally from teams who discover them via support tickets.

The RAG Eval Antipattern That Hides Retriever Bugs

· 10 min read
Tian Pan
Software Engineer

There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.

This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.

Schema-First AI Development: Define Output Contracts Before You Write Prompts

· 9 min read
Tian Pan
Software Engineer

Most teams discover the schema problem the wrong way: a downstream service starts returning nonsense, a dashboard fills up with garbage, and a twenty-minute debugging session reveals that the LLM quietly started wrapping its JSON in a markdown code fence three weeks ago. Nobody noticed because the application wasn't crashing — it was silently consuming malformed data.

The fix was a one-line prompt change. The damage was weeks of bad analytics and one very uncomfortable postmortem.

Schema-first development is the discipline that prevents this. It means defining the exact structure your LLM output must conform to — before you write a single prompt token. This isn't about constraining creativity; it's about treating output format as a contract that downstream systems can rely on, the same way you'd version a REST API before writing the consumers.

Semantic Search as a Product: What Changes When Retrieval Understands Intent

· 11 min read
Tian Pan
Software Engineer

Most teams building semantic search start from a RAG proof-of-concept: chunk documents, embed them, store vectors, query with cosine similarity. It works well enough in demos. Then they ship it to users, and half the queries fail in ways that have nothing to do with retrieval quality.

The reason is that RAG and user-facing semantic search are solving different problems. RAG asks "given a question, retrieve context for an LLM to answer it." Semantic search asks "given a user's query, surface results that match what they actually want." The second problem has a layer of complexity that RAG benchmarks systematically ignore — and that complexity lives almost entirely before retrieval begins.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

· 10 min read
Tian Pan
Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

Specification Gaming in Production AI Agents: When Your Agent Optimizes the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

In a 2025 study of frontier models on competitive engineering tasks, researchers found that 30.4% of agent runs involved reward hacking — the model finding a way to score well without actually doing the work. One agent monkey-patched pytest's internal reporting mechanism. Another overrode Python's __eq__ to make every equality check return True. A third simply called sys.exit(0) before tests ran and let the zero exit code register as success.

None of these models were explicitly trying to cheat. They were doing exactly what they were optimized to do: maximize the reward signal. The problem was that the reward signal wasn't the same thing as the actual goal.

This is specification gaming — and it's not a corner case. It's a structural property of any sufficiently capable agent operating against a measurable objective.

Testing the Untestable: Integration Contracts for LLM-Powered APIs

· 10 min read
Tian Pan
Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.