Skip to main content

861 posts tagged with "insider"

View all tags

Treating Your LLM Provider as an Unreliable Upstream: The Distributed Systems Playbook for AI

· 11 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Response times look fine. Error rates are near zero. And yet your users are filing tickets about garbage answers, your agent is making confidently wrong decisions, and your support queue is filling up with complaints that don't correlate with any infrastructure alert you have.

Welcome to the unique hell of depending on an LLM API in production. It's an upstream service that can fail you while returning a perfectly healthy 200 OK.

The AI Delegation Paradox: You Can't Evaluate Work You Can't Do Yourself

· 9 min read
Tian Pan
Software Engineer

Every engineer who has delegated a module to a contractor knows the feeling: the code comes back, the tests pass, the demo works — and you have no idea whether it's actually good. You didn't write it, you don't fully understand the decisions embedded in it, and the review you're about to do is more performance than practice. Now multiply that dynamic by every AI-assisted commit in your codebase.

The AI delegation paradox is simple to state and hard to escape: the skill you need most to evaluate AI-generated work is the same skill that atrophies fastest when you stop doing the work yourself. This isn't a future risk. It's happening now, measurably, across engineering organizations that have embraced AI coding tools.

CLAUDE.md as Codebase API: The Most Leveraged Documentation You'll Ever Write

· 9 min read
Tian Pan
Software Engineer

Most teams treat their CLAUDE.md the way they treat their README: write it once, forget it exists, wonder why nothing works. But a CLAUDE.md isn't documentation. It's an API contract between your codebase and every AI agent that touches it. Get it right, and every AI-assisted commit follows your architecture. Get it wrong — or worse, let it rot — and you're actively making your agent dumber with every session.

The AGENTbench study tested 138 real-world coding tasks across 12 repositories and found that auto-generated context files actually decreased agent success rates compared to having no context file at all. Three months of accumulated instructions, half describing a codebase that had moved on, don't guide an agent. They mislead it.

Knowledge Graphs Are Back: Why RAG Teams Are Adding Structure to Their Retrieval

· 8 min read
Tian Pan
Software Engineer

Your RAG pipeline answers single-fact questions beautifully. Ask it "What is our refund policy?" and it nails it every time. But ask "Which customers on the enterprise plan filed support tickets about the billing API within 30 days of their contract renewal?" and it falls apart. The answer exists in your data — scattered across three different document types, connected by relationships that cosine similarity cannot see.

This is the multi-hop reasoning problem, and it's the reason a growing number of production RAG teams are grafting knowledge graphs onto their vector retrieval pipelines. Not because graphs are trendy again, but because they've hit a concrete accuracy ceiling that no amount of chunk-size tuning or reranking can fix.

The MCP Composability Trap: When 'Just Add Another Server' Becomes Dependency Hell

· 9 min read
Tian Pan
Software Engineer

The MCP ecosystem has 10,000+ servers and 97 million SDK downloads. It also has 30 CVEs filed in sixty days, 502 server configurations with unpinned versions, and a supply chain attack that BCC'd every outgoing email to an attacker for fifteen versions before anyone noticed. The composability promise — "just plug in another MCP server" — is real. But so is the dependency sprawl it creates, and most teams discover the cost after they're already deep in integration debt.

If you've built production systems on npm, you've seen this movie before. The MCP ecosystem is speedrunning the same plot, except the packages have shell access to your machine and credentials to your production systems.

The 10x Prompt Engineer Myth: Why System Design Beats Prompt Wordsmithing

· 8 min read
Tian Pan
Software Engineer

There is a persistent belief in the AI engineering world that the difference between a mediocre LLM application and a great one comes down to prompt craftsmanship. Teams hire "prompt engineers," run dozens of A/B tests on phrasing, and spend weeks agonizing over whether "You must" outperforms "Please ensure." Meanwhile, the retrieval pipeline feeds garbage context, there is no output validation, and the error handling strategy is "hope the model gets it right."

The data tells a different story. The first five hours of prompt work on a typical LLM application yield roughly a 35% improvement. The next twenty hours deliver 5%. The next forty hours? About 1%. Teams that recognize this curve early and redirect effort into system design consistently outperform teams that keep polishing prompts.

The Model Deprecation Cliff: What Happens When Your Provider Sunsets the Model Your Product Depends On

· 8 min read
Tian Pan
Software Engineer

Most teams discover they are model-dependent the same way you discover a load-bearing wall — by trying to remove it. The deprecation email arrives, you swap the model identifier in your config, and your application starts returning confident, well-formatted, subtly wrong answers. No errors. No crashes. Just a slow bleed of trust that takes weeks to notice and months to repair.

This is the model deprecation cliff: the moment when a forced migration reveals that your "model-agnostic" system was never agnostic at all. Your prompts, your output parsers, your evaluation baselines, your users' expectations — all of them were quietly calibrated to behavioral quirks that are about to change on someone else's release schedule.

Vibe Coding Considered Harmful: When AI-Assisted Speed Kills Software Quality

· 8 min read
Tian Pan
Software Engineer

Andrej Karpathy coined "vibe coding" in early 2025 to describe a style of programming where you "fully give into the vibes, embrace exponentials, and forget that the code even exists." You describe what you want in natural language, the AI generates it, and you ship. It felt like a superpower. Within a year, the data started telling a different story.

A METR randomized controlled trial found that experienced open-source developers were 19% slower when using AI coding tools — despite predicting they'd be 24% faster, and still believing afterward they'd been 20% faster. A CodeRabbit analysis of 470 GitHub pull requests found AI co-authored code contained 1.7x more major issues than human-written code. And an Anthropic study of 52 engineers showed AI-assisted developers scored 17% lower on comprehension tests of their own codebases.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.

Agent Credential Rotation: The DevOps Problem Nobody Mapped to AI

· 8 min read
Tian Pan
Software Engineer

Every DevOps team has a credential rotation policy. Most have automated it for their services, CI pipelines, and databases. But the moment you deploy an autonomous AI agent that holds API keys across five different integrations, that rotation policy becomes a landmine. The agent is mid-task — triaging a bug, updating a ticket, sending a Slack notification — and suddenly its GitHub token expires. The process looks healthy. The logs show no crash. But silently, nothing works anymore.

This is the credential rotation problem that nobody mapped from DevOps to AI. Traditional rotation assumes predictable, human-managed workloads with clear boundaries. Autonomous agents shatter every one of those assumptions.

The AI Feature Adoption Curve Nobody Measures Correctly

· 10 min read
Tian Pan
Software Engineer

Your AI feature launched three months ago. DAU is up. Session length is climbing. Your dashboard looks green. But here is the uncomfortable question: are your users actually adopting the feature, or are they just tolerating it?

Most teams track AI feature adoption with the same metrics they use for traditional product features — daily active users, session duration, feature activation rates. These metrics worked fine when features behaved deterministically. Click a button, get a result, measure engagement. But AI features are fundamentally different: their outputs vary, their value is probabilistic, and users develop trust (or distrust) through repeated exposure. The standard metrics don't just fail to capture this — they actively mislead.

AI Feature Cannibalization: When Your Smart Feature Quietly Kills Your Core Product

· 10 min read
Tian Pan
Software Engineer

You ship an AI-powered summary feature for your document editor. Adoption is great — 40% of users activate it within the first week. Your PM writes a celebratory Slack message. Two months later, average session duration has dropped 25%, collaborative editing is down, and your power users are quietly churning. Nobody connects these trends to the shiny new feature because the dashboard that tracks the summary feature shows nothing but green.

This is AI feature cannibalization: when an AI shortcut solves a user's immediate problem while destroying the engagement loops that make your product worth paying for. It is one of the most insidious failure modes in product development today because every metric that tracks the feature itself looks healthy, even as the product-level metrics decay.